Advanced Topics in Machine Learning for Bioinformatics and Biomedical Engineering
March 9, 2026
Learning goals
A single unit (neuron) takes an input vector \(x \in \mathbb{R}^d\), computes:
\[ z = Wx + b,\qquad h = \phi(z) \] where
Input vector: \(x \in \mathbb{R}^{d}\)
Parameters:
Hidden layer activation: \(h \in \mathbb{R}^{m}\)
Nonlinearity: \(\sigma(\cdot)\), or \(\phi(\cdot): \mathbb{R}^{m} \to \mathbb{R}^{m}\) (applied element-wise)
Output:
Loss: \(L \in \mathbb{R}\) (a scalar)
\[ z = Wx + b,\qquad h = \phi(z) \] where
d: input dimensionality = number of input features. Examples: gene-expression features, k-mer counts, embedding length.
m: hidden-layer width = number of neurons (units) in the hidden layer. This is a tunable architecture choice (a hyperparameter).
k: output dimensionality = number of outputs. Examples:
A classical neural network
Training loop:
Forward Pass: compute predictions \(\hat{y}\) from inputs \(x\) and parameters \(\theta\).
Loss Computation: measure mismatch \(L(\hat{y}, y)\). \(L\) measures how wrong a model’s prediction is on a given example.
Backward pass (backprop): compute \(\nabla_\theta L\) efficiently.
Update: \(\theta \leftarrow \theta - \eta \nabla_\theta L\) (or Adam, etc.).
A loss function measures how wrong a model’s prediction is on a given example. Given: ¡
the loss is a scalar: \[ L(\hat{y}, y) \in \mathbb{R} \]
We train the model by choosing parameters \(\theta\) that minimize the average loss over a dataset of \(n\) samples: \[ \min_{\theta}\; \frac{1}{n}\sum_{i=1}^{n} L\!\left(\hat{y}^{(i)}, y^{(i)}\right) \]
LSMS
The loss defines what good predictions mean.
Different tasks (regression, binary classification, multiclass) use different losses.
Loss functions are a big family. But in neural nets you can think of them in a few categories:
In practice, use just a few core ones: MSE (regression), BCE (binary/multi-label), and softmax cross-entropy (multiclass).
Used when targets are real-valued (e.g., expression level prediction, continuous phenotype).
For a single scalar prediction \(\hat{y}\in\mathbb{R}\): \[ L_{\text{MSE}}(\hat{y},y)=\frac{1}{2}(\hat{y}-y)^2 \] (The factor \(1/2\) is convenient: it cancels in derivatives.)
Gradient w.r.t. prediction: \[ \frac{\partial L_{\text{MSE}}}{\partial \hat{y}} = \hat{y}-y \]
When it works well: Gaussian noise / squared-error is acceptable, smooth optimization.
Sensitivity: strongly penalizes outliers.
\[ L_{\text{MAE}}(\hat{y},y)=|\hat{y}-y| \] Subgradient: \[ \frac{\partial L_{\text{MAE}}}{\partial \hat{y}} \in \mathrm{sign}(\hat{y}-y) \] (Not differentiable at \(0\); optimizers use subgradients.)
When it works well: heavier-tailed noise, robust to outliers.
Tradeoff: gradient is constant (can be slower to converge near optimum).
Interpolates between MSE (near 0) and MAE (far away). For residual \(r=\hat{y}-y\) and threshold \(\delta>0\): \[ L_{\delta}(r)= \begin{cases} \frac{1}{2}r^2 & |r|\le \delta \\ \delta(|r|-\frac{1}{2}\delta) & |r|>\delta \end{cases} \] Derivative: \[ \frac{\partial L_\delta}{\partial \hat{y}}= \begin{cases} r & |r|\le \delta \\ \delta\,\mathrm{sign}(r) & |r|>\delta \end{cases} \]
The Huber loss is the convolution of the absolute value function with the rectangular function, scaled and translated. Thus it “smoothens out” the former’s corner at the origin.
Practical: often a good default when outliers exist.
Logit \(z\in\mathbb{R}\), probability \(\hat{y}=\sigma(z)\in(0,1)\), label \(y\in\{0,1\}\): \[ \hat{y}=\sigma(z)=\frac{1}{1+e^{-z}} \] \[ L_{\text{BCE}}(\hat{y},y)= -\Big(y\log\hat{y} + (1-y)\log(1-\hat{y})\Big) \]
A key simplification for backprop is the derivative w.r.t. the logit1: \[ \frac{\partial L_{\text{BCE}}}{\partial z}=\hat{y}-y \]
In implementations you typically avoid computing \(\log(1-\sigma(z))\) directly. A stable form is: \[ L_{\text{BCE-logits}}(z,y)=\log\big(1+e^{z}\big) - yz \] (which is equivalent to BCE after algebra).
Why it matters: prevents overflow/underflow for large \(|z|\).
Logits \(z\in\mathbb{R}^{k}\), probabilities \(\hat{y}\in\Delta^{k-1}\) 1: \[ \hat{y}_i = \mathrm{softmax}(z)_i=\frac{e^{z_i}}{\sum_{j=1}^k e^{z_j}} \]
For one-hot \(y\in\{0,1\}^k\): \[ L_{\text{CE}}(\hat{y},y)= -\sum_{i=1}^k y_i \log(\hat{y}_i) \]
Backprop simplification: \[ \frac{\partial L_{\text{CE}}}{\partial z}=\hat{y}-y \]
It is the multiclass analogue of BCE+sigmoid.
Use: \[ \log\sum_j e^{z_j} = \alpha + \log\sum_j e^{z_j-\alpha},\quad \alpha=\max_j z_j \] to avoid overflow when some logits are large.
e.g. predicting multiple functional GO annotations per protein (several labels can be 1 simultaneously).
Use \(k\) independent logits \(z\in\mathbb{R}^k\) and apply sigmoid per label: \[ \hat{y}_j = \sigma(z_j) \] Loss is a sum (or mean) of BCE across labels: \[ L=\sum_{j=1}^k \mathrm{BCE}(\hat{y}_j, y_j) \]
Gradient still has a simple per-label form: \[ \frac{\partial L}{\partial z_j} = \hat{y}_j - y_j \]
Bioinformatics datasets often have strong imbalance (rare motifs, rare cell types, rare variants).
Let positive weight \(w_+>0\) and negative weight \(w_->0\): \[ L = -\left(w_+\,y\log\hat{y} + w_-\,(1-y)\log(1-\hat{y})\right) \] This changes the gradient scale so minority classes influence training more.
Often used for extreme imbalance. Define \(p_t=\hat{y}\) if \(y=1\), else \(p_t=1-\hat{y}\): \[ L_{\text{focal}} = -(1-p_t)^\gamma \log(p_t) \] - \(\gamma>0\) focuses learning on hard cases (where \(p_t\) is small). - Often combined with class weights.
Intuition: if an example is already confidently correct, it contributes less to the loss.
Some tasks are naturally ranking problems (e.g., scoring candidate interactions).
With label \(y\in\{-1,+1\}\) and score \(s\in\mathbb{R}\): \[ L_{\text{hinge}}(s,y)=\max(0, 1 - y s) \] Encourages \(ys\ge 1\) (a margin).
Often used in SVMs; less common than cross-entropy in modern deep nets, but useful conceptually.
If \(\theta\) is the collection of all parameters, say \(\theta = (\theta_1,\theta_2,\dots,\theta_n)\) then the gradient is the vector of partial derivatives: \(\nabla_\theta L = \left( \frac{\partial L}{\partial \theta_1}, \frac{\partial L}{\partial \theta_2}, \dots, \frac{\partial L}{\partial \theta_n} \right)\)
To train a network we will do:
\[ a = w^\top x + b,\qquad \hat{y}=\sigma(a)=\frac{1}{1+e^{-a}} \]
Binary cross-entropy (BCE) for label \(y\in\{0,1\}\): \[ L(\hat{y},y) = -\left(y\log \hat{y} + (1-y)\log(1-\hat{y})\right) \]
We want gradients: \(\frac{\partial L}{\partial w}\) and \(\frac{\partial L}{\partial b}\).
flowchart LR x((x)) --> a[$$a = w \cdot x + b$$] w((w)) --> a b((b)) --> a a --> yhat["$$\hat y = \sigma (a)$$"] yhat --> L["$$L(\hat y, y)$$"] y((y)) --> L
Backprop = applying the chain rule along the edges of a graph. \[ a = w^\top x + b,\qquad \hat{y}=\sigma(a)=\frac{1}{1+e^{-a}} \]
\[ L(\hat{y},y) = -\left(y\log \hat{y} + (1-y)\log(1-\hat{y})\right) \]
If \(L\) depends on \(\hat{y}\), which depends on \(a\), which depends on \(w\):
The problem
\[ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial a} \cdot \frac{\partial a}{\partial w} \]
Backprop organizes this so we reuse intermediate derivatives.

The problem
\[ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial a} \cdot \frac{\partial a}{\partial w} \]
and then
\[ w \leftarrow w - \eta \frac{\partial L}{\partial w} \]
Sigmoid: \[ \sigma'(a)=\sigma(a)\left(1-\sigma(a)\right) \]
Affine: \[ a = w^\top x + b \quad\Rightarrow\quad \frac{\partial a}{\partial w}=x,\; \frac{\partial a}{\partial b}=1 \]
Let’s derive \(\frac{\partial L}{\partial a}\)
Binary classifier (single neuron):
\[ a = w^\top x + b,\qquad \hat{y}=\sigma(a)=\frac{1}{1+e^{-a}} \]
Binary cross-entropy (BCE):
\[ L(\hat{y},y)= -\Big(y\log \hat{y} + (1-y)\log(1-\hat{y})\Big), \quad y\in\{0,1\} \]
We want the gradient w.r.t. the logit \(a\): \(\frac{\partial L}{\partial a}\).
Differentiate BCE w.r.t. \(\hat{y}\):
\[ \frac{\partial L}{\partial \hat{y}} = -\left( y\cdot \frac{1}{\hat{y}} + (1-y)\cdot \frac{-1}{1-\hat{y}} \right) = -\frac{y}{\hat{y}}+\frac{1-y}{1-\hat{y}} \]
Put over a common denominator:
\[ \frac{\partial L}{\partial \hat{y}} = \frac{-y(1-\hat{y})+(1-y)\hat{y}}{\hat{y}(1-\hat{y})} = \frac{\hat{y}-y}{\hat{y}(1-\hat{y})} \]
Sigmoid derivative:
\[ \hat{y}=\sigma(a)=\frac{1}{1+e^{-a}} \quad\Rightarrow\quad \frac{\partial \hat{y}}{\partial a} = \sigma(a)(1-\sigma(a)) = \hat{y}(1-\hat{y}) \]
By the chain rule:
\[ \frac{\partial L}{\partial a} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial a} \]
Substitute the two results:
\[ \frac{\partial L}{\partial a} = \left(\frac{\hat{y}-y}{\hat{y}(1-\hat{y})}\right) \cdot \left(\hat{y}(1-\hat{y})\right) = \hat{y}-y \]
The simplification !
This cancellation is why sigmoid + BCE is so clean in backprop:
the gradient at the logit is just prediction minus label!!.
Since \(a=w^\top x + b\):
\[ \frac{\partial a}{\partial w}=x,\qquad \frac{\partial a}{\partial b}=1 \]
Therefore:
\[ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial a}\frac{\partial a}{\partial w} = (\hat{y}-y)x, \qquad \frac{\partial L}{\partial b} = \hat{y}-y \]
For multiclass logits \(z\in\mathbb{R}^k\), \(\hat{y}=\mathrm{softmax}(z)\), and one-hot \(y\):
\[ L = -\sum_{i=1}^k y_i\log(\hat{y}_i) \]
A key result (analogous to BCE+sigmoid) is:
\[ \frac{\partial L}{\partial z}=\hat{y}-y \]
So in both cases, the error signal at the final logits is \(\hat{y}-y\).
Batch of size \(n\):
Layer view: \[ a = XW^\top + b \qquad \hat{Y} = \sigma(a) \]
We use the mean BCE over the batch: \[ L = \frac{1}{n}\sum_{i=1}^n \ell(\hat{y}^{(i)},y^{(i)}) \]
For sample \(i\): \[ \ell(\hat{y}^{(i)},y^{(i)})= -\Big(y^{(i)}\log \hat{y}^{(i)} + (1-y^{(i)})\log(1-\hat{y}^{(i)})\Big) \]
Derivative wrt \(\hat{y}^{(i)}\): \[ \frac{\partial \ell}{\partial \hat{y}^{(i)}} = \frac{\hat{y}^{(i)}-y^{(i)}}{\hat{y}^{(i)}(1-\hat{y}^{(i)})} \]
Sigmoid derivative: \[ \frac{\partial \hat{y}^{(i)}}{\partial a^{(i)}}=\hat{y}^{(i)}(1-\hat{y}^{(i)}) \]
Chain rule per sample: \[ \frac{\partial \ell}{\partial a^{(i)}} = \frac{\partial \ell}{\partial \hat{y}^{(i)}} \cdot \frac{\partial \hat{y}^{(i)}}{\partial a^{(i)}} = \hat{y}^{(i)}-y^{(i)} \]
Stacking all samples: \[ \frac{\partial \ell}{\partial a} = \hat{Y}-Y \quad \in \mathbb{R}^{n\times 1} \]
For the mean loss: \[ \boxed{ \frac{\partial L}{\partial a} = \frac{1}{n}(\hat{Y}-Y) } \]
Note
You will often see \(\delta := \frac{\partial L}{\partial a}\) called the error signal. With mean reduction, it carries the \(\tfrac{1}{n}\) factor.
Recall: \[ a = XW^\top + b \]
Using matrix calculus:
With mean loss: \[ \frac{\partial L}{\partial W}=\frac{1}{n}(\hat{Y}-Y)^\top X, \qquad \frac{\partial L}{\partial b}=\frac{1}{n}\sum_{i=1}^n(\hat{y}^{(i)}-y^{(i)}) \]
Now logits \(Z\in\mathbb{R}^{n\times k}\), probabilities \(\hat{Y}\in\mathbb{R}^{n\times k}\), one-hot targets \(Y\in\{0,1\}^{n\times k}\).
\[ \hat{Y}=\mathrm{softmax}(Z) \quad (\text{row-wise}) \] \[ L = -\frac{1}{n}\sum_{i=1}^n \sum_{c=1}^k Y_{ic}\log(\hat{Y}_{ic}) \]
Key result (same pattern): \[ \boxed{ \frac{\partial L}{\partial Z}=\frac{1}{n}(\hat{Y}-Y) } \]
So the last-layer error signal is still prediction minus target.
For an MLP ending in logits \(Z\):
Then backprop into the previous layer: \[ \delta^{(\text{prev})} = \left(\delta^{(\text{out})}W^{(\text{out})}\right)\odot \phi'(A^{(\text{prev})}) \]
This is the standard backprop recursion.
For BCE with sigmoid output, the gradient w.r.t. the pre-activation \(a\) is:
\[ \frac{\partial L}{\partial a} = \hat{y} - y \]
Then \[ \frac{\partial L}{\partial w} = (\hat{y}-y)\,x,\qquad \frac{\partial L}{\partial b} = (\hat{y}-y) \]
Tip
This is why logistic regression and binary classifiers are so clean to implement.


Naively computing each partial derivative separately is expensive.
Backprop computes all parameter gradients in time proportional to: - one forward pass + one backward pass
Roughly:

During the backward pass each node receives an upstream gradient \[ \bar{v} := \frac{\partial L}{\partial v} \] and sends downstream gradients to its parents.
A single unit (neuron) takes an input vector \(x \in \mathbb{R}^d\), computes:
\[ z = Wx + b,\qquad a = \phi(z) \] where \(\phi\) is applied element-wise for most common activations.
A neural network layer typically computes:
\[ \mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b}, \qquad \mathbf{a} = \phi(\mathbf{z}) \]
Without a nonlinear activation \(\phi(\cdot)\), stacking layers collapses into a single linear transformation, so the network cannot represent nonlinear relationships (crucial in bioinformatics: regulatory interactions, epistasis, nonlinear signal-response curves, etc.).
Activation functions mainly affect:
| Activation | Typical use | Range | Key risk |
|---|---|---|---|
| Identity (linear) | Regression outputs | \((-\infty,\infty)\) | No nonlinearity |
| Sigmoid | Binary probability outputs | \((0,1)\) | Saturation, vanishing gradients |
| Tanh | Hidden layers (older), RNNs | \((-1,1)\) | Saturation, vanishing gradients |
| ReLU | Default hidden layers | \([0,\infty)\) | “Dying ReLU” (zero gradient) |
| Leaky ReLU | Hidden layers | \((-\infty,\infty)\) | Slope choice |
| ELU | Hidden layers | \((-\alpha,\infty)\) | Slightly costlier than ReLU |
| Softplus | Smooth ReLU alternative | \((0,\infty)\) | Still saturates for very negative inputs |
| GELU | Transformers / modern deep nets | \((-\infty,\infty)\) | More compute than ReLU |
| Softmax | Multi-class probability outputs | \((0,1)\) with sum 1 | Numerical stability if naive |
\[ \phi(x) = x \]
\[ \sigma(x) = \frac{1}{1 + e^{-x}} \]
\[ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \]
\[ \mathrm{ReLU}(x) = \max(0, x) \]
\[ \phi(x)= \begin{cases} x & x\ge 0 \\ \alpha x & x<0 \end{cases} \quad \text{with } \alpha \in (0,1) \]
\[ \mathrm{ELU}(x)= \begin{cases} x & x\ge 0 \\ \alpha\left(e^x - 1\right) & x<0 \end{cases} \]
\[ \mathrm{Softplus}(x)=\ln\left(1+e^x\right) \]
A common approximation used in practice is:
\[ \mathrm{GELU}(x) \approx \frac{1}{2}x\left[1+\tanh\left(\sqrt{\frac{2}{\pi}}(x+0.044715x^3)\right)\right] \]
For logits \(\mathbf{z} \in \mathbb{R}^K\),
\[ \mathrm{Softmax}(z_i)=\frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} \]


Definition
A learning curve shows a score of an estimator for a varying number of training samples (or epochs).
Intuitively, a model should increase its learning (score) with more experience (samples or epochs).
It is common practice to plot dual curves simultaneously:
Learning curves are heavily used when training deep neural networks, where each epoch provides a natural evaluation point.
You can configure your learning curve to maximise or minimise a score:
During training, at each step you evaluate two quantities:
From the dynamics and shape of the learning curves we can diagnose:
About our model:
About our data:
These patterns hint at issues with model capacity, dataset size, and generalisation.
Underfit condition
The model cannot obtain a sufficiently low error value on the training set.

Flat curves at high loss
The model does not have sufficient capacity for the complexity of the dataset.
Causes:
Remedy: increase model capacity, add layers, or improve features.
Underfit condition
The model cannot obtain a sufficiently low error value on the training set.

Dec. Train loss · Dec. Val loss
Both curves are still falling — the model has capacity to keep improving but training was stopped prematurely.
Causes:
Remedy: train for more epochs; revisit early-stopping patience.
Overfit condition
The model learns the training data too well, at the cost of increased generalisation error.

Dec. Train loss · Inc. Val loss
Training loss keeps decreasing while validation loss starts to increase — a clear sign the model is memorising rather than learning.
Remedies:
Good fit condition
The model learns the training data correctly and generalises well.

Stable Train loss · Stable Val loss
A small, stable gap between train and val loss is acceptable — a perfect gap of zero is unlikely in practice.
Unrepresentative training condition
The training data does not provide sufficient information to explain the validation distribution. (Sample size? Distribution shift?)

Dec. Train loss · Stable Val loss · Growing gap
Remedies:
Unrepresentative validation condition
The validation data may not provide sufficient information to evaluate generalisation reliably.

Stable Train loss · Noisy Val loss
Causes:
Remedy: increase validation set size; use k-fold cross-validation.
Unrepresentative validation condition
Validation loss systematically lower than training loss — often an artefact of the training procedure.

Stable Train loss · Lower Val loss · Wide gap
Common cause: dropout or other stochastic regularisation active during training but disabled during evaluation, making val loss artificially lower.
Not always a problem — understand the cause first.
A deep model can look brilliant on the training set and still be useless on new samples.
Think about a classifier trained to distinguish tumour subtypes from gene-expression profiles. If the network learns quirks of the training cohort rather than stable biological patterns, it will not generalise to patients from another lab, hospital, or sequencing batch.
\[ \mathcal{L}_{\text{train}} \downarrow \quad \text{while} \quad \mathcal{L}_{\text{val}} \uparrow \]
\[ \text{generalisation gap} = \mathcal{L}_{\text{val}} - \mathcal{L}_{\text{train}} \]
Note
A widening gap between training and validation loss is the classic warning sign: the model is remembering the data, not learning the signal.
Tip
In biology this is especially dangerous — datasets are often small, noisy, high-dimensional, and heterogeneous.

A healthy model improves on both sets; an overfit model keeps improving only on training data.
NN
About dropout
Dropout randomly switches off neurons during training. The network cannot rely on one convenient pathway, so it learns more robust, distributed representations.
\[ \tilde{h}_i = h_i \cdot m_i, \qquad m_i \sim \mathrm{Bernoulli}(1-p) \]
At test time, dropout is disabled and activations are rescaled by \((1-p)\) to preserve expected magnitude.
Tip
For omics tabular data, dropout is most effective in the fully connected layers — not at the raw input, where every feature may already be scarce.
The pattern changes every batch, forcing the network to learn redundant representations.
\[ \mathcal{L} = \mathcal{L}_{\text{data}} + \lambda \sum_i w_i^2 \]
Large weights are penalised quadratically. The model keeps all features but with modest coefficients.
\[ \mathcal{L} = \mathcal{L}_{\text{data}} + \lambda \sum_i |w_i| \]
Promotes sparsity — some weights shrink to exactly zero, acting like automatic feature selection.
Warning
\(\lambda\) controls regularisation strength. Too small → no effect. Too large → underfitting. Tune it on validation data.
Tip
In transcriptomics or proteomics, L1 is attractive because sparse solutions are easier to interpret biologically.
L2 discourages extreme weights; L1 pushes many weights to exactly zero — automatic feature selection.
Batch norm standardises each layer’s output during training:
\[ \hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma^2_{\mathcal{B}} + \varepsilon}}, \qquad y_i = \gamma \hat{x}_i + \beta \]
where \(\mu_{\mathcal{B}}, \sigma^2_{\mathcal{B}}\) are computed over the current mini-batch and \(\gamma, \beta\) are learnable. The mini-batch dependence also injects a small, healthy amount of noise.
Instead of hard targets \((0, 1, 0)\), soften them slightly:
\[ y^{\text{smooth}} = (1-\alpha)\,y + \frac{\alpha}{K} \]
This reduces overconfidence and improves calibration.
Note
Label smoothing is particularly useful in biological classification tasks where labels are noisy or borderline cases exist.
Batch norm stabilises the learning signal; label smoothing prevents the model from becoming overconfident.
Monitor validation loss. Stop when it has not improved for \(k\) consecutive epochs (patience):
\[ \text{stop if } \mathcal{L}_{\text{val}}^{(t)} > \min_{t' < t} \mathcal{L}_{\text{val}}^{(t')} \;\text{for}\; k \text{ steps} \]
No extra parameters. Save the best checkpoint and restore it.
Create extra training examples with label-preserving transforms:
| Data type | Example transforms |
|---|---|
| Microscopy / histology | flip, rotate, crop, colour jitter |
| DNA sequences | reverse complement, masking |
| Protein sequences | token masking, substitutions |
| Expression matrices | mild noise after normalisation |
Warning
Augmentation must preserve the label. A transform that changes the biology is not regularisation — it is corruption.
Early stopping controls training time; augmentation increases effective dataset diversity.
Use a model no larger than necessary. In small biological datasets, architecture size matters as much as clever regularisation.
A common recipe is weight decay + early stopping, with dropout added if the dense layers are clearly overfitting.
Split by patient, experiment, or batch — not by random rows. In bioinformatics, leakage hides in the structure of the data.
Note
The most important idea is not a specific trick. It is the habit of asking: “Will this model still work on new biological samples?”

b2slab.upc.edu alexandre.perera@upc.edu