Evaluate a mathematical function on the unseen instance
Measure similarities between unseen instance and training
instance
Training a model
Choose a parameterized model family (life
satisfaction = θ0 + θ1 ⋅ GDP_per_capita)
Find parameter values that maximize a fitness function or minimize a
cost function
No free lunch
There is no model that is guaranteed to work better, a lot of testing
must happen to choose and fine-tune the model.
Testing and validation
Overfitting
Model doesn’t generalize enough. It learns your specific training
data but underperforms on new data.
Possible cures:
Use a bigger dataset
Simplify model
Reduce the noice in the dataset
Underfitting
When the performance is even bad on the training data
Possible cures:
Increase number of parameters
Add more features
Reduce the regularization parameters
ML workflow
Problems with bad datasets
Sampling bias: Dataset can be non-representative if
it has an underrepresented class.
Garbage in == Garbage out: Bad dataset is guaranteed to lead to a bad
(trained) model.
2. End-to-end ML project
Workflow
Exploratory Data Analysis
(EDA)
Get an initial feel of the data
Visualize and gain insight
Prepare the data
What to do with missing
values
Remove entry
Imputation: replace by mean, median, 0, …
Categorical attributes
Attributes that can only take a limited number of values (eg: T-shirt
sizes)
one-hot-encoding: Use on categorical variables to
transform them into a format the model can understand.
Features scaling
Some extremely big or small features may have an abnormally large
impact on the model. This can be solved by rescaling them using
the following techniques:
Assumption (Inductive bias): There is a linear
relationship between the input features and the target.
Price = θ0 + Bedrooms * θ1
θ0 : intercept
Bias
θ1 : slope
weight
Goal: find optimal parameter that defines line that best fits the
data
The prediction is the weighted sum of the input features with an
additional bias
ŷ = hθ(x) = θ ⋅ x
ŷ : prediction
x: input features, extended with a “1” value (as bias)
θ: model parameters
Linear regression training
Minimize some loss function that measures how good θ is (Root Mean Square Error): $RMSE = \sqrt{\sum_{i=1}^{n} \frac{(\hat{y}_{i} -
y_{i})^2}{n}}$
Multiple options to find θ̂
Direct “closed form” solution
Gradient descent
Batch
Mini-batch
Stochastic
Direct solution
Directly calculate the optimal parameter given a labelled dataset
(rare)
Like generating a trendline in excel
Ordinary least squares
Calculate the partial derivatives of the loss function with respect
to the parameters (Θ0, Θ1)
Set derivatives to zero
Solve the set of equations
Relies on matrix inversion: O(n3)
Gradient descent
Gradient descent
It is a first-order iterative algorithm for finding a local minimum of a
differentiable multivariate function.
Generic optimization algorithm
Procedure:
Calculate the partial derivatives of the Loss function with respect
to the parameters (Θ0, Θ1)
Pick random values for (Θ0, Θ1)
and evaluate the partial derivative function. These describe how the
loss function changes when you change (Θ0, Θ1)
Adjust (Θ0, Θ1)
in opposite directions of the gradient
Repeat until convergence
Learning rate determines step size
Convex vs non-convex optimization problem Convex: There is no local minimum, only one
global minimum. Gradient descent guaranteed to find minimum.
Non-convex: There are local minima, meaning
that model can easily get stuck on bad model.
Batch Gradient descent
Combination of Batch and Stochastic
Calculate the gradient using all available training data before
performing a step
✅ You use all data -> great model
❌ Very slow for small datasets
Stochastic Gradient descent
Loop over all training data, calculate the gradient and take a step
for each sample
✅ Fast
✅ Only uses memory for single sample instead of entire dataset
❌ Gradient estimate will be noisy
unlikely to find optimal solution
randomness might help to escape local minima
Mini-batch Gradient descent
Combination of Batch and Stochastic
Polynomial regression
If the relationship is not linear. (eg: temperature and sales of ice
cream. Hot days mean lots of sales, but too hot and eating ice cream
becomes inconvenient)
Fit on non-linear model: y = f(x) = Θ0 + Θ1x + Θ2x2
Find (Θ0, Θ1, Θ2) with gradient
descent
Transform data to space where it can be solved with linear
model
Model regularization
There are 3 kinds of generalization (make useful
predictions on unseen data) errors:
Bias error
Mistakes because of wrong assumptions when designing the model =
underfitting (eg: assuming linear relations)
Variance error
Mistakes made because the model is very sensitive to small variations in
the input = overfitting
Irreducible error
Noise in the data
How to regularize a model:
Limit the models expressive power -> less overfitting ->
better generalization
Add regularization term that forces weights to be small
Ridge regression: Tries to approach a loss of
0
Lasso regression: Tries to reach a loss of 0 (->
Sparse model)
Elastic net: Weighted sum of Ridge and LASSO
Cross validation
If there is too little data to afford splitting into test/train/val,
you can reuse a lot of data using cross-validation
Split data into K equal folds (K=5 or K=10)
Train using folds 1-9
Test using fold 10
Repeat 10 times with other folds as test set
Report average accuracy and standard deviation
✅: All data will be used for evaluation
❌: Computationally expensive
Hyper parameter optimization
Automated procedures to find good hyper parameters:
Grid search: From each hyper param, combine some
values.
Random search: Randomize hyper prarams
⚠️ : Optimizing hyper params is also a form of
overfitting!
Best practice:
Split off test set and do not touch it!
Develop your model, optimizing hyper parameters with Random search
in combination with cross-validation
Retrain your model on all training data using the best hyper
parameters
Evaluate model on test data
Classification
Why not use linear regression for classification?
Regression models predict an exact value. Gradient descent will
change wights to adjust to latest train data, introducing errors for
other data.
Logistic and Softmax
regression
⚠️ : Despite the name, Logistic and Softmax are not regression
models
Logistic regression
p̂ = σ(0.25*X1+0.125*X2−1)
in general: p̂ = hθ(x) = σ(xTθ)
With σ = sigmoid function:
$\sigma(t) = \frac{1}{1 + \exp(-t)}$
>0.5 if input is positive
<0.5 if input is negative
p close to 0 or 1: data lies far from decision boundary
p close to 0.5: data close to decision boundary
Training logistic regression
Goal: find optimal parameters θ̂ that defines line that best
separates the data.
Optimize a cost function
Use Log loss
Train using gradient descent with partial derivatives
Softmax regression
For if there is more than one class
Predict a probability for each class and normalize them to sum to
one using the Softmax
The model is then trained using gradient descent with the cross
entropy loss
5. Super Vector
Machines: Linear classification
Objective: Learn discriminating function g(x) that separates samples
of two categories.
Inductive Bias: g(x) is linear in its features (line in 3D, plane in
3D)
g(x) = w1x1 + w2x2 + b = w ⋅ x + b = wTx + b
b = bias term
w = direction of discriminating function
Decision rule:
wTx + b ≥ 0
-> Sample belongs to class 1
wTx + b < 0
-> Sample belongs to class 2
Largest margin classifier
Update decision rule to include margin:
g(x) = wTx + b ≥ 1
-> Sample belongs to class 1
g(x) = wTx + b ≤ − 1
-> Sample belongs to class 2
Width of street becomes: $\frac{2}{\lVert
w \lVert}$
Margin functions:
g(x+) = wTx+ + b = 1
g(x−) = wTx− + b = − 1
To maximize the street width, we need to minimize ∥w∥. For math convenience, we will
minimize $\frac{1}{2} \lVert w \lVert
^{2}$
Importance
of standardization before margin classification
When not standardized, the distance calculation is dominated by the
feature with the largest scale, therefor it is necessary to
scale/standardize all values using the following formula: $x_{0, scaled} = \frac{x_{0} -
\mu_{x_{0}}}{\sigma_{x_{0}}}$
Meaning: w = on the margin + in-margin or misclassified
α(i) = 0: if
x(i) is on
the correct side of the boundary and outside margin
α(i) = C:
if x(i) is
on the wrong side of the boundary or within the margin
0 < α(i) < C:
if x(i) is
exactly on the margin
SVM for non-linear
classification
Example of non-linear data
The kernel trick
It is possible to project data to higher dimension, and solve using
linear classification.
❗: Due to hidden constants, the complexity of the solver scales
badly with the number of features, so the generalized Lagrangian is no
longer that efficient.
6. Decision Trees
Break down a decision on a sequence of smaller decisions, each based
on a single feature.
Example:
X1 = Age
X2 = Sex
X3 = Cholesterol
Y = Patient has heart disease
-> Age > 60?
├ yes -> Cholesterol > 300?
│ ├ yes -> Y = 1
│ └ No -> Y = 0
└ No -> Sex = Male?
├ yes -> ...
└ No -> Y = 0
Decision trees
for classification vs regression
Classification
Regression
Predict class members
Predict continuous value
Leaves store class probabilities
Leaves store values
Internal nodes define a split of the remaining data points
All items in the same subspace have the same prediction
Building a decision tree
Given a dataset with features X and labels Y, split data in two by
asking questions
Question = compare one feature with a threshold
Iteratively split two subsets again by asking another question
Repeat until no data left or tree has certain hight
CART
(classification and regression trees) algorithm
❗: High probability for a practical question on the exam!
Loop over all features
Pick a threshold
Split data points in two based on this threshold
Measure how good split is with a cost function
Pick best split
Repeat until split is empty or until tree height is reached
m = total number of instances in a node before the split
Gini impurity = $1-\sum_{i=1}^{C}(p_{i})^{2}$
C = number of classes
pi =
fraction of data points that belong to class C in the subset
Entropy
Instead of the Gini impurity we can also use Entropy
Entropy
Average level of “information”, inherent to the variable’s possible
outcome
In decision tree, entropy is zero if a particular node only contains
samples of a single class.
Regularization
Decision trees, too, can overfit. Therefor regularization is
important. It limits the freedom of:
Min_samples_leaf
Min_weight_fraction_leaf
Max_height
Max_leaf_node
Are decision trees any good?
✅: Easy to understand
✅: Can model non-linear data
✅: Whitebox model (eg: useful for medical prediction to know what
influenced the diagnosis prediction)
✅: Easy to implement
✅: Allows to determine feature importance
✅: Supports multiple outputs
❌: Unstable (small change in data can drastically change
model)
❌: All decisions are made by orthogonal decision boundaries (only
straight lines perpendicular to feature axis)
❌: Relatively inaccurate
❌: Prone to overfitting
7. Ensembles
Combine many weak models into one strong model.
Procedure:
Make sure all models learn different things
Combine individual predictions
Different techniques: Voting, Boosting and Stacking
💡: Decision trees are a good candidate for ensembles, because they
can completely change if the data changes a bit. This is a good quality
for ensembles.
Voting
Train models on slightly different datasets and combine them using
average or voting
Methods to make different versions:
Bagging: sample from the dataset with
replacement
Pasting: sample without replacement
Random subspaces: use different subsets of the
features
Random patches: select both random data points and
random feature subsets
Boosting
Train different models that correct each other’s mistakes.
Unlike voting, training is sequential and can not be performed in
parallel.
Adaboost
Each sequential model focuses more on the data points that had
incorrect predictions for previous models
Two types of weights:
Instance weightw(i): Weight of
each sample, focuses on samples that were misclassified by the previous
models
Predictor weightα(i): Weight of
each model, used to calculate the final prediction
Initialize every weight to 1/m
Train the model (calculate its weighted error rate)
Calculate the predictor weight
Update the instance weights and normalize them
Repeat 2-4 until max number of predictors is reached or sufficient
performance
Calculate weighted prediction of the different models
Gradient boost
Each sequential model tries to predict the residual error of the
previous models.
The next predictor tries to predict the residual
error made by previous predictors
Final output is the sum of the predictions of the individual
models
Stacking
Train a model using predictions of other models as input. Meaning
that the model will combine predictions of other models.
Split train set in two
Train different models on the first part
Extract the predictions of the models for the second part =
Blending data set
Train a model on the predictions of the first models
8. Dimensionality Reduction
Every data point is a point in an n-dimensional space.
Curse of dimensionality
As the amount of features (= dimensions) grows, the amount of data we
need to generalize grows exponentially.
Data becomes sparser: instances are further from each other in
higher dimensions
Distance loses its meaning: all points become equally far
away from a given point
Manifold hypothesis
The hypothesis dictates that data is actually lower dimensional than the
original input space. (Therefor, we should be able to reduce dimensions
and preserve accurate data)
Approaches to dimensionality reduction:
Feature selection: Only keep the useful
features
Feature projection: Apply transformation and
projections to derive new features
Linear dimensionality
reduction
Dimensionality reduction on linear data.
Principal
Component Analysis (PCA): geometric interpretation
Procedure:
Minimize projection error is equal to maximizing the variance of the
projected data
Find direction of largest variance in data (eigenvector with largest
eigenvalue) (1st principal component)
Find direction of remaining largest variance, orthogonal to 1st
principal components (2nd principal component)
etc…
How to find the optimal components?
Calculate the normalized eigenvectors of the covariancematrix.
(matrix that displays the covariance between every pair of variables in
a dataset)
The eigenvector v of a matrix A are the
vectors v such that if you multiply A with
v, you get a multiple of v
The vector still points in the same direction but has different
length
The scalar λ is called the eigenvalue that belongs to
v.
U = orthogonal matrix representing the left singular vectors
∑ = diagonal matrix with singular
values on the diagonal
V = columns of vector V contain unit vectors
corresponding with principal components
PCA steps
Center-mean the dataset
Compute eigenvectors and the corresponding eigenvalues of the
covariance matrix (via SVD (Singular Value Decomposition))
Sort the eigenvectors by decreasing eigenvalues
Select the eigenvectors of the d largest eigenvalues
Form an n x d matrix W containing the
second eigenvectors as columns
Use eigenvector matrix to transform the samples onto the new
subspace
❗: Make exercises on slides 25 and 26!
Non-linear dimensionality
reduction
One technique to transform data to linear data is to use
Linear Local Embedding. In this technique, you preserve
the original relative distance to near neighbours in the reduced space
(Reduced dimension).
9. Unsupervised learning
Distance-based algo
Goal: Find a group of similar instances
Clustering algo assigns a cluster number to each sample
Unsupervised: There is no ground truth to evaluate
the performance of the algorithm
K-means clustering
Strategy:
You want to make k clusters (desired number of
classes)
Randomly select k points from the dataset as initial
cluster centroids C(1), …C(k)
For each data point (x(i)) in the
dataset, calculate its distance to each centroid. Assign the data point
to the nearest centroid: $b^{(i, j)} =
\begin{cases} 1 & \text{if} \quad j = argmin \sum_{l=i}^{m}
(x_{l}^{(i)} - C_{l}^{(k)})^{2} \\ 0 \end{cases}$
Calculate new values for centroids: $C_{new}^{(k)} = \frac{\sum_{i=1}^{m} b^{(i,k)}
x^{(i)}}{\sum_{i=1}^{m} b^{(i,k)}}$
Repeat 3-4 until convergence criteria are met
💡: Possible stop criteria: length of cluster center shift, #
iteration, …
Problem
If initial randomly selected centroids are bad, it may never find the
right clusters.
Solution
Restart the k-means algo x-times and pick the clustering with the lowest
inertia (= mean average distance of each point to its
closest centroid)
❌: cannot deal with difference in density between clusters
Variance of features must be similar for all
clusters
The silhouette score is a measure used to evaluate
the quality of clustering in a dataset. It provides an indication of how
well-separated the clusters are. This score ranges from -1 to 1,
where:
close to +1: the sample is far away from neighboring clusters,
indicating good separation.
close to 0: the sample is close to the decision boundary between two
neighboring clusters.
close to -1: sample might have been assigned to the wrong
cluster.
a(i):
mean distance from point to all other points in same cluster
b(i):
mean distance from point to all instances in the next closest
cluster
DBScan clustering
Density-Based Spatial Clustering of Applications with Noise
is a clustering algorithm particularly effective at identifying clusters
of arbitrary shapes and handling noise in the data.
Core instance: point with least minPts
within distance ϵ
Pick a random point p
init a new cluster
inspect p
if p is not core: consider as noise
if p is core:
add all points within distance ϵ to cluster
recursively inspect all point in neighbourhood
Sample new unclustered point and repeat
✅: no need to manually select number of clusters
✅: find clusters of arbitrary shape
✅: works well when clusters are dense enough and well-separated by
low-density regions
❌: not deterministic
❌: uses distance metric (beware of scaling and curse of dim
reduction)
❌: difficult to choose right hyperparam
Hierarchical clustering
Cluster data into a hierarchy of groups or clusters. It doesn’t
require specifying the number of clusters beforehand.
Hyperparams:
Affinity: distance metric between points
Linkage function: distance between clusters
complete linkage: max distance between elements of different
clusters
single linkage: min distance between elements of different
clusters
Probability-density
based (density estimation): Gaussian Mixture Model
Generative modelling
A ML technique used to create models that can generate new data
similar to a given dataset. This is commonly used for image or text
generation.
Inductive bias: instances in dataset are samples
form unknown statistical distribution over a feature space:
x(i) ∼ Pdata(x)
Task: learn from the data a probability density
function that matches this unknown distribution: Pmodel(x) ≈ Pdata(x)
Distribution family
Inductive bias: data is sampled from a member of a
particular distribution family: Pmodel(x) = Pmodel(x;θ)
Task: find the “best” values for the parameter
vector θ which we denote as
θ*
Use cases for generative
models
Anomaly detection: if Pmodel(xnew;θ*)
is lower than threshold -> anomaly
Classification: generate a new sample with the
features of the new data and add a feature that represents a class. Do
this for every class and then compare the probabilities. Pmodel(x1, new,x2, new,y=1,θ*) ≫ Pmodel(x1, new,x2, new,y=2,θ*)
Where x1 and x2 are features form the data
and y is the class
Generating new samples
How to sample from a
distribution?
Make inverse cumulative distribution function
Sample a random number in [0, 1[ (CDF(x) = xnew)
Density estimation:
Maximum likelihood
Inductive bias: universal Gaussian distribution
Task: find “best” values for θ = [μ,σ]
Likelihood
function (to find best distribution)
Every item x in our dataset is a sample of an unknown
distribution Pdata(x).
We assume that this distribution is a member of a family of
distributions Pmodel(x;θ).
The best model should maximize the joint probability of
all items x(i).
In this formula, ϕ is the
factor with which to rescale the individual distributions to
maintain a valid density function (surface = 1).
MLE of Gaussian Mixture
Model
The MLE (Maximum Likelihood Estimation) of a GMM involves finding the
parameters (μ, σ)
that best fit the observed data under the assumption that the data comes
from a mixture of Gaussian distributions.
However, data under GMM doesn’t have a closed mathematical solution
so we can’t maximize the likelihood function with concepts like gradient
descent. Instead, we use our prior knowledge (inductive bias)
If we know which source or component generated each data point (x),
we can estimate the parameters of μ and σ by fitting a Gaussian
distribution.
If we know the means and variances, we can calculate the likelihood
of each data source to have generated x.
Soft labeling
We can label data by comparing the confidences (𝒩(x;μi,σi))
of data to a certain GMM component.
Expection-Maximization
algorithm (EM)
Calculate soft-labels based on current distribution params
(expectation step)
Update distribution params (maximization step)
Repeat
Soft Clustering
In contrast to k-means clustering, where the model decides
which group a sample belongs to. We can quantify the likelihood a sample
belongs to a particular group by soft clustering the data.
Process:
Identify the number of clusters you want to split the data
into.
Define each cluster by generating it a Gaussian model.
For every sample, calculate the probability it belongs to a certain
cluster.
Recalculate the Gaussian models using the above probabilities.
Repeat until high set of probabilities.
multivariate GMM
Multivariate unimodel
Gaussian
A multi-dimensional Gaussian distribution (multivariate) that has a
single peak or mode. It’s a type of multivariate Gaussian distribution
where the data tends to cluster around a single central point
A multi-dimensional Gaussian distribution (multivariate) that has
multiple peaks. For example, a dataset of dogs. Every dog in that
distribution has its own multivariate unimodel Gaussian distribution (eg
height x wight).
10.
Introduction to Artificial Neural Networks with Keras
10.1 Wave 1: Perceptron
Perceptron
Single-layer NN that aims to solve binary classification problems.
Different components:
Inputs
Weights
Weighted sum
Step function
Weights get updated using the following formula:
wi, j(nextstep) = wi, j + η(yj−ŷj)xi
wi, jnextstep:
Next weight value
wi, j:
Old weight
η: learning rate (constant
that determines the step size)
yj − ŷj:
target - prediction (binary value)
xi:
input value
The problem with the perceptron is that it is only capable of making
linear decision boundaries. (Not even able to solve an XOR).
10.2 Wave 2: Distributed
representations
The concept of connectionism got tracktion: Every
input should be represented by different features. Every feature should
be involved in the representation of the many inputs.
Multi layered perceptrons (MLP) also got discovered.
These added layers allowed backpropagation to happen -> non-linear
classification boundaries.
10.3 Wave 3: Deep learning
Because of the better hardware, we were able to make deep neural
networks. And train them with more readily available data. A deep neural
network consists of multiple stacked layers of neurons, typically
trained using backpropagation (gradient descent).
“Deep” refers to the fact that the model is able to
extract features from raw information instead of having humans extract
the manually.
Logistic Regression: simple neural network:
Activation functions
By default, each layer calculates a linear combination of the outputs
of the previous layer. By adding non-linear activations, we can make the
network behave in a non-linear way. (most common: ReLU)
Neural network training
There is no closed solution to finding the minimal loss as the loss
function is non-convex. Instead we rely on gradient
descent which is not guaranteed to find the global minima.
Backpropagation :Algorithm to calculate the gradient
with respect to the parameters of hidden layers.
Neural
networks for unsupervised learning / dimensionality reduction
Autoencoder: model trained to reconstruct its own
input
Autoencoder has a bottleneck between the encoder
and decoder, effectively compressing the data.
After compressing, a decoder tries to reconstruct
it to original state
If there was no correlation between data this would not work.
If there is correlation, that structure can be learned
Autoencoders are used for:
Dimensionality reduction: discard decoder after training
Anomaly detection: reconstruction error is anomaly metric
Unsupervised pretraining
Neural networks have many hyperparams that need to get
configured:
Num of layers
Num of neurons/layer
Activation function
Learning rate & batch size
Training algo and loss function
Initialization of params
Noise and data augmentation
Regularization
11. Training Deep Neural
Networks
Vanishing / exploding
gradients
By stacking multiple layers with sigmoid activations, the gradients
become very small for the first layers (Vanishing
gradients) and wont change much despite not being trained
good.
Situation:
Gradients are propagated backward through layers, they get
multiplied by weight matrices and activation functions’
derivatives.
Activation functions have regions where their derivatives become
very small ( ≈ 0)
The deeper the gradient traverses in the network, the effect of
small gradients gets compounded until it effectively “vanishes”.
Solution 1: Weight
initialization
The distribution of the initial weight factors can have a big impact
on the model training. It is best to give every weight an initial
±random value. A common strategy is to assign a weight a value from a
normal distribution.
Solution 2: Activation
functions
⚠️: Sigmoid activation functions are not the best choice because of
saturation. Saturation occurs because the activation
function squashes the input values to a number in the range of [0, 1].
Extreme input values (small or large) get a value close to 0 or 1 and
the gradient becomes very small.
To avoid saturation, the use of the ReLU (Rectified Linear Units)
activation function, or one of its derivatives, is preferred. (ReLU
returns the value itself if it is bigger than 0)
Solution 3: Batchnorm
Additional layers after neural network layers
Normalizes the activations by subtracting a mean and dividing by the
standard deviation of the activations across the batch during
training
Keeps track of a moving average of these statistics to be used for
inference.
It is supposed to reduce “internal covariate shift”, but it is not
quite understood how or why this happens.
Optimization algorithms
There are some algorithms that can perform better than mini-batch
gradient.
Momentum: Use previous gradient to keep track of
the direction (faster and can escape local min/max)
Adaptive learning rate: Algo that automatically
uses different learning rate for different parameter
Second order method: Use second order partial
derivatives to estimate curvature around a point (expensive but gives
more insight in how far you are from optimum)
Learning rate scheduling
💡: Learning rate determines the size of the step the model takes
when searching for minimum. It is perhaps the most important hyper
parameter to tune. A good strategy is to start with a high learning rate
and decrease it overtime (decay).
Regularization
Regularization limits the freedom of the model. If you let the model
learn to much, it becomes prone to overfitting.
A couple regularization techniques:
Use more training data
Use data augmentation and randomness
L1 and L2 regularization
Dropouts
Randomly set some nodes to zero during training
This increases robustness
Forces each neuron to be more independent
Normalization
Normalization is the process of scaling and transforming the input
features to a standard scale. The goal is to bring all features to a
similar scale, preventing certain features from dominating the learning
process simply because they have larger magnitudes.
Techniques:
Min-max normalization: Scale to [0, 1]
Zero-center: subtract mean
Standardize: subtract mean, divide by standard deviation
Hyper parameter tuning
This can be done manually (“grad student descent”) or automatically
(grid search, random search, …).
A good general approach:
Use simple model without generalization
Sanity check the loss
Overfit on a small subset of the data
try different models
try different learning rates
Train the most promising models on the full dataset for a few
epochs, try different optimizers
Apply regularization and train on the full dataset
14.
Deep Computer vision using convolutional neural networks
When working with input data with higher dimensions (like images),
traditional neural networks face some problems.
Lets take an image for example (200x200x3), first of all, this data
has to get reshaped to 1D (flatten).
Problem 1: The model needs to learn that certain
inputs are related
Problem 2: 120000 input nodes + hidden layers make
for a very large neural network (consumes a lot of memory and is less
efficient to train)
Convolutional neural
networks
(CNN’s) are designed to solve these problems.
Inputs are kept in 2D or even 3D (the convolutional layers
also keep these dimensions)
Not all neurons are connected to all neurons from previous layer.
Instead, every neuron is connected to some patches (e.g. 3x3) of neurons
from the previous layer
Hierarchical structure
First layer (most nodes) focuses on small low-level features
Deeper layers assemble these into higher level features
Weights are reused
All neurons on the same layer use the same weights
These weights define a kernel (filter) that slide
over the input
By convolving the inputs with a suitable kernel, we
obtain a filtered version of the image or a feature
map. Depending on the kernel, it is also possible to learn
spatial features (like horizontal vs vertical lines or
certain color transitions). A single layers applies N filters and thus
generates N feature maps. The weights of these filters are determined
using gradient descent.
Convolutional layer output
size
The output size is determinded by the size of the kernel and the
stride (step between the places the filter is
applied)
p: conv padding size (sometimes zero
padding is applied so that every value of the input is taken
into account)
s: conv stride size
Convolutional layer
number of parameters
number of parameters = Number of kernels * kernel size + number of
kernels (bias term)
Pooling
After performing a convolution, the feature maps can still be quite
large. A solution to this is to subsampel the features from the feature
maps. In max pooling this is done by dividing the
feature maps into slices. For every slice, only the most dominant (max)
value is kept. The result is a feature map with with the same resolution
as the number of slices. It is also possible to pool in strides.
$$W_{out} = \frac{W - F}{S} +
1$$
Wout
= output width/height
W = input width/height
F = pool size
S = pool stride
✅: Reduce memory required
✅: Reduce computational cost
✅: Allow next layer to have larger receptive field
✅: Improve invariance to small translations of the
input
❌: Some loss of information
CNN architecture
The final network is obtained by stacking
Conv layers
Non-linear activations
(opt) batchnorm
Pooling
To perform classification, the model is typically followed by:
flatten
fully connected layer
activation function (softmax)
Data augmentation
For CNN, a common technique is to artificially increase the size of
the training set but generating variants of existing training instances
(rotate, flip, blur, crop, …). This often reduces overfitting.
15. Processing
Sequences using RNNs and CNNs
Recurrent Neural Networks
NNs with neurons that have connections (weights) that point backwards.
This characteristic is desirable when working with sequential data like
time series prediction, speech recognition, NLP, video analysis, …
The output of a step t is given together with the input
at step t+1
yt = ϕ(Wxx(t)+Wyy(t−1)+b)
RNN architectures
There are a couple different RNN architectures
Seq2seq
For when both input and output are sequential. A use case of this
could be text-to-speech conversion.
Seq2vec
Input is sequential, but output is a fixed-size vector. A use case
for this could be sentimental analysis, or a speech-recognition tool
that has to understand only “hey google”.
Vec2seq
For when the input is a fixed-size vector, but the output needs to be
sequential. A use case for this could be image captioning; a model gets
an image as input and has to describe it.
Encoder-decoder
A combination of Seq2vec and Vec2seq, it serves as an alternative for
Seq2seq. A use case for this could be a language translation model the
translation system needs more than just one world, as the word could be
different depending on the context.
Backpropagation through time
To train an RNN, we unfold the network over time and train it like a
feed forward model.
Since the output of a neuron at a certain time step is a function of
all the inputs at previous time steps, a neuron in an RNN is also called
a memory cell (it preserves state).
RNNs are known to have some issues during trraining:
RNNs are slow and unstable (especially for long sequences)
Saturation: the same weights are used in every
step, we need to make sure that the output does not keep
increasing.
Long-Short Term Memory
(LSTMs)
RNNs output depends on the input of all previous timesteps, but for
long sequences, RNNs tend to forget the first inputs. That is why LSTM
models explicitly define a memory slot.
Two memories:
Short term (preivious outputs)
Long term
Model decides what to store in long term
output is based on input and both memories
💡: On exam, not necessary to reproduce schema, but know how to
explain it!
Two states
c(t) : longterm
h(t) : shortterm
Three gate controllers
Forget gate: controls which part of long term gets
erased
Input gate: controls which part of
g(t) should get added to long term
Output gate: Which part of long term should get
added to output
Gated Recurrent Units (GRU)
A simplified version of LSTMs that work equally well.
Single state h(t)
Single gate controller layer which controls both forget gate and
input gate
No output gate
Different gate controller which decides what part of the state will
be used in current step
16.
Natural Language processing with RNNs and attention
(Transformers)
Natural Language
Preprocessing (NLP)
ML technique that uses text as input and/or output.
Different approaches:
Traditional: feature extraction + ML model
RNN
Transformer based
Basic textual feature
extraction
ML algos typically require numeric input data
N-grams: often, words are kept into sequences of N-words. For
example the bi-grams of the sentence “This is a sentence” becomes:
["This is", "is a", "a sentence"]
Large vocab -> sparse inputs
Text vectorization: convert each document to a vector of numerical
features
$$w_{x,y} = tf_{x,y} .
\log(\frac{N}{df_{x}})$$
w: weight
tf: freq of word
N: number of documents
df: number of documents where word occures
RNNs for text generation
(seq2seq)
Pass a single input character at a time and predict the next
character in the sequence.
At inference time, we predict one char at a time passing the
previous output as input
Unsupervised pretraining: The text encodings learned by the
model can be useful for other tasks like sentiment analysis.
RNNs for text
classification (seq2vec)
Classification of text could be sentiment analysis. For the purpose
of classification, it’s best to pass the entire sentence as input. Every
word gets assigned a unique ID, and only the N most frequent words get
used to classify. The network typically has one embedding layer
that serves as a lookup table that converts every word to a learned
vector representation. The vector representation usually conveys how
close certain words are to each other.
Neural machine
translation (encoder-decoder)
Encoder receives input sentence, one word at a time
Decoder receives encoder state and correct previous output word
Bidirectional RNN
The representation of a word typically depends on both the previous
and the next word. That’s why a second RNN gets added that reads the
sentence backwards. Then, concatenate the two states of the encoder RNNs
before passing it to the decoder.
Beam search
Keep track of the k most likely potential output
sentences. Try to extend all of these sentences with one additional
word.
Attention mechanisms
The encoder-decoder architecture struggles with long sentences
because all the info needs to be encoded in the state. To solve this
issue, an attention mechanism gets used.
The attention mechanism keeps track of info related to all words in
the input sentence and allow the decoder to selectively attend to parts
of the input.
We store the encoder’s output while preprocessing the input
At each timestep, the decoder receives a weighted sum of those
outputs as input.
Weights are predicted by an alignment model based on the decoder’s
current state and encoder outputs.
Multi-head attention
For each word in the input sentence, the attention applies a linear
layer to obtain three tensors: key, value, query.
It then calculates the similarity between the word’s query vec with
all other key vecs. These similarities are used to calculate a weighted
average. This is repeated a couple of times to allow the model to
extract different types of relationships between input tokens.
Transformers
An architecture that does not make use of RNN or CNN, just fully
connected layers + attention. It does not suffer from vanishing gradient
and is better at capturing long-range patterns.
The sentence is split word by word
Each word is replaced by a learnable embedding
The sentence is then represented by a 3D tensor
Tensor is transformed while going through the model but the
dimensions remain the same
The encoder transforms every input word into a high level
representation that captures the meaning of that word in this specific
sentence. While the decoder transforms the embedding of a translated
word into the embedding of the next translated word based on the
information extracted by the encoder.
GPT
Generative pretrained transformer
Pretrained: predict the next token in the sentence
Finetuning: set of different task (summarization, text
classification or sentiment analysis)
Previous models used supervised learning (e.g. translation) but
labeled datasets are expensive to collect
Vision transformer
Transformers can also be used for image classification
(e.g. ConvNet)
17. Autoencoders, GANs
and diffusion models
Representation
learning and generative modelling
Representation learning:
Learn good feature descriptions from data without manually defining
them
Autoencoders can be used to do it in an unsupervised way
Generative modelling:
Can generate realistic “fake” samples
Can do stuff like colorization, image editing, future frame
prediction, …
The encoder transform data to a latent
representation (a compressed representation)
The decoder uses this representation to reconstruct input
Trained using gradient descent
Manifold hypothesis
A hypothesis which explains why autoencoders work. It suggests that
high-dimensional data, such as real-world data, lies on or near a
lower-dimensional manifold within the high-dimensional
space. In other words, even though data might exist in a
high-dimensional space, it often occupies a lower-dimensional structure
or subspace.
By assuming that real-world data lies on a lower-dimensional
manifold, autoencoders learn to represent and generate data in a way
that captures the essential structure and variations within the
data.
Autoencoder applications
Dimensionality reduction: sometimes autoencoders
(encoder) are better than PCA because they can also make non linear
mappings, in contrast to PCA
Unsupervised pretraining: When the labeled data is
sparse:
Train autoencoder on labeled data
Build a second model en use autoencoder of previous model as
classification layer in new model
Train the second model
Anomaly detection: Train autoencoder on good data
only, it will succeed in detecting anomalies because it wasn’t trained
on bad data.
Autoencoder variants
💡: These are not all that common, so likely not an exam question
Tying weights: reduce num of params by using the
same weights as the encoder and decoder
Greedy layerwise training: first train autoencoder
with two layers, then train in between these two layers
Sparse autoencoder: add regularization term during
training to encourage the learning of sparse representation
Convolutional autoencoder
For image inputs, its a good idea to add a CNN layer to the
autoencoder. The encoder reduces the spatial size, and increased the
number of feature maps. The decoder increases the spatial size, and
decreased the number of feature maps.
Transposed convolution
💡: Also not that common , so likely not an exam question
It’s an alternative for decoding, sometimes called “deconvolution”.
It upsamples the spatial dimensions of the input, equivalent to a normal
convolution with 0-padding and a flipped kernel.
Denoising autoencoder
Autoencoders can also be used to remove noise. Random noise is
applied to inputs, the model gets assessed based on the original input
(without the noise), so it learns to remove noise.
Variational autoencoders
In contrast to normal autoencoders, the output is not deterministic,
rather, it is probabilistic. It is also generative, so it can generate
new samples.
Latent vectors consists of two parts:
Mean coding (μ)
Standard deviation (σ)
During training, the encoder predicts both μ and σ.
Trained using two loss functions:
Reconstruction Loss (e.g. MSE)
Latent loss (to force the model to output codings that look like a
gaussian distribution)
After training we can sample from N(0,1) and pass these sampled
latent codes to encoder to generate new samples.
VAEs simulate disentanglement of the latent factors.
Generative adversarial models
(GAN)
Two NNs get trained by competing with each other
Generator: generates a data sample based on
noise
Discriminator: Tries to distinguish between real
and fake samples
The generator never sees real samples, it is trained to generate
samples that will fool the discriminator.
Problems
Discriminator and generator compete with each other, this results in
a complex optimization problem
Very sensitive to hyper params
Mode collapse: when generator becomes less
diverse
Diffusion models
Training
Keep adding noise to an image to generate increasingly noise
versions
Train denoiseing autoencoer to restore less noisy version from noisy
image
Generating new image: start from random noise and apply all
denoiseing models
✅: Easy to train
✅: Can generate more diverse images
❌: Slow to generate new image
Latent diffusion models
Combine VAE and a diffusion model. The diffusion model operates on the
latent code of the VAE (generation is faster)
Stable diffusion
Trained on images together with a human provided captions. The text
is preprocessed by a transformer based encoder. Stable diffusions is
used in image generation models like DALL-E.
Regression = predict a target value from sample’s
feature↩︎