Yanaël Barbier

I build my path with an export!

Day-to-day: Learning Machine Learning

Machine Learning and Deep Learning are trendy topics and it is a promising area. As a Software Engineer, it’s identified as a Data Scientist subject but it sounded so cool that I decided to dig it. Here are my day-to-day notes on this long journey to master the field.


march 2020
april 2020
latest


2020-03-17

Hello world for CNNs, make a simple network that predicts the MNIST digits

https://keras.io/examples/mnist_cnn/

2020-03-18

Tasks:

I have seen how to download a dataset, how to create a DataLoaders and how to train a pre-trained model.

In the learner, I see different types of layers: Conv2d, BatchNorm2d, ReLU, MaxPool2d, etc.

Questions:

2020-03-19

Try to tweak a model.
Read the documentation, and discover the simple model creation.
The list of definitions: https://deepai.org/definitions

Questions:

2020-03-20

Read more documentation about the callbacks, the callback makes it easy to add new functionalities to the training loop.

Read about the learning rate.

Setting the learning rate of your neural network

Test all three phases to discover the optimal learning rate range.

Visualizing the Loss Landscape of Neural Nets (Paper)

Cyclical Learning Rates for Training Neural Networks (Paper)

Questions:

2020-03-21

I have watched videos of Three Blue One Brown about Neural Networks.

A great method to train a model is the one-cycle method.

A disciplined approach to neural network hyper-parameters (Paper)

ReLU (Rectified Linear Unit) is an activation function.

Question:

Tasks:

2020-03-22

I have trained a model with my own dataset. I have to clean up the top losses data, Fastai provides a class to do that. Load the new data and retrain the model.

I have to export the model, it creates a pickle file .pkl

This file contains the model, the weights, metadata like the classes or the transforms

I have built a small program that takes this file and loads it with Fastai and returns the prediction on custom images.

Now I have a good idea of how to export a model and run it into production.

“Inference” is the name of the prediction process.

Learning rate too high -> The validation loss will be high

Learning rate too low the error rate will decrease very slowly between epochs. An indication is the training loss will be higher than the validation loss

To recognize if a model is starting over fitting is only if the error rate starts to be worth than the previous epochs. Not about training loss lower than validation loss.

Tensor with 3 dimensions for a colored image.

Visualize Matrix Multiplication

Acronyms:

Learning linear regression and the derivative is useful to understand how loss calculation works.

To calculate the gradient descent in practice, we use mini-batches instead of calculating the whole batch.

Vocabulary:

To sharpen your math understanding

How (and why) to create a good validation set

Questions:

Tasks:

2020-03-23

Today learning about computer vision and image processing.

Rank: How many dimensions are there. Colored image has a rank of 3.

Visualize Neural Networks


Computational Linear Algebra for Coders

Linear Function:

y = ax+b
x, y	are the coordinates of any point on the line
a	is the slope of the line
b	is the y-intercept (where the line crosses the y-axis)

2020-03-24

Fast.ai Lesson 2 at Linear Regression Problem

Find the parameters to minimize the error. For a regression problem, the error function a.k.a loss function MSE is a common one.

Mean Squared Error → MSE, also RMSE → Root Mean Squared Error

MSE is the loss, the difference between predictions and the actual number.

2020-03-25

Continue the calculation of the SGD.

Create mini batch to don’t train on the whole dataset each time before updating the weights.

2020-03-26

Starting lesson 3 about data blocks, multi-label, classification, and segmentation.

thresh is used in case we have multiple labels.

Python partial is used to create a function with a specific parameter.

Use small images to experiment more quicker.

Segmentation creates a labialization pixel per pixel.

2020-03-27

Using progressive resizing to train the model.

If underfitting, train longer, train last part reduce learning rate, decrease regularization.

U-Net for segmentation training

Learning rate should be high at the begging and reduce after. Don’t be locked in to find the smallest.

After each linear regression use an activation function.
The sigmoid is not used anymore.
Mostly used now:
ReLU: Rectified Linear Unit → ReLU activation
max(x,0)

A visual proof that neural nets can compute any function
Activation Functions in Neural Networks

Imagenet expect 3 channels, if we have only 2 channels we can create a new channel set as 0.

Tasks:

2020-03-29

Watched the videos of 3blue1brown - Essence of linear algebra about linear algebra, vector, linear transformation, matrix multiplication.

Now working on fastai and data blocks API.

What’s the normalization imagenet_stats

Questions:

2020-03-31

Working with Fastai library, and practice the data block API.

I read some articles Basics of Linear Algebra for Machine Learning super valuable, he advises the top-down approach and prevents mistakes of what beginners do. They learn math too early in the ML journey.

ResNet solve the vanishing gradient

Article: Deep Convolutional Neural Networks

CNN: Convolutional Neural Network

Question:

2020-04-01

Want to work with a segmentation dataset, importing this COCO datasets http://cocodataset.org/

Train the model.

Python division with two slashes to do integer division 10//3 = 3

Questions:

2020-04-02

I continue to train on CamVid.

Question:

2020-04-05

Working on segmentation, based on the camvid tiny dataset.

I’m exploring the mask and the function to display it. It call an external library and the convert mode is used to display different color based on the value in the image. The value does not represent an RGB value.

I check the min value and the max value in the tensor

torch.min(mask.data)
torch.max(mask.data)

I expect it should match the number of codes I have in my dataset.

It will depend on the image, not on all the images we have all the things in the label. I will check how many different value I have in my dataset instead min and max.

len(np.unique(mask.data))

Now it make sense, I can discover all the things I have in the images based on the mask code.

I trained the model, still learning about the learning rate and the plot loss.

Tasks:

Starting fast.ai lesson 4

NLP: Legal Text classifier with univeral model fine tuning.

Universal Language Model Fine-tuning for Text Classification (paper)

NLP use transfer learning, use pre trained Language model from wikipedia Wikitext.

Language model → Specific language model → Classifier
Self supervised learning

Collaborative filtering:
Sparse matrix storage
Cold start problem, to solve use a metadata model

Terms:
Parameters/weights → Number inside the matrices
Activations → Result of the calculation (Result of matrix multiply or activation function)
Layers → Each step of the computation
Input → Entry point
Output → Result

Common to have a sigmoid at the last layer, to have an output between 2 values

Loss function compare the result with the last layer.

2020-04-06

Training on the head poses dataset.

Data augmentation can help generate more data for the training set.

2020-04-08

Working with tabular data today.

For categorical data, we will use embeddings.

A processor, similar to transformations in computer vision, will be used, but it’s applied beforehand.

The validation set should be a contiguous group of items.

2020-04-10

Regarding Collaborative Learning, what is the difference between Embedding NN and Embedding Dot Bias?

Will fastai keep the better parameters of the training when we change the learning rate?

How are columns selected in CollabDataBunch?

What is PCA?

Calculate the loss with RMSE (Root Mean Squared Error).

2020-04-11

I’m watching a course from MIT about computer vision CNNs.

Layers:

Apply layers to match the depth of the neural network.

Convolution:

Depth: h x w x d

Create a feature map.

Non-Linear Activation:

Pooling:

CNN:

Tensorflow provides a playground tool.

The playground is pretty nice; you can better understand the learning rate, see the impact of the activation function, and observe the effects of noise and batch size.

Create a COCO dataset from scratch

Questions:

Tasks:

2020-04-12

Continuing learning and training models.

I’ve read an article that summarizes the fastai course.

Datasets:

In NLP, the first step is training a language model, which is the step of guessing the next word in a sentence.

A language model has its own labels.

Now we can create a classifier.

MSE: Mean Squared Error. Always non-negative.

Article to understand and choose the last layer activation and loss function.

Questions:

Tasks:

2020-04-13

I’ve explored the batch size and tried to understand how it works. It’s related to PyTorch and involves loading data per batch to avoid loading everything into memory, I guess.

I’ll start lesson 5.

Fastai adds two layers at the end:

learn.fit(n_epoch, learning_rate)

# Learning rate parameter
# All layers receive the same learning rate
1e-3

# The final layer receives the indicated learning rate. The other layers receive the number divided by 3.
slice(1e-3)

# The first layer receives the first value,
# All intermediate layers receive multiplicatively equal spreads,
# The last layer receives the second value.
slice(1e-5, 1e-3)

Add learning rate per group, not per layer.

cnn_learner has 3 layer groups by default.

Pre-processing:

Matrix multiplications followed by ReLUs, when stacked together, have this amazing mathematical property known as the universal approximation theorem. If you have large enough weight matrices and enough of them, they can solve any arbitrarily complex mathematical function to any arbitrarily high level of accuracy.

Fine-tuning: Based on a trained model, retrain the model to fit our use case.

Tasks:

Questions:

Why use square instead of absolute in the loss function?

Okay, it’s called MAE, and it’s used when we don’t care about outliers.

Loss functions: MSE, MAE, Huber

Another loss function that exists is named Huber.

2020-04-15

I want to write the MSE, MAE, and RMSE functions.

Clarifying the squaring and square rooting.

Squared:

$$ x^2 $$

Square root:

$$ \sqrt{x} $$

Here are the Python functions for MSE, RMSE, and MAE:

# Mean Squared Error (MSE)
def mse(target, prediction):
    return ((target - prediction) ** 2).mean()

# Root Mean Squared Error (RMSE)
def rmse(target, prediction):
    return math.sqrt(mse(target, prediction))

# Mean Absolute Error (MAE)
def mae(target, prediction):
    return abs(target - prediction).mean()

These functions calculate the respective errors between the target values and the predictions.

2020-04-16

I’ll be focusing on writing a neural net from scratch.

I have started to rewrite the SGD (Stochastic Gradient Descent) functions. I’m following this article: https://www.charlesbordet.com/fr/gradient-descent/#et-les-maths-dans-tout-ca-. It’s well explained.

Questions:

2020-04-17

Google provides a course with a clear path and good explanations:

Introduction to Machine Learning

2020-04-18

Working with the basic implementation of the functions.

Machine Learning Glossary

Weight Decay:

We want a lot of parameters but less complexity.

How to penalize complexity:

Sum up the values of the parameters.

Some parameters are negative and others positive, so we can sum the squares of the parameters.

The problem with that is the number is too big, and the best loss would be to set all the parameters to zero.

So, let’s multiply that with a number we choose. That number is called wd → weight decay.

A good wd is 0.1; the default is 0.01.

I have rewritten a matrix multiplication function.

The gradient is like the delta between two values.

The gradient of an array of values will be the number needed to reach the next one.

From 1 to 1.5, the gradient will be 0.5.

np.gradient([1, 1.5])

# Output: array([0.5, 0.5])

Derivative:

$$ \frac{d}{dx} $$

$$ f’ $$

Notation of the Gradient of a function:

$$ \vec \nabla f $$

To calculate the gradient, we are based on the previous weights.

Based on the weight of the previous epoch minus the learning rate multiplied by the derivative of the loss function.

$$ wt = w_{t-1} - lr \times \dfrac{dL}{dw_{t-1}} $$

dL → Loss

$$ L(x,w) = mse(m(x,y), y) + wd \times \sum w^2 $$

A gradient is a vector; a directional derivative is a scalar.

Derivative based on the gradient gets the direction to choose to decrease the cost.

An implementation of gradient descent in python: gradient_descent.py

Questions:

2020-04-19

Discussion with Natan about Fastai and the multi-label training on the Planet dataset.

lr_find() restores the weights at the end of the exploration.

The threshold should not impact the training, but what if we set the threshold to 1%? Should we get an accuracy of 100%?

To know the loss function used, use learn.loss_func.

To launch the debugger in Jupyter, use %debug.

Tutorial to implement a NN from scratch.

Logistic regression model → Neural net with one layer and no hidden layer.

CrossEntropyLoss is used for problems where the loss can’t be calculated between elements. If we predict 5 instead of 4, the error is the same as predicting 0 instead of 4.

There’s no relation between the numbers; we can’t calculate the loss between 5 and 4.

Gradient of

$$ \frac{d\ (wd \times w^2)}{dw} = 2 \times wd \times w $$

where wd → constant, w → weights.

Weight Decay: Subtracts some constant times the weights every time we do a batch.

$$ wd \times w $$

L2 Regularization: Adds the square of the weights to the loss function.

$$ wd \times w^2 $$

Momentum:
Exponentially weighted moving average:

$$ S_t = \beta \times g_t + (1 - \beta) \times S_{t-1} $$

where ( S_t ) is the step at time t, ( g_t ) is the gradient at time t, and ( \beta ) is the momentum coefficient.

SGD with Momentum → Based on the current gradient plus the exponentially weighted moving average of the last few steps.
SGD Momentum is almost always set to 0.9.

RMSPROP:
If the gradient is consistently very small and not volatile, make bigger jumps.

ADAM:
Combines RMSPROP and Momentum.

Cross-Entropy Loss:
Sum of the one-hot encoded variables times all the activations.

Can be solved with an if statement:
if cat then log(cat_pred) else log(1 - cat_pred).

Can be done by a lookup:
Look up the log of the activation for the correct answer.
Ensure the prediction sums up to one.
Using softmax in the last layer: All activations are greater than 0 and less than 1.

For single-label multi-class classification, use softmax as the activation function, and cross-entropy as the loss function.

Weight Decay, Dropout, and BatchNorm are regularization functions.

Tasks:

Questions:

2020-04-21

Refreshing my understanding of one-hot encoding and embedding.

☝ Affine function → Linear function → Matrices Multiplication

Embedding is the use of array lookup instead of using matrix multiply by one hot encoding matrix.

2020-04-22

Starting lesson 6 of fastai.

platform.ai To classifier unlabeled images

Add more detail about the date, the data are used as one hot embedding by the model.

add_datepart(train, "Date", drop=False)

2020-04-23

I’m working with a notebook that utilizes GPU processing.

To switch to CPU processing:

defaults.device = torch.device('cpu')

Fastai Preprocessing:

Categorify
FillMissing

Preventing overfitting with dropout:

Tasks:

Note: When using dropout, it is typically applied during the training phase to prevent overfitting. During the test or evaluation phase, dropout is not used; instead, the full network is utilized.

2020-04-25

I’m continuing with Lesson 6.

Image Kernels

CNNs from Different Viewpoints

Convolution involves multiplying pixel blocks with a kernel.

A kernel can detect various features like top edges, left edges, etc.

In the case of Stride 2 convolution, every other pixel is skipped.

Average Pooling is another technique used.

With Natan, we started a project focusing on NLP. Our goal is to process our conversations on WhatsApp or Messenger and detect the sentiments within them.

I imported my WhatsApp messages and performed some cleanup.

Now, I’ll focus on the first part of NLP. In Natural Language Processing, there are typically 3 parts:

  1. Language Model
  2. Specific Language Model
  3. Classifier

I will work on the Language Model, which will be based on Wikipedia content. Since we want to analyze conversations in French, the language model needs to be in French.

The language model’s task is to predict the next word in a sentence. There’s an existing Language Model created from Wikipedia: https://github.com/piegu/language-models/blob/master/lm2-french.ipynb

In NLP, there are Bidirectional Language Models where predictions can be made both backward and forward.

I’m figuring out how to load this model into Fastai. I have the weights and need to create a language_model_learner with them. It seems I need the data - to predict the next word, the model needs something to refer to.

The vocab appears to be included in the files, so I’m figuring out how to load the weights and vocab without needing the entire dataset, and still be able to use learn.predict. I might need a corpus; initially, I’ll try using the one from Wikipedia.

2020-04-26

Continuing with the NLP project, I’m researching how NLP is implemented in Fastai, aiming to customize some parts of the workflow with our custom data.

Fastai discusses MultiFiT for text classification, a method based on ULMFiT.

https://nlp.fast.ai/

For our NLP project, we scraped a French website for cinema reviews.

2020-04-29

I continue working on our NLP project. There was an issue with the data at some point; I’ll check why and try to fix the problem I’m encountering with the training of the language model.

2020-04-30

I’m exploring QRNNs and trying to figure out why they’re not working on the workstation.

NeXT

To follow the upcoming days: