Machine learning models

class: center, middle

# Machine learning models

Hugo RICHARD

.affiliations[
ENSAE, CREST
]

.footnote.tiny[Credits: Most of slides are from Mathieu Blondel (see [here](https://data-psl.github.io/lectures2021/slides/02_intro_to_machine_learning/#1)). Many contents and figures are borrowed from the scikit-learn [mooc](https://github.com/inria/scikit-learn-mooc/) and scipy
[lecture notes](https://scipy-lectures.org/packages/scikit-learn/index.html).]
---
.center[
# Recap from yesterday]
---

# Classification Dataset: Iris

Three species of iris:

--
.center[
<img src="images/iris.png" style="width: 550px;" />
<br/>
]

--

For each plant, we have 4 "features":

Sepal length, Sepal width, Petal length, Petal width

--

### Goal: given features, predict the species

---

# Classification Dataset: Iris

### Goal: given features, predict the species
--

Example:

- Sepal length = 5.1 cm
- Sepal width = 3.5 cm
- Petal length = 1.4 cm
- Petal width = 0.2 cm

--

### Algorithm : "The species is Setosa"

--
We want to build a function $f$ such that

$f($Sepal length, sepal width, petal length, petal width$)= s$

where $s$ is the correct species.

---

# Classification Dataset: Iris

We want to build a function $f$ such that

$f($Sepal length, sepal width, petal length, petal width$)= s$

where $s$ is the correct species.

## Today: how can we build such function?

---

# Looking at the dataset

Consider just two features: petal length (l) and petal width (w).

We have $150$ samples:
--

- Sample 1: species = Setosa, l = 1.5, w = 0.2
--

- Sample 2: species = Versicolor, l = 4.2, w = 1.5
--

- ...
---

# Looking at the dataset

Since we have only two features, each sample can be represented as a point in a two dimensional plane:

--
.center[
<img src="images/iris_1.png" style="width: 600px;" />
<br/>
]

---

# Looking at the dataset

Since we have only two features, each sample can be represented as a point in a two dimensional plane:

.center[
<img src="images/iris_2.png" style="width: 600px;" />
<br/>
]

---

# Looking at the dataset

Since we have only two features, each sample can be represented as a point in a two dimensional plane:

.center[
<img src="images/iris_3.png" style="width: 600px;" />
<br/>
]

---

# Looking at the dataset

Since we have only two features, each sample can be represented as a point in a two dimensional plane:

.center[
<img src="images/iris_4.png" style="width: 600px;" />
<br/>
]

---

# Looking at the dataset

Since we have only two features, each sample can be represented as a point in a two dimensional plane:

.center[
<img src="images/iris_5.png" style="width: 600px;" />
<br/>
]

---
# A **classification** machine learning model...

... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to.

--
.center[
<img src="images/iris_5.png" style="width: 600px;" />
<br/>
]

---
# A **classification** machine learning model...

... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to.

.center[
<img src="images/iris_6.png" style="width: 600px;" />
<br/>
]

---
# A **classification** machine learning model...

... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to.

.center[
<img src="images/iris_7.png" style="width: 600px;" />
<br/>
]

---
# A **classification** machine learning model...

... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to.

.center[
<img src="images/iris_8.png" style="width: 600px;" />
<br/>
]

---
# A **classification** machine learning model...

... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to.

.center[
<img src="images/iris_9.png" style="width: 600px;" />
<br/>
]

---
# A **classification** machine learning model...

... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to.

.center[
<img src="images/iris_10.png" style="width: 600px;" />
<br/>
]

---
# A **classification** machine learning model...

... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to.

.center[
<img src="images/iris_11.png" style="width: 600px;" />
<br/>
]

---
# A **classification** machine learning model...

... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to.

.center[
<img src="images/iris_12.png" style="width: 600px;" />
<br/>
]

---
# A **classification** machine learning model...

... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to.

.center[
<img src="images/iris_13.png" style="width: 600px;" />
<br/>
]

---
# A **classification** machine learning model...

... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to.

.center[
<img src="images/iris_14.png" style="width: 600px;" />
<br/>
]

---
# A **classification** machine learning model...

... Assigns to each point in space a predicted class. It splits space into decision regions.

.center[
<img src="images/iris_14.png" style="width: 600px;" />
<br/>
]

---
# A **classification** machine learning model...

... Assigns to each point in space a predicted class. It splits space into decision regions.

.center[
<img src="images/iris_15.png" style="width: 600px;" />
<br/>
]

---
# Today:

Usual ways to build these regions.

.center[
<img src="images/iris_15.png" style="width: 600px;" />
<br/>
]

---
# Regression dataset: California housing

Goal: predict the median price of houses in Canifornia districts given some characteristics (8 features)

--
- Median income in block group
- Median house age in block group
- Agerage number of rooms per household
- Number of people in block group
- ...

Block group = 600 to 3000 people

--

A model takes all these inputs and predicts the median price: **regression**

---
# Illustration in 1-D

To visualize the data, we can plot the target (price) against one feature (e.g. median income):

.center[
<img src="images/california_1.png" style="width: 600px;" />
<br/>
]

---
# Illustration in 1-D

To visualize the data, we can plot the target (price) against one feature (e.g. median income):

.center[
<img src="images/california_2.png" style="width: 600px;" />
<br/>
]
--

There is a clear trend, but perfect prediction of the target is impossible

---
# Illustration in 1-D

A machine learning model would try to predict the target from the feature:

.center[
<img src="images/california_3.png" style="width: 600px;" />
<br/>
]

---
# Illustration in 2-D

Using two features, we can visualize each sample as a point in 2-D, and associate the **target** to a **color**

.center[
<img src="images/california_4.png" style="width: 600px;" />
<br/>
]

---
# Illustration in 2-D

A machine learning model associates each point with a predicted value

.center[
<img src="images/california_5.png" style="width: 600px;" />
<br/>
]

---
# Today:

Usual ways to build these regions.

.center[
<img src="images/california_5.png" style="width: 600px;" />
<br/>
]

---

# Decision trees

---
# Decision trees

Let's go back to the Iris dataset

.center[
<img src="images/iris_5.png" style="width: 600px;" />
<br/>
]

---
# Decision trees

If I give you a new point where "petal length" $=1.5$, you are fairly confident that the species is "Setosa"

.center[
<img src="images/iris_5.png" style="width: 600px;" />
<br/>
]

---
# Decision trees

If I give you a new point where "petal length" $<2.5$, you are fairly confident that the species is "Setosa"

.center[
<img src="images/decision_trees.png" style="width: 600px;" />
<br/>
]

---
# Decision trees

If I give you a new point where "petal length" $<2.5$, you are fairly confident that the species is "Setosa"

.center[
<img src="images/decision_trees1.png" style="width: 600px;" />
<br/>
]

---
# Decision trees

Then if I give you a new point with "petal width" $>1.6$, you are fairly confident that the species is "Virginica"

.center[
<img src="images/decision_trees2.png" style="width: 600px;" />
<br/>
]

---
# Decision trees

Then if I give you a new point with "petal width" $>1.6$, you are fairly confident that the species is "Virginica"

.center[
<img src="images/decision_trees3.png" style="width: 600px;" />
<br/>
]

---
# Decision trees

Then if I give you a new point with "petal length" $<5$, you are fairly confident that the species is "Versicolor"

.center[
<img src="images/decision_trees4.png" style="width: 600px;" />
<br/>
]

---
# Decision trees

Then if I give you a new point with "petal length" $<5$, you are fairly confident that the species is "Versicolor"

.center[
<img src="images/decision_trees5.png" style="width: 600px;" />
<br/>
]

---
# Decision trees

The other points can be classified as "Virginica": we have built a set of rules to classify all points !

.center[
<img src="images/decision_trees6.png" style="width: 600px;" />
<br/>
]

---
# Why is it called trees?

.center[
<img src="tree.svg" style="width: 600px;" />
<br/>
]

---
# Advantages of trees

Fast inference:
- Sequentially follow a set of simple rules

--

Interpretability

- It is easy to understand why the model made a prediction

---
# Problem with single trees

Fitting complicated datasets might require **very** deep trees

.center[
<img src="images/trees.png" style="width: 450px;" />
<br/>
]

---
# Problem with single trees

Fitting complicated datasets might require **very** deep trees:

depth = 1
.center[
<img src="images/trees_1.png" style="width: 450px;" />
<br/>
]

---
# Problem with single trees

Fitting complicated datasets might require **very** deep trees:

depth = 2
.center[
<img src="images/trees_2.png" style="width: 450px;" />
<br/>
]

---
# Problem with single trees

Fitting complicated datasets might require **very** deep trees:

depth = 3
.center[
<img src="images/trees_3.png" style="width: 450px;" />
<br/>
]

---
# Problem with single trees

Fitting complicated datasets might require **very** deep trees:

depth = 4
.center[
<img src="images/trees_4.png" style="width: 450px;" />
<br/>
]

---
# Problem with single trees

Fitting complicated datasets might require **very** deep trees:

depth = 5
.center[
<img src="images/trees_5.png" style="width: 450px;" />
<br/>
]

---
# Problem with single trees

Fitting complicated datasets might require **very** deep trees:

depth = 6
.center[
<img src="images/trees_6.png" style="width: 450px;" />
<br/>
]

---
# Problem with single trees

Fitting complicated datasets might require **very** deep trees:

depth = 7
.center[
<img src="images/trees_7.png" style="width: 450px;" />
<br/>
]

---
# Problem with single trees

Fitting complicated datasets might require **very** deep trees:

depth = 8
.center[
<img src="images/trees_8.png" style="width: 450px;" />
<br/>
]

---
# Problem with single trees

Fitting complicated datasets might require **very** deep trees:

depth = 9
.center[
<img src="images/trees_9.png" style="width: 450px;" />
<br/>
]

---
# Problem with single trees

Fitting complicated datasets might require **very** deep trees:

depth = 10  **overfit**
.center[
<img src="images/trees_10.png" style="width: 450px;" />
<br/>
]

---
# Solution: grow a forest !

Idea:

- Instead of 1, grow multiple trees on random subsets of the data
- To make prediction, each tree makes a prediction, and choose the most common answer (i.e. vote)

--
.center[
<img src="images/forest_picture.png" style="width: 800px;" />
<br/>
]

---
# Random Forest:

Using 100 trees

depth = 1
.center[
<img src="images/trees_11.png" style="width: 450px;" />
<br/>
]

---
# Random Forest:

Using 100 trees

depth = 2
.center[
<img src="images/trees_12.png" style="width: 450px;" />
<br/>
]

---
# Random Forest:

Using 100 trees

depth = 3
.center[
<img src="images/trees_13.png" style="width: 450px;" />
<br/>
]

---
# Random Forest:

Using 100 trees

depth = 4
.center[
<img src="images/trees_14.png" style="width: 450px;" />
<br/>
]

---
# Random Forest:

Using 100 trees

depth = 5
.center[
<img src="images/trees_15.png" style="width: 450px;" />
<br/>
]

---
# Random Forest:

Using 100 trees

depth = 6
.center[
<img src="images/trees_16.png" style="width: 450px;" />
<br/>
]

---
# Random Forest:

Using 100 trees

depth = 7
.center[
<img src="images/trees_17.png" style="width: 450px;" />
<br/>
]

---
# Random Forest:

Using 100 trees

depth = 8
.center[
<img src="images/trees_18.png" style="width: 450px;" />
<br/>
]

---
# Random Forest:

Using 100 trees

depth = 9
.center[
<img src="images/trees_19.png" style="width: 450px;" />
<br/>
]

---
# Random Forest:

Using 100 trees

depth = 10
.center[
<img src="images/trees_20.png" style="width: 450px;" />
<br/>
]

---
# Linear models

---
# Linear models

Linear models for **classification** with two classes

Features: $x = (x_1, \dots, x_p)$

--

Model coefficients: $(w_1, \dots, w_p)$ and bias $b$

--

Prediction

- $x$ is from class 1 if $w_1 \times x_1 + \cdots + w_p \times x_p + b > 0$
- $x$ is from class 2 if $w_1 \times x_1 + \cdots + w_p \times x_p + b < 0$

---
# Linear models

Linear models for **regression**

Features: $x = (x_1, \dots, x_p)$

--

Model coefficients: $(w_1, \dots, w_p)$ and bias $b$

--

Prediction

- $y\simeq w_1 \times x_1 + \cdots + w_p \times x_p + b$

---
# Linear models

Corresponds to drawing a line / plane between points

.center[
<img src="images/linear_model1.png" style="width: 450px;" />
<br/>
]

---
# Linear models

Corresponds to drawing a line / plane between points

.center[
<img src="images/linear_model2.png" style="width: 450px;" />
<br/>
]

---
# Linear models

Corresponds to drawing a line / plane between points

.center[
<img src="images/linear_model3.png" style="width: 450px;" />
<br/>
]

--

- Versicolor if $2.8 \times$ length $+2.4 \times$ width $-17.6 < 0$
- Virginica otherwise

---
# Linear models: use case

Linear models are extremely simple. Cannot separate such dataset:

.center[
<img src="images/linear_model4.png" style="width: 350px;" />
<br/>
]

Widely used when the number of samples is **not** much larger than number of features

---
# Linear models and overfit

Despite their simplicity, linear models **can** overfit, especially when there are not enough samples.

--

In general, if the number of samples $n$ is $\leq$ number of features $p$, you can always find linear coefficients such that the prediction is perfect on each sample

--

** Even if the dataset is pure noise**

--

In 2-d, if you only have $2$ points, you can always separate them with a line.

---
# Linear models and overfit

## Example:

Between 22/08 and 24/08 <br />
Target $y$: number of new covid cases in France $y = (38 702, 28 988, 21289)$

--

3 features:

- Temperature in Paris: $x_1 = (24, 30, 31)$

--
- Maximum height of the tide in Brest: $x_2 = (5.23, 5.49, 5.82)$

--
- Day number $x_3 = (22, 23, 24)$

--

You can check that:
$$
y = -1863 \times x_1 -104338 \times x_2 + 28596 \times x_3
$$

---
# Neural networks

---
# Neural networks

Neural networks are a powerful way to go beyond linear models.

--
- Sequence of layers
- One layer maps $p$ inputs to $q$ outputs

--

Input $x_1, \cdots, x_p$

Output $z_1, \cdots, z_q$

--

**Parameters**

- Weights of $p$ coefficients $w^i\in\mathbb{R}$, $i=1\dots q$
- Biases $b^i\in\mathbb{R}$
- Non linearity $\sigma$ (e.g. $\sigma = ReLU, \sigma = \tanh$, ...)

--

$$z_i = \sigma(w^i_1\times x_1 + \dots + w^i_p \times x_p + b^i)$$

---
# Non linearities

- $\sigma$ is a function $\mathbb{R}\to\mathbb{R}$
- Non-linear: $\sigma(x) \neq a\times x$

**ReLU**

.center[
<img src="relu.png" style="width: 350px;" />
<br/>
]

---
# Neural networks

$z_i = \sigma(w^i_1\times x_1 + \dots + w^i_p \times x_p + b^i)$
--

Layers are stacked:

.center[
<img src="images/Colored_neural_network.svg" style="width: 350px;" />
<br/>
]

---
# Neural networks

$z_i = \sigma(w^i_1\times x_1 + \dots + w^i_p \times x_p + b^i)$

Layers are stacked:

.center[
<img src="images/Colored_neural_network.svg" style="width: 270px;" />
<br/>
]

Here, 2 layers: depth=2

---
# Universal approximators

**Theorem:** a neural network with one hidden layer is a universal approximator.

With enough hidden units, it can approximate any transform.

--

**Striking** difference with linear model

.center[
<img src="images/linear_model4.png" style="width: 350px;" />
<br/>
]

---
# Role of depth

If 1-layers Neural networks can learn any function, why do you need more layers?

--

- Can learn more complicated functions with fewer hidden units

--

.center[
<img src="images/nn.png" style="width: 350px;" />
<br/>
]

---
# Role of depth

If 1-layers Neural networks can learn any function, why do you need more layers?

1 layer

2 hidden units

.center[
<img src="images/nn_one_2.png" style="width: 350px;" />
<br/>
]
---
# Role of depth

If 1-layers Neural networks can learn any function, why do you need more layers?

1 layer

3 hidden units

.center[
<img src="images/nn_one_3.png" style="width: 350px;" />
<br/>
]
---
# Role of depth

If 1-layers Neural networks can learn any function, why do you need more layers?

1 layer

4 hidden units

.center[
<img src="images/nn_one_4.png" style="width: 350px;" />
<br/>
]
---
# Role of depth

If 1-layers Neural networks can learn any function, why do you need more layers?

1 layer

5 hidden units

.center[
<img src="images/nn_one_5.png" style="width: 350px;" />
<br/>
]
---
# Role of depth

If 1-layers Neural networks can learn any function, why do you need more layers?

1 layer

6 hidden units

.center[
<img src="images/nn_one_6.png" style="width: 350px;" />
<br/>
]
---
# Role of depth

If 1-layers Neural networks can learn any function, why do you need more layers?

2 layers

2 hidden units

.center[
<img src="images/nn_two_2.png" style="width: 350px;" />
<br/>
]
---
# Role of depth

If 1-layers Neural networks can learn any function, why do you need more layers?

2 layers

3 hidden units

.center[
<img src="images/nn_two_3.png" style="width: 350px;" />
<br/>
]

---
# Role of depth

If 1-layers Neural networks can learn any function, why do you need more layers?

2 layers

4 hidden units

.center[
<img src="images/nn_two_4.png" style="width: 350px;" />
<br/>
]

---
# Neural nets and overfit

- Neural nets have lots of parameters, so you need many samples to train them, otherwise they overfit

- Mystery : some very big neural networks (i.e. for image recognition) have many more parameters than training samples, yet they do not overfit

---
# Next:

- How do you easily use all these models in Python?