class: center, middle # Machine learning models Hugo RICHARD .affiliations[ ENSAE, CREST ] .footnote.tiny[Credits: Most of slides are from Mathieu Blondel (see [here](https://data-psl.github.io/lectures2021/slides/02_intro_to_machine_learning/#1)). Many contents and figures are borrowed from the scikit-learn [mooc](https://github.com/inria/scikit-learn-mooc/) and scipy [lecture notes](https://scipy-lectures.org/packages/scikit-learn/index.html).] --- .center[ # Recap from yesterday] --- # Classification Dataset: Iris Three species of iris: -- .center[
] -- For each plant, we have 4 "features": Sepal length, Sepal width, Petal length, Petal width -- ### Goal: given features, predict the species --- # Classification Dataset: Iris ### Goal: given features, predict the species -- Example: - Sepal length = 5.1 cm - Sepal width = 3.5 cm - Petal length = 1.4 cm - Petal width = 0.2 cm -- ### Algorithm : "The species is Setosa" -- We want to build a function $f$ such that $f($Sepal length, sepal width, petal length, petal width$)= s$ where $s$ is the correct species. --- # Classification Dataset: Iris We want to build a function $f$ such that $f($Sepal length, sepal width, petal length, petal width$)= s$ where $s$ is the correct species. ## Today: how can we build such function? --- # Looking at the dataset Consider just two features: petal length (l) and petal width (w). We have $150$ samples: -- - Sample 1: species = Setosa, l = 1.5, w = 0.2 -- - Sample 2: species = Versicolor, l = 4.2, w = 1.5 -- - ... --- # Looking at the dataset Since we have only two features, each sample can be represented as a point in a two dimensional plane: -- .center[
] --- # Looking at the dataset Since we have only two features, each sample can be represented as a point in a two dimensional plane: .center[
] --- # Looking at the dataset Since we have only two features, each sample can be represented as a point in a two dimensional plane: .center[
] --- # Looking at the dataset Since we have only two features, each sample can be represented as a point in a two dimensional plane: .center[
] --- # Looking at the dataset Since we have only two features, each sample can be represented as a point in a two dimensional plane: .center[
] --- # A **classification** machine learning model... ... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to. -- .center[
] --- # A **classification** machine learning model... ... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to. .center[
] --- # A **classification** machine learning model... ... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to. .center[
] --- # A **classification** machine learning model... ... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to. .center[
] --- # A **classification** machine learning model... ... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to. .center[
] --- # A **classification** machine learning model... ... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to. .center[
] --- # A **classification** machine learning model... ... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to. .center[
] --- # A **classification** machine learning model... ... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to. .center[
] --- # A **classification** machine learning model... ... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to. .center[
] --- # A **classification** machine learning model... ... will take as input a **new** pair (petal length, petal width) and tell which species it belongs to. .center[
] --- # A **classification** machine learning model... ... Assigns to each point in space a predicted class. It splits space into decision regions. .center[
] --- # A **classification** machine learning model... ... Assigns to each point in space a predicted class. It splits space into decision regions. .center[
] --- # Today: Usual ways to build these regions. .center[
] --- # Regression dataset: California housing Goal: predict the median price of houses in Canifornia districts given some characteristics (8 features) -- - Median income in block group - Median house age in block group - Agerage number of rooms per household - Number of people in block group - ... Block group = 600 to 3000 people -- A model takes all these inputs and predicts the median price: **regression** --- # Illustration in 1-D To visualize the data, we can plot the target (price) against one feature (e.g. median income): .center[
] --- # Illustration in 1-D To visualize the data, we can plot the target (price) against one feature (e.g. median income): .center[
] -- There is a clear trend, but perfect prediction of the target is impossible --- # Illustration in 1-D A machine learning model would try to predict the target from the feature: .center[
] --- # Illustration in 2-D Using two features, we can visualize each sample as a point in 2-D, and associate the **target** to a **color** .center[
] --- # Illustration in 2-D A machine learning model associates each point with a predicted value .center[
] --- # Today: Usual ways to build these regions. .center[
] --- # Decision trees --- # Decision trees Let's go back to the Iris dataset .center[
] --- # Decision trees If I give you a new point where "petal length" $=1.5$, you are fairly confident that the species is "Setosa" .center[
] --- # Decision trees If I give you a new point where "petal length" $<2.5$, you are fairly confident that the species is "Setosa" .center[
] --- # Decision trees If I give you a new point where "petal length" $<2.5$, you are fairly confident that the species is "Setosa" .center[
] --- # Decision trees Then if I give you a new point with "petal width" $>1.6$, you are fairly confident that the species is "Virginica" .center[
] --- # Decision trees Then if I give you a new point with "petal width" $>1.6$, you are fairly confident that the species is "Virginica" .center[
] --- # Decision trees Then if I give you a new point with "petal length" $<5$, you are fairly confident that the species is "Versicolor" .center[
] --- # Decision trees Then if I give you a new point with "petal length" $<5$, you are fairly confident that the species is "Versicolor" .center[
] --- # Decision trees The other points can be classified as "Virginica": we have built a set of rules to classify all points ! .center[
] --- # Why is it called trees? .center[
] --- # Advantages of trees Fast inference: - Sequentially follow a set of simple rules -- Interpretability - It is easy to understand why the model made a prediction --- # Problem with single trees Fitting complicated datasets might require **very** deep trees .center[
] --- # Problem with single trees Fitting complicated datasets might require **very** deep trees: depth = 1 .center[
] --- # Problem with single trees Fitting complicated datasets might require **very** deep trees: depth = 2 .center[
] --- # Problem with single trees Fitting complicated datasets might require **very** deep trees: depth = 3 .center[
] --- # Problem with single trees Fitting complicated datasets might require **very** deep trees: depth = 4 .center[
] --- # Problem with single trees Fitting complicated datasets might require **very** deep trees: depth = 5 .center[
] --- # Problem with single trees Fitting complicated datasets might require **very** deep trees: depth = 6 .center[
] --- # Problem with single trees Fitting complicated datasets might require **very** deep trees: depth = 7 .center[
] --- # Problem with single trees Fitting complicated datasets might require **very** deep trees: depth = 8 .center[
] --- # Problem with single trees Fitting complicated datasets might require **very** deep trees: depth = 9 .center[
] --- # Problem with single trees Fitting complicated datasets might require **very** deep trees: depth = 10 **overfit** .center[
] --- # Solution: grow a forest ! Idea: - Instead of 1, grow multiple trees on random subsets of the data - To make prediction, each tree makes a prediction, and choose the most common answer (i.e. vote) -- .center[
] --- # Random Forest: Using 100 trees depth = 1 .center[
] --- # Random Forest: Using 100 trees depth = 2 .center[
] --- # Random Forest: Using 100 trees depth = 3 .center[
] --- # Random Forest: Using 100 trees depth = 4 .center[
] --- # Random Forest: Using 100 trees depth = 5 .center[
] --- # Random Forest: Using 100 trees depth = 6 .center[
] --- # Random Forest: Using 100 trees depth = 7 .center[
] --- # Random Forest: Using 100 trees depth = 8 .center[
] --- # Random Forest: Using 100 trees depth = 9 .center[
] --- # Random Forest: Using 100 trees depth = 10 .center[
] --- # Linear models --- # Linear models Linear models for **classification** with two classes Features: $x = (x_1, \dots, x_p)$ -- Model coefficients: $(w_1, \dots, w_p)$ and bias $b$ -- Prediction - $x$ is from class 1 if $w_1 \times x_1 + \cdots + w_p \times x_p + b > 0$ - $x$ is from class 2 if $w_1 \times x_1 + \cdots + w_p \times x_p + b < 0$ --- # Linear models Linear models for **regression** Features: $x = (x_1, \dots, x_p)$ -- Model coefficients: $(w_1, \dots, w_p)$ and bias $b$ -- Prediction - $y\simeq w_1 \times x_1 + \cdots + w_p \times x_p + b$ --- # Linear models Corresponds to drawing a line / plane between points .center[
] --- # Linear models Corresponds to drawing a line / plane between points .center[
] --- # Linear models Corresponds to drawing a line / plane between points .center[
] -- - Versicolor if $2.8 \times$ length $+2.4 \times$ width $-17.6 < 0$ - Virginica otherwise --- # Linear models: use case Linear models are extremely simple. Cannot separate such dataset: .center[
] Widely used when the number of samples is **not** much larger than number of features --- # Linear models and overfit Despite their simplicity, linear models **can** overfit, especially when there are not enough samples. -- In general, if the number of samples $n$ is $\leq$ number of features $p$, you can always find linear coefficients such that the prediction is perfect on each sample -- ** Even if the dataset is pure noise** -- In 2-d, if you only have $2$ points, you can always separate them with a line. --- # Linear models and overfit ## Example: Between 22/08 and 24/08
Target $y$: number of new covid cases in France $y = (38 702, 28 988, 21289)$ -- 3 features: - Temperature in Paris: $x_1 = (24, 30, 31)$ -- - Maximum height of the tide in Brest: $x_2 = (5.23, 5.49, 5.82)$ -- - Day number $x_3 = (22, 23, 24)$ -- You can check that: $$ y = -1863 \times x_1 -104338 \times x_2 + 28596 \times x_3 $$ --- # Neural networks --- # Neural networks Neural networks are a powerful way to go beyond linear models. -- - Sequence of layers - One layer maps $p$ inputs to $q$ outputs -- Input $x_1, \cdots, x_p$ Output $z_1, \cdots, z_q$ -- **Parameters** - Weights of $p$ coefficients $w^i\in\mathbb{R}$, $i=1\dots q$ - Biases $b^i\in\mathbb{R}$ - Non linearity $\sigma$ (e.g. $\sigma = ReLU, \sigma = \tanh$, ...) -- $$z_i = \sigma(w^i_1\times x_1 + \dots + w^i_p \times x_p + b^i)$$ --- # Non linearities - $\sigma$ is a function $\mathbb{R}\to\mathbb{R}$ - Non-linear: $\sigma(x) \neq a\times x$ **ReLU** .center[
] --- # Neural networks $z_i = \sigma(w^i_1\times x_1 + \dots + w^i_p \times x_p + b^i)$ -- Layers are stacked: .center[
] --- # Neural networks $z_i = \sigma(w^i_1\times x_1 + \dots + w^i_p \times x_p + b^i)$ Layers are stacked: .center[
] Here, 2 layers: depth=2 --- # Universal approximators **Theorem:** a neural network with one hidden layer is a universal approximator. With enough hidden units, it can approximate any transform. -- **Striking** difference with linear model .center[
] --- # Role of depth If 1-layers Neural networks can learn any function, why do you need more layers? -- - Can learn more complicated functions with fewer hidden units -- .center[
] --- # Role of depth If 1-layers Neural networks can learn any function, why do you need more layers? 1 layer 2 hidden units .center[
] --- # Role of depth If 1-layers Neural networks can learn any function, why do you need more layers? 1 layer 3 hidden units .center[
] --- # Role of depth If 1-layers Neural networks can learn any function, why do you need more layers? 1 layer 4 hidden units .center[
] --- # Role of depth If 1-layers Neural networks can learn any function, why do you need more layers? 1 layer 5 hidden units .center[
] --- # Role of depth If 1-layers Neural networks can learn any function, why do you need more layers? 1 layer 6 hidden units .center[
] --- # Role of depth If 1-layers Neural networks can learn any function, why do you need more layers? 2 layers 2 hidden units .center[
] --- # Role of depth If 1-layers Neural networks can learn any function, why do you need more layers? 2 layers 3 hidden units .center[
] --- # Role of depth If 1-layers Neural networks can learn any function, why do you need more layers? 2 layers 4 hidden units .center[
] --- # Neural nets and overfit - Neural nets have lots of parameters, so you need many samples to train them, otherwise they overfit - Mystery : some very big neural networks (i.e. for image recognition) have many more parameters than training samples, yet they do not overfit --- # Next: - How do you easily use all these models in Python?