Introduction to Machine Learning

class: center, middle

# Introduction to Machine Learning

Hugo RICHARD

.affiliations[
Criteo
]
.footnote.tiny[Credits: Most of slides are from Mathieu Blondel (see [here](https://data-psl.github.io/lectures2021/slides/02_intro_to_machine_learning/#1)). Many contents and figures are borrowed from the scikit-learn [mooc](https://github.com/inria/scikit-learn-mooc/) and scipy
[lecture notes](https://scipy-lectures.org/packages/scikit-learn/index.html).]
---
# Scope of this lecture

- A 1h30 long tutorial for absolute beginners in machine learning
- High-level concepts with almost zero maths

What will be covered later this week but not today:
- Model-specific explanations (linear models, trees, neural networks)
- scikit-learn
- Optimization
- Deep learning

---
# Outline

- Supervised learning
 - Classification
 - Regression
 - Generalization

- Model selection
 - Model complexity
 - Overfitting
 - Bias / variance trade-off

---
class: center, middle

# Supervised learning
---
# Supervised learning

- Given some input, **predict** some output

- Input domain $\mathcal{X}$

- Output domain $\mathcal{Y}$

- Prediction function $f \colon \mathcal{X} \to \mathcal{Y}$

- Calling the function: $y = f(\mathbf{x})$

---
# Supervised learning

Example 1 : Classifying cats and dogs

- $\mathcal{X} = \text{ Set of all gray scale images of size 64 x 64 }$

- $\mathcal{Y} = \\{0, 1 \\}$

where $0$ means dog" and $1$ means cat"

- Given $f \colon \mathcal{X} \to \mathcal{Y}$

$f(x) = 0$ means the image is a cat according to the model

---

# Supervised learning

Example 2 : Predict salary given age

- $\mathcal{X} = \mathbb{R}$ (Set of all possible ages)

- $\mathcal{Y} = \mathbb{R}$ (Set of all possible salaries)

- Given the model $f \colon \mathcal{X} \to \mathcal{Y}$

- Take $ \mathbf{x}$ the age of a person, $f( \mathbf{x})$ is its predicted salary according to the model

---
# Supervised learning

- Learning from examples (inductive learning)

_Training phase_

- $X_{train} = (x_1, y_1), \dots, (x_n, y_n) \in \mathcal{X} \times \mathcal{Y}$

Ex: $n$ images of cats or dogs

- $f$ is chosen based on $X_{train}$ via a *training algorithm*

_Inference phase_

- Use $f$ on unseen data $x_{test} \in \mathcal{X}$

---
# Supervised learning

Feature extraction or Feature engineering

- Samples are often in a raw format

- It is necessary to extract useful features (descriptors) for the prediction task

- Traditional ML: human-designed extraction rules

- Deep learning: learned features representations

---
# Supervised learning

Feature extraction or Feature engineering

- Predictions as downstream features

Example: "is there a pedestrian?" can be used as a feature for automatic breaking

Example 2:

Input data = age $a$ of the person

Input feature is whether the person is older than $50$

The function $a \to 1_{a > 50}$ is called a *feature extractor*

We then have $\mathcal{X} = \\{0, 1\\}$ instead of $\mathcal{X} = \mathbb{R}$

---
# Supervised learning

- Training phase

.center[<img src="figures/training-phase.png" width="85%" />]

- Inference (prediction) phase

.center[<img src="figures/inference-phase.png" width="85%" />]

---
# Classification

- Given an input, predict the class (category) it belongs to

- $y \in \mathcal{Y}$ is a categorical variable = $ \mathcal{Y}$ is discrete

- Examples
  - Given the picture of an astronomical object, determine whether it is a star, a galaxy, etc.
  - Given the picture of a person, identify the person
  - Given an email, detect whether it is a spam or not

---
# Types of classification

- Binary classification: $\mathcal{Y} = \\{0, 1\\}$
- Multiclass classification: $\mathcal{Y} = [K] = \\{1, \dots, K\\}$
- Multi-label classification: $\mathcal{Y} = \\{ 0, 1 \\}^K$

Example: The model should predict whether a dog, a lion, or a cat is in the image

We choose $\mathcal{Y} = \\{ 0, 1 \\}^3$

$ f(\mathbf{x}) = (1, 1, 0)$ means the image $ \mathbf{x}$ contains a dog and a lion but no cat according to the model.

---
# Regression

- Given an input, predict a real (continuous) variable associated with it

- $y \in \mathcal{Y} = \mathbb{R}$ is real variable

- Examples
 - Given the profile of a person, predict his/her salary
 - Given house characteristics, predict its price
 - Given a patient's DNA information, predict his/her life expectancy
 - Given camrea input, predict steering wheel position (autonomous driving)

---
# The iris dataset

- A famous dataset in statistics and machine learning
.center[
<table>
  <tr>
    <td style="border-width: 0px">
      <img src="iris_setosa.jpg" style="width: 200px;" />
      <br />Setosa
    </td>
    <td style="border-width: 0px">
      <img src="iris_versicolor.jpg" style="width: 200px;" />
      <br />Versicolor
    </td>
    <td style="border-width: 0px">
      <img src="iris_virginica.jpg" style="width: 200px;" />
      <br />Viriginica
    </td>
  </tr>
</table>
]

.small[
| Sepal length | Sepal width | Petal length | Petal width | | Iris type  |
| ------------ | ----------- | ------------ | ----------- |-| ---------- |
| 6cm          | 3.4cm       | 4.5cm        | 1.6cm       | | versicolor |
| 5.7cm        | 3.8cm       | 1.7cm        | 0.3cm       | | setosa     |
| 6.5cm        | 3.2cm       | 5.1cm        | 2cm         | | virginica  |
| 5cm          | 3.cm        | 1.6cm        | 0.2cm       | | setosa     |
]
---
# Data matrix and target vector

$$
\mathbf{X} =
\begin{bmatrix}
\mathbf{x}_1 \\\\
\mathbf{x}_2 \\\\
\mathbf{x}_3 \\\\
\mathbf{x}_4 \\\\
\vdots
\end{bmatrix}
=
\begin{bmatrix}
6 & 3.4 & 4.5 & 1.6 \\\\
5.7 & 3.8 & 1.7 & 0.3 \\\\
6.5 & 3.2 & 5.1 & 2 \\\\
5 & 3 & 1.6 & 0.2 \\\\
\vdots & \vdots & \vdots & \vdots
\end{bmatrix}
\quad
\mathbf{y}
=
\begin{bmatrix}
y_1 \\\\
y_2 \\\\
y_3 \\\\
y_4 \\\\
\vdots
\end{bmatrix}
=
\begin{bmatrix}
1 \\\\
0 \\\\
2 \\\\
0 \\\\
\vdots
\end{bmatrix}
$$

$\mathbf{X}$ is called the data matrix / design matrix and consists of $n$ samples (observations) with $d$ features (attributes)

$\mathbf{x}_i$ is called a feature vector

$\mathbf{y}$ is called the target or label vector

$y_i$ is called a target or label
---
# Samples as points in a space

.center[
    <img src="sphx_glr_plot_iris_scatter_001.png" style="width: 500px;" />
]
Samples from the iris datasets live in a 4-dimensional space,
we only use 2 dimensions for visualization purposes.
---
# Inferring rules from data

<table>
<thead><tr>
	<th>Sepal length</th>
	<th>Sepal width</th>
	<th>Petal length</th>
	<th>Petal width</th>
</tr></thead>
<tbody><tr>
    <td>
    <img src="figures/iris_sepal_length_cm_hist.svg" style="width: 100%;">
    </td>
    <td>
    <img src="figures/iris_sepal_width_cm_hist.svg" style="width: 100%;">
    </td>
    <td>
    <img src="figures/iris_petal_length_cm_hist.svg" style="width: 100%;">
    </td>
    <td>
    <img src="figures/iris_petal_width_cm_hist.svg" style="width: 100%;">
    </td>
    </tr></tbody>
</table>
<img src="figures/legend_irises.svg" style='float: left;'>

<br/><br/>We can *infer* from the data that setosa irises have small petals.

Learning algorithms automate this process!
---
# Learning a model from examples

- A model can be seen as a function $f$ that maps a feature vector $\mathbf{x}$ (input) to a target $y$ (output).

- Given input-output pairs (training examples) $(\mathbf{x}_1, y_1)$, $(\mathbf{x}_2, y_2)$,
..., $(\mathbf{x}_n, y_n)$, we want to find a $f$ such that

$$
f(\mathbf{x}_i) \approx y_i
$$

---
# Example: the nearest neighbor algorithm

$f(\mathbf{x})$ searches for the example $\mathbf{x}_i$ with smallest distance
$d(\mathbf{x}, \mathbf{x}_i)$ and returns the label $y_i$ associated with this example.

.center[<img src="sphx_glr_plot_iris_knn_001.png" width="65%"/>]

---
# US Census data

.super-tiny[

| Age | Workclass | Education    | Marital-status     | Occupation         | Relationship | Race  | Sex  | Capital-gain | Hours-per-week | Native-country | Class |
| --- | --------- | ------------ | ------------------ | ------------------ | ------------ | ----- | ---- | ------------ | -------------- | -------------- | ----- |
| 25  | Private   | 11th         | Never-married      | Machine-op-inspct  | Own-child    | Black | Male | 0            | 40             | United-States  | <=50K |
| 38  | Private   | HS-grad      | Married-civ-spouse | Farming-fishing    | Husband     | White  | Male | 0            | 50             | United-States   | <=50K |
| 28  | Local-gov | Assoc-acdm   | Married-civ-spouse | Protective-serv    | Husband      | White | Male | 0            | 40             | United-States   | >50K  |
| 44  | Private   | Some-college | Married-civ-spouse | Machine-op-inspct  | Husband      | Black | Male | 7688         | 40             | United-States   | >50K  |

]

Does this person make more than 50K per year?
---
# Generalizing

- In machine learning, we want to predict on new data instances.
That is, for a feature vector $\mathbf{x}$ not seen in the training set,
we want to compute $f(\mathbf{x})$.

- In the census example, we want to predict the income of new
individuals, with a combination of jobs and demographics that we have
never seen.

- Many sources of variability: age workclass, education, marital-status,
occupation, ...

*+* Noise: unexplainable variance

---
# Memorizing

* Store all known individuals (the census)
* Given a new individual, predict the income of its closest match in our database
(nearest neighbor predictor)

Trying out this strategy on the data we have, the census, *what error
rate do we expect?*

> **0 errors**

.red[Yet, we will make errors on **new** data]

---
# Generalizing  ≠  Memorizing

.center["test" data  ≠  "train" data]

.pull-left[Data on which the predictive model is applied]

.pull-right[Data used to learn the predictive model]

* Different sampling of noise
* Unobserved combination of features

---
# Other types of learning

- Unsupervised: learning without human annotations
  - Clustering: discovers groups (clusters) in the data
  - Dimensionality reduction
  - Anomaly detection

- Semi-supervised learning: learning from both annotated and non-annotated data

- Active learning: the algorithm can choose what samples need to be annotated

- And many others...
---
class: center, middle

# Model selection
---
# Model complexity

- There exists a wealth a machine learning models, some more complex and expressive
than others.
- What are the implications of model complexity?
- How do we pick one model rather than another?

.center[
<table>
  <tr>
    <td style="border-width: 0px">
      <img src="sphx_glr_plot_svm_non_linear_001.png" style="width: 400px;" />
      <br/> Linear
    </td>
    <td style="border-width: 0px; padding: 0px">
      <img src="sphx_glr_plot_svm_non_linear_002.png" style="width: 400px;" />
      <br />Non-linear
    </td>
  </tr>
</table>
]
---
## Varying the degree of a polynomial

.center[<img src="figures/polynomial_overfit_0.svg" width=75%>]