class: center, middle # Unsupervised learning Hugo RICHARD .affiliations[ ENSAE, CREST ] .footnote.tiny[Credits: Based on the [slides](https://data-psl.github.io/lectures2021/slides/08_unsupervised_learning/) of Pierre Ablin] --- # Overview: - Dimension reduction methods - Matrix factorization --- # Unsupervised learning We have some data, but no label :( -- - Labelling is often time consuming - Many data are not labelled (e.g. all images in google...) -- Goal of unsupervised learning: Analyse the structure of the dataset -- .center[
] --- # Dimension reduction Data $\mathbf{x}$ lives in dimension $p$ ### Notation: - $\mathbf{x}_i$ is the sample $i$ (vector of size $p$) - $x_i^a$ is the feature $a$ of $\mathbf{x}_i$ (real number) -- **Dimension reduction**: build a transform $\mathbf{z}= f(\mathbf{x})$, where $z$ is in dimension $q < p$, such that most information caried by $\mathbf{x}$ is kept in $\mathbf{z}$ -- Dimension reduction questions: - What kind of transform ? - How to define "information"? --- # Principal component analysis (PCA) - Linear transform - Information measured as power -- We have $n$ samples $x_1, \dots, x_n$ Assume that the dataset is centered ($\sum \mathbf{x}_i = 0$). -- To extract $1$ component: - Look for a projection: $z_i = \mathbf{w}^{\top}\mathbf{x}_i$, where $\mathbf{w}$ is a vector of unit norm - We want to maximize the power $\sum z_i^2$ --- # PCA and correlation - Look for a projection: $z_i = \mathbf{w}^{\top}\mathbf{x}_i$, where $\mathbf{w}$ is a vector of unit norm - We want to maximize the power $P = \sum z_i^2$ -- We have: $$P=\sum (\mathbf{w}^{\top}\mathbf{x}_i)^2 = \sum \mathbf{w}^{\top}\mathbf{x}_i\mathbf{x}_i^{\top}\mathbf{w} = \mathbf{w}^{\top}\left(\sum \mathbf{x}_i\mathbf{x}_i^{\top}\right)\mathbf{w} = \mathbf{w}^{\top}C\mathbf{w}$$ where $C$ is the $p\times p$ matrix $$C = \sum_{i=1}^n \mathbf{x}_i\mathbf{x}_i^{\top}$$ -- $C$ is called the **correlation** matrix --- # Correlation matrix $$C = \sum_{i=1}^n \mathbf{x}_i\mathbf{x}_i^{\top}$$ Diagonal coefficients: -- $C^{aa}= \sum_{i=1}^n (x_i^a)^2$ 'Power' of the feature $a$ -- Off-diagonal coefficients: $C^{ab}= \sum_{i=1}^n (x_i^a)(x_i^b)$ Correlation between feature $a$ and $b$ --- # Correlation matrix .center[
] --- # Correlation matrix .center[
] --- # Correlation matrix .center[
] --- # Correlation matrix .center[
] --- # Correlation matrix .center[
] --- # Correlation matrix .center[
] --- # Direction with most power .center[
] --- # Direction with most power .center[
] --- # Direction with most power .center[
] --- # Direction with most power .center[
] --- # Direction with most power .center[
] --- # Direction with most power .center[
] --- # Direction with most power .center[
] --- # Direction with most power .center[
] --- # Direction with most power .center[
] --- # Direction with most power .center[
] --- # Direction with most power .center[
] --- # Direction with most power .center[
] --- # Direction with most power .center[
] --- # Direction with most power .center[
] --- # Direction with most power .center[
] --- # Direction with most power .center[
] --- # Direction with most power .center[
] --- # Direction with most power .center[
] --- # Direction with most power .center[
] --- # Direction with most power .center[
] --- # PCA and eigenvalue decomposition The vector $\mathbf{w}$ that maximizes $\mathbf{w}^{\top}C\mathbf{w}$ is the leading eigenvector of $C$. The corresponding power is the eigenvalue. -- First PCA dimension: $z^1 = \mathbf{w}_1^{\top}\mathbf{x}$ -- Unexplained data: $$\tilde{\mathbf{x}} = \mathbf{x} - z^1 \mathbf{w}_1$$ -- - What if $\tilde{\mathbf{x}} = 0$ ? --- # PCA: other components Unexplained data: $$\tilde{\mathbf{x}} = \mathbf{x} - z^1 \mathbf{w}_1$$ We can repeat the procedure and find $\mathbf{w}_2$ that maximizes the power of $z^2 = \mathbf{w}_2^{\top}\tilde{\mathbf{x}}$ -- Corresponds to the second eigenvector of the covariance matrix. -- You can iterate as long as you want, giving you $q$ coefficients. -- If unexplained data $=0$ with $q$ components, it means that the data is of dim $q$ --- # What is the dimension of real datasets? As we have seen, the covariance matrix is important, its eigenvalues give the correponding powers. Digits dataset: `from sklearn.datasets import load_digits` .center[
] --- # What is the dimension of real datasets? Spectrum of the covariance: .center[
] --- # 2-D PCA on digits: .center[
] --- # 2-D PCA on digits: .center[
] --- # Matrix factorization --- # Another formulation of PCA The data $\mathbf{x}_1,\dots, \mathbf{x}_n$ can be represented as a matrix $X$ of size $p \times n$: each *row* is a sample. -- ## **WARNING!!!!** In `scikit-learn`, the data is represented as $n\times p$ --- # PCA as a matrix factorization The data $\mathbf{x}_1,\dots, \mathbf{x}_n$ can be represented as a matrix $X$ of size $p \times n$: each *row* is a sample. -- - PCA finds a set of weights $\mathbf{w}_1, \cdots, \mathbf{w_q}$, which can be seen as a matrix $W$ of size $p\times q$. - Transformed data, $z^1, \dots, z^q$ can be seen as a matrix $Z$ of size $q \times n$ We have: -- $$X \simeq W Z$$ -- ### Data $X$ is **factorized** as a product of two matrices --- # Matrix factorization: general principle Factorize $X$ as a product $$X\simeq WZ$$ - Factorization can be *undercomplete* : $q \leq p$ - *overcomplete*: $q \geq p$ -- Inherent indeterminacy: If $(W, Z)$ is a factorization of $X$, for $R$ a matrix of size $q\times q$: $$(WQ, Q^{-1}Z)$$ is also a factorization (because $WQQ^{-1}Z = WZ$) --- # Usual matrix factorization methods --- # Sparse Dictionary learning $W$ is called the *dictionary* and $Z$ the *code* -- ### Specificity: The code is **sparse**: - Each $\mathbf{z}_i$ only has a few non-zero coefficients - Each sample $\mathbf{x}_i$ is represented as a linear combination of a **few** elements of the dictionary --- # Sparse dictionary learning: example On image dataset, learned dictionary: .center[
] -- ### Gabor filters --- # Non-negative matrix factorization ### Specificity - Works only when the data features are all positive - $W$ and $Z$ contain only positive values --- # Non-negative matrix factorization: applications $X$: spectrogram of a song. $p$ = frequency, $n$ = times .center[
] --- # Non-negative matrix factorization: applications - $W$ contains the spectral signatures -- .center[
] -- - $Z$ contains temporal activations -- .center[
] --- # Non-negative matrix factorization: applications .center[
] .center[
] ### Corresponds to this signal .center[
] --- # Non-negative matrix factorization: applications - Each element in $W$ corresponds to a distinct frequency signature: ### NMF can separate the musical instruments from each other (drums, piano, etc...) -- This is completely unsupervised ! --- # Independent component analysis ### Specificity - $W$ is square ($p=q$) - The rows of $Z$ are statistically **independent** -- PCA gives decorrelated rows in $Z$. ICA goes further by imposing independence -- ## Independence => decorrelation --- # Power of independence: $$X$$ .center[
] --- # Power of independence: $$Z_{PCA}$$ .center[
] --- # Power of independence: $$Z_{ICA}$$ .center[
] --- ### Electrocardiogram of a pregnant mother .center[
] --- ### ICA decomposition .center[
]