Introductory talk on
Self-supervised learning and transformers
What is it and do I need it?
Julian C. Schäfer-Zimmermann
Max Planck Institute of Animal Behavior
Department for the Ecology of Animal Societies
Communication and Collective Movement (CoCoMo) Group
Self-supervised learning
The concepts
The concepts
Predict everything from everything else
Meta Research Blog (March 4, 2021)
Self-supervised learning
Pretext tasks
Pretext tasks
Three categories
- Context-based
- Contrastive learning
- Masked modeling
Pretext tasks
Contrastive learning: Negative example-based
Negative example-based methods push down the loss of compatible pairs (blue dots), and push up on the loss of incompatible pairs (green dots)
Meta Research Blog (March 4, 2021)
Pretext tasks
Contrastive learning: Negative example-based
Pretext tasks
Contrastive learning
Pretext tasks
Contrastive learning: Self-distillation-based
Pretext tasks
Contrastive learning
Pretext tasks
Contrastive learning: Feature decorrelation-based
Pretext tasks
Masked modeling
Self-supervised learning
Overview/Applicability to bioacoustics
Overview
What to chose when?
| CLR | MM |
Dataset size: | Data hungry | Can handle smaller datasets |
Learning: | Focus on global views | Focus on local views |
Scaling: | Scale with larger datasets | Inferior data scaling |
Disadvantages: | Vulnerable to overfitting | Data-filling challenges |
Applicability to bioacoustics
Bioacoustics lives in two domains
Sound is a temporal phenomenon based on wave-mechanics.
Information is encoded in the time- and phase-domain
- Two similar tones cause uneven phase cancellations across frequencies
- Two simultaneous voices with similar pitch are difficult to tell apart
- Noisy and complex auditory scenes make it particularly difficult to distinguish sound events
Often-used log scaling biases the input towards human hearing
Computer vision models are empirically a good fit, but conceptually they're not
Applicability to bioacoustics
Data sparsity
Animal vocalizations are short and rare events
- For example:
- MeerKAT [1]: 184h of labeled audio, of which 7.8h (4.2%) contain meerkat vocalizations
- BirdVox-full-night [2]: 4.5M clips, each of duration 150 ms, only 35k (0.7%) are positive
- Hainan gibbon calls [3]: 256h of fully labeled PAM data with 1246 few-seconds events (0.01%)
- Marine datasets have even higher reported sparsity levels [4]
[1] Schäfer-Zimmermann, J. C., et al. (2024). Preprint at arXiv:2406.01253
[2] Lostanlen, V., et al. (2018) IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
[3] Dufourq, E., et al. (2021) Remote Sensing in Ecology and Conservation, 7(3), 475-487.
[4] Allen, A. N., et al. (2021) Frontiers in Marine Science, 8, 607321
Applicability to bioacoustics
Our response (not necessarily the best) to these challenges
animal2vec
- Masked-based self-distillation using raw audio input
- Better with sparse data as no negative-mining is done
- Makes use of the phase information as we are not calculating spectrograms
Schäfer-Zimmermann, J. C., et al. (2024). Preprint at arXiv:2406.01253
Let's take a break
Next: Transformer
Attention matrix
\[\text{"Thinking": } \begin{pmatrix} a_{1} & a_{2} \end{pmatrix} \] \[\text{"Machines": } \begin{pmatrix} b_{1} & b_{2} \end{pmatrix} \]
\[\text{Input: } X=\begin{pmatrix} a_{1} & a_{2} \\ b_{1} & b_{2} \end{pmatrix} \]
Attention matrix
\[\text{Query Matrix: } Q= \begin{pmatrix} a_{1} & a_{2}\\ b_{1} & b_{2} \end{pmatrix} \cdot \begin{pmatrix} W^{q}_{1} & W^{q}_{2}\\ W^{q}_{3} & W^{q}_{4}\\ \end{pmatrix} \] \[ = \begin{pmatrix} a_{1}W^{q}_{1} + a_{2}W^{q}_{3} & a_{1}W^{q}_{2} + a_{2}W^{q}_{4} \\ b_{1}W^{q}_{1} + b_{2}W^{q}_{3} & b_{1}W^{q}_{4} + b_{2}W^{q}_{4} \end{pmatrix} \]
Attention matrix
\[\text{Query Matrix: } Q= \begin{pmatrix} a_{1} & a_{2}\\ b_{1} & b_{2} \end{pmatrix} \cdot \begin{pmatrix} W^{q}_{1} & W^{q}_{2}\\ W^{q}_{3} & W^{q}_{4}\\ \end{pmatrix} \] \[ = \begin{pmatrix} a_{1}W^{q}_{1} + a_{2}W^{q}_{3} & a_{1}W^{q}_{2} + a_{2}W^{q}_{4} \\ b_{1}W^{q}_{1} + b_{2}W^{q}_{3} & b_{1}W^{q}_{4} + b_{2}W^{q}_{4} \end{pmatrix} \]
\[\text{Key Matrix: } K= \begin{pmatrix} a_{1} & a_{2}\\ b_{1} & b_{2} \end{pmatrix} \cdot \begin{pmatrix} W^{k}_{1} & W^{k}_{2}\\ W^{k}_{3} & W^{k}_{4}\\ \end{pmatrix} \] \[ = \begin{pmatrix} a_{1}W^{k}_{1} + a_{2}W^{k}_{3} & a_{1}W^{k}_{2} + a_{2}W^{k}_{4} \\ b_{1}W^{k}_{1} + b_{2}W^{k}_{3} & b_{1}W^{k}_{2} + b_{2}W^{k}_{4} \end{pmatrix} \]
Attention matrix
\[\text{Attention Matrix: } Q \cdot K^{T}=\] \[ \begin{pmatrix} a_{1}W^{q}_{1} + a_{2}W^{q}_{3} & a_{1}W^{q}_{2} + a_{2}W^{q}_{4} \\ b_{1}W^{q}_{1} + b_{2}W^{q}_{3} & b_{1}W^{q}_{2} + b_{2}W^{q}_{4} \end{pmatrix} \cdot \begin{pmatrix} a_{1}W^{k}_{1} + a_{2}W^{k}_{3} & b_{1}W^{k}_{1} + b_{2}W^{k}_{3} \\ a_{1}W^{k}_{2} + a_{2}W^{k}_{4} & b_{1}W^{k}_{2} + b_{2}W^{k}_{4} \end{pmatrix} \]
\[ = \begin{pmatrix} (a_{1}W^{q}_{1} + a_{2}W^{q}_{3})(a_{1}W^{k}_{1} + a_{2}W^{k}_{3})+(a_{1}W^{q}_{2} + a_{2}W^{q}_{4})(a_{1}W^{k}_{2} + a_{2}W^{k}_{4}) & (a_{1}W^{q}_{1} + a_{2}W^{q}_{3})(b_{1}W^{k}_{1} + b_{2}W^{k}_{3}) + (a_{1}W^{q}_{2} + a_{2}W^{q}_{4})(b_{1}W^{k}_{2} + b_{2}W^{k}_{4}) \\ (b_{1}W^{q}_{1} + b_{2}W^{q}_{3})(a_{1}W^{k}_{1} + a_{2}W^{k}_{3})+(b_{1}W^{q}_{2} + b_{2}W^{q}_{4})(a_{1}W^{k}_{2} + a_{2}W^{k}_{4}) & (b_{1}W^{q}_{1} + b_{2}W^{q}_{3})(b_{1}W^{k}_{1} + b_{2}W^{k}_{3})+(b_{1}W^{q}_{2} + b_{2}W^{q}_{4})(b_{1}W^{k}_{2} + b_{2}W^{k}_{4}) \end{pmatrix} \]
Attention matrix
\[\text{Attention Matrix: } Q \cdot K^{T}=\] \[ \begin{pmatrix} a_{1}W^{q}_{1} + a_{2}W^{q}_{3} & a_{1}W^{q}_{2} + a_{2}W^{q}_{4} \\ b_{1}W^{q}_{1} + b_{2}W^{q}_{3} & b_{1}W^{q}_{2} + b_{2}W^{q}_{4} \end{pmatrix} \cdot \begin{pmatrix} a_{1}W^{k}_{1} + a_{2}W^{k}_{3} & b_{1}W^{k}_{1} + b_{2}W^{k}_{3} \\ a_{1}W^{k}_{2} + a_{2}W^{k}_{4} & b_{1}W^{k}_{2} + b_{2}W^{k}_{4} \end{pmatrix} \]
\[ = \begin{pmatrix} ({\bf\color{green} a_{1}}{\bf\color{purple}W^{q}_{1}} + {\bf\color{green} a_{2}}{\bf\color{purple}W^{q}_{3}})({\bf\color{green} a_{1}}{\bf\color{orange}W^{k}_{1}} + {\bf\color{green} a_{2}}{\bf\color{orange}W^{k}_{3}})+({\bf\color{green} a_{1}}{\bf\color{purple}W^{q}_{2}} + {\bf\color{green} a_{2}}{\bf\color{purple}W^{q}_{4}})({\bf\color{green} a_{1}}{\bf\color{orange}W^{k}_{2}} + {\bf\color{green} a_{2}}{\bf\color{orange}W^{k}_{4}}) & (a_{1}W^{q}_{1} + a_{2}W^{q}_{3})(b_{1}W^{k}_{1} + b_{2}W^{k}_{3}) + (a_{1}W^{q}_{2} + a_{2}W^{q}_{4})(b_{1}W^{k}_{2} + b_{2}W^{k}_{4}) \\ (b_{1}W^{q}_{1} + b_{2}W^{q}_{3})(a_{1}W^{k}_{1} + a_{2}W^{k}_{3})+(b_{1}W^{q}_{2} + b_{2}W^{q}_{4})(a_{1}W^{k}_{2} + a_{2}W^{k}_{4}) & (b_{1}W^{q}_{1} + b_{2}W^{q}_{3})(b_{1}W^{k}_{1} + b_{2}W^{k}_{3})+(b_{1}W^{q}_{2} + b_{2}W^{q}_{4})(b_{1}W^{k}_{2} + b_{2}W^{k}_{4}) \end{pmatrix} \]
All weights from all weight matrices attend to only the first word
Attention matrix
\[\text{Attention Matrix: } Q \cdot K^{T}=\] \[ \begin{pmatrix} a_{1}W^{q}_{1} + a_{2}W^{q}_{3} & a_{1}W^{q}_{2} + a_{2}W^{q}_{4} \\ b_{1}W^{q}_{1} + b_{2}W^{q}_{3} & b_{1}W^{q}_{2} + b_{2}W^{q}_{4} \end{pmatrix} \cdot \begin{pmatrix} a_{1}W^{k}_{1} + a_{2}W^{k}_{3} & b_{1}W^{k}_{1} + b_{2}W^{k}_{3} \\ a_{1}W^{k}_{2} + a_{2}W^{k}_{4} & b_{1}W^{k}_{2} + b_{2}W^{k}_{4} \end{pmatrix} \]
\[ = \begin{pmatrix} (a_{1}W^{q}_{1} + a_{2}W^{q}_{3})(a_{1}W^{k}_{1} + a_{2}W^{k}_{3})+(a_{1}W^{q}_{2} + a_{2}W^{q}_{4})(a_{1}W^{k}_{2} + a_{2}W^{k}_{4}) & (a_{1}W^{q}_{1} + a_{2}W^{q}_{3})(b_{1}W^{k}_{1} + b_{2}W^{k}_{3}) + (a_{1}W^{q}_{2} + a_{2}W^{q}_{4})(b_{1}W^{k}_{2} + b_{2}W^{k}_{4}) \\ (b_{1}W^{q}_{1} + b_{2}W^{q}_{3})(a_{1}W^{k}_{1} + a_{2}W^{k}_{3})+(b_{1}W^{q}_{4} + b_{2}W^{q}_{4})(a_{1}W^{k}_{2} + a_{2}W^{k}_{4}) & ({\bf\color{green}b_{1}}{\bf\color{purple}W^{q}_{1}} + {\bf\color{green}b_{2}}{\bf\color{purple}W^{q}_{3}})({\bf\color{green}b_{1}}{\bf\color{orange}W^{k}_{1}} + {\bf\color{green}b_{2}}{\bf\color{orange}W^{k}_{3}})+({\bf\color{green}b_{1}}{\bf\color{purple}W^{q}_{2}} + {\bf\color{green}b_{2}}{\bf\color{purple}W^{q}_{4}})({\bf\color{green}b_{1}}{\bf\color{orange}W^{k}_{2}} + {\bf\color{green}b_{2}}{\bf\color{orange}W^{k}_{4}}) \end{pmatrix} \]
All weights from all weight matrices attend to only the second word
Attention matrix
\[\text{Attention Matrix: } Q \cdot K^{T}=\] \[ \begin{pmatrix} a_{1}W^{q}_{1} + a_{2}W^{q}_{3} & a_{1}W^{q}_{2} + a_{2}W^{q}_{4} \\ b_{1}W^{q}_{1} + b_{2}W^{q}_{3} & b_{1}W^{q}_{2} + b_{2}W^{q}_{4} \end{pmatrix} \cdot \begin{pmatrix} a_{1}W^{k}_{1} + a_{2}W^{k}_{3} & b_{1}W^{k}_{1} + b_{2}W^{k}_{3} \\ a_{1}W^{k}_{2} + a_{2}W^{k}_{4} & b_{1}W^{k}_{2} + b_{2}W^{k}_{4} \end{pmatrix} \]
\[ = \begin{pmatrix} (a_{1}W^{q}_{1} + a_{2}W^{q}_{3})(a_{1}W^{k}_{1} + a_{2}W^{k}_{3})+(a_{1}W^{q}_{2} + a_{2}W^{q}_{4})(a_{1}W^{k}_{2} + a_{2}W^{k}_{4}) & ({\bf\color{green}a_{1}}{\bf\color{purple}W^{q}_{1}} + {\bf\color{green}a_{2}}{\bf\color{purple}W^{q}_{3}})({\bf\color{green}b_{1}}{\bf\color{orange}W^{k}_{1}} + {\bf\color{green}b_{2}}{\bf\color{orange}W^{k}_{3}}) + ({\bf\color{green}a_{1}}{\bf\color{purple}W^{q}_{2}} + {\bf\color{green}a_{2}}{\bf\color{purple}W^{q}_{4}})({\bf\color{green}b_{1}}{\bf\color{orange}W^{k}_{4}} + {\bf\color{green}b_{2}}{\bf\color{orange}W^{k}_{4}}) \\ ({\bf\color{green}b_{1}}{\bf\color{purple}W^{q}_{1}} + {\bf\color{green}b_{2}}{\bf\color{purple}W^{q}_{3}})({\bf\color{green}a_{1}}{\bf\color{orange}W^{k}_{1}} + {\bf\color{green}a_{2}}{\bf\color{orange}W^{k}_{3}})+({\bf\color{green}b_{1}}{\bf\color{purple}W^{q}_{2}} + {\bf\color{green}b_{2}}{\bf\color{purple}W^{q}_{4}})({\bf\color{green}a_{1}}{\bf\color{orange}W^{k}_{2}} + {\bf\color{green}a_{2}}{\bf\color{orange}W^{k}_{4}}) & (b_{1}W^{q}_{1} + b_{2}W^{q}_{3})(b_{1}W^{k}_{1} + b_{2}W^{k}_{3})+(b_{1}W^{q}_{2} + b_{2}W^{q}_{4})(b_{1}W^{k}_{2} + b_{2}W^{k}_{4}) \end{pmatrix} \]
All weights from all weight matrices attend to both words
How to input data (Frontend)
Introductory talk on
Self-supervised learning and transformers
What is it and do I need it?
Julian C. Schäfer-Zimmermann
Max Planck Institute of Animal Behavior
Department for the Ecology of Animal Societies
Communication and Collective Movement (CoCoMo) Group