Heterodyning Audiofrontend

Learning to look for the right frequencies

Julian C. Schäfer-Zimmermann

Max Planck Institute of Animal Behavior

Department for the Ecology of Animal Societies

Communication and Collective Movement (CoCoMo) Group

animal2vec

Quadratic Complexity in Attention: $O(N^2)$

Sequence Length ($N$): 5

Total Dot Products ($N^2$): 25

Sample Rate vs. Sequence Length

Sampling Rate ($f_s$): 12 kHz

Sequence Token Count ($N$): 6

Attention Matrix ($N^2$): 36

Signal $x(t)$

Example

$\text{N} = 3 \cdot 96,000 = 288,000$

Example

$\text{N} = 3 \cdot 4,000 = 12,000$

A reduction of $576$ in attention matrix size

Heterodyning

Some math ... sorry

Heterodyning

Assuming two signals:

\[ S_1(t) = A_1 \cos(\omega_1 t) \]

\[ S_2(t) = A_2 \cos(\omega_2 t) \]

$ S_1(t)$ is the signal we want to downshift

$ S_2(t)$ is the signal we use for downshifting $ S_1(t)$

Heterodyning

Heterodyning is defined as the mathematical product of these two time-domain signals:

\[ S_{\text{out}}(t) = S_1(t) \cdot S_2(t) \]

Substituting the signal functions:

\[ S_{\text{out}}(t) = A_1 A_2 \cos(\omega_1 t) \cos(\omega_2 t) \]

Heterodyning

To resolve the frequency components, apply the trigonometric product-to-sum identity:

\[ \cos(\alpha) \cos(\beta) = \frac{1}{2} \left[ \cos(\alpha - \beta) + \cos(\alpha + \beta) \right] \]

Substituting $\alpha = \omega_1 t$ and $\beta = \omega_2 t$.

Heterodyning

Expanding the terms maps the output into two distinct frequencies:

\[ S_{\text{out}}(t) = \frac{A_1 A_2}{2} \left[ \cos\left( (\omega_1 - \omega_2)t\right) + \cos\left( (\omega_1 + \omega_2)t \right) \right] \]

Heterodyning

The system maps the original frequencies to two new spectral coordinates:

Difference Frequency (Intermediate Frequency / IF): \[ \omega_{\text{IF}} = |\omega_1 - \omega_2| \] Utilized for down-conversion workflows.
Sum Frequency (Up-conversion): \[ \omega_{\text{sum}} = \omega_1 + \omega_2 \] Typically eliminated via lowpass or bandpass filtering.

Example

We multiple with a 10 kHz signal for downshifting

Example

Up- and Downshifted signals are visible $\Rightarrow$ Downsample

Example

Only the downshifted mixture remains

Architecture and test scenario

Input Signal

Audio / SR

Heterodyning Module

Downshifting and adaptive interpolation

Hypernetwork

Predicts Filter Start Frequencies

Projecting Module

Convolutional stack & Projection

Loss

$\sum \mathcal{L}_{components}$

Look at changes not absolutes

Mathematical Formulation

Power spectrogram from complex STFT $X(f, t)$:

$$P(f, t) = |X(f, t)|^2$$

Temporal variance per frequency bin $f$:

$$\sigma_P^2(f) = \text{Var}_t(P(f, t))$$

Combined soft Cauchy bandpass mask (soft-OR):

$$M(f) = 1 - \prod_{k=1}^K (1 - M_k(f))$$

Transient Energy Coverage Loss

Objective Function

Log-ratio of total variance to captured variance:

$$\mathcal{L}_{\text{energy}_{\text{var}}} = \log\left(\sum_f \sigma_P^2(f)\right) - \log\left(\sum_f M(f) \cdot \sigma_P^2(f) + \epsilon\right)$$

Physical Interpretation

Bypasses static energy (e.g., constant background hums like wind or water).
Measures energy changes over time ($\sigma_P^2$).
Forces bandpass filters to target frequency bins containing high-magnitude transient bursts.

Surprise Coverage Loss

Mathematical Formulation

Probability distribution $p(f)$ of temporal variance across the spectrum:

$$p(f) = \frac{\sigma_P^2(f)}{\sum_{f'} \sigma_P^2(f')}$$

Information surprise (Shannon entropy component) per bin $f$:

$$S(f) = -p(f) \log(p(f) + \epsilon)$$

Total vs. captured spectral entropy:

$$S_{\text{total}} = \sum_f S(f) \quad \text{and} \quad S_{\text{captured}} = \sum_f M(f) \cdot S(f)$$

Surprise Coverage Loss

Objective Function

$$\mathcal{L}_{\text{surprise}} = \log(S_{\text{total}}) - \log(S_{\text{captured}} + \epsilon)$$

Physical Interpretation

Maximizes the proportion of spectral entropy captured by the filters.
If the entire spectrum has a uniform, low-level fluctuation (like wind blowing leaves or moving water), the entropy of that variance is very flat.
Forces bands to snap onto highly localized, narrow-band peaks (e.g., clean animal calls).

Synthetic dataset

Idealized synthetic bioacoustic dataset:

Short bursts at various frequencies with varying lengths
Each input sample has a different sampling rate, but constant length

Out of: 8000, 16000, 22050, 32000, 44100, 48000 Hz

Eval metrics by sampling rate

SR (Hz)	Samples	Both Cov	Any Cov	Energy Overlap	Surprise Overlap
6000	62	98.4%	100.0%	100.0%	100.0%
12000	72	93.1%	100.0%	100.0%	100.0%
18000	73	100.0%	100.0%	100.0%	100.0%
24000	70	97.1%	100.0%	100.0%	100.0%
38000	55	98.2%	100.0%	100.0%	100.0%
64000	68	77.9%	100.0%	100.0%	100.0%

Eval metrics by sampling rate

SR (Hz)	Samples	Both Cov	Any Cov	Energy Overlap	Surprise Overlap
6000	62	98.4%	100.0%	100.0%	100.0%
12000	72	93.1%	100.0%	100.0%	100.0%
18000	73	100.0%	100.0%	100.0%	100.0%
24000	70	97.1%	100.0%	100.0%	100.0%
38000	55	98.2%	100.0%	100.0%	100.0%
64000	68	77.9%	100.0%	100.0%	100.0%

Eval metrics by sampling rate

SR (Hz)	Samples	Both Cov	Any Cov	Energy Overlap	Surprise Overlap
6000	62	98.4%	100.0%	100.0%	100.0%
12000	72	93.1%	100.0%	100.0%	100.0%
18000	73	100.0%	100.0%	100.0%	100.0%
24000	70	97.1%	100.0%	100.0%	100.0%
38000	55	98.2%	100.0%	100.0%	100.0%
64000	68	77.9%	100.0%	100.0%	100.0%

Eval metrics by sampling rate

SR (Hz)	Samples	Both Cov	Any Cov	Energy Overlap	Surprise Overlap
6000	62	98.4%	100.0%	100.0%	100.0%
12000	72	93.1%	100.0%	100.0%	100.0%
18000	73	100.0%	100.0%	100.0%	100.0%
24000	70	97.1%	100.0%	100.0%	100.0%
38000	55	98.2%	100.0%	100.0%	100.0%
64000	68	77.9%	100.0%	100.0%	100.0%

Summary

We have an audio frontend that:

Always outputs a fixed sequence length regardless of the input sample rate
Is able to find the adequate frequencies on its own in a synthetic benchmark, regardless of the input sample rate
Is lightweight and faster than other frontends

What we still have to find out:

Is it working in a real-life scenario (complex vocalizations with pitch, structure, and harmonics surrounded by a lot of noise)?
Will it work in a larger pipeline solving complex tasks (e.g., self-supervised self-distillation, as in a2v)?

Heterodyning Audiofrontend

Learning to look for the right frequencies

Julian C. Schäfer-Zimmermann

Max Planck Institute of Animal Behavior

Department for the Ecology of Animal Societies

Communication and Collective Movement (CoCoMo) Group