Julian C. Schäfer-Zimmermann
Max Planck Institute of Animal Behavior
Department for the Ecology of Animal Societies
Communication and Collective Movement (CoCoMo) Group
$\text{N} = 3 \cdot 96,000 = 288,000$
$\text{N} = 3 \cdot 4,000 = 12,000$
A reduction of $576$ in attention matrix size
Some math ... sorry
Assuming two signals:
\( S_1(t)\) is the signal we want to downshift
\( S_2(t)\) is the signal we use for downshifting \( S_1(t)\)
Heterodyning is defined as the mathematical product of these two time-domain signals:
Substituting the signal functions:
To resolve the frequency components, apply the trigonometric product-to-sum identity:
Substituting $\alpha = \omega_1 t$ and $\beta = \omega_2 t$.
Expanding the terms maps the output into two distinct frequencies:
The system maps the original frequencies to two new spectral coordinates:
We multiple with a 10 kHz signal for downshifting
Up- and Downshifted signals are visible $\Rightarrow$ Downsample
Only the downshifted mixture remains
Mathematical Formulation
Power spectrogram from complex STFT $X(f, t)$:
$$P(f, t) = |X(f, t)|^2$$Temporal variance per frequency bin $f$:
$$\sigma_P^2(f) = \text{Var}_t(P(f, t))$$Combined soft Cauchy bandpass mask (soft-OR):
$$M(f) = 1 - \prod_{k=1}^K (1 - M_k(f))$$Objective Function
Log-ratio of total variance to captured variance:
$$\mathcal{L}_{\text{energy}_{\text{var}}} = \log\left(\sum_f \sigma_P^2(f)\right) - \log\left(\sum_f M(f) \cdot \sigma_P^2(f) + \epsilon\right)$$Physical Interpretation
Mathematical Formulation
Probability distribution $p(f)$ of temporal variance across the spectrum:
$$p(f) = \frac{\sigma_P^2(f)}{\sum_{f'} \sigma_P^2(f')}$$Information surprise (Shannon entropy component) per bin $f$:
$$S(f) = -p(f) \log(p(f) + \epsilon)$$Total vs. captured spectral entropy:
$$S_{\text{total}} = \sum_f S(f) \quad \text{and} \quad S_{\text{captured}} = \sum_f M(f) \cdot S(f)$$Objective Function
$$\mathcal{L}_{\text{surprise}} = \log(S_{\text{total}}) - \log(S_{\text{captured}} + \epsilon)$$Physical Interpretation
Idealized synthetic bioacoustic dataset:
| SR (Hz) | Samples | Both Cov | Any Cov | Energy Overlap | Surprise Overlap |
|---|---|---|---|---|---|
| 6000 | 62 | 98.4% | 100.0% | 100.0% | 100.0% |
| 12000 | 72 | 93.1% | 100.0% | 100.0% | 100.0% |
| 18000 | 73 | 100.0% | 100.0% | 100.0% | 100.0% |
| 24000 | 70 | 97.1% | 100.0% | 100.0% | 100.0% |
| 38000 | 55 | 98.2% | 100.0% | 100.0% | 100.0% |
| 64000 | 68 | 77.9% | 100.0% | 100.0% | 100.0% |
| SR (Hz) | Samples | Both Cov | Any Cov | Energy Overlap | Surprise Overlap |
|---|---|---|---|---|---|
| 6000 | 62 | 98.4% | 100.0% | 100.0% | 100.0% |
| 12000 | 72 | 93.1% | 100.0% | 100.0% | 100.0% |
| 18000 | 73 | 100.0% | 100.0% | 100.0% | 100.0% |
| 24000 | 70 | 97.1% | 100.0% | 100.0% | 100.0% |
| 38000 | 55 | 98.2% | 100.0% | 100.0% | 100.0% |
| 64000 | 68 | 77.9% | 100.0% | 100.0% | 100.0% |
| SR (Hz) | Samples | Both Cov | Any Cov | Energy Overlap | Surprise Overlap |
|---|---|---|---|---|---|
| 6000 | 62 | 98.4% | 100.0% | 100.0% | 100.0% |
| 12000 | 72 | 93.1% | 100.0% | 100.0% | 100.0% |
| 18000 | 73 | 100.0% | 100.0% | 100.0% | 100.0% |
| 24000 | 70 | 97.1% | 100.0% | 100.0% | 100.0% |
| 38000 | 55 | 98.2% | 100.0% | 100.0% | 100.0% |
| 64000 | 68 | 77.9% | 100.0% | 100.0% | 100.0% |
| SR (Hz) | Samples | Both Cov | Any Cov | Energy Overlap | Surprise Overlap |
|---|---|---|---|---|---|
| 6000 | 62 | 98.4% | 100.0% | 100.0% | 100.0% |
| 12000 | 72 | 93.1% | 100.0% | 100.0% | 100.0% |
| 18000 | 73 | 100.0% | 100.0% | 100.0% | 100.0% |
| 24000 | 70 | 97.1% | 100.0% | 100.0% | 100.0% |
| 38000 | 55 | 98.2% | 100.0% | 100.0% | 100.0% |
| 64000 | 68 | 77.9% | 100.0% | 100.0% | 100.0% |
Julian C. Schäfer-Zimmermann
Max Planck Institute of Animal Behavior
Department for the Ecology of Animal Societies
Communication and Collective Movement (CoCoMo) Group