Filter Banks and Log-Mel Spectrograms

https://apxml.com/courses/applied-speech-recognition/chapter-2-feature-extraction-for-speech/filter-banks-log-mel-spectrograms
log-mel features are usually the first preprocessing step before feeding audio into a neural network,
e.g. for LFM2-audio, “Raw 16 kHz waveforms are first transformed into 128-dimensional log-mel features using a 0.025s window and 0.01s stride”

Let’s unpack what this means.

1. Window and stride: chopping audio into frames

using a 0.025s window and 0.01s stride

Window = 0.025 s
- At 16 kHz → 0.025 × 16000 = 400 samples per window.
- For each 400-sample chunk, we’ll compute a short-time spectrum.
Stride (hop) = 0.01 s
- At 16 kHz → 0.01 × 16000 = 160 samples.
- So frames overlap:
  - Frame 1: samples 0–399
  - Frame 2: samples 160–559
  - Frame 3: samples 320–719
  - etc.

This gives you about 100 frames per second (because hop = 0.01 s ≈ 1/100 s). Each frame is treated as ~locally stationary so we can talk about “the frequency content” in that short time span.

2. From time to frequency: STFT (spectrogram)

For each 400-sample frame:

Multiply by a window function (e.g., Hamming) to reduce edge effects.
Compute DFT → 400 complex values (or 201 unique ones).
1. We only need to keep half, because real-valued signal are symmetric in the frequency domain
Keep magnitude/power → real-valued spectral vector.

What do these values represent or at least indexes?

Each index $k$ corresponds to a frequency:

$f_{k} = \frac{k}{N} f_{s} = \frac{k}{400} \cdot 16000 Hz$

for a 16 kHz sampling rate.

So:

k=0 → 0 Hz (DC)
k=1 → 40 Hz
…
k=200 → 8000 Hz (Nyquist)
k=399 → “negative” frequencies mirrored around Nyquist

But we’re not done yet, that’s still in linear frequency.

3. Mel scale: match human perception of pitch

Humans don’t hear frequency linearly:

We’re more sensitive to differences at low frequencies (e.g., 500 vs 1000 Hz) than at very high ones (10k vs 10.5k Hz).
The mel scale compresses high frequencies and spaces frequencies in a way roughly aligned with human pitch perception.

To get mel features:

Define a bank of filters in frequency, each covering a band on the mel scale (e.g., 128 filters).
For each frame’s magnitude spectrum, pass it through this filterbank:
- Each filter sums (or weights-sums) energy in its frequency band.
Result: instead of, say, 200 linear frequency bins, you get 128 mel bands.

A Mel filter bank is a set of triangular filters that we apply to the power spectrogram. These filters have two main characteristics:

Triangular Shape: Each filter is a triangle that starts at 0, ramps up to a peak amplitude of 1, and then ramps back down to 0.
Mel Spacing: The filters are narrow and tightly packed at low frequencies and become wider and more spread out at higher frequencies, following the Mel scale.

4. log-mel?

We take a log of the mel energies:

$log_mel [i] = lo g (mel_energy [i] + ϵ)$

Matches the approximate logarithmic nature of human loudness perception
Makes the distribution of values more manageable for ML models.

So each frame gives you a 128-D log-mel vector.

🤖 Harold's Notes

Explorer

Filter Banks and Log-Mel Spectrograms

1. Window and stride: chopping audio into frames

2. From time to frequency: STFT (spectrogram)

3. Mel scale: match human perception of pitch

4. log-mel?

Graph View

Table of Contents

Backlinks