-
log-mel features are usually the first preprocessing step before feeding audio into a neural network,
-
e.g. for LFM2-audio, “Raw 16 kHz waveforms are first transformed into 128-dimensional log-mel features using a 0.025s window and 0.01s stride”

Let’s unpack what this means.
1. Window and stride: chopping audio into frames
using a 0.025s window and 0.01s stride
- Window = 0.025 s
- At 16 kHz → 0.025 × 16000 = 400 samples per window.
- For each 400-sample chunk, we’ll compute a short-time spectrum.
- Stride (hop) = 0.01 s
- At 16 kHz → 0.01 × 16000 = 160 samples.
- So frames overlap:
- Frame 1: samples 0–399
- Frame 2: samples 160–559
- Frame 3: samples 320–719
- etc.
This gives you about 100 frames per second (because hop = 0.01 s ≈ 1/100 s). Each frame is treated as ~locally stationary so we can talk about “the frequency content” in that short time span.
2. From time to frequency: STFT (spectrogram)
For each 400-sample frame:
- Multiply by a window function (e.g., Hamming) to reduce edge effects.
- Compute DFT → 400 complex values (or 201 unique ones).
- We only need to keep half, because real-valued signal are symmetric in the frequency domain
- Keep magnitude/power → real-valued spectral vector.
What do these values represent or at least indexes?
Each index corresponds to a frequency:
for a 16 kHz sampling rate.
So:
- k=0 → 0 Hz (DC)
- k=1 → 40 Hz
- …
- k=200 → 8000 Hz (Nyquist)
- k=399 → “negative” frequencies mirrored around Nyquist
But we’re not done yet, that’s still in linear frequency.
3. Mel scale: match human perception of pitch
Humans don’t hear frequency linearly:
- We’re more sensitive to differences at low frequencies (e.g., 500 vs 1000 Hz) than at very high ones (10k vs 10.5k Hz).
- The mel scale compresses high frequencies and spaces frequencies in a way roughly aligned with human pitch perception.
To get mel features:
- Define a bank of filters in frequency, each covering a band on the mel scale (e.g., 128 filters).
- For each frame’s magnitude spectrum, pass it through this filterbank:
- Each filter sums (or weights-sums) energy in its frequency band.
- Result: instead of, say, 200 linear frequency bins, you get 128 mel bands.
A Mel filter bank is a set of triangular filters that we apply to the power spectrogram. These filters have two main characteristics:
- Triangular Shape: Each filter is a triangle that starts at 0, ramps up to a peak amplitude of 1, and then ramps back down to 0.
- Mel Spacing: The filters are narrow and tightly packed at low frequencies and become wider and more spread out at higher frequencies, following the Mel scale.

4. log-mel?
We take a log of the mel energies:
- Matches the approximate logarithmic nature of human loudness perception
- Makes the distribution of values more manageable for ML models.
So each frame gives you a 128-D log-mel vector.