Let’s unpack what this means.

1. Window and stride: chopping audio into frames

using a 0.025s window and 0.01s stride

  • Window = 0.025 s
    • At 16 kHz → 0.025 × 16000 = 400 samples per window.
    • For each 400-sample chunk, we’ll compute a short-time spectrum.
  • Stride (hop) = 0.01 s
    • At 16 kHz → 0.01 × 16000 = 160 samples.
    • So frames overlap:
      • Frame 1: samples 0–399
      • Frame 2: samples 160–559
      • Frame 3: samples 320–719
      • etc.

This gives you about 100 frames per second (because hop = 0.01 s ≈ 1/100 s). Each frame is treated as ~locally stationary so we can talk about “the frequency content” in that short time span.

2. From time to frequency: STFT (spectrogram)

For each 400-sample frame:

  1. Multiply by a window function (e.g., Hamming) to reduce edge effects.
  2. Compute DFT → 400 complex values (or 201 unique ones).
    1. We only need to keep half, because real-valued signal are symmetric in the frequency domain
  3. Keep magnitude/power → real-valued spectral vector.

What do these values represent or at least indexes?

Each index corresponds to a frequency:

for a 16 kHz sampling rate.

So:

  • k=0 → 0 Hz (DC)
  • k=1 → 40 Hz
  • k=200 → 8000 Hz (Nyquist)
  • k=399 → “negative” frequencies mirrored around Nyquist

But we’re not done yet, that’s still in linear frequency.

3. Mel scale: match human perception of pitch

Humans don’t hear frequency linearly:

  • We’re more sensitive to differences at low frequencies (e.g., 500 vs 1000 Hz) than at very high ones (10k vs 10.5k Hz).
  • The mel scale compresses high frequencies and spaces frequencies in a way roughly aligned with human pitch perception.

To get mel features:

  1. Define a bank of filters in frequency, each covering a band on the mel scale (e.g., 128 filters).
  2. For each frame’s magnitude spectrum, pass it through this filterbank:
    • Each filter sums (or weights-sums) energy in its frequency band.
  3. Result: instead of, say, 200 linear frequency bins, you get 128 mel bands.

A Mel filter bank is a set of triangular filters that we apply to the power spectrogram. These filters have two main characteristics:

  1. Triangular Shape: Each filter is a triangle that starts at 0, ramps up to a peak amplitude of 1, and then ramps back down to 0.
  2. Mel Spacing: The filters are narrow and tightly packed at low frequencies and become wider and more spread out at higher frequencies, following the Mel scale.


4. log-mel?

We take a log of the mel energies:

  1. Matches the approximate logarithmic nature of human loudness perception
  2. Makes the distribution of values more manageable for ML models.

So each frame gives you a 128-D log-mel vector.