https://github.com/nick7ong/mp3_codec
This project implements the Psychoacoustic Ear Model II used in MPEG-1 Layer III (MP3), as described in ISO/IEC 11172-3. The model estimates perceptual masking thresholds to determine which frequency components of an audio signal can be discarded without audible loss, enabling efficient compression. We implemented and simulated key steps of this codec pipeline using Python (NumPy, SciPy), and tested on two audio files: flute.wav
(solo flute) and queen.wav
(a segment from Queen’s Bohemian Rhapsody).
*See the Results section to hear sound examples
We began by loading the audio signals and computing short-time Fourier transforms using a Hann window with 50% overlap (frame size = 1024, hop size = 512). We normalized the resulting spectral energy into Sound Pressure Level (SPL) and decibel full scale (dBFS) representations. Frames with very low energy (below -96 dBFS) were skipped to optimize downstream processing.
We applied the Bark scale, a psychoacoustically motivated frequency mapping, to the FFT bins to analyze how the human auditory system groups frequencies. Tonal and noise maskers were identified by scanning SPL spectra for local maxima that meet masking criteria based on spectral shape and frequency proximity. The search window for local peaks dynamically expands with frequency index to model the widening critical bands of human hearing.
Not all detected maskers are perceptually relevant. We eliminated those falling below the threshold of human audibility (threshold in quiet), and merged maskers that lie within close Bark-scale proximity. This step reduces redundant or weak masking components, simulating perceptual redundancy removal common in audio codecs.
Using spreading functions, we calculated how much each tonal or noise masker contributes to masking nearby frequency bins. Tonal maskers have narrow, sharply peaked masking curves, while noise maskers produce broader, flatter masking effects. These individual masking thresholds form the basis for computing global audibility.
We aggregated all individual masking curves and the threshold in quiet to produce a global masking threshold for each frame. This threshold defines the frequency-dependent SPL below which any signal is considered inaudible due to simultaneous masking — critical for perceptual quantization and bit allocation.
To simulate perceptual compression, we applied two key techniques:
Spectral Masking Removal: We compared each FFT bin's SPL against the global masking threshold. If a bin's energy fell below this threshold, it was zeroed out (i.e., discarded). This mimics quantization in perceptual codecs, reducing spectral resolution in inaudible regions.
SMR-Based Low-Pass Filtering: For each frame, we computed the Signal-to-Mask Ratio (SMR) across 32 uniformly spaced subbands. When SMR in high-frequency bands fell below a threshold (e.g., -10 dB), we applied a sharp biquad low-pass filter at the corresponding cutoff frequency (~15-16 kHz). This allowed further removal of masked high-frequency content.