ERB scale and psychoacoustics
how the ERB scale models human hearing and why psychoacoustic frequency resolution matters for resonance suppression. the science behind perceptual audio processing.
your ears are not microphones
the ERB scale is one of the most important concepts in modern audio processing, and most producers have never heard of it. here is why it matters: your ears do not hear all frequencies with the same precision. at 100 Hz, you can distinguish tones about 35 Hz apart. at 1 kHz, that widens to about 130 Hz. at 8 kHz, you need nearly 900 Hz of separation before two tones sound different. (i had been building audio tools for a year before i learned this. it changed everything.)
this is not a limitation. it is how your auditory system evolved. low frequencies carry pitch information that your brain needs to resolve precisely. high frequencies carry noise-like content (consonants, transients, air) where fine detail matters less.
the problem is that most audio tools ignore this entirely. a standard FFT splits the spectrum into equal-width bins: 10.77 Hz wide at every frequency from 20 Hz to 20 kHz. that gives you absurd resolution at 10 kHz (where you cannot hear the difference) and barely enough at 200 Hz (where you can). the map does not match the territory.
the ERB scale fixes this mismatch.
key takeaway
the ERB scale models how your ears actually resolve frequency. it gives audio processors the same perceptual resolution as human hearing: fine detail where it matters, broader strokes where it does not.
what ERB actually measures
ERB stands for equivalent rectangular bandwidth. it answers a specific question: how wide is your auditory filter at a given frequency?
your cochlea, the spiral-shaped organ in your inner ear, acts as a bank of overlapping bandpass filters. each point along the cochlea responds to a narrow range of frequencies. at the base (the stiff end), the filters are tuned to high frequencies. at the apex (the flexible end), they are tuned to low frequencies. this physical arrangement is called tonotopic mapping, and the cochlea’s roughly 35 mm length maps approximately logarithmically across the audible range.[^1]
the width of each filter, measured in hertz, is the ERB at that frequency. the formula, derived by Glasberg and Moore in 1990 from careful psychoacoustic experiments, is:
the ERB formula
ERB(f) = 24.7 × (4.37 × f/1000 + 1) Hz. at 100 Hz, this gives ~35 Hz. at 1 kHz, ~130 Hz. at 8 kHz, ~888 Hz. the ERB rate (converting Hz to perceptual units) is: ERB_rate(f) = 21.4 × log10(0.00437 × f + 1). this maps 20 Hz to ~0.8 and 20 kHz to ~41.7, giving approximately 40 perceptual bands across the audible range.
the key insight: auditory filter width grows with frequency. at 100 Hz, your ear resolves a 35 Hz window. at 8 kHz, that window is 25 times wider. any audio processor that treats all frequencies equally is working against how you actually hear.
how they measured it: the notched-noise experiment
the ERB formula did not come from theory. it came from a specific psychoacoustic experiment that Glasberg and Moore refined over years of work.[^2]
the setup is elegant. (this is one of those experiments that makes you appreciate how clever psychoacousticians are.) you play a listener a pure tone at a known frequency. then you surround that tone with noise, but the noise has a gap (a notch) centred on the tone’s frequency. you start with a narrow gap and ask: can you still hear the tone?
as you widen the notch, the tone gradually becomes easier to hear. the critical moment is when the notch width exceeds the listener’s auditory filter bandwidth at that frequency. at that point, the noise falls outside the filter and stops masking the tone.
by measuring the detection threshold at many notch widths and many center frequencies, you can reconstruct the shape and width of the auditory filter at every frequency. the shape turns out to be asymmetric: steeper on the high-frequency side than the low-frequency side. the width follows the ERB formula.
this method avoids the artifacts that plagued earlier measurements. the older Bark scale (Zwicker, 1961) used narrowband maskers, which introduced beats and intermodulation products that contaminated the results. notched noise sidesteps those problems entirely.
ERB vs Bark vs Mel
three psychoacoustic frequency scales exist, and they serve different purposes.
Bark scale (Zwicker, 1961): 24 critical bands from 0 to 24 Bark. derived from narrowband masking experiments. coarser than ERB, especially at low frequencies where the critical bandwidth floor is about 100 Hz. still used in some perceptual audio codecs and older literature.
Mel scale (Stevens, Volkmann & Newman, 1937): originally derived from pitch-matching experiments. maps frequencies based on perceived pitch distance. widely used in speech recognition (mel-frequency cepstral coefficients) but less accurate for suprathreshold hearing than ERB. the mel scale was designed for how pitch sounds, not how well you can distinguish frequencies.
ERB scale (Glasberg & Moore, 1990): 40 bands from 20 Hz to 20 kHz. derived from notched-noise masking. resolves about 4x finer than Bark at low frequencies (25 Hz floor vs 100 Hz). the current standard for computational auditory models and perceptual audio processing.
tip
for music production tools, ERB is the best choice. it matches hearing resolution more accurately than Bark (especially below 500 Hz where bass and lower midrange detail matters) and is more perceptually valid than Mel for tasks beyond pitch matching.
why this matters for resonance suppression
a standard spectral processor uses a linear FFT. at 44.1 kHz with a 4096-point window, each bin is 10.77 Hz wide. this means:
- at 200 Hz: the ERB is about 42 Hz, covered by roughly 4 FFT bins. reasonable resolution
- at 1 kHz: the ERB is about 130 Hz, covered by about 12 bins. good resolution
- at 10 kHz: the ERB is about 1110 Hz, covered by over 100 bins. massively more resolution than your ears can use
this is backwards. the processor spends most of its resolution budget where your ears cannot tell the difference, and relatively little where they can. it is like putting all your studio treatment on the ceiling and none behind the monitors.
an ERB-based processor groups FFT bins into perceptual bands. instead of acting on 2049 individual bins (most of which your ears cannot distinguish), it acts on 40 bands that each correspond to one unit of auditory resolution.
the practical benefits:
- fewer artifacts: acting on perceptual bands instead of individual bins means gain changes are smoother. no isolated bin gets a drastically different gain from its neighbors, which is what causes metallic “musical noise” artifacts
- better sensitivity matching: the processor is more sensitive to resonances in the 1-5 kHz range (where your ears are most sensitive) and less reactive in the highs (where your ears tolerate more variation)
- lower CPU cost: 40 bands require less computation than 2049 bins. the heavy lifting happens in the FFT, and the per-band analysis is lightweight
key takeaway
ERB-based processing gives a resonance suppressor the same frequency resolution as your ears. it detects resonances where you hear them, ignores variations where you do not, and produces fewer artifacts as a result.
equal loudness and the 2-5 kHz peak
the ERB scale tells you how precisely your ears resolve frequency. equal-loudness contours tell you how sensitively they respond to it.
the current standard is ISO 226:2023, which shows that your ears are most sensitive between 2 and 5 kHz.[^3] a 3 kHz tone at 40 dB SPL sounds as loud as a 100 Hz tone at about 60 dB SPL. that is a 20 dB difference in physical level for the same perceived loudness.
this sensitivity peak is not random. it corresponds to the resonant frequency of the ear canal (roughly a quarter-wavelength resonator at 2.5-3 kHz) and the range where speech formants carry the most information.
for resonance suppression, this means two things:
- resonances in the 2-5 kHz range are perceptually louder than the same dB peak at other frequencies. a 3 dB resonance at 3 kHz sounds significantly more offensive than a 3 dB resonance at 300 Hz
- suppression in this range is more audible, both the improvement and any artifacts. processing needs to be especially clean and well-smoothed in the sensitive region
a processor that combines ERB frequency resolution with loudness-weighted sensitivity gets both dimensions right: it resolves frequencies the way your ears do and prioritizes the ranges your ears care most about.
a common trap
when A/B testing resonance suppression, always level-match the bypassed signal. processing that reduces energy in the 2-5 kHz range will sound quieter due to equal-loudness effects, and quieter signals sound smoother regardless of any actual improvement. match to within 0.5 LUFS before judging.
beyond resonance: where ERB shows up
the ERB scale is not limited to resonance suppression. it underpins many modern audio tools:
- loudness metering (ITU-R BS.1770): uses frequency weighting derived from auditory filter models
- perceptual audio codecs (AAC, Opus): allocate bits per critical band, spending more where ears are sensitive
- noise reduction (Ephraim-Malah estimators): spectral smoothing follows perceptual band boundaries to avoid audible artifacts[^4]
- hearing aid algorithms: amplification curves match ERB resolution to compensate for specific hearing loss profiles
in music production, the most visible application is spectral resonance suppression. but as more tools adopt perceptual frequency models, you will see ERB-based thinking in equalizers, compressors, and spatial processors.
note
the ERB model has limitations. it was derived from experiments with normal-hearing listeners at moderate levels. at very high SPL, auditory filter widths change. for listeners with hearing damage, the filters broaden. the model is a good approximation for typical mixing and mastering conditions, not a universal truth.
frequently asked questions
frequently asked questions
what is the ERB scale in audio processing?
the ERB (equivalent rectangular bandwidth) scale models how your ears resolve frequency. at low frequencies, your hearing is precise: you can distinguish tones just 35 Hz apart at 100 Hz. at high frequencies, your resolution drops: at 8 kHz, you need almost 900 Hz of separation. the ERB scale captures this by defining frequency bands that match your actual perceptual resolution, derived from Glasberg & Moore 1990.
what is the difference between ERB and Bark scales?
both model human frequency perception, but they use different measurement methods. the Bark scale (Zwicker, 1961) uses narrowband maskers and has a floor of about 100 Hz. the ERB scale (Glasberg & Moore, 1990) uses notched noise, which avoids artifacts from beats and intermodulation. the result: ERB resolves roughly 3-4x finer at low frequencies (floor near 25 Hz vs 100 Hz) and is the current standard in computational auditory models.
why does ERB matter for resonance suppression?
a linear FFT treats all frequencies equally: the same bin width at 200 Hz and 10 kHz. but your ears do not work that way. ERB-based processing gives finer resolution where you are most sensitive (1-5 kHz) and coarser resolution where you are less sensitive. this means the processor targets resonances with the same precision your ears perceive them, reducing artifacts and over-processing.
how many ERB bands cover the audible range?
approximately 40. the ERB rate formula gives about 0.8 ERB-rate units at 20 Hz and about 41.7 at 20 kHz, spanning roughly 40 bands. this is why many perceptual audio processors use 40-band filterbanks: each band corresponds to one unit of perceptual frequency resolution.
is ERB processing better than linear FFT for audio?
it depends on the task. linear FFT is better when you need uniform frequency resolution (pitch detection, spectral analysis). ERB processing is better when the goal is perceptual: resonance suppression, loudness measurement, noise reduction. by matching human hearing, ERB processing avoids wasting resolution where your ears cannot tell the difference and adds resolution where they can.
references
a note from the developer
the Glasberg and Moore paper from 1990 is the single most important thing i have read for building plugins. not because the math is hard (it is one equation), but because it reframed the problem. i had been thinking about audio as frequencies. the paper made me think about audio as perception. those are different things.
when i built SMOOTH, the ERB scale was one of the first decisions. 40 perceptual bands instead of 2049 linear bins. every instinct says more resolution is better. it is not. the right resolution is better. the listening tests confirmed it immediately: suppression that sounds natural instead of metallic. 40 bands, matched to how the cochlea actually works. sometimes the simpler model wins because it matches the thing it is modelling.
if the psychoacoustics here do not match your experience, or if you think about frequency resolution differently, i want to hear it. jonas@kernaudio.io.
try it yourself
KERN SMOOTH: dynamic resonance suppression across 40 psychoacoustic bands. $29, no iLok, no subscription.
built on this research
SMOOTH applies this science in real time. five knobs. $29. no iLok.