2 8 min read

how stereo perception works

the psychoacoustics of stereo hearing: interaural time differences, level differences, the duplex theory, precedence effect, and why bass should be mono.

March 9, 2026

your brain’s stereo decoder

a car horn blares from your left. you flinch and turn toward it before you consciously register what happened. your auditory system resolved the direction of that sound in milliseconds, using nothing but the tiny differences in what each ear received.

this is how stereo perception works. your eyes use the slight difference between left and right to perceive depth. your ears do the same thing with sound. your brain compares the signals arriving at your left and right ears and extracts spatial information from two types of differences: timing and level. these two cues are the foundation of everything in stereo audio, from basic panning to psychoacoustic widening.[^1]

how your auditory system locates sound. a source to the left reaches the left ear first (ITD) and louder (ILD, because the head shadows the right ear). the brain combines both cues to compute a direction.

key takeaway

your auditory system uses two cues to determine where a sound is coming from: timing differences between the ears (ITD, dominant below 1500 Hz) and level differences caused by head shadow (ILD, dominant above 1500 Hz). every stereo technique, from panning to decorrelation, exploits one or both of these mechanisms.

interaural time difference

when a sound comes from your left, it reaches your left ear a fraction of a millisecond before your right ear. this delay is the interaural time difference (ITD). your brain is extraordinarily sensitive to it: trained listeners can detect timing differences as small as 10 to 20 microseconds.[^1]

the maximum ITD for a sound directly to one side is about 690 microseconds (0.69 ms). this is determined by the physical distance between the ears: the average adult head is roughly 21 to 23 cm in diameter, and sound travels at 343 meters per second. the math is straightforward: 0.23 m / 343 m/s ≈ 0.00069 seconds.

ITD is the dominant localization cue for low frequencies. below about 1500 Hz, sound waves are long enough to diffract around the head, maintaining phase coherence between the ears. your auditory system tracks the ongoing phase difference between the two ear signals and uses it to compute direction.

above 1500 Hz, the situation changes. the wavelength becomes shorter than the head diameter, which means the phase relationship between the ears becomes ambiguous: multiple cycles of the wave fit between the ears, and the brain cannot tell which cycle is which. ITD still works at high frequencies through a different mechanism: the brain tracks the envelope (the amplitude modulation) of high-frequency sounds rather than the carrier wave itself.

ITD dominance by frequency. below 1500 Hz, ongoing phase ITD is the primary localization cue (amber). above 1500 Hz, phase becomes ambiguous and ITD reliability drops, though envelope ITD partially compensates.

for producers, this has a direct practical consequence. when a Haas delay creates width using a 1 to 30 ms delay between channels, it is exploiting the ITD mechanism. below about 1 ms, the delay shifts the phantom image toward the earlier channel. above 1 ms, the precedence effect takes over and the brain fuses the two arrivals into one image, using the delay to add a sense of spaciousness.

interaural level difference

your head is not transparent. high-frequency sounds cannot bend around it easily, so the ear facing away from the sound source receives a weaker signal. this acoustic shadow creates an interaural level difference (ILD) that increases with frequency.

the numbers are significant. at 200 Hz, the maximum ILD is only about 3 dB: long wavelengths diffract around the head with little attenuation. at 1 kHz, it reaches about 10 dB. at 5 kHz, about 17 dB. at 10 kHz, the head shadow creates a massive 21 dB difference between the ears.[^1]

this is why simple amplitude panning (turning a knob left or right) sounds more convincing for high-frequency content than for bass. at low frequencies, ILD is negligible in real acoustic environments, so pure amplitude panning of bass creates a spatial impression that does not match how your brain expects low frequencies to behave. the image feels undefined rather than clearly positioned.

the 1500 Hz crossover

the transition between ITD dominance and ILD dominance happens around 1500 Hz. this is not a coincidence: at 1500 Hz, the wavelength of sound (343 / 1500 = 0.23 m) roughly equals the diameter of the average human head. below this frequency, waves diffract around the head and phase is the reliable cue. above it, the head casts an acoustic shadow and level becomes the reliable cue. this observation is the core of Lord Rayleigh’s duplex theory, first proposed in 1907.[^2]

the duplex theory

in 1907, Lord Rayleigh (John William Strutt) proposed that humans localize sounds using two complementary mechanisms: ITD dominates at low frequencies, ILD dominates at high frequencies, with a crossover around where the wavelength equals head size.

this was a breakthrough insight and remains the foundation of spatial hearing science. the 1500 Hz crossover is not a sharp boundary. it is a transition zone where both cues are active but with shifting dominance. modern research has refined Rayleigh’s theory in several ways: envelope ITD extends timing cues to higher frequencies, complex sounds use multiple cues simultaneously, and front-back confusion cannot be resolved by ITD or ILD alone (it requires spectral cues from the shape of the outer ear).

but for stereo music production, the duplex theory explains most of what you need to know. timing differences (delays, phase manipulation) are most effective below 1500 Hz. level differences (panning) are most effective above 1500 Hz. the best stereo tools operate across the full spectrum, using the right mechanism in the right frequency range.

the precedence effect

in the 1940s, Helmut Haas conducted experiments on how we hear in rooms full of reflections. his finding: when two similar sounds arrive within a short time window, the brain fuses them into a single event and localizes it at the position of the first arrival. the second arrival is perceptually suppressed for localization but still contributes to loudness and spaciousness.[^3]

the time windows matter for production:

0 to 1 ms: summing localization. the brain treats the two arrivals as a single source and computes its position from the combined ITD and ILD. this is how phantom center works in speaker stereo.

1 to 5 ms (clicks and transients): the fusion zone. two sounds fuse into one perceived event, localized at the first arrival. the delayed copy adds fullness and subtle spatial impression.

5 to 30 ms (speech and music): the precedence zone. the lagging sound is suppressed for localization but adds noticeable spaciousness and width. Haas showed that the delayed copy can be up to 10 dB louder than the original and still be suppressed.

beyond 30 to 50 ms: the echo threshold. the delayed sound becomes a distinct, audible echo.

time delay zones and their perceptual effects. the precedence zone (5-30 ms) is where delay-based stereo widening operates: the delay adds width without being heard as a separate echo.

this is why Haas-based stereo widening works: a 10 to 25 ms delay falls squarely in the precedence zone, where the delayed copy adds spaciousness and width without being perceived as a separate sound. but it is also why Haas-based widening has a fundamental problem: on mono fold-down, the original and delayed signals sum together, creating comb filtering with notches at f = 1/(2 x delay).

why bass should be mono

at 100 Hz, the wavelength of sound is 3.43 meters. the distance between your ears is about 0.17 meters. your ear spacing is roughly 5% of the wavelength at 100 Hz.

the consequence is fundamental: your auditory system cannot resolve the spatial position of sounds with wavelengths much longer than your head. below about 80 to 150 Hz, sounds seem to come from everywhere regardless of which speaker they play from. this is why subwoofer placement is less critical than mid-range speaker placement, and why mono bass is not a creative compromise. it is a psychoacoustic reality.

the mono compatibility argument seals it. stereo bass means different bass content in the left and right channels. when those channels sum to mono, any phase differences between them cause cancellation. even partial decorrelation of bass creates partial low-end loss. on a club PA system with a mono subwoofer, on a phone speaker, on a bluetooth box: you lose the foundation of your mix.

most engineers mono-sum below 100 to 200 Hz. the exact frequency depends on the material: electronic music and club tracks often set the boundary at 100 to 150 Hz. pop and rock sessions typically use 80 to 120 Hz. there is no hard rule, but the principle is clear: do not waste stereo processing on frequencies where width is imperceptible and mono safety is critical.

tip

KERN WIDE’s FOCUS knob sets a crossover frequency that collapses all stereo content below it to mono, then applies width processing only above it. this is the psychoacoustically correct approach: do not spend processing power on frequencies where width cannot be heard, and guarantee a solid mono-compatible low end.

frequency-dependent width perception

not all frequencies contribute equally to the perception of stereo width. this has been demonstrated in listening tests on decorrelation, where researchers found that mid-frequencies (roughly 300 to 3000 Hz) contribute the most to perceived width and spaciousness.[^4]

the reasons are rooted in how your auditory system works. the ear processes sound in bands called critical bands (or equivalent rectangular bandwidths, ERBs). these bands are narrower at low frequencies and wider at high frequencies, following the Glasberg and Moore 1990 formula. within each band, the auditory system computes interaural differences independently.

at low frequencies, the bands are narrow enough that decorrelation within a band is subtle. and because ITD is the dominant cue below 1500 Hz, phase manipulation alone does not create the same sense of spaciousness as it does in the midrange. at very high frequencies, excessive decorrelation creates a “phasey” or “smeary” quality that sounds like an artifact rather than natural width.

the practical takeaway: the most effective stereo widening is frequency-dependent. more decorrelation in the midrange (where it is most perceptible), less at the extremes (where it either does nothing useful or causes artifacts). this is the approach taken by psychoacoustically-informed stereo tools, including KERN WIDE’s STEREO mode, which uses a perceptually-weighted width curve that peaks around 600 Hz and rolls off at the frequency extremes.

frequently asked questions

what is the interaural time difference in audio?

the interaural time difference (ITD) is the tiny delay between when a sound reaches your near ear versus your far ear. the maximum ITD is about 690 microseconds (0.69 ms) for a sound directly to one side. your brain uses these timing differences to locate sounds. ITD is the dominant localization cue below about 1500 Hz.

why is 1500 Hz important for stereo perception?

at 1500 Hz, the wavelength of sound roughly equals the diameter of the human head (about 23 cm). below this frequency, sound waves bend around the head easily, so timing differences (ITD) are the main localization cue. above it, the head blocks high frequencies and creates a level difference (ILD) between the ears. this crossover at 1500 Hz is the core of Lord Rayleigh's duplex theory from 1907.

what is the Haas effect and how does it affect mixing?

the Haas effect (also called the precedence effect) describes how your brain handles two similar sounds arriving within 1 to 30 milliseconds of each other. the brain fuses them into a single perceived event and localizes it at the first arrival. the delayed copy adds spaciousness and width without being heard as a separate echo. this is why delay-based stereo widening works, but it creates comb filtering on mono fold-down.

why should bass frequencies be mono in a mix?

bass frequencies below about 150 Hz have wavelengths much longer than the distance between your ears (3.4 meters at 100 Hz vs 0.17 meters between ears). your auditory system cannot resolve the spatial position of these long wavelengths, so stereo bass provides no perceptual benefit. worse, stereo bass content that is out of phase between left and right cancels on mono playback, which costs you low-end power on phone speakers, bluetooth devices, and club systems.

do all frequencies contribute equally to perceived stereo width?

no. research on decorrelation and spatial hearing shows that mid-frequencies (roughly 300 to 3000 Hz) contribute most to the perception of width and spaciousness. very low frequencies are perceptually unlocalized, so decorrelation there adds nothing. very high frequencies are sensitive to comb filtering artifacts. the most effective stereo widening concentrates its effect in the midrange where your ears are most spatially sensitive.

references

a note from the developer

the psychoacoustics of stereo hearing became an obsession when i started working on KERN WIDE. i kept reading about producers who wanted wider mixes but did not understand why some widening techniques sounded great on headphones and fell apart on speakers. the answer was always in the psychoacoustics: the brain uses specific cues in specific frequency ranges, and any stereo processing that ignores these cues is working against the listener’s perception rather than with it.

the duplex theory is over a hundred years old, but it still explains most of what you need to know about stereo imaging in music production. ITD below 1500 Hz. ILD above. the Haas effect in the 5 to 30 ms range. these are the rules your ears follow, whether you know them or not.

building KERN WIDE meant translating these psychoacoustic principles into DSP code: allpass decorrelation that operates in ERB bands, a perceptual width curve that concentrates its effect where your ears are most spatially sensitive, and a bass crossover that respects the simple fact that stereo bass is both perceptually useless and acoustically dangerous.

if i got something wrong, or if you apply these principles in a way i have not described, jonas@kernaudio.io. i read everything.

built on this research

WIDE applies this science in real time. five knobs. $29. no iLok.

see WIDE read the docs