2 8 min read

what causes harshness in audio

why your ears hurt between 2 and 5 kHz, how equal-loudness contours explain harshness, and why resonant peaks compound across tracks in your mix. a practical guide for music producers.

the frequency range that hurts

you are A/B-ing two takes of the same vocal. same singer, same mic, same room. one take sounds smooth and present. the other makes you want to reach for the volume knob after 30 seconds. you pull up a spectrum analyzer and the difference is right there: 3 dB at 3.2 kHz. that is all it takes. what causes harshness in audio is rarely about volume. it is about where in the spectrum that energy lives.

in stage 1, we covered how resonance builds up at specific frequencies. this stage explains why some of those frequencies hurt more than others.

the answer starts with a discovery from 1933. Harvey Fletcher and Wilden Munson at Bell Labs measured how loud tones at different frequencies need to be for people to perceive them as equally loud.[^1] the result was the first set of equal-loudness contours: curves showing that at moderate listening levels (around 60-80 phon), the human ear is 10 to 15 dB more sensitive at 3.5 kHz than at 200 Hz.

that is a massive difference. a resonant peak of 3 dB at 3.5 kHz is perceived as roughly equivalent to a 15 dB peak at 200 Hz. the ear amplifies the presence range and attenuates the extremes.

the reason is physical. your ear canal is a tube approximately 2.5 cm long. like any tube, it has a resonant frequency: around 2.5 to 3 kHz, where the quarter-wavelength matches the canal length. incoming sound at this frequency is physically amplified before it even reaches the eardrum. evolution reinforced this: speech consonants (s, t, k, f) carry information in the 2-5 kHz range, and the ability to distinguish them at a distance was a survival advantage.[^2]

simplified equal-loudness sensitivity curve. higher values mean the ear is more sensitive at that frequency. the peak around 3.5 kHz explains why even small resonances in this range sound harsh.

key takeaway

harshness is not about volume. it is about energy in the specific frequency range where your hearing is most sensitive. a small peak at 3 kHz sounds far louder than the same peak at 200 Hz, because the ear physically amplifies the presence range.

resonance vs noise: what makes something harsh

not all loud content in the 2-5 kHz range sounds harsh. a bright cymbal crash has plenty of energy there but does not necessarily make you wince. a resonant vocal peak at the same frequency does.

the difference is spectral contrast. harshness correlates with narrow peaks that stand significantly above their spectral neighbors. a peak that protrudes 6 dB above the surrounding content sounds harsh. the same energy spread across a wider band sounds bright but tolerable.

this is why Q factor matters. a narrow peak (Q of 8 or higher) at 3 kHz sounds dramatically harsher than a broad shelf of the same overall energy. your auditory system is tuned to detect spectral contrast, not just absolute level. a resonant frequency that rings while everything else moves sounds like something that should not be there.[^3]

duration matters too. a snare hit with a 3 kHz transient lasts milliseconds. your ear barely registers it as harsh. a held vocal note with a resonant peak at the same frequency sustains for seconds. the longer the exposure, the more fatiguing it becomes.

spectral contrast

harshness perception correlates with the Q factor of a spectral peak, not just its level. a narrow peak (Q > 8) at 3 kHz sounds harsher than a broad boost of the same magnitude. this is because the auditory system detects contrast between neighboring frequency bands. a peak that stands out from its spectral context triggers a stronger perceptual response than one that blends in.

where harshness enters the signal

harshness rarely comes from one source. it accumulates through the signal chain, with each stage adding its own contribution.

room reflections

stage 1 covered room modes: standing waves at low frequencies caused by room dimensions. but rooms also create harshness through a different mechanism: early reflections. sound bouncing off a hard surface at close range (a desk, a wall behind the singer, a ceiling) arrives at the microphone milliseconds after the direct sound. the interference between direct and reflected signal creates a comb filter: a pattern of constructive and destructive peaks across the spectrum. when one of those constructive peaks lands in the 2-5 kHz range, you get harshness that was not in the source.

microphone coloring

condenser microphones are designed with a presence peak between 3 and 5 kHz. this is intentional: it adds clarity to speech and vocals. but when the source already has energy in that range, the mic’s resonance stacks on top. a bright singer through a bright condenser in a reflective room can create a 6-9 dB peak at 3 kHz before any processing.

compression revealing resonance

compressors reduce the loudest parts of a signal. transients go down, sustained content stays relatively untouched. when a vocal has a resonant tail at 3 kHz that was previously masked by louder transients, compression exposes it. the resonance was always there. compression just made it the loudest thing in the signal.

digital processing artifacts

aggressive brick-wall limiting, poor-quality sample rate conversion, and certain digital EQ implementations can create narrow peaks or ringing in the presence range that analog processing would not. these are subtle but cumulative.

how harshness accumulates through the signal chain. each stage adds its own contribution, and the effects compound by the time signals reach the mix bus.

how harshness compounds across tracks

this is where small problems become big ones.

a single vocal with a 3 dB peak at 3 kHz is mildly harsh. annoying, but manageable. add a second track with its own 3 dB peak at the same frequency, and the combined level at 3 kHz is not 6 dB. it is closer to 8 dB, because when two coherent signals at the same frequency sum, the amplitudes add, not the decibels. three tracks with mild 3 kHz peaks produce a mix with a problem that none of the individual tracks had.

this is the “last track” effect that every mixer has experienced. the mix sounds fine with 15 tracks. you bring in the 16th, and suddenly the presence range is fatiguing. the new track did not cause the problem. it tipped a cumulative build-up past the threshold of comfort.

the physics are simple. phase-coherent summation in a narrow frequency band produces constructive interference. the more sources with energy at the same frequency, the worse the build-up. and because most sources in a typical production (vocals, acoustic guitars, bright synths, cymbals) have significant energy in the 2-5 kHz range, the presence region is where compounding hits hardest.

how a mild 3 kHz peak on individual tracks compounds into a significant problem when three tracks are summed. the build-up in the presence range is far worse than any single track suggests.

heads up

if your mix sounds harsh, do not reach for a mix bus EQ first. find the individual track causing the worst build-up and fix it at the source. a mix bus cut at 3 kHz affects everything: the punchy snare, the airy vocal, the bright acoustic guitar. fixing individual tracks preserves the character of sources that are not part of the problem.

source-specific harshness

different sources produce harshness through different mechanisms. understanding why a source is harsh tells you which tool will fix it. stage 5 covers the practical techniques for each. here is the science behind the problem.

vocals

the human voice produces formants: resonant peaks in the vocal tract that define vowel sounds. the second and third formants sit between 1.5 and 4 kHz, right in the harshness zone. when a singer changes vowels, the formants shift. when a singer pushes volume, the formants get louder. this is why vocal harshness is dynamic: it comes and goes with the performance, and no single EQ frequency can catch all of it.

cymbals

a cymbal’s overtone series is dense and inharmonic: dozens of resonant modes spread across 2-15 kHz, shifting with each strike. overhead microphones pick up both the direct cymbal sound and its room reflections, creating an especially complex spectral signature. the harshness is broadband and unpredictable.

acoustic guitar

the guitar body resonates aggressively in the 2-4 kHz range. string overtones and pick attack add more energy in the same region. close-miking at the soundhole amplifies the body resonance. further away picks up room coloring. either way, the 2-4 kHz region gets concentrated attention.

synths

digital oscillators have no natural high-frequency roll-off. a saw wave has harmonics extending to the Nyquist frequency with a predictable -6 dB/octave slope, but at 2-5 kHz, those harmonics are still significant. FM synthesis and resonant filter sweeps can create sharp peaks that move across the spectrum.

the harshness zone mapped to perceptual frequency bands. ERB-scale bands are narrower in the 2-5 kHz range, reflecting the ear's finer frequency resolution where it matters most.

when harshness is intentional

not all harshness is a problem to solve.

distorted electric guitars depend on presence-range energy. the “bite” of a driven amp is literally harmonic content concentrated in the 2-5 kHz range. remove it and the guitar disappears behind the drums and bass. the same energy that makes a clean vocal harsh makes a distorted guitar cut through a dense mix.

aggressive vocal production in hip-hop and pop deliberately pushes presence. a vocal sitting at 3-4 kHz with intentional brightness is a production choice, not a mistake. the goal is intelligibility over a dense beat, and presence-range energy achieves it.

mastering decisions depend on playback context. a track designed for earbuds needs more presence energy than one for studio monitors. the frequency response of consumer earbuds typically drops off above 4 kHz, so what sounds harsh on monitors may sound balanced on the listener’s actual playback system.

the difference between “harsh” and “present” is context, consistency, and duration. a 3 kHz peak that appears on loud phrases and disappears on quiet ones is a resonance problem. a 3 kHz peak that is consistent across the performance and intentional in the production is character.

key takeaway

harshness is not always a problem to fix. it is a problem when it is uncontrolled, when it compounds across sources, or when it fatigues the listener. the goal is to know the difference, and to have the tools to act when it matters.

frequently asked questions

frequently asked questions

why do my ears hurt at certain frequencies?

your auditory system has peak sensitivity between 2 and 5 kHz. this range evolved for speech intelligibility: distinguishing consonants at a distance was a survival advantage. the consequence is that any resonant peak in this range sounds dramatically louder than the same peak at 300 Hz or 10 kHz. a 3 dB build-up at 3 kHz is perceived as roughly twice as significant as the same build-up at lower frequencies.

what frequency range is responsible for harshness?

most harshness lives between 2 and 5 kHz, with a secondary zone around 6-8 kHz for sibilance. the 2-5 kHz range corresponds to the ear canal resonant frequency and the peak of the equal-loudness contours. content in this range needs less energy to sound loud, so any resonant build-up becomes fatiguing quickly.

is harshness the same as loudness?

no. loudness is overall level across the spectrum. harshness is the perception of excessive energy in a narrow frequency band, specifically in the range where your ears are most sensitive. a signal can be quiet overall but still sound harsh if it has a resonant peak at 3 kHz. harshness is about spectral shape, not volume.

can room treatment fix harshness?

room treatment fixes harshness caused by reflections and standing waves. it will not fix harshness from microphone presence peaks, signal chain resonances, or cumulative build-up from summing multiple tracks. treatment is one piece of the puzzle, not the whole solution.

when is harshness intentional and useful?

distorted guitars, aggressive synth leads, and certain vocal styles use presence-range energy for impact and cut-through. the difference between harsh and present is context: a 3 kHz peak on a rock vocal sitting above a dense arrangement might be exactly what it needs to stay intelligible. harshness becomes a problem when it is uncontrolled or fatiguing over time.

references

a note from the developer

this guide is built on four years of studying psychoacoustics and DSP research. reading papers, building prototypes, making mistakes, and learning from all of it. i am a solo developer in copenhagen, and i am still learning every day.

the equal-loudness discovery changed how i thought about KERN SMOOTH. i had been building a resonance suppressor that treated every frequency equally: same sensitivity, same threshold, same smoothing. the problem was that a 3 dB resonance at 3 kHz and a 3 dB resonance at 200 Hz are not the same problem. your ears treat them completely differently. so SMOOTH needed to treat them differently too. that led to the perceptual salience weighting: heavier sensitivity in the 2-5 kHz range, lighter sensitivity where the ear is naturally more tolerant. the result sounds more natural because it matches how you actually hear.

if i got something wrong, missed something important, or if you just want to share how you deal with harshness in your mixes, i genuinely want to hear from you. reach out at jonas@kernaudio.io. every piece of feedback makes these guides better.