how spectral processing works
how spectral processing works in audio production: FFT, frequency bins, magnitude and phase, STFT, and overlap-add. the technology behind tools that see every frequency independently.
the thousand-band EQ
spectral processing is what happens when your audio tools stop guessing and start seeing. think of it as a RAW editor for audio. instead of adjusting a few broad sliders, you can see and modify every frequency independently. imagine an EQ with two thousand bands, each about 10 Hz wide, each with its own gain control, all updating automatically every 23 milliseconds. you do not place them manually. the processor analyzes your signal, finds every frequency that needs attention, and adjusts the corresponding band in real time.
this is how spectral processing works. it is the technology behind noise reducers, spectral editors like iZotope RX, and resonance suppressors like KERN SMOOTH. instead of giving you 4 or 8 EQ bands to place by ear, a spectral processor sees every frequency independently and acts on all of them at once.
the core of every spectral processor is the FFT: the fast fourier transform. it is a mathematical operation that takes a short chunk of audio, a waveform in the time domain, and converts it into a frequency-domain representation: a list of how much energy exists at every frequency in that chunk.
key takeaway
spectral processing is not a different kind of EQ. it is a fundamentally different approach: instead of you telling the processor which frequencies to target, the processor analyzes the signal and finds them itself. this is why it can handle problems that shift in frequency, like vocal formants or filter sweeps, where static EQ bands cannot follow.
frequency and time: the trade-off
the FFT works on chunks of audio called windows. the size of the window determines two things that pull in opposite directions: frequency resolution and time resolution.
a larger window gives you finer frequency resolution. a 4096-sample window at 44.1 kHz produces 2049 frequency bins, each 10.77 Hz wide. that is precise enough to distinguish two notes a semitone apart across most of the audible range. you can see individual harmonics, narrow resonances, and subtle spectral features.
but here is the thing: a larger window also means worse time resolution. those 4096 samples represent about 93 milliseconds of audio. everything that happens within that window gets blended together in the frequency analysis. a transient that lasts 5 milliseconds is smeared across the full 93-millisecond frame. the FFT sees its frequency content accurately, but loses the information about exactly when it happened.[^1] you get precision in one dimension by losing it in the other. always.
a smaller window reverses the trade-off. a 512-sample window gives you roughly 86 Hz per bin, too coarse to distinguish individual harmonics, but updates every 11.6 milliseconds, fast enough to track rapid changes.
most spectral processors in music production use 4096 or 8192 samples as a compromise. the frequency resolution is detailed enough for resonance detection, and the time resolution is adequate for musical signals that change on the scale of syllables or notes, not individual transient attacks.
the numbers
at 44.1 kHz with a 4096-point FFT: frequency resolution = 44100 / 4096 = 10.77 Hz per bin. number of bins = 4096 / 2 + 1 = 2049 (only the positive frequencies are unique). window duration = 4096 / 44100 = 92.9 ms. hop size (at 75% overlap) = 1024 samples = 23.2 ms between frames.
magnitude and phase
the FFT produces two pieces of information for every frequency bin: magnitude and phase.
magnitude tells you how much energy exists at that frequency. this is what you see on a spectrum analyzer: the height of each point in the display. when a spectral processor reduces a resonance, it is reducing the magnitude of specific bins.
phase tells you where in its cycle that frequency component is at the start of the window. a sine wave at 1000 Hz might be at its peak, at zero-crossing, or anywhere in between. the phase value records this timing offset.
for spectral processing, magnitude is almost always what matters. when you reduce the magnitude of a resonant peak, you are making that frequency quieter. the phase stays the same, which means the timing relationships between frequency components are preserved. the signal sounds like the same signal, just without the problematic peak.
modifying phase is risky. the phase relationships between frequency components are what give a signal its temporal structure: its transients, its waveform shape, its stereo image. change the phase of even a few bins and you can hear it as smearing, pre-ringing, or metallic artifacts. this is why most spectral processors operate on magnitude only.[^2]
tip
when you read that a spectral processor “preserves phase,” this is what it means. the processor modifies how loud each frequency is without changing the timing relationships between them. this is a significant advantage over time-domain approaches that inevitably shift phase as a side effect of filtering.
the sliding window
a single FFT frame gives you a snapshot. to process continuous audio, you need a stream of snapshots: the short-time fourier transform (STFT).
the STFT works by sliding the analysis window across the audio, one hop at a time. with a 4096-sample window and a 1024-sample hop (75% overlap), each position shares 75% of its samples with the previous frame. this overlap is critical for two reasons.
first, it provides smooth transitions between frames. if you modify the magnitude of a frequency bin in one frame, the overlapping frames blend that change gradually across the output. without overlap, you would hear clicks and discontinuities at every frame boundary.
second, the overlap compensates for the windowing function. before each frame enters the FFT, it is multiplied by a window function, typically a Hann window, that fades the edges of the chunk to zero. this prevents spectral leakage (artifacts from the signal being abruptly cut off at the window edges), but it also attenuates the samples near the edges of each frame. with 75% overlap, every sample in the output is covered by four overlapping windows whose values sum to a constant. this is called the constant overlap-add (COLA) condition, and it guarantees perfect reconstruction: if you do nothing to the spectrum, the output exactly equals the input.[^3]
the reconstruction works in reverse. after processing, each frame is transformed back to the time domain with an inverse FFT (IFFT), multiplied by a synthesis window, and added to the output buffer at its original position. because the windows overlap, the output is a smooth, continuous signal.
what spectral processors can do
the STFT framework is the foundation. what you build on top of it determines the tool.
spectral editing
tools like iZotope RX display the full spectrogram (time on one axis, frequency on the other, color for intensity) and let you select and modify regions. you can paint out a cough in a vocal recording, remove a phone ring from a film take, or isolate and extract a single instrument from a mix. this is spectral editing: manual intervention at the individual bin level. it is powerful but time-consuming, meant for repair work on individual files rather than real-time mixing.
spectral noise reduction
noise reduction was one of the first applications of STFT processing. the processor learns the noise profile from a “noise print” (a section of audio containing only the noise). during processing, it compares each frame’s spectrum to the noise profile and attenuates bins where the signal level is close to or below the noise floor. the key challenge is avoiding “musical noise,” the metallic, bubbly artifacts that appear when isolated bins are over-attenuated. temporal smoothing across frames helps: instead of cutting a bin sharply from one frame to the next, the gain changes gradually.
resonance suppression
a resonance suppressor analyzes the spectral envelope of each frame, identifies bins that protrude above the local average, and applies gain reduction to those specific bins. the “local average” is the key concept: it defines what counts as a resonance versus normal spectral content. a peak 6 dB above its neighbors is a resonance. a peak that is part of a broad spectral shape is not.
spectral dynamics
spectral dynamics extends the concept to per-bin compression or expansion. instead of a single threshold across the whole spectrum, each frequency bin has its own dynamic range processing. this is how tools can selectively compress only the frequencies that are too loud while expanding the ones that are too quiet. full disclosure: KERN SMOOTH uses this approach, applying per-bin dynamic gain with independent attack and release times per frequency region.
the resolution problem
the FFT divides the frequency range into equal-width bins. at 44.1 kHz with 4096 samples, every bin is exactly 10.77 Hz wide, from 20 Hz at the bottom to 22050 Hz at the top.
your ears do not work this way. not even close.
at low frequencies, your auditory system has extremely fine resolution. you can distinguish a 100 Hz tone from a 110 Hz tone easily. at high frequencies, your resolution is much coarser. you cannot distinguish 10000 Hz from 10050 Hz. (the Mercator projection makes Greenland look the same size as Africa. linear FFT bins do something similar to high frequencies: they give you resolution you cannot perceive.) the ERB (equivalent rectangular bandwidth) scale models this: an auditory filter at 1 kHz is about 130 Hz wide, while a filter at 8 kHz is about 960 Hz wide.
this mismatch has practical consequences. a 4096-point FFT at 44.1 kHz always has finer resolution than your ears, but the mismatch grows with frequency. at 200 Hz, the FFT’s 10.77 Hz bins are about 3 times finer than your auditory filter (an ERB band at 200 Hz is about 35 Hz wide). at 8 kHz, the mismatch is extreme: the FFT bins are still 10.77 Hz wide, but your auditory filter is nearly 900 Hz wide. the FFT has over 80 times more resolution than your ears can use.
smarter spectral processors solve this by grouping FFT bins into perceptual bands that match the ERB scale. instead of treating all 2049 bins independently, they group them into 30-40 ERB bands. each band represents a region of frequency that your ear treats as a unit. processing at this resolution produces more natural-sounding results because the gain changes align with what you actually perceive.
this is covered in depth in the next guide: ERB scale and psychoacoustics.
practical implications
spectral processing is powerful but not free. there are real costs to consider.
latency
a spectral processor must collect a full window of samples before it can analyze them. a 4096-sample window at 44.1 kHz means roughly 93 milliseconds of delay before the first processed sample appears. your DAW handles this transparently with plugin delay compensation (PDC), aligning the output of all tracks so you hear everything in sync during playback.
the latency matters in two situations: live performance (93 ms is noticeable as a delay when monitoring through the plugin) and when stacking multiple spectral processors (each adds its own latency, and PDC has to account for all of them).
CPU cost
the FFT itself is efficient, but the per-bin processing that follows is not trivial. analyzing 2049 bins, computing gain for each one, and applying temporal smoothing every 23 milliseconds adds up. a well-optimized spectral processor typically uses 1-3% of a single CPU core at 44.1 kHz. this is more than a simple EQ (~0.1%) but less than most convolution reverbs (~5-10%).
when to use it
spectral processing is the right tool when the problem is too complex for conventional tools. resonances that shift in frequency, noise that sits underneath a signal, artifacts that span dozens of narrow frequencies simultaneously. if a simple EQ cut or a dynamic EQ with 2-3 bands solves the problem, reach for those first. they add less latency, use less CPU, and are easier to control.
spectral processing is the wrong tool when the problem is simple and static. a room mode at 240 Hz that never changes does not need 2049 frequency bins to fix it. that is like bringing a microscope to read a road sign.
heads up
spectral processing can create artifacts if pushed too hard. “musical noise” (metallic, watery artifacts) appears when individual frequency bins are attenuated aggressively without temporal smoothing. if you hear ringing or warbling in the processed signal, reduce the depth of processing and check whether a gentler approach would work.
frequently asked questions
frequently asked questions
what is spectral processing in audio?
spectral processing decomposes your audio into individual frequency components using a mathematical operation called the FFT (fast fourier transform). instead of working with the raw waveform, a spectral processor can see and modify every frequency independently. this is how tools like noise reducers, spectral editors, and resonance suppressors work: they identify problem frequencies in the spectrum and reduce them without affecting the rest of the signal.
what is an FFT and why does it matter for audio?
the FFT (fast fourier transform) converts a chunk of audio from the time domain (amplitude over time) into the frequency domain (amplitude at each frequency). a 4096-point FFT at 44.1 kHz gives you 2049 frequency bins, each about 10.8 Hz wide. this lets you analyze and modify individual frequencies with a precision that no EQ can match.
what is the difference between magnitude and phase in audio?
magnitude tells you how loud each frequency is. phase tells you where in its cycle each frequency is at a given moment. most spectral processors modify only the magnitude (making specific frequencies louder or quieter) and leave the phase untouched. modifying phase carelessly creates audible artifacts because it disrupts the timing relationships between frequencies.
does spectral processing add latency to your signal?
yes. a spectral processor needs to collect a full window of samples before it can analyze them. a 4096-sample window at 44.1 kHz adds about 93 milliseconds of latency. your DAW compensates for this automatically with plugin delay compensation (PDC), so you will not hear the delay during playback. but you will feel it when recording or playing live through the plugin.
when should you use spectral processing instead of EQ?
use spectral processing when the problem frequencies shift over time (vocal formants, filter sweeps, cymbal overtones) or when there are too many simultaneous resonances for a few EQ bands to handle. if the problem is a fixed resonance at a known frequency, a simple EQ or dynamic EQ is faster, lighter on CPU, and equally effective.
references
a note from the developer
the first time i got a working STFT prototype running, i fed it a vocal and watched the spectral display light up. every frequency, every frame, updating in real time. i remember thinking: this is it. this is what the tool needs to see.
and then i listened to the output. it sounded terrible. metallic, watery, thin. the math was correct. the sound was not. it took months to figure out why: the FFT gives you perfect frequency information, but the decisions you make on top of it determine everything. how to identify a resonance versus character. how to smooth gain changes so they do not create artifacts. how to handle transients that the windowed analysis smears. the math is the starting point. the listening is where the tool becomes musical. i am still learning that part.
if something in this guide does not click, or if spectral processing fits into your workflow in a way i have not described, tell me. jonas@kernaudio.io.
try it yourself
KERN SMOOTH: dynamic resonance suppression across 40 psychoacoustic bands. $29, no iLok, no subscription.
built on this research
SMOOTH applies this science in real time. five knobs. $29. no iLok.