[FFmpeg-user] resolution of the waterfall diagram of typical mp3 file
florin at andrei.myip.org
Mon Aug 8 01:48:06 EEST 2016
On 2016-08-07 14:28, Nicolas George wrote:
> You can not compute the spectrum of a single sample, that does not make
> sense mathematically. The spectrum needs to be computed on the whole
> or at least, if you want to observe how it evolves during time, over a
> window large enough.
Then it's my mistake. I'm explaining it wrong - sorry for that, and
allow me to rephrase.
I start with a mono audio file containing one song - a few minutes of
audio. Let's say the "quality" here is arbitrarily high, for simplicity.
Using python/numpy or some other tools, I calculate the spectrum of the
whole song, either all at once if possible, or using a reasonably large,
shifting time window.
I store that spectrum in a matrix. In the time dimension, the matrix has
T rows per second (depends on the length of the song). In the frequency
dimension, the matrix has F rows (frequency buckets or bins). In each
cell, I store one value using B bits (the color of the waterfall, or the
height of the 3D representation of the spectrum).
I then convert the matrix back into a PCM representation.
I need to determine the matrix parameters T, F, and B, so that the final
PCM file has about as much information (about the same "sound quality",
however you want to define that) as if it was extracted from an MP3
file, 44.1 kHz, 128 kbps CBR.
I understand that the frequency bins do not have constant width, but
rather their upper/lower frequency limits have constant ratio (similar
to octaves on a keyboard, but different ratio here).
The purpose of this whole exercise is to run some computations on the
full spectrum (the matrix). I need to minimize the size of the matrix,
while keeping the time and frequency resolutions pretty decent. I've
decided that the "sound quality" of MP3 / 44.1 / 128 CBR is good enough,
so I'm trying to imitate those respective resolutions, as used by MP3.
I suspect the MP3 encoding algorithm is more complex than using a fixed
size matrix, so I'm only asking for a rough approximation, like a back
of the envelope estimate. How many rows per second, how many frequency
buckets, how many bits per cell, so that the result is not worse than
that reference MP3/44.1/128 file? It doesn't have to be the exact same
signal degradation, but if it's subjectively close then that's enough
More information about the ffmpeg-user