[FFmpeg-trac] #9307(undetermined:new): Decoding of Opus audio with missing packets produces produces noise spike (was: Seeking to the start of some files produces noise spike)
FFmpeg
trac at avcodec.org
Mon Jun 28 07:21:41 EEST 2021
#9307: Decoding of Opus audio with missing packets produces produces noise spike
-------------------------------------+-------------------------------------
Reporter: Misaki | Owner: (none)
Type: defect | Status: new
Priority: normal | Component:
| undetermined
Version: unspecified | Resolution:
Keywords: ffplay, | Blocked By:
opus, libopus |
Blocking: | Reproduced by developer: 0
Analyzed by developer: 0 |
-------------------------------------+-------------------------------------
Changes (by Misaki):
* summary: Seeking to the start of some files produces noise spike =>
Decoding of Opus audio with missing packets produces produces noise
spike
Old description:
> Summary of the bug:
> With some files, seeking to the start produces a noise spike.
>
> In the past, I found this was the case with a slight change in volume
> when encoding audio. So, for example, the filter "volume=0.8734" would
> produce a 'bugged' file that would cause this noise spike, while
> "volume=0.8733" would not. I waited to report it until I could use a more
> recent version of ffmpeg and ffplay.
>
> How to reproduce (see below for interpretation):
> {{{
> $ /usr/bin/ffplay \[pow\ at\ start\]\[crop_1080\]屏東潮州六姐妹in新北
> 市三重正義堂遶境\ Part2\[2012-06-17\]\ \[vUY-EH3gTRU\].webm -af astats
> ffplay version 4.3.2-0+deb11u1ubuntu1 Copyright (c) 2003-2021 the FFmpeg
> developers
> built with gcc 10 (Ubuntu 10.2.1-20ubuntu1)
> configuration: --prefix=/usr --extra-version=0+deb11u1ubuntu1
> --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu
> --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl
> --disable-stripping --enable-avresample --disable-filter=resample
> --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-
> libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-
> libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig
> --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm
> --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-
> libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse
> --enable-librabbitmq --enable-librsvg --enable-librubberband --enable-
> libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-
> libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-
> libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack
> --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid
> --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-
> openal --enable-opencl --enable-opengl --enable-sdl2 --enable-
> pocketsphinx --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-
> libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-
> libx264 --enable-shared
> libavutil 56. 51.100 / 56. 51.100
> libavcodec 58. 91.100 / 58. 91.100
> libavformat 58. 45.100 / 58. 45.100
> libavdevice 58. 10.100 / 58. 10.100
> libavfilter 7. 85.100 / 7. 85.100
> libavresample 4. 0. 0 / 4. 0. 0
> libswscale 5. 7.100 / 5. 7.100
> libswresample 3. 7.100 / 3. 7.100
> libpostproc 55. 7.100 / 55. 7.100
> Input #0, matroska,webm, from '[pow at start][crop_1080]屏東潮州六姐妹i
> n新北市三重正義堂遶境 Part2[2012-06-17] [vUY-EH3gTRU].webm':
> Metadata:
> ENCODER : Lavf57.83.100
> Duration: 00:00:01.02, start: -0.007000, bitrate: 1804 kb/s
> Stream #0:0(eng): Video: vp9 (Profile 0), yuv420p(tv), 1280x720, SAR
> 32:27 DAR 512:243, 24 fps, 24 tbr, 1k tbn, 1k tbc (default)
> Metadata:
> DURATION : 00:00:01.020000000
> Stream #0:1(eng): Audio: opus, 48000 Hz, mono, fltp (default)
> Metadata:
> ENCODER : Lavc57.107.100 libopus
> DURATION : 00:00:01.001000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Channel: 165KB sq= 0B f=0/0
> [Parsed_astats_0 @ 0x7f68b40151c0] DC offset: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] Min level:
> 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Max level:
> -179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Min difference:
> 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Max difference: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Mean difference: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] RMS difference: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Peak level dB: nan
> [Parsed_astats_0 @ 0x7f68b40151c0] RMS level dB: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] RMS peak dB: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] RMS trough dB: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] Crest factor: 1.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Flat factor: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] Peak count: 0
> [Parsed_astats_0 @ 0x7f68b40151c0] Noise floor dB: nan
> [Parsed_astats_0 @ 0x7f68b40151c0] Noise floor count: 0
> [Parsed_astats_0 @ 0x7f68b40151c0] Bit depth: 0/0
> [Parsed_astats_0 @ 0x7f68b40151c0] Dynamic range: inf
> [Parsed_astats_0 @ 0x7f68b40151c0] Zero crossings: 0
> [Parsed_astats_0 @ 0x7f68b40151c0] Zero crossings rate: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] Number of NaNs: 0
> [Parsed_astats_0 @ 0x7f68b40151c0] Number of Infs: 0
> [Parsed_astats_0 @ 0x7f68b40151c0] Number of denormals: 0
> [Parsed_astats_0 @ 0x7f68b40151c0] Overall
> [Parsed_astats_0 @ 0x7f68b40151c0] DC offset: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] Min level:
> 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Max level:
> -179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Min difference:
> 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Max difference: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Mean difference: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] RMS difference: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Peak level dB: nan
> [Parsed_astats_0 @ 0x7f68b40151c0] RMS level dB: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] RMS peak dB: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] RMS trough dB: 3082.547156
> [Parsed_astats_0 @ 0x7f68b40151c0] Flat factor: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] Peak count: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Noise floor dB: nan
> [Parsed_astats_0 @ 0x7f68b40151c0] Noise floor count: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Bit depth: 0/0
> [Parsed_astats_0 @ 0x7f68b40151c0] Number of samples: 0
> [Parsed_astats_0 @ 0x7f68b40151c0] Number of NaNs: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Number of Infs: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Number of denormals: 0.000000
>
> [end of initial filter output before playback starts]
>
> [Parsed_astats_0 @ 0x7f689c004580] Channel: 1 0KB sq= 0B f=0/0
> [Parsed_astats_0 @ 0x7f689c004580] DC offset: 0.000042
> [Parsed_astats_0 @ 0x7f689c004580] Min level: -0.661960
> [Parsed_astats_0 @ 0x7f689c004580] Max level: 0.650380
> [Parsed_astats_0 @ 0x7f689c004580] Min difference: 0.000000
> [Parsed_astats_0 @ 0x7f689c004580] Max difference: 0.132158
> [Parsed_astats_0 @ 0x7f689c004580] Mean difference: 0.021959
> [Parsed_astats_0 @ 0x7f689c004580] RMS difference: 0.028355
> [Parsed_astats_0 @ 0x7f689c004580] Peak level dB: -3.583367
> [Parsed_astats_0 @ 0x7f689c004580] RMS level dB: -15.233950
> [Parsed_astats_0 @ 0x7f689c004580] RMS peak dB: -13.131743
> [Parsed_astats_0 @ 0x7f689c004580] RMS trough dB: -16.290829
> [Parsed_astats_0 @ 0x7f689c004580] Crest factor: 3.824099
> [Parsed_astats_0 @ 0x7f689c004580] Flat factor: 0.000000
> [Parsed_astats_0 @ 0x7f689c004580] Peak count: 2
> [Parsed_astats_0 @ 0x7f689c004580] Noise floor dB: -3.921652
> [Parsed_astats_0 @ 0x7f689c004580] Noise floor count: 1454
> [Parsed_astats_0 @ 0x7f689c004580] Bit depth: 32/32
> [Parsed_astats_0 @ 0x7f689c004580] Dynamic range: 318.880510
> [Parsed_astats_0 @ 0x7f689c004580] Zero crossings: 2403
> [Parsed_astats_0 @ 0x7f689c004580] Zero crossings rate: 0.050390
> [Parsed_astats_0 @ 0x7f689c004580] Number of NaNs: 0
> [Parsed_astats_0 @ 0x7f689c004580] Number of Infs: 0
> [Parsed_astats_0 @ 0x7f689c004580] Number of denormals: 0
> [Parsed_astats_0 @ 0x7f689c004580] Overall
> [Parsed_astats_0 @ 0x7f689c004580] DC offset: 0.000042
> [Parsed_astats_0 @ 0x7f689c004580] Min level: -0.661960
> [Parsed_astats_0 @ 0x7f689c004580] Max level: 0.650380
> [Parsed_astats_0 @ 0x7f689c004580] Min difference: 0.000000
> [Parsed_astats_0 @ 0x7f689c004580] Max difference: 0.132158
> [Parsed_astats_0 @ 0x7f689c004580] Mean difference: 0.021959
> [Parsed_astats_0 @ 0x7f689c004580] RMS difference: 0.028355
> [Parsed_astats_0 @ 0x7f689c004580] Peak level dB: -3.583367
> [Parsed_astats_0 @ 0x7f689c004580] RMS level dB: -15.233950
> [Parsed_astats_0 @ 0x7f689c004580] RMS peak dB: -13.131743
> [Parsed_astats_0 @ 0x7f689c004580] RMS trough dB: -16.290829
> [Parsed_astats_0 @ 0x7f689c004580] Flat factor: 0.000000
> [Parsed_astats_0 @ 0x7f689c004580] Peak count: 2.000000
> [Parsed_astats_0 @ 0x7f689c004580] Noise floor dB: -3.921652
> [Parsed_astats_0 @ 0x7f689c004580] Noise floor count: 1454.000000
> [Parsed_astats_0 @ 0x7f689c004580] Bit depth: 32/32
> [Parsed_astats_0 @ 0x7f689c004580] Number of samples: 47688
> [Parsed_astats_0 @ 0x7f689c004580] Number of NaNs: 0.000000
> [Parsed_astats_0 @ 0x7f689c004580] Number of Infs: 0.000000
> [Parsed_astats_0 @ 0x7f689c004580] Number of denormals: 0.000000
>
> [end of first playback]
>
> [Parsed_astats_0 @ 0x7f689c0429c0] Channel: 1 0KB sq= 0B f=0/0
> [Parsed_astats_0 @ 0x7f689c0429c0] DC offset: 0.012485
> [Parsed_astats_0 @ 0x7f689c0429c0] Min level: -1.093806
> [Parsed_astats_0 @ 0x7f689c0429c0] Max level: 5.133211
> [Parsed_astats_0 @ 0x7f689c0429c0] Min difference: 0.000000
> [Parsed_astats_0 @ 0x7f689c0429c0] Max difference: 1.122307
> [Parsed_astats_0 @ 0x7f689c0429c0] Mean difference: 0.024617
> [Parsed_astats_0 @ 0x7f689c0429c0] RMS difference: 0.035662
> [Parsed_astats_0 @ 0x7f689c0429c0] Peak level dB: 14.207783
> [Parsed_astats_0 @ 0x7f689c0429c0] RMS level dB: -9.882940
> [Parsed_astats_0 @ 0x7f689c0429c0] RMS peak dB: -13.131624
> [Parsed_astats_0 @ 0x7f689c0429c0] RMS trough dB: -16.263167
> [Parsed_astats_0 @ 0x7f689c0429c0] Crest factor: 16.015339
> [Parsed_astats_0 @ 0x7f689c0429c0] Flat factor: 0.000000
> [Parsed_astats_0 @ 0x7f689c0429c0] Peak count: 2
> [Parsed_astats_0 @ 0x7f689c0429c0] Noise floor dB: -3.921652
> [Parsed_astats_0 @ 0x7f689c0429c0] Noise floor count: 1454
> [Parsed_astats_0 @ 0x7f689c0429c0] Bit depth: 32/32
> [Parsed_astats_0 @ 0x7f689c0429c0] Dynamic range: 336.671657
> [Parsed_astats_0 @ 0x7f689c0429c0] Zero crossings: 2399
> [Parsed_astats_0 @ 0x7f689c0429c0] Zero crossings rate: 0.050999
> [Parsed_astats_0 @ 0x7f689c0429c0] Number of NaNs: 0
> [Parsed_astats_0 @ 0x7f689c0429c0] Number of Infs: 0
> [Parsed_astats_0 @ 0x7f689c0429c0] Number of denormals: 0
> [Parsed_astats_0 @ 0x7f689c0429c0] Overall
> [Parsed_astats_0 @ 0x7f689c0429c0] DC offset: 0.012485
> [Parsed_astats_0 @ 0x7f689c0429c0] Min level: -1.093806
> [Parsed_astats_0 @ 0x7f689c0429c0] Max level: 5.133211
> [Parsed_astats_0 @ 0x7f689c0429c0] Min difference: 0.000000
> [Parsed_astats_0 @ 0x7f689c0429c0] Max difference: 1.122307
> [Parsed_astats_0 @ 0x7f689c0429c0] Mean difference: 0.024617
> [Parsed_astats_0 @ 0x7f689c0429c0] RMS difference: 0.035662
> [Parsed_astats_0 @ 0x7f689c0429c0] Peak level dB: 14.207783
> [Parsed_astats_0 @ 0x7f689c0429c0] RMS level dB: -9.882940
> [Parsed_astats_0 @ 0x7f689c0429c0] RMS peak dB: -13.131624
> [Parsed_astats_0 @ 0x7f689c0429c0] RMS trough dB: -16.263167
> [Parsed_astats_0 @ 0x7f689c0429c0] Flat factor: 0.000000
> [Parsed_astats_0 @ 0x7f689c0429c0] Peak count: 2.000000
> [Parsed_astats_0 @ 0x7f689c0429c0] Noise floor dB: -3.921652
> [Parsed_astats_0 @ 0x7f689c0429c0] Noise floor count: 1454.000000
> [Parsed_astats_0 @ 0x7f689c0429c0] Bit depth: 32/32
> [Parsed_astats_0 @ 0x7f689c0429c0] Number of samples: 47040
> [Parsed_astats_0 @ 0x7f689c0429c0] Number of NaNs: 0.000000
> [Parsed_astats_0 @ 0x7f689c0429c0] Number of Infs: 0.000000
> [Parsed_astats_0 @ 0x7f689c0429c0] Number of denormals: 0.000000
> [end of second playback]
> }}}
>
> For the above output, I enter the command. It plays the 1 second video,
> and then stops. I press left arrow to seek to the start. This causes the
> astats filter to finish processing, so it produces the output that
> includes 'Peak level dB: -3.583367'. It plays again, with the noise peak
> at start. I press Q to quit, and the astats filter finishes for this
> second playback, producing the output that includes 'Peak level dB:
> 14.207783'.
>
> I'm not sure if this is somehow caused by opus. Specifying '-acodec
> libopus' gives output that sounds the same; for some reason it seems to
> result in format s16 as the audio input for filter chain, compared to
> format 'fltp' for the default codec of 'opus', as seen with -v verbose or
> filter ashowinfo. This changes the output from astats, with peak of 0 dB
> but peak count of 184.
>
> When using option '-vn' for no video, the noise spike does not happen
> when seeking to the start of the file.
>
> It's possible this isn't a bug, though the result I had with a slight
> change in volume leading the noise spike suggests it is a bug. If it
> isn't a bug, I'm guessing it's somehow caused by concatenating opus
> packets in the wrong way. Describing a problem I had when doing that in
> case it helps with diagnosing this bug: I was trying to make a video
> which I encoded in segments. I had each segment as H.264 video and Opus
> audio. When I joined all the segments with 'concat' demuxer and '-c copy'
> for stream copy, in some places it seemed to work fine, but between some
> segments there was a noise spike.
>
> That is, 'astats' would report a spike to something like 7 dB at the
> start of a segment, even though the original audio did not have this
> spike and '-c copy' was used. I tried uploading the joined audio to
> YouTube in case it was a problem with my decoding software and the
> problem was there too. I can only guess that Opus keeps some kind of
> information state, and packets depend on the state from previous packets.
> (Can kind of see this if you try to force a DC bias into an audio stream;
> output visualization with ffplay or something shows it quickly going to
> zero each time you seek to a new point in the file.) So something like
> this could be the cause of the current bug. I can't explain why a
> miniscule change in volume while encoding would lead to greatly diverging
> results during playback, or why there's no noise spike with '-vn',
> though.
>
> I do note that in this output, there are fewer audio samples in the
> playback with the noise spike (47040 instead 47688), and I think this
> might actually be the key to fixing this bug. With -vn, the second
> playback gives 48000 samples but the same peak dB; the second playback
> sounds slightly different, but probably just from my pulseaudio starting
> later or something.
>
> So I think what is happening here is that, since the first video has a
> presentation timestamp of 0.021 (due to opus audio becoming 0.007 seconds
> earlier each time you copy it, which might be another bug which I'm not
> reporting here), ffplay seeks to the audio packet that matches the start
> of video. When it starts from this slightly later packet, there is a
> noise spike.
>
> If this explanation is correct, the questions are
> 1) is ffmpeg/ffplay following the decoding specifications for opus?
> 2) can the problem be fixed even if it's due to following the spec?
New description:
Summary of the bug:
OLD: With some files, seeking to the start produces a noise spike.
UPDATED: Output levels from Opus decoding can vary greatly if some packets
at the start are missing, either because they weren't included in the
stream or because ffplay seeks to a video keyframe that comes after those
packets.
In the past, I found this was the case with a slight change in volume when
encoding audio. So, for example, the filter "volume=0.8734" would produce
a 'bugged' file that would cause this noise spike, while "volume=0.8733"
would not. I waited to report it until I could use a more recent version
of ffmpeg and ffplay.
How to reproduce (see below for interpretation):
{{{
$ /usr/bin/ffplay \[pow\ at\ start\]\[crop_1080\]屏東潮州六姐妹in新北市
三重正義堂遶境\ Part2\[2012-06-17\]\ \[vUY-EH3gTRU\].webm -af astats
ffplay version 4.3.2-0+deb11u1ubuntu1 Copyright (c) 2003-2021 the FFmpeg
developers
built with gcc 10 (Ubuntu 10.2.1-20ubuntu1)
configuration: --prefix=/usr --extra-version=0+deb11u1ubuntu1
--toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu
--incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl
--disable-stripping --enable-avresample --disable-filter=resample
--enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-
libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-
libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig
--enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm
--enable-libjack --enable-libmp3lame --enable-libmysofa --enable-
libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse
--enable-librabbitmq --enable-librsvg --enable-librubberband --enable-
libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-
libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-
libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack
--enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid
--enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal
--enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx
--enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883
--enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264
--enable-shared
libavutil 56. 51.100 / 56. 51.100
libavcodec 58. 91.100 / 58. 91.100
libavformat 58. 45.100 / 58. 45.100
libavdevice 58. 10.100 / 58. 10.100
libavfilter 7. 85.100 / 7. 85.100
libavresample 4. 0. 0 / 4. 0. 0
libswscale 5. 7.100 / 5. 7.100
libswresample 3. 7.100 / 3. 7.100
libpostproc 55. 7.100 / 55. 7.100
Input #0, matroska,webm, from '[pow at start][crop_1080]屏東潮州六姐妹in
新北市三重正義堂遶境 Part2[2012-06-17] [vUY-EH3gTRU].webm':
Metadata:
ENCODER : Lavf57.83.100
Duration: 00:00:01.02, start: -0.007000, bitrate: 1804 kb/s
Stream #0:0(eng): Video: vp9 (Profile 0), yuv420p(tv), 1280x720, SAR
32:27 DAR 512:243, 24 fps, 24 tbr, 1k tbn, 1k tbc (default)
Metadata:
DURATION : 00:00:01.020000000
Stream #0:1(eng): Audio: opus, 48000 Hz, mono, fltp (default)
Metadata:
ENCODER : Lavc57.107.100 libopus
DURATION : 00:00:01.001000000
[Parsed_astats_0 @ 0x7f68b40151c0] Channel: 165KB sq= 0B f=0/0
[Parsed_astats_0 @ 0x7f68b40151c0] DC offset: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] Min level:
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Max level:
-179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Min difference:
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Max difference: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Mean difference: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] RMS difference: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Peak level dB: nan
[Parsed_astats_0 @ 0x7f68b40151c0] RMS level dB: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] RMS peak dB: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] RMS trough dB: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] Crest factor: 1.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Flat factor: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] Peak count: 0
[Parsed_astats_0 @ 0x7f68b40151c0] Noise floor dB: nan
[Parsed_astats_0 @ 0x7f68b40151c0] Noise floor count: 0
[Parsed_astats_0 @ 0x7f68b40151c0] Bit depth: 0/0
[Parsed_astats_0 @ 0x7f68b40151c0] Dynamic range: inf
[Parsed_astats_0 @ 0x7f68b40151c0] Zero crossings: 0
[Parsed_astats_0 @ 0x7f68b40151c0] Zero crossings rate: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] Number of NaNs: 0
[Parsed_astats_0 @ 0x7f68b40151c0] Number of Infs: 0
[Parsed_astats_0 @ 0x7f68b40151c0] Number of denormals: 0
[Parsed_astats_0 @ 0x7f68b40151c0] Overall
[Parsed_astats_0 @ 0x7f68b40151c0] DC offset: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] Min level:
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Max level:
-179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Min difference:
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Max difference: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Mean difference: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] RMS difference: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Peak level dB: nan
[Parsed_astats_0 @ 0x7f68b40151c0] RMS level dB: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] RMS peak dB: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] RMS trough dB: 3082.547156
[Parsed_astats_0 @ 0x7f68b40151c0] Flat factor: -nan
[Parsed_astats_0 @ 0x7f68b40151c0] Peak count: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Noise floor dB: nan
[Parsed_astats_0 @ 0x7f68b40151c0] Noise floor count: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Bit depth: 0/0
[Parsed_astats_0 @ 0x7f68b40151c0] Number of samples: 0
[Parsed_astats_0 @ 0x7f68b40151c0] Number of NaNs: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Number of Infs: 0.000000
[Parsed_astats_0 @ 0x7f68b40151c0] Number of denormals: 0.000000
[end of initial filter output before playback starts]
[Parsed_astats_0 @ 0x7f689c004580] Channel: 1 0KB sq= 0B f=0/0
[Parsed_astats_0 @ 0x7f689c004580] DC offset: 0.000042
[Parsed_astats_0 @ 0x7f689c004580] Min level: -0.661960
[Parsed_astats_0 @ 0x7f689c004580] Max level: 0.650380
[Parsed_astats_0 @ 0x7f689c004580] Min difference: 0.000000
[Parsed_astats_0 @ 0x7f689c004580] Max difference: 0.132158
[Parsed_astats_0 @ 0x7f689c004580] Mean difference: 0.021959
[Parsed_astats_0 @ 0x7f689c004580] RMS difference: 0.028355
[Parsed_astats_0 @ 0x7f689c004580] Peak level dB: -3.583367
[Parsed_astats_0 @ 0x7f689c004580] RMS level dB: -15.233950
[Parsed_astats_0 @ 0x7f689c004580] RMS peak dB: -13.131743
[Parsed_astats_0 @ 0x7f689c004580] RMS trough dB: -16.290829
[Parsed_astats_0 @ 0x7f689c004580] Crest factor: 3.824099
[Parsed_astats_0 @ 0x7f689c004580] Flat factor: 0.000000
[Parsed_astats_0 @ 0x7f689c004580] Peak count: 2
[Parsed_astats_0 @ 0x7f689c004580] Noise floor dB: -3.921652
[Parsed_astats_0 @ 0x7f689c004580] Noise floor count: 1454
[Parsed_astats_0 @ 0x7f689c004580] Bit depth: 32/32
[Parsed_astats_0 @ 0x7f689c004580] Dynamic range: 318.880510
[Parsed_astats_0 @ 0x7f689c004580] Zero crossings: 2403
[Parsed_astats_0 @ 0x7f689c004580] Zero crossings rate: 0.050390
[Parsed_astats_0 @ 0x7f689c004580] Number of NaNs: 0
[Parsed_astats_0 @ 0x7f689c004580] Number of Infs: 0
[Parsed_astats_0 @ 0x7f689c004580] Number of denormals: 0
[Parsed_astats_0 @ 0x7f689c004580] Overall
[Parsed_astats_0 @ 0x7f689c004580] DC offset: 0.000042
[Parsed_astats_0 @ 0x7f689c004580] Min level: -0.661960
[Parsed_astats_0 @ 0x7f689c004580] Max level: 0.650380
[Parsed_astats_0 @ 0x7f689c004580] Min difference: 0.000000
[Parsed_astats_0 @ 0x7f689c004580] Max difference: 0.132158
[Parsed_astats_0 @ 0x7f689c004580] Mean difference: 0.021959
[Parsed_astats_0 @ 0x7f689c004580] RMS difference: 0.028355
[Parsed_astats_0 @ 0x7f689c004580] Peak level dB: -3.583367
[Parsed_astats_0 @ 0x7f689c004580] RMS level dB: -15.233950
[Parsed_astats_0 @ 0x7f689c004580] RMS peak dB: -13.131743
[Parsed_astats_0 @ 0x7f689c004580] RMS trough dB: -16.290829
[Parsed_astats_0 @ 0x7f689c004580] Flat factor: 0.000000
[Parsed_astats_0 @ 0x7f689c004580] Peak count: 2.000000
[Parsed_astats_0 @ 0x7f689c004580] Noise floor dB: -3.921652
[Parsed_astats_0 @ 0x7f689c004580] Noise floor count: 1454.000000
[Parsed_astats_0 @ 0x7f689c004580] Bit depth: 32/32
[Parsed_astats_0 @ 0x7f689c004580] Number of samples: 47688
[Parsed_astats_0 @ 0x7f689c004580] Number of NaNs: 0.000000
[Parsed_astats_0 @ 0x7f689c004580] Number of Infs: 0.000000
[Parsed_astats_0 @ 0x7f689c004580] Number of denormals: 0.000000
[end of first playback]
[Parsed_astats_0 @ 0x7f689c0429c0] Channel: 1 0KB sq= 0B f=0/0
[Parsed_astats_0 @ 0x7f689c0429c0] DC offset: 0.012485
[Parsed_astats_0 @ 0x7f689c0429c0] Min level: -1.093806
[Parsed_astats_0 @ 0x7f689c0429c0] Max level: 5.133211
[Parsed_astats_0 @ 0x7f689c0429c0] Min difference: 0.000000
[Parsed_astats_0 @ 0x7f689c0429c0] Max difference: 1.122307
[Parsed_astats_0 @ 0x7f689c0429c0] Mean difference: 0.024617
[Parsed_astats_0 @ 0x7f689c0429c0] RMS difference: 0.035662
[Parsed_astats_0 @ 0x7f689c0429c0] Peak level dB: 14.207783
[Parsed_astats_0 @ 0x7f689c0429c0] RMS level dB: -9.882940
[Parsed_astats_0 @ 0x7f689c0429c0] RMS peak dB: -13.131624
[Parsed_astats_0 @ 0x7f689c0429c0] RMS trough dB: -16.263167
[Parsed_astats_0 @ 0x7f689c0429c0] Crest factor: 16.015339
[Parsed_astats_0 @ 0x7f689c0429c0] Flat factor: 0.000000
[Parsed_astats_0 @ 0x7f689c0429c0] Peak count: 2
[Parsed_astats_0 @ 0x7f689c0429c0] Noise floor dB: -3.921652
[Parsed_astats_0 @ 0x7f689c0429c0] Noise floor count: 1454
[Parsed_astats_0 @ 0x7f689c0429c0] Bit depth: 32/32
[Parsed_astats_0 @ 0x7f689c0429c0] Dynamic range: 336.671657
[Parsed_astats_0 @ 0x7f689c0429c0] Zero crossings: 2399
[Parsed_astats_0 @ 0x7f689c0429c0] Zero crossings rate: 0.050999
[Parsed_astats_0 @ 0x7f689c0429c0] Number of NaNs: 0
[Parsed_astats_0 @ 0x7f689c0429c0] Number of Infs: 0
[Parsed_astats_0 @ 0x7f689c0429c0] Number of denormals: 0
[Parsed_astats_0 @ 0x7f689c0429c0] Overall
[Parsed_astats_0 @ 0x7f689c0429c0] DC offset: 0.012485
[Parsed_astats_0 @ 0x7f689c0429c0] Min level: -1.093806
[Parsed_astats_0 @ 0x7f689c0429c0] Max level: 5.133211
[Parsed_astats_0 @ 0x7f689c0429c0] Min difference: 0.000000
[Parsed_astats_0 @ 0x7f689c0429c0] Max difference: 1.122307
[Parsed_astats_0 @ 0x7f689c0429c0] Mean difference: 0.024617
[Parsed_astats_0 @ 0x7f689c0429c0] RMS difference: 0.035662
[Parsed_astats_0 @ 0x7f689c0429c0] Peak level dB: 14.207783
[Parsed_astats_0 @ 0x7f689c0429c0] RMS level dB: -9.882940
[Parsed_astats_0 @ 0x7f689c0429c0] RMS peak dB: -13.131624
[Parsed_astats_0 @ 0x7f689c0429c0] RMS trough dB: -16.263167
[Parsed_astats_0 @ 0x7f689c0429c0] Flat factor: 0.000000
[Parsed_astats_0 @ 0x7f689c0429c0] Peak count: 2.000000
[Parsed_astats_0 @ 0x7f689c0429c0] Noise floor dB: -3.921652
[Parsed_astats_0 @ 0x7f689c0429c0] Noise floor count: 1454.000000
[Parsed_astats_0 @ 0x7f689c0429c0] Bit depth: 32/32
[Parsed_astats_0 @ 0x7f689c0429c0] Number of samples: 47040
[Parsed_astats_0 @ 0x7f689c0429c0] Number of NaNs: 0.000000
[Parsed_astats_0 @ 0x7f689c0429c0] Number of Infs: 0.000000
[Parsed_astats_0 @ 0x7f689c0429c0] Number of denormals: 0.000000
[end of second playback]
}}}
For the above output, I enter the command. It plays the 1 second video,
and then stops. I press left arrow to seek to the start. This causes the
astats filter to finish processing, so it produces the output that
includes 'Peak level dB: -3.583367'. It plays again, with the noise peak
at start. I press Q to quit, and the astats filter finishes for this
second playback, producing the output that includes 'Peak level dB:
14.207783'.
I'm not sure if this is somehow caused by opus. Specifying '-acodec
libopus' gives output that sounds the same; for some reason it seems to
result in format s16 as the audio input for filter chain, compared to
format 'fltp' for the default codec of 'opus', as seen with -v verbose or
filter ashowinfo. This changes the output from astats, with peak of 0 dB
but peak count of 184.
When using option '-vn' for no video, the noise spike does not happen when
seeking to the start of the file.
It's possible this isn't a bug, though the result I had with a slight
change in volume leading the noise spike suggests it is a bug. If it isn't
a bug, I'm guessing it's somehow caused by concatenating opus packets in
the wrong way. Describing a problem I had when doing that in case it helps
with diagnosing this bug: I was trying to make a video which I encoded in
segments. I had each segment as H.264 video and Opus audio. When I joined
all the segments with 'concat' demuxer and '-c copy' for stream copy, in
some places it seemed to work fine, but between some segments there was a
noise spike.
That is, 'astats' would report a spike to something like 7 dB at the start
of a segment, even though the original audio did not have this spike and
'-c copy' was used. I tried uploading the joined audio to YouTube in case
it was a problem with my decoding software and the problem was there too.
I can only guess that Opus keeps some kind of information state, and
packets depend on the state from previous packets. (Can kind of see this
if you try to force a DC bias into an audio stream; output visualization
with ffplay or something shows it quickly going to zero each time you seek
to a new point in the file.) So something like this could be the cause of
the current bug. I can't explain why a miniscule change in volume while
encoding would lead to greatly diverging results during playback, or why
there's no noise spike with '-vn', though.
I do note that in this output, there are fewer audio samples in the
playback with the noise spike (47040 instead 47688), and I think this
might actually be the key to fixing this bug. With -vn, the second
playback gives 48000 samples but the same peak dB; the second playback
sounds slightly different, but probably just from my pulseaudio starting
later or something.
So I think what is happening here is that, since the first video has a
presentation timestamp of 0.021 (due to opus audio becoming 0.007 seconds
earlier each time you copy it, which might be another bug which I'm not
reporting here), ffplay seeks to the audio packet that matches the start
of video. When it starts from this slightly later packet, there is a noise
spike.
If this explanation is correct, the questions are
1) is ffmpeg/ffplay following the decoding specifications for opus?
2) can the problem be fixed even if it's due to following the spec?
--
--
Ticket URL: <https://trac.ffmpeg.org/ticket/9307#comment:3>
FFmpeg <https://ffmpeg.org>
FFmpeg issue tracker
More information about the FFmpeg-trac
mailing list