[FFmpeg-trac] #9307(undetermined:new): Decoding of Opus audio with missing packets produces produces noise spike (was: Seeking to the start of some files produces noise spike)

FFmpeg trac at avcodec.org
Mon Jun 28 07:21:41 EEST 2021


#9307: Decoding of Opus audio with missing packets produces produces noise spike
-------------------------------------+-------------------------------------
             Reporter:  Misaki       |                    Owner:  (none)
                 Type:  defect       |                   Status:  new
             Priority:  normal       |                Component:
                                     |  undetermined
              Version:  unspecified  |               Resolution:
             Keywords:  ffplay,      |               Blocked By:
  opus, libopus                      |
             Blocking:               |  Reproduced by developer:  0
Analyzed by developer:  0            |
-------------------------------------+-------------------------------------
Changes (by Misaki):

 * summary:  Seeking to the start of some files produces noise spike =>
     Decoding of Opus audio with missing packets produces produces noise
     spike


Old description:

> Summary of the bug:
> With some files, seeking to the start produces a noise spike.
>
> In the past, I found this was the case with a slight change in volume
> when encoding audio. So, for example, the filter "volume=0.8734" would
> produce a 'bugged' file that would cause this noise spike, while
> "volume=0.8733" would not. I waited to report it until I could use a more
> recent version of ffmpeg and ffplay.
>
> How to reproduce (see below for interpretation):
> {{{
> $  /usr/bin/ffplay \[pow\ at\ start\]\[crop_1080\]屏東潮州六姐妹in新北
> 市三重正義堂遶境\ Part2\[2012-06-17\]\ \[vUY-EH3gTRU\].webm -af astats
> ffplay version 4.3.2-0+deb11u1ubuntu1 Copyright (c) 2003-2021 the FFmpeg
> developers
>   built with gcc 10 (Ubuntu 10.2.1-20ubuntu1)
>   configuration: --prefix=/usr --extra-version=0+deb11u1ubuntu1
> --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu
> --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl
> --disable-stripping --enable-avresample --disable-filter=resample
> --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-
> libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-
> libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig
> --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm
> --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-
> libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse
> --enable-librabbitmq --enable-librsvg --enable-librubberband --enable-
> libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-
> libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-
> libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack
> --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid
> --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-
> openal --enable-opencl --enable-opengl --enable-sdl2 --enable-
> pocketsphinx --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-
> libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-
> libx264 --enable-shared
>   libavutil      56. 51.100 / 56. 51.100
>   libavcodec     58. 91.100 / 58. 91.100
>   libavformat    58. 45.100 / 58. 45.100
>   libavdevice    58. 10.100 / 58. 10.100
>   libavfilter     7. 85.100 /  7. 85.100
>   libavresample   4.  0.  0 /  4.  0.  0
>   libswscale      5.  7.100 /  5.  7.100
>   libswresample   3.  7.100 /  3.  7.100
>   libpostproc    55.  7.100 / 55.  7.100
> Input #0, matroska,webm, from '[pow at start][crop_1080]屏東潮州六姐妹i
> n新北市三重正義堂遶境 Part2[2012-06-17] [vUY-EH3gTRU].webm':
>   Metadata:
>     ENCODER         : Lavf57.83.100
>   Duration: 00:00:01.02, start: -0.007000, bitrate: 1804 kb/s
>     Stream #0:0(eng): Video: vp9 (Profile 0), yuv420p(tv), 1280x720, SAR
> 32:27 DAR 512:243, 24 fps, 24 tbr, 1k tbn, 1k tbc (default)
>     Metadata:
>       DURATION        : 00:00:01.020000000
>     Stream #0:1(eng): Audio: opus, 48000 Hz, mono, fltp (default)
>     Metadata:
>       ENCODER         : Lavc57.107.100 libopus
>       DURATION        : 00:00:01.001000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Channel: 165KB sq=    0B f=0/0
> [Parsed_astats_0 @ 0x7f68b40151c0] DC offset: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] Min level:
> 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Max level:
> -179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Min difference:
> 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Max difference: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Mean difference: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] RMS difference: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Peak level dB: nan
> [Parsed_astats_0 @ 0x7f68b40151c0] RMS level dB: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] RMS peak dB: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] RMS trough dB: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] Crest factor: 1.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Flat factor: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] Peak count: 0
> [Parsed_astats_0 @ 0x7f68b40151c0] Noise floor dB: nan
> [Parsed_astats_0 @ 0x7f68b40151c0] Noise floor count: 0
> [Parsed_astats_0 @ 0x7f68b40151c0] Bit depth: 0/0
> [Parsed_astats_0 @ 0x7f68b40151c0] Dynamic range: inf
> [Parsed_astats_0 @ 0x7f68b40151c0] Zero crossings: 0
> [Parsed_astats_0 @ 0x7f68b40151c0] Zero crossings rate: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] Number of NaNs: 0
> [Parsed_astats_0 @ 0x7f68b40151c0] Number of Infs: 0
> [Parsed_astats_0 @ 0x7f68b40151c0] Number of denormals: 0
> [Parsed_astats_0 @ 0x7f68b40151c0] Overall
> [Parsed_astats_0 @ 0x7f68b40151c0] DC offset: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] Min level:
> 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Max level:
> -179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Min difference:
> 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Max difference: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Mean difference: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] RMS difference: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Peak level dB: nan
> [Parsed_astats_0 @ 0x7f68b40151c0] RMS level dB: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] RMS peak dB: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] RMS trough dB: 3082.547156
> [Parsed_astats_0 @ 0x7f68b40151c0] Flat factor: -nan
> [Parsed_astats_0 @ 0x7f68b40151c0] Peak count: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Noise floor dB: nan
> [Parsed_astats_0 @ 0x7f68b40151c0] Noise floor count: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Bit depth: 0/0
> [Parsed_astats_0 @ 0x7f68b40151c0] Number of samples: 0
> [Parsed_astats_0 @ 0x7f68b40151c0] Number of NaNs: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Number of Infs: 0.000000
> [Parsed_astats_0 @ 0x7f68b40151c0] Number of denormals: 0.000000
>
> [end of initial filter output before playback starts]
>
> [Parsed_astats_0 @ 0x7f689c004580] Channel: 1 0KB sq=    0B f=0/0
> [Parsed_astats_0 @ 0x7f689c004580] DC offset: 0.000042
> [Parsed_astats_0 @ 0x7f689c004580] Min level: -0.661960
> [Parsed_astats_0 @ 0x7f689c004580] Max level: 0.650380
> [Parsed_astats_0 @ 0x7f689c004580] Min difference: 0.000000
> [Parsed_astats_0 @ 0x7f689c004580] Max difference: 0.132158
> [Parsed_astats_0 @ 0x7f689c004580] Mean difference: 0.021959
> [Parsed_astats_0 @ 0x7f689c004580] RMS difference: 0.028355
> [Parsed_astats_0 @ 0x7f689c004580] Peak level dB: -3.583367
> [Parsed_astats_0 @ 0x7f689c004580] RMS level dB: -15.233950
> [Parsed_astats_0 @ 0x7f689c004580] RMS peak dB: -13.131743
> [Parsed_astats_0 @ 0x7f689c004580] RMS trough dB: -16.290829
> [Parsed_astats_0 @ 0x7f689c004580] Crest factor: 3.824099
> [Parsed_astats_0 @ 0x7f689c004580] Flat factor: 0.000000
> [Parsed_astats_0 @ 0x7f689c004580] Peak count: 2
> [Parsed_astats_0 @ 0x7f689c004580] Noise floor dB: -3.921652
> [Parsed_astats_0 @ 0x7f689c004580] Noise floor count: 1454
> [Parsed_astats_0 @ 0x7f689c004580] Bit depth: 32/32
> [Parsed_astats_0 @ 0x7f689c004580] Dynamic range: 318.880510
> [Parsed_astats_0 @ 0x7f689c004580] Zero crossings: 2403
> [Parsed_astats_0 @ 0x7f689c004580] Zero crossings rate: 0.050390
> [Parsed_astats_0 @ 0x7f689c004580] Number of NaNs: 0
> [Parsed_astats_0 @ 0x7f689c004580] Number of Infs: 0
> [Parsed_astats_0 @ 0x7f689c004580] Number of denormals: 0
> [Parsed_astats_0 @ 0x7f689c004580] Overall
> [Parsed_astats_0 @ 0x7f689c004580] DC offset: 0.000042
> [Parsed_astats_0 @ 0x7f689c004580] Min level: -0.661960
> [Parsed_astats_0 @ 0x7f689c004580] Max level: 0.650380
> [Parsed_astats_0 @ 0x7f689c004580] Min difference: 0.000000
> [Parsed_astats_0 @ 0x7f689c004580] Max difference: 0.132158
> [Parsed_astats_0 @ 0x7f689c004580] Mean difference: 0.021959
> [Parsed_astats_0 @ 0x7f689c004580] RMS difference: 0.028355
> [Parsed_astats_0 @ 0x7f689c004580] Peak level dB: -3.583367
> [Parsed_astats_0 @ 0x7f689c004580] RMS level dB: -15.233950
> [Parsed_astats_0 @ 0x7f689c004580] RMS peak dB: -13.131743
> [Parsed_astats_0 @ 0x7f689c004580] RMS trough dB: -16.290829
> [Parsed_astats_0 @ 0x7f689c004580] Flat factor: 0.000000
> [Parsed_astats_0 @ 0x7f689c004580] Peak count: 2.000000
> [Parsed_astats_0 @ 0x7f689c004580] Noise floor dB: -3.921652
> [Parsed_astats_0 @ 0x7f689c004580] Noise floor count: 1454.000000
> [Parsed_astats_0 @ 0x7f689c004580] Bit depth: 32/32
> [Parsed_astats_0 @ 0x7f689c004580] Number of samples: 47688
> [Parsed_astats_0 @ 0x7f689c004580] Number of NaNs: 0.000000
> [Parsed_astats_0 @ 0x7f689c004580] Number of Infs: 0.000000
> [Parsed_astats_0 @ 0x7f689c004580] Number of denormals: 0.000000
>
> [end of first playback]
>
> [Parsed_astats_0 @ 0x7f689c0429c0] Channel: 1 0KB sq=    0B f=0/0
> [Parsed_astats_0 @ 0x7f689c0429c0] DC offset: 0.012485
> [Parsed_astats_0 @ 0x7f689c0429c0] Min level: -1.093806
> [Parsed_astats_0 @ 0x7f689c0429c0] Max level: 5.133211
> [Parsed_astats_0 @ 0x7f689c0429c0] Min difference: 0.000000
> [Parsed_astats_0 @ 0x7f689c0429c0] Max difference: 1.122307
> [Parsed_astats_0 @ 0x7f689c0429c0] Mean difference: 0.024617
> [Parsed_astats_0 @ 0x7f689c0429c0] RMS difference: 0.035662
> [Parsed_astats_0 @ 0x7f689c0429c0] Peak level dB: 14.207783
> [Parsed_astats_0 @ 0x7f689c0429c0] RMS level dB: -9.882940
> [Parsed_astats_0 @ 0x7f689c0429c0] RMS peak dB: -13.131624
> [Parsed_astats_0 @ 0x7f689c0429c0] RMS trough dB: -16.263167
> [Parsed_astats_0 @ 0x7f689c0429c0] Crest factor: 16.015339
> [Parsed_astats_0 @ 0x7f689c0429c0] Flat factor: 0.000000
> [Parsed_astats_0 @ 0x7f689c0429c0] Peak count: 2
> [Parsed_astats_0 @ 0x7f689c0429c0] Noise floor dB: -3.921652
> [Parsed_astats_0 @ 0x7f689c0429c0] Noise floor count: 1454
> [Parsed_astats_0 @ 0x7f689c0429c0] Bit depth: 32/32
> [Parsed_astats_0 @ 0x7f689c0429c0] Dynamic range: 336.671657
> [Parsed_astats_0 @ 0x7f689c0429c0] Zero crossings: 2399
> [Parsed_astats_0 @ 0x7f689c0429c0] Zero crossings rate: 0.050999
> [Parsed_astats_0 @ 0x7f689c0429c0] Number of NaNs: 0
> [Parsed_astats_0 @ 0x7f689c0429c0] Number of Infs: 0
> [Parsed_astats_0 @ 0x7f689c0429c0] Number of denormals: 0
> [Parsed_astats_0 @ 0x7f689c0429c0] Overall
> [Parsed_astats_0 @ 0x7f689c0429c0] DC offset: 0.012485
> [Parsed_astats_0 @ 0x7f689c0429c0] Min level: -1.093806
> [Parsed_astats_0 @ 0x7f689c0429c0] Max level: 5.133211
> [Parsed_astats_0 @ 0x7f689c0429c0] Min difference: 0.000000
> [Parsed_astats_0 @ 0x7f689c0429c0] Max difference: 1.122307
> [Parsed_astats_0 @ 0x7f689c0429c0] Mean difference: 0.024617
> [Parsed_astats_0 @ 0x7f689c0429c0] RMS difference: 0.035662
> [Parsed_astats_0 @ 0x7f689c0429c0] Peak level dB: 14.207783
> [Parsed_astats_0 @ 0x7f689c0429c0] RMS level dB: -9.882940
> [Parsed_astats_0 @ 0x7f689c0429c0] RMS peak dB: -13.131624
> [Parsed_astats_0 @ 0x7f689c0429c0] RMS trough dB: -16.263167
> [Parsed_astats_0 @ 0x7f689c0429c0] Flat factor: 0.000000
> [Parsed_astats_0 @ 0x7f689c0429c0] Peak count: 2.000000
> [Parsed_astats_0 @ 0x7f689c0429c0] Noise floor dB: -3.921652
> [Parsed_astats_0 @ 0x7f689c0429c0] Noise floor count: 1454.000000
> [Parsed_astats_0 @ 0x7f689c0429c0] Bit depth: 32/32
> [Parsed_astats_0 @ 0x7f689c0429c0] Number of samples: 47040
> [Parsed_astats_0 @ 0x7f689c0429c0] Number of NaNs: 0.000000
> [Parsed_astats_0 @ 0x7f689c0429c0] Number of Infs: 0.000000
> [Parsed_astats_0 @ 0x7f689c0429c0] Number of denormals: 0.000000
> [end of second playback]
> }}}
>
> For the above output, I enter the command. It plays the 1 second video,
> and then stops. I press left arrow to seek to the start. This causes the
> astats filter to finish processing, so it produces the output that
> includes 'Peak level dB: -3.583367'. It plays again, with the noise peak
> at start. I press Q to quit, and the astats filter finishes for this
> second playback, producing the output that includes 'Peak level dB:
> 14.207783'.
>

> I'm not sure if this is somehow caused by opus. Specifying '-acodec
> libopus' gives output that sounds the same; for some reason it seems to
> result in format s16 as the audio input for filter chain, compared to
> format 'fltp' for the default codec of 'opus', as seen with -v verbose or
> filter ashowinfo. This changes the output from astats, with peak of 0 dB
> but peak count of 184.
>
> When using option '-vn' for no video, the noise spike does not happen
> when seeking to the start of the file.
>

> It's possible this isn't a bug, though the result I had with a slight
> change in volume leading the noise spike suggests it is a bug. If it
> isn't a bug, I'm guessing it's somehow caused by concatenating opus
> packets in the wrong way. Describing a problem I had when doing that in
> case it helps with diagnosing this bug: I was trying to make a video
> which I encoded in segments. I had each segment as H.264 video and Opus
> audio. When I joined all the segments with 'concat' demuxer and '-c copy'
> for stream copy, in some places it seemed to work fine, but between some
> segments there was a noise spike.
>
> That is, 'astats' would report a spike to something like 7 dB at the
> start of a segment, even though the original audio did not have this
> spike and '-c copy' was used. I tried uploading the joined audio to
> YouTube in case it was a problem with my decoding software and the
> problem was there too. I can only guess that Opus keeps some kind of
> information state, and packets depend on the state from previous packets.
> (Can kind of see this if you try to force a DC bias into an audio stream;
> output visualization with ffplay or something shows it quickly going to
> zero each time you seek to a new point in the file.) So something like
> this could be the cause of the current bug. I can't explain why a
> miniscule change in volume while encoding would lead to greatly diverging
> results during playback, or why there's no noise spike with '-vn',
> though.
>
> I do note that in this output, there are fewer audio samples in the
> playback with the noise spike (47040 instead 47688), and I think this
> might actually be the key to fixing this bug. With -vn, the second
> playback gives 48000 samples but the same peak dB; the second playback
> sounds slightly different, but probably just from my pulseaudio starting
> later or something.
>
> So I think what is happening here is that, since the first video has a
> presentation timestamp of 0.021 (due to opus audio becoming 0.007 seconds
> earlier each time you copy it, which might be another bug which I'm not
> reporting here), ffplay seeks to the audio packet that matches the start
> of video. When it starts from this slightly later packet, there is a
> noise spike.
>
> If this explanation is correct, the questions are
> 1) is ffmpeg/ffplay following the decoding specifications for opus?
> 2) can the problem be fixed even if it's due to following the spec?

New description:

 Summary of the bug:
 OLD: With some files, seeking to the start produces a noise spike.

 UPDATED: Output levels from Opus decoding can vary greatly if some packets
 at the start are missing, either because they weren't included in the
 stream or because ffplay seeks to a video keyframe that comes after those
 packets.


 In the past, I found this was the case with a slight change in volume when
 encoding audio. So, for example, the filter "volume=0.8734" would produce
 a 'bugged' file that would cause this noise spike, while "volume=0.8733"
 would not. I waited to report it until I could use a more recent version
 of ffmpeg and ffplay.

 How to reproduce (see below for interpretation):
 {{{
 $  /usr/bin/ffplay \[pow\ at\ start\]\[crop_1080\]屏東潮州六姐妹in新北市
 三重正義堂遶境\ Part2\[2012-06-17\]\ \[vUY-EH3gTRU\].webm -af astats
 ffplay version 4.3.2-0+deb11u1ubuntu1 Copyright (c) 2003-2021 the FFmpeg
 developers
   built with gcc 10 (Ubuntu 10.2.1-20ubuntu1)
   configuration: --prefix=/usr --extra-version=0+deb11u1ubuntu1
 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu
 --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl
 --disable-stripping --enable-avresample --disable-filter=resample
 --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-
 libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-
 libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig
 --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm
 --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-
 libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse
 --enable-librabbitmq --enable-librsvg --enable-librubberband --enable-
 libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-
 libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-
 libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack
 --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid
 --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal
 --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx
 --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883
 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264
 --enable-shared
   libavutil      56. 51.100 / 56. 51.100
   libavcodec     58. 91.100 / 58. 91.100
   libavformat    58. 45.100 / 58. 45.100
   libavdevice    58. 10.100 / 58. 10.100
   libavfilter     7. 85.100 /  7. 85.100
   libavresample   4.  0.  0 /  4.  0.  0
   libswscale      5.  7.100 /  5.  7.100
   libswresample   3.  7.100 /  3.  7.100
   libpostproc    55.  7.100 / 55.  7.100
 Input #0, matroska,webm, from '[pow at start][crop_1080]屏東潮州六姐妹in
 新北市三重正義堂遶境 Part2[2012-06-17] [vUY-EH3gTRU].webm':
   Metadata:
     ENCODER         : Lavf57.83.100
   Duration: 00:00:01.02, start: -0.007000, bitrate: 1804 kb/s
     Stream #0:0(eng): Video: vp9 (Profile 0), yuv420p(tv), 1280x720, SAR
 32:27 DAR 512:243, 24 fps, 24 tbr, 1k tbn, 1k tbc (default)
     Metadata:
       DURATION        : 00:00:01.020000000
     Stream #0:1(eng): Audio: opus, 48000 Hz, mono, fltp (default)
     Metadata:
       ENCODER         : Lavc57.107.100 libopus
       DURATION        : 00:00:01.001000000
 [Parsed_astats_0 @ 0x7f68b40151c0] Channel: 165KB sq=    0B f=0/0
 [Parsed_astats_0 @ 0x7f68b40151c0] DC offset: -nan
 [Parsed_astats_0 @ 0x7f68b40151c0] Min level:
 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
 [Parsed_astats_0 @ 0x7f68b40151c0] Max level:
 -179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
 [Parsed_astats_0 @ 0x7f68b40151c0] Min difference:
 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
 [Parsed_astats_0 @ 0x7f68b40151c0] Max difference: 0.000000
 [Parsed_astats_0 @ 0x7f68b40151c0] Mean difference: 0.000000
 [Parsed_astats_0 @ 0x7f68b40151c0] RMS difference: 0.000000
 [Parsed_astats_0 @ 0x7f68b40151c0] Peak level dB: nan
 [Parsed_astats_0 @ 0x7f68b40151c0] RMS level dB: -nan
 [Parsed_astats_0 @ 0x7f68b40151c0] RMS peak dB: -nan
 [Parsed_astats_0 @ 0x7f68b40151c0] RMS trough dB: -nan
 [Parsed_astats_0 @ 0x7f68b40151c0] Crest factor: 1.000000
 [Parsed_astats_0 @ 0x7f68b40151c0] Flat factor: -nan
 [Parsed_astats_0 @ 0x7f68b40151c0] Peak count: 0
 [Parsed_astats_0 @ 0x7f68b40151c0] Noise floor dB: nan
 [Parsed_astats_0 @ 0x7f68b40151c0] Noise floor count: 0
 [Parsed_astats_0 @ 0x7f68b40151c0] Bit depth: 0/0
 [Parsed_astats_0 @ 0x7f68b40151c0] Dynamic range: inf
 [Parsed_astats_0 @ 0x7f68b40151c0] Zero crossings: 0
 [Parsed_astats_0 @ 0x7f68b40151c0] Zero crossings rate: -nan
 [Parsed_astats_0 @ 0x7f68b40151c0] Number of NaNs: 0
 [Parsed_astats_0 @ 0x7f68b40151c0] Number of Infs: 0
 [Parsed_astats_0 @ 0x7f68b40151c0] Number of denormals: 0
 [Parsed_astats_0 @ 0x7f68b40151c0] Overall
 [Parsed_astats_0 @ 0x7f68b40151c0] DC offset: -nan
 [Parsed_astats_0 @ 0x7f68b40151c0] Min level:
 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
 [Parsed_astats_0 @ 0x7f68b40151c0] Max level:
 -179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
 [Parsed_astats_0 @ 0x7f68b40151c0] Min difference:
 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
 [Parsed_astats_0 @ 0x7f68b40151c0] Max difference: 0.000000
 [Parsed_astats_0 @ 0x7f68b40151c0] Mean difference: 0.000000
 [Parsed_astats_0 @ 0x7f68b40151c0] RMS difference: 0.000000
 [Parsed_astats_0 @ 0x7f68b40151c0] Peak level dB: nan
 [Parsed_astats_0 @ 0x7f68b40151c0] RMS level dB: -nan
 [Parsed_astats_0 @ 0x7f68b40151c0] RMS peak dB: -nan
 [Parsed_astats_0 @ 0x7f68b40151c0] RMS trough dB: 3082.547156
 [Parsed_astats_0 @ 0x7f68b40151c0] Flat factor: -nan
 [Parsed_astats_0 @ 0x7f68b40151c0] Peak count: 0.000000
 [Parsed_astats_0 @ 0x7f68b40151c0] Noise floor dB: nan
 [Parsed_astats_0 @ 0x7f68b40151c0] Noise floor count: 0.000000
 [Parsed_astats_0 @ 0x7f68b40151c0] Bit depth: 0/0
 [Parsed_astats_0 @ 0x7f68b40151c0] Number of samples: 0
 [Parsed_astats_0 @ 0x7f68b40151c0] Number of NaNs: 0.000000
 [Parsed_astats_0 @ 0x7f68b40151c0] Number of Infs: 0.000000
 [Parsed_astats_0 @ 0x7f68b40151c0] Number of denormals: 0.000000

 [end of initial filter output before playback starts]

 [Parsed_astats_0 @ 0x7f689c004580] Channel: 1 0KB sq=    0B f=0/0
 [Parsed_astats_0 @ 0x7f689c004580] DC offset: 0.000042
 [Parsed_astats_0 @ 0x7f689c004580] Min level: -0.661960
 [Parsed_astats_0 @ 0x7f689c004580] Max level: 0.650380
 [Parsed_astats_0 @ 0x7f689c004580] Min difference: 0.000000
 [Parsed_astats_0 @ 0x7f689c004580] Max difference: 0.132158
 [Parsed_astats_0 @ 0x7f689c004580] Mean difference: 0.021959
 [Parsed_astats_0 @ 0x7f689c004580] RMS difference: 0.028355
 [Parsed_astats_0 @ 0x7f689c004580] Peak level dB: -3.583367
 [Parsed_astats_0 @ 0x7f689c004580] RMS level dB: -15.233950
 [Parsed_astats_0 @ 0x7f689c004580] RMS peak dB: -13.131743
 [Parsed_astats_0 @ 0x7f689c004580] RMS trough dB: -16.290829
 [Parsed_astats_0 @ 0x7f689c004580] Crest factor: 3.824099
 [Parsed_astats_0 @ 0x7f689c004580] Flat factor: 0.000000
 [Parsed_astats_0 @ 0x7f689c004580] Peak count: 2
 [Parsed_astats_0 @ 0x7f689c004580] Noise floor dB: -3.921652
 [Parsed_astats_0 @ 0x7f689c004580] Noise floor count: 1454
 [Parsed_astats_0 @ 0x7f689c004580] Bit depth: 32/32
 [Parsed_astats_0 @ 0x7f689c004580] Dynamic range: 318.880510
 [Parsed_astats_0 @ 0x7f689c004580] Zero crossings: 2403
 [Parsed_astats_0 @ 0x7f689c004580] Zero crossings rate: 0.050390
 [Parsed_astats_0 @ 0x7f689c004580] Number of NaNs: 0
 [Parsed_astats_0 @ 0x7f689c004580] Number of Infs: 0
 [Parsed_astats_0 @ 0x7f689c004580] Number of denormals: 0
 [Parsed_astats_0 @ 0x7f689c004580] Overall
 [Parsed_astats_0 @ 0x7f689c004580] DC offset: 0.000042
 [Parsed_astats_0 @ 0x7f689c004580] Min level: -0.661960
 [Parsed_astats_0 @ 0x7f689c004580] Max level: 0.650380
 [Parsed_astats_0 @ 0x7f689c004580] Min difference: 0.000000
 [Parsed_astats_0 @ 0x7f689c004580] Max difference: 0.132158
 [Parsed_astats_0 @ 0x7f689c004580] Mean difference: 0.021959
 [Parsed_astats_0 @ 0x7f689c004580] RMS difference: 0.028355
 [Parsed_astats_0 @ 0x7f689c004580] Peak level dB: -3.583367
 [Parsed_astats_0 @ 0x7f689c004580] RMS level dB: -15.233950
 [Parsed_astats_0 @ 0x7f689c004580] RMS peak dB: -13.131743
 [Parsed_astats_0 @ 0x7f689c004580] RMS trough dB: -16.290829
 [Parsed_astats_0 @ 0x7f689c004580] Flat factor: 0.000000
 [Parsed_astats_0 @ 0x7f689c004580] Peak count: 2.000000
 [Parsed_astats_0 @ 0x7f689c004580] Noise floor dB: -3.921652
 [Parsed_astats_0 @ 0x7f689c004580] Noise floor count: 1454.000000
 [Parsed_astats_0 @ 0x7f689c004580] Bit depth: 32/32
 [Parsed_astats_0 @ 0x7f689c004580] Number of samples: 47688
 [Parsed_astats_0 @ 0x7f689c004580] Number of NaNs: 0.000000
 [Parsed_astats_0 @ 0x7f689c004580] Number of Infs: 0.000000
 [Parsed_astats_0 @ 0x7f689c004580] Number of denormals: 0.000000

 [end of first playback]

 [Parsed_astats_0 @ 0x7f689c0429c0] Channel: 1 0KB sq=    0B f=0/0
 [Parsed_astats_0 @ 0x7f689c0429c0] DC offset: 0.012485
 [Parsed_astats_0 @ 0x7f689c0429c0] Min level: -1.093806
 [Parsed_astats_0 @ 0x7f689c0429c0] Max level: 5.133211
 [Parsed_astats_0 @ 0x7f689c0429c0] Min difference: 0.000000
 [Parsed_astats_0 @ 0x7f689c0429c0] Max difference: 1.122307
 [Parsed_astats_0 @ 0x7f689c0429c0] Mean difference: 0.024617
 [Parsed_astats_0 @ 0x7f689c0429c0] RMS difference: 0.035662
 [Parsed_astats_0 @ 0x7f689c0429c0] Peak level dB: 14.207783
 [Parsed_astats_0 @ 0x7f689c0429c0] RMS level dB: -9.882940
 [Parsed_astats_0 @ 0x7f689c0429c0] RMS peak dB: -13.131624
 [Parsed_astats_0 @ 0x7f689c0429c0] RMS trough dB: -16.263167
 [Parsed_astats_0 @ 0x7f689c0429c0] Crest factor: 16.015339
 [Parsed_astats_0 @ 0x7f689c0429c0] Flat factor: 0.000000
 [Parsed_astats_0 @ 0x7f689c0429c0] Peak count: 2
 [Parsed_astats_0 @ 0x7f689c0429c0] Noise floor dB: -3.921652
 [Parsed_astats_0 @ 0x7f689c0429c0] Noise floor count: 1454
 [Parsed_astats_0 @ 0x7f689c0429c0] Bit depth: 32/32
 [Parsed_astats_0 @ 0x7f689c0429c0] Dynamic range: 336.671657
 [Parsed_astats_0 @ 0x7f689c0429c0] Zero crossings: 2399
 [Parsed_astats_0 @ 0x7f689c0429c0] Zero crossings rate: 0.050999
 [Parsed_astats_0 @ 0x7f689c0429c0] Number of NaNs: 0
 [Parsed_astats_0 @ 0x7f689c0429c0] Number of Infs: 0
 [Parsed_astats_0 @ 0x7f689c0429c0] Number of denormals: 0
 [Parsed_astats_0 @ 0x7f689c0429c0] Overall
 [Parsed_astats_0 @ 0x7f689c0429c0] DC offset: 0.012485
 [Parsed_astats_0 @ 0x7f689c0429c0] Min level: -1.093806
 [Parsed_astats_0 @ 0x7f689c0429c0] Max level: 5.133211
 [Parsed_astats_0 @ 0x7f689c0429c0] Min difference: 0.000000
 [Parsed_astats_0 @ 0x7f689c0429c0] Max difference: 1.122307
 [Parsed_astats_0 @ 0x7f689c0429c0] Mean difference: 0.024617
 [Parsed_astats_0 @ 0x7f689c0429c0] RMS difference: 0.035662
 [Parsed_astats_0 @ 0x7f689c0429c0] Peak level dB: 14.207783
 [Parsed_astats_0 @ 0x7f689c0429c0] RMS level dB: -9.882940
 [Parsed_astats_0 @ 0x7f689c0429c0] RMS peak dB: -13.131624
 [Parsed_astats_0 @ 0x7f689c0429c0] RMS trough dB: -16.263167
 [Parsed_astats_0 @ 0x7f689c0429c0] Flat factor: 0.000000
 [Parsed_astats_0 @ 0x7f689c0429c0] Peak count: 2.000000
 [Parsed_astats_0 @ 0x7f689c0429c0] Noise floor dB: -3.921652
 [Parsed_astats_0 @ 0x7f689c0429c0] Noise floor count: 1454.000000
 [Parsed_astats_0 @ 0x7f689c0429c0] Bit depth: 32/32
 [Parsed_astats_0 @ 0x7f689c0429c0] Number of samples: 47040
 [Parsed_astats_0 @ 0x7f689c0429c0] Number of NaNs: 0.000000
 [Parsed_astats_0 @ 0x7f689c0429c0] Number of Infs: 0.000000
 [Parsed_astats_0 @ 0x7f689c0429c0] Number of denormals: 0.000000
 [end of second playback]
 }}}

 For the above output, I enter the command. It plays the 1 second video,
 and then stops. I press left arrow to seek to the start. This causes the
 astats filter to finish processing, so it produces the output that
 includes 'Peak level dB: -3.583367'. It plays again, with the noise peak
 at start. I press Q to quit, and the astats filter finishes for this
 second playback, producing the output that includes 'Peak level dB:
 14.207783'.


 I'm not sure if this is somehow caused by opus. Specifying '-acodec
 libopus' gives output that sounds the same; for some reason it seems to
 result in format s16 as the audio input for filter chain, compared to
 format 'fltp' for the default codec of 'opus', as seen with -v verbose or
 filter ashowinfo. This changes the output from astats, with peak of 0 dB
 but peak count of 184.

 When using option '-vn' for no video, the noise spike does not happen when
 seeking to the start of the file.


 It's possible this isn't a bug, though the result I had with a slight
 change in volume leading the noise spike suggests it is a bug. If it isn't
 a bug, I'm guessing it's somehow caused by concatenating opus packets in
 the wrong way. Describing a problem I had when doing that in case it helps
 with diagnosing this bug: I was trying to make a video which I encoded in
 segments. I had each segment as H.264 video and Opus audio. When I joined
 all the segments with 'concat' demuxer and '-c copy' for stream copy, in
 some places it seemed to work fine, but between some segments there was a
 noise spike.

 That is, 'astats' would report a spike to something like 7 dB at the start
 of a segment, even though the original audio did not have this spike and
 '-c copy' was used. I tried uploading the joined audio to YouTube in case
 it was a problem with my decoding software and the problem was there too.
 I can only guess that Opus keeps some kind of information state, and
 packets depend on the state from previous packets. (Can kind of see this
 if you try to force a DC bias into an audio stream; output visualization
 with ffplay or something shows it quickly going to zero each time you seek
 to a new point in the file.) So something like this could be the cause of
 the current bug. I can't explain why a miniscule change in volume while
 encoding would lead to greatly diverging results during playback, or why
 there's no noise spike with '-vn', though.

 I do note that in this output, there are fewer audio samples in the
 playback with the noise spike (47040 instead 47688), and I think this
 might actually be the key to fixing this bug. With -vn, the second
 playback gives 48000 samples but the same peak dB; the second playback
 sounds slightly different, but probably just from my pulseaudio starting
 later or something.

 So I think what is happening here is that, since the first video has a
 presentation timestamp of 0.021 (due to opus audio becoming 0.007 seconds
 earlier each time you copy it, which might be another bug which I'm not
 reporting here), ffplay seeks to the audio packet that matches the start
 of video. When it starts from this slightly later packet, there is a noise
 spike.

 If this explanation is correct, the questions are
 1) is ffmpeg/ffplay following the decoding specifications for opus?
 2) can the problem be fixed even if it's due to following the spec?

--
-- 
Ticket URL: <https://trac.ffmpeg.org/ticket/9307#comment:3>
FFmpeg <https://ffmpeg.org>
FFmpeg issue tracker


More information about the FFmpeg-trac mailing list