> The following works here:
> $ ffmpeg -rtsp_transport tcp -i rtsp... -qscale 2 -acodec copy -y out.avi

Can confirm. And to think that I spent last 36 hours on this issue thinking I was recording the actual audio in the room. (the room was empty the whole time until just one hour ago).

> I am happy if somebody convinces me that there is a bug in FFmpeg (and not in the Hikvision server)!

As for the timestamps I am fairly certain that our customized Hikvision implementation is to blame - we actually have a physical mixer device that merges output from two-three different microphones (different language translations of what's happening in the room spoken by operators in separate rooms in real time) and diverts it to a single camera's audio output; so when the NVR receives the rtsp stream from the camera, it records this merged audio track with different languages spoken on it. Usually it's a combination of the actual language spoken in the room (with decreased volume) and an overlay translation to a different language. I can only assume that merging of these tracks by the mixer device produces unstable/unpredictable output in some way related to the format/timestamps. 
The same thing is true when I fetch the MPEG-TS stream from the NVR. 
However, it would not be the first time I have encountered a buggy environment when working with Hikvision's internal API.

Thank you for your help, I will try to provide any requested information regarding this stream or Hikvision NVR API in general if asked to for the ffmpeg community.

