[FFmpeg-devel] RFC: new packed pixel formats (machine vision)

Tue Oct 22 14:11:04 EEST 2024

On 22.10.24 08:50, Diederick C. Niehorster wrote:
>> I want to pick up a discussion i started last week
>> (https://ffmpeg.org/pipermail/ffmpeg-devel/2024-October/334585.html)
>> in a new thread, with the relevant information nicely organized. This
>> is about adding pixel formats common in machine vision to ffmpeg
>> (though i understand some formats may also be used by cinema cameras),
>> and supporting them as input formats in swscale so that it becomes
>> easy to use ffmpeg for machine vision purposes (I already have such
>> software, it will be open-sourced in good time, but right now there is
>> a proprietary conversion layer from Basler i need to replace (e.g. by
>> this proposal)).

most of your point do not look so much machine learning or computer 
vision specific, but more like typical/traditional video tech 
peculiarities. More ML related obstacles come into play, if have to 
support optimized calculations with uncommon small bit sizes, etc. But 
most of your described issues should be solvable easily by already 
available features of ffmpeg, if I'm not wrong.

>> Example formats are 10 and 12 bit Bayer formats, where the 10 bit
>> cannot be represented in AVPixFmtDescriptors as currently as effective
>> bit depth for the red and blue channels is 2.5 bits, but component
>> depths should be integers. 

As bits will always be distinct entities, you don't need more than 
simple natural numbers to describe their placement and amount precisely.

ffmpeg already supports the AV_PIX_FMT_FLAG_BITSTREAM to switch some 
description fields from byte to bit values. That's enough to describe 
the layout of most pixelformats -- even those packed ones, which are not 
aligned to byte or 32bit borders. You just have to use bit size values 
for step and offset stuct members.

But there is another common case, which is indeed not describable with 
ffmpeg current stuct: color components can be composed out of separated 
MSb and LSb parts at different places in the component sequenz -- 
similar to the color examples BayerRG12g40 and BayerRG12g24 in your 
linked examples. Although these examples are indeed a little bit more 
complex, because they may describe arrangements, which differ between 
even and odd lanes. The bit packing for 10 and 12bit data in 
DNxUncompressed entails a similar issue, by packing all LSb information 
as one block at the end of every scan line.

For the simple case of just separated MSb and LSb locations within 
otherwise simply repeating pixel bits group it could be solved by 
extending the description in a similar way as used in the RGBALayout 
description sequenz of MXF -- see G.2.40/p174 of 
https://pub.smpte.org/latest/st377-1/st377-1-2019.pdf

More complex arrangements should be IMHO simply converted by application 
specfic handling to more common formats, but don't get an overly complex 
ffmpeg pixel description.

>> Other example formats are 10bit gray
>> formats where multiple values are packed without padding over multiple
>> bytes (e.g. 4 10-bit pixels packed into 5 bytes, so not aligned to 16
>> or 32 bits).

That's no problem, as already explained.

The unpacking of this kind of date to more sparse 16 bit aligned 
structures can be handled very efficient by using PDEP intrinsics of 
modern CPUs, as long as the order of components fits. Component order 
swapping is unfortunately a slightly more inefficient operation in case 
of packed image date, while it can be solved much more easily in case of 
planar data arrangements by pointer swaps.

>> Here a proposal for how these new formats could be encoded into
>> AVPixFmtDescriptor, so that these can then be used in ffmpeg/swscale.

I think swscale and the internal processing of ffmpeg should not be 
support an endless amount of arbitrary pixel formats, but be focused on 
a really useful minimal set of required base formats.

I would look at vulkans pixel format list as modern example for more 
systematic list of elementary pixel data storage variants.
(https://docs.vulkan.org/spec/latest/chapters/formats.html)

>> - AV_PIX_FMT_FLAG_BITPACKED_UNALIGNED which indicates formats that are
>> bit-wise packed in a way that is not aligned on 1, 2 or 4 bytes (e.g.
>> 4 10-bit values in 5 bytes). This flag is needed because
>> AV_PIX_FMT_FLAG_BITSTREAM
>> formats are aligned to 8 or 32 bits, ...

Is this really the case?

But in generals you should better describe byte/32bit aligned bitbacked 
formats by using explicit "fill" (X, etc.) pseudo components, than you 
can simply indicate aligned and unaligned groups by the actual sum of 
defined bits res. the reminder of a division by the alignment bit size 
count.

I hope, that's at least inspiring food for thought... ;)

Martin