[FFmpeg-devel] [PATCH] lavfi: add opencl tonemap filter.

Song, Ruiling ruiling.song at intel.com
Tue May 22 11:56:37 EEST 2018



> -----Original Message-----
> From: ffmpeg-devel [mailto:ffmpeg-devel-bounces at ffmpeg.org] On Behalf Of
> Niklas Haas
> Sent: Tuesday, May 22, 2018 10:28 AM
> To: ffmpeg-devel at ffmpeg.org
> Cc: Mark Thompson <sw at jkqxz.net>
> Subject: Re: [FFmpeg-devel] [PATCH] lavfi: add opencl tonemap filter.
> 
> On Tue, 22 May 2018 01:18:30 +0100, Mark Thompson <sw at jkqxz.net> wrote:
> > On 21/05/18 07:50, Ruiling Song wrote:
> > > This filter does HDR(HDR10/HLG) to SDR conversion with tone-mapping.
> > >
> > > An example command to use this filter with vaapi codecs:
> > > FFMPEG -init_hw_device vaapi=va:/dev/dri/renderD128 -init_hw_device \
> > > opencl=ocl at va -hwaccel vaapi -hwaccel_device va -hwaccel_output_format
> \
> > > vaapi -i INPUT -filter_hw_device ocl -filter_complex \
> > > '[0:v]hwmap,tonemap_opencl=t=bt2020:tonemap=linear:format=p010[x1];
> \
> > > [x1]hwmap=derive_device=vaapi:reverse=1' -c:v hevc_vaapi -profile 2
> OUTPUT
> > >
> > > Signed-off-by: Ruiling Song <ruiling.song at intel.com>
> > > ---
> >
> > I assume you're testing with Beignet for this sort of mapping to work?  I tried it
> with Beignet on Coffee Lake with 10-bit videos and it looks sensible, though it is
> rather hard to tell whether it is in some sense "correct".
> 
> It's also rather hard to define whether it is in some sense "correct".
> The methodology employed here is generally based on ITU-R
> recommendations, however the ITU-R advises multiple possible ways of
> doing tone-mapping. They also highlight their own curve function, which
> we don't use (for performance/simplicity reasons - iirc I gave it a try
> and the result was not visually dissimilar enough from the hable
> function, but my memory could be wrong). There's nothing resembling an
> official "standard" way to tone-map defined by the ITU-R.
> 
> This algorithm is also generally based on the results obtained from the
> "official" ACES implementation of HDR->SDR tone mapping (obtainable
> here: https://github.com/ampas/aces-dev), with the key difference that
> we do chroma-invariant tone mapping whereas hollywood tends to use
> channel-independent tone mapping. I think the latter distorts the colors
> too much for taste and generally results in a worse looking image. The
> only important bit to make chroma-invariant tone mapping work well,
> however, is the need for a good desaturation algorithm. This one is
> based on original research and experimentation. The desaturation
> strength with a parameter of 1.0 is comparable to the one achieved by
> the ACES algorithm, although I pick a lower strength by default (0.5),
> because I found it too damaging for some types of sources (particularly
> bright skies) as a result of the chroma-invariant nature.
> 
> In addition to the desaturation step, the other key innovation which I
> cannot find mentioned in ITU-R literature is the importance of adjusting
> the overall average brightness before tone mapping. I suspect the reason
> this isn't considered by the ITU-R is because the ITU-R assumes that HDR
> sources actually follow their standards, which in practice none seem to
> do. In theory, HDR material isn't supposed to have a significantly
> higher average brightness than SDR material. Apart from the addition of
> isolated super-highlights, nothing should have changed about the image
> appearance. In practice, HDR directors like to point their expensive
> cameras at very bright objects (e.g. the sun) and blind viewers' eyes by
> failing to adjust the brightness during the mastering step. Our
> algorithm compensates for this by essentially "correcting" the bad
> mastering in real-time. [1] Of course, the result here is not as good as
> doing it ahead of time by a human, but unfortunately we don't have a say
> in this matter.
> 
> As long as the implementation is correct, I'd be confident in assuming
> that this produces pleasurable results for all the sources I've thrown
> at it, often even exceeding in quality the "official" SDR-mapped blu-ray
> versions of the same sources on the same scenes. (Partially due to the
> preserved higher color gamut)
> 
> In order to ascertain whether or not the implementation is correct, you
> could compare it to results obtained by latest `mpv` (might need git
> master) or `libplacebo`, both of which implement the same algorithm.
> 
> 
> [1] The algorithm I use in mpv and libplacebo does this with one frame
> of latency, because I don't want to round-trip through an intermediate
> buffer in my processing chain, and there's no other way to communicate
> back the measured frame statistics to the rest of the kernels in
> OpenGL/Vulkan land. I do this because of my realtime requirements as
> well as the structure of my processing chain.
> 
> Since you are implementing an offline filter and neither of these
> restrictions apply to you, I would recommend changing the implementation
> to separate the peak measurement step from the tone mapping step, so
> that the former completes first and then the second runs from scratch
> and can use the results computed by the former in the same frame.
Yes, your idea sounds reasonable. But it may need much effort to re-structure the code to make it (that would launch two kernels, and we may need a wait between them) and evaluate the performance.
Although we are developing offline filter, I think that performance is still very important as well as quality.
Given that the current implementation does well for video transcoding. I would leave it in my TODO list. Sounds ok?

> 
> If you don't do this, you run the risk of failing to tone map single
> frame data (e.g. screenshots), because no data about the previous frame
> is available at the time.
> 
> > > +// detect peak/average signal of a frame, the algorithm was ported from:
> > > +// libplacebo (https://github.com/haasn/libplacebo)
> > > +struct detection_result
> > > +detect_peak_avg(global uint *util_buf, __local uint *sum_wg,
> > > +            float signal, float peak) {
> > > +    global uint *avg_buf = util_buf;
> > > +    global uint *peak_buf = avg_buf + DETECTION_FRAMES + 1;
> > > +    global uint *counter_wg_p = peak_buf + DETECTION_FRAMES + 1;
> > > +    global uint *max_total_p = counter_wg_p + 1;
> > > +    global uint *avg_total_p = max_total_p + 1;
> > > +    global uint *frame_idx_p = avg_total_p + 1;
> > > +    global uint *scene_frame_num_p = frame_idx_p + 1;
> > > +
> > > +    uint frame_idx = *frame_idx_p;
> > > +    uint scene_frame_num = *scene_frame_num_p;
> > > +
> > > +    size_t lidx = get_local_id(0);
> > > +    size_t lidy = get_local_id(1);
> > > +    size_t lsizex = get_local_size(0);
> > > +    size_t lsizey = get_local_size(1);
> > > +    uint num_wg = get_num_groups(0) * get_num_groups(1);
> > > +    size_t group_idx = get_group_id(0);
> > > +    size_t group_idy = get_group_id(1);
> > > +    struct detection_result r = {peak, sdr_avg};
> > > +    *sum_wg = 0;
> >
> > This is technically a data race - maybe set it in only the first workitem?
> 
> I'm not sure where the data race is here. There's a barrier immediately
> below it, which ensures that all of the *sum_wg writes must complete
> before progressing further, no? So even though all of the threads conflict
> in their write to *sum_wg, they all write the same thing and wait for
> each other before continuing.
> 
> >
> > > +    barrier(CLK_LOCAL_MEM_FENCE);
> > > +
> > > +    // update workgroup sum
> > > +    atomic_add(sum_wg, (uint)(signal * REFERENCE_WHITE));
> >
> > I think the numbers you're adding together here sum to at most something like
> 16 * 16 * 100 * 1023?  Can you make sure this can't overflow and add a
> comment on that.
> 
> It's not * 1023, the highest possible peak in practice is 100.0 (PQ's
> peak brightness). So the limit per workgroup is 16 * 16 * 10000,
> requiring 22 bits to not overflow on a pathological input.
> 
> >
> > > +    barrier(CLK_LOCAL_MEM_FENCE);
> > > +
> > > +    // update frame peak/avg using work-group-average.
> > > +    if (lidx == 0 && lidy == 0) {
> > > +        uint avg_wg = *sum_wg / (lsizex * lsizey);
> > > +        atomic_max(&peak_buf[frame_idx], avg_wg);
> > > +        atomic_add(&avg_buf[frame_idx], avg_wg);
> >
> > Similarly this one?  (width/16 * height/16 * 100 * 1023, I think, which might
> overflow for 8K?)
> 
> For 8K it's 8192/16 * 4320/16 * 10000, requiring 31 bits to store
> without theoretical risk of overflow.
> 
> And actually, there is a third source of overflow worth investigating,
> namely the *avg_total_p variable, since this accumulates across frames.
> It stores a value of 10000 * (PEAK_DETECTION_FRAMES+1). In practice,
> however, this shouldn't cause any issues for typical buffer sizes.
> (Needing 20 bits for a buffer size of 100).
> 
> Note: In practice, none of these considerations are that worth worrying
> about, since the average illumination of a scene is generally around at
> most 50, so it's more like 23 bits needed to store a typical scene
> rather than the 31 worst case I calculated earlier. The only scenario in
> which I could imagine a worst case like that occurring in normal content
> is if some mastering engineer mistakenly implements a "fade to white" by
> fading to the highest possible HDR peak, and this were to somehow
> survive being reviewed by other humans who presumably have functioning
> retinas that would be screaming in pain as their displays blasted 10000
> cd/m² during the fade.
> 
> > > +    // de-saturate
> > > +    if (desat_param > 0.0f) {
> > > +        float luma = get_luma_dst(rgb);
> > > +        float base = 0.18f * dst_peak;
> >
> > Magic number might want some explaination.
> 
> It is derived from experimentation and visual comparisons with e.g. the
> ACES algorithm. There is no theoretical basis for it.
> 
> > +float3 ootf_hlg(float3 c) {
> > +    float luma = get_luma_src(c);
> > +    // assume a reference display with 1000 nits peak
> > +    float factor = 1000.0f / REFERENCE_WHITE * pow(luma, 0.2f) / pow(12.0f,
> 1.2f);
> > +    return c * factor;
> > +}
> > +
> > +float3 inverse_ootf_hlg(float3 c) {
> > +    // assume a reference display with 1000 nits peak
> > +    c *=  pow(12.0f, 1.2f) / (1000.0f / REFERENCE_WHITE);
> > +    c /= pow(get_luma_dst(c), 0.2f / 1.2f);
> > +    return c;
> > +}
> 
> I would recommend parametrizing these by the peak variable. When you
> tone map from HLG to HLG at a lower peak, the inverse OOTF call needs to
> use the new peak. (You should also update the peak tagging in the
> frame's side channel data, not sure if you do).
Are you talking about display-referred HLG? I didn't update frame side channel data.
I am not sure when do I need to update it. I thought all HLG should be scene-referred, seems not?
Could you tell me more about display-referred HLG?
I don't find anything about it. What metadata in HEVC indicate display-referred?
Any display-referred HLG video sample?

Thanks for your comment.
Ruiling
> 
> Ditto, for the forwards OOTF, the `peak` needs to match the value you
> assume for the src sig peak down below. You have it hard-coded as 12.0
> for HLG, which is the correct peak in scene-referred space, but that
> doesn't necessarily need to match the display referred case, which is
> what's relevant for tone mapping. If you tune the OOTF for a 1000 nits
> peak display, the source peak after applying the OOTF would be 10.0, not
> 12.0. Alternatively, you could tune the OOTF for 1200 nits instead.
> (This is what libplacebo does, although I think not intentionally. I'll
> change it to use 1000 nits as well.)
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel at ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


More information about the ffmpeg-devel mailing list