[FFmpeg-devel] [PATCH] lavfi: add opencl tonemap filter.

Niklas Haas ffmpeg at haasn.xyz
Tue May 22 05:28:16 EEST 2018

On Tue, 22 May 2018 01:18:30 +0100, Mark Thompson <sw at jkqxz.net> wrote:
> On 21/05/18 07:50, Ruiling Song wrote:
> > This filter does HDR(HDR10/HLG) to SDR conversion with tone-mapping.
> > 
> > An example command to use this filter with vaapi codecs:
> > FFMPEG -init_hw_device vaapi=va:/dev/dri/renderD128 -init_hw_device \
> > opencl=ocl at va -hwaccel vaapi -hwaccel_device va -hwaccel_output_format \
> > vaapi -i INPUT -filter_hw_device ocl -filter_complex \
> > '[0:v]hwmap,tonemap_opencl=t=bt2020:tonemap=linear:format=p010[x1]; \
> > [x1]hwmap=derive_device=vaapi:reverse=1' -c:v hevc_vaapi -profile 2 OUTPUT
> > 
> > Signed-off-by: Ruiling Song <ruiling.song at intel.com>
> > ---
> I assume you're testing with Beignet for this sort of mapping to work?  I tried it with Beignet on Coffee Lake with 10-bit videos and it looks sensible, though it is rather hard to tell whether it is in some sense "correct".

It's also rather hard to define whether it is in some sense "correct".
The methodology employed here is generally based on ITU-R
recommendations, however the ITU-R advises multiple possible ways of
doing tone-mapping. They also highlight their own curve function, which
we don't use (for performance/simplicity reasons - iirc I gave it a try
and the result was not visually dissimilar enough from the hable
function, but my memory could be wrong). There's nothing resembling an
official "standard" way to tone-map defined by the ITU-R.

This algorithm is also generally based on the results obtained from the
"official" ACES implementation of HDR->SDR tone mapping (obtainable
here: https://github.com/ampas/aces-dev), with the key difference that
we do chroma-invariant tone mapping whereas hollywood tends to use
channel-independent tone mapping. I think the latter distorts the colors
too much for taste and generally results in a worse looking image. The
only important bit to make chroma-invariant tone mapping work well,
however, is the need for a good desaturation algorithm. This one is
based on original research and experimentation. The desaturation
strength with a parameter of 1.0 is comparable to the one achieved by
the ACES algorithm, although I pick a lower strength by default (0.5),
because I found it too damaging for some types of sources (particularly
bright skies) as a result of the chroma-invariant nature.

In addition to the desaturation step, the other key innovation which I
cannot find mentioned in ITU-R literature is the importance of adjusting
the overall average brightness before tone mapping. I suspect the reason
this isn't considered by the ITU-R is because the ITU-R assumes that HDR
sources actually follow their standards, which in practice none seem to
do. In theory, HDR material isn't supposed to have a significantly
higher average brightness than SDR material. Apart from the addition of
isolated super-highlights, nothing should have changed about the image
appearance. In practice, HDR directors like to point their expensive
cameras at very bright objects (e.g. the sun) and blind viewers' eyes by
failing to adjust the brightness during the mastering step. Our
algorithm compensates for this by essentially "correcting" the bad
mastering in real-time. [1] Of course, the result here is not as good as
doing it ahead of time by a human, but unfortunately we don't have a say
in this matter.

As long as the implementation is correct, I'd be confident in assuming
that this produces pleasurable results for all the sources I've thrown
at it, often even exceeding in quality the "official" SDR-mapped blu-ray
versions of the same sources on the same scenes. (Partially due to the
preserved higher color gamut)

In order to ascertain whether or not the implementation is correct, you
could compare it to results obtained by latest `mpv` (might need git
master) or `libplacebo`, both of which implement the same algorithm.

[1] The algorithm I use in mpv and libplacebo does this with one frame
of latency, because I don't want to round-trip through an intermediate
buffer in my processing chain, and there's no other way to communicate
back the measured frame statistics to the rest of the kernels in
OpenGL/Vulkan land. I do this because of my realtime requirements as
well as the structure of my processing chain.

Since you are implementing an offline filter and neither of these
restrictions apply to you, I would recommend changing the implementation
to separate the peak measurement step from the tone mapping step, so
that the former completes first and then the second runs from scratch
and can use the results computed by the former in the same frame.

If you don't do this, you run the risk of failing to tone map single
frame data (e.g. screenshots), because no data about the previous frame
is available at the time.

> > +// detect peak/average signal of a frame, the algorithm was ported from:
> > +// libplacebo (https://github.com/haasn/libplacebo)
> > +struct detection_result
> > +detect_peak_avg(global uint *util_buf, __local uint *sum_wg,
> > +            float signal, float peak) {
> > +    global uint *avg_buf = util_buf;
> > +    global uint *peak_buf = avg_buf + DETECTION_FRAMES + 1;
> > +    global uint *counter_wg_p = peak_buf + DETECTION_FRAMES + 1;
> > +    global uint *max_total_p = counter_wg_p + 1;
> > +    global uint *avg_total_p = max_total_p + 1;
> > +    global uint *frame_idx_p = avg_total_p + 1;
> > +    global uint *scene_frame_num_p = frame_idx_p + 1;
> > +
> > +    uint frame_idx = *frame_idx_p;
> > +    uint scene_frame_num = *scene_frame_num_p;
> > +
> > +    size_t lidx = get_local_id(0);
> > +    size_t lidy = get_local_id(1);
> > +    size_t lsizex = get_local_size(0);
> > +    size_t lsizey = get_local_size(1);
> > +    uint num_wg = get_num_groups(0) * get_num_groups(1);
> > +    size_t group_idx = get_group_id(0);
> > +    size_t group_idy = get_group_id(1);
> > +    struct detection_result r = {peak, sdr_avg};
> > +    *sum_wg = 0;
> This is technically a data race - maybe set it in only the first workitem?

I'm not sure where the data race is here. There's a barrier immediately
below it, which ensures that all of the *sum_wg writes must complete
before progressing further, no? So even though all of the threads conflict
in their write to *sum_wg, they all write the same thing and wait for
each other before continuing.

> > +    barrier(CLK_LOCAL_MEM_FENCE);
> > +
> > +    // update workgroup sum
> > +    atomic_add(sum_wg, (uint)(signal * REFERENCE_WHITE));
> I think the numbers you're adding together here sum to at most something like 16 * 16 * 100 * 1023?  Can you make sure this can't overflow and add a comment on that.

It's not * 1023, the highest possible peak in practice is 100.0 (PQ's
peak brightness). So the limit per workgroup is 16 * 16 * 10000,
requiring 22 bits to not overflow on a pathological input.

> > +    barrier(CLK_LOCAL_MEM_FENCE);
> > +
> > +    // update frame peak/avg using work-group-average.
> > +    if (lidx == 0 && lidy == 0) {
> > +        uint avg_wg = *sum_wg / (lsizex * lsizey);
> > +        atomic_max(&peak_buf[frame_idx], avg_wg);
> > +        atomic_add(&avg_buf[frame_idx], avg_wg);
> Similarly this one?  (width/16 * height/16 * 100 * 1023, I think, which might overflow for 8K?)

For 8K it's 8192/16 * 4320/16 * 10000, requiring 31 bits to store
without theoretical risk of overflow.

And actually, there is a third source of overflow worth investigating,
namely the *avg_total_p variable, since this accumulates across frames.
It stores a value of 10000 * (PEAK_DETECTION_FRAMES+1). In practice,
however, this shouldn't cause any issues for typical buffer sizes.
(Needing 20 bits for a buffer size of 100).

Note: In practice, none of these considerations are that worth worrying
about, since the average illumination of a scene is generally around at
most 50, so it's more like 23 bits needed to store a typical scene
rather than the 31 worst case I calculated earlier. The only scenario in
which I could imagine a worst case like that occurring in normal content
is if some mastering engineer mistakenly implements a "fade to white" by
fading to the highest possible HDR peak, and this were to somehow
survive being reviewed by other humans who presumably have functioning
retinas that would be screaming in pain as their displays blasted 10000
cd/m² during the fade.

> > +    // de-saturate
> > +    if (desat_param > 0.0f) {
> > +        float luma = get_luma_dst(rgb);
> > +        float base = 0.18f * dst_peak;
> Magic number might want some explaination.

It is derived from experimentation and visual comparisons with e.g. the
ACES algorithm. There is no theoretical basis for it.

> +float3 ootf_hlg(float3 c) {
> +    float luma = get_luma_src(c);
> +    // assume a reference display with 1000 nits peak
> +    float factor = 1000.0f / REFERENCE_WHITE * pow(luma, 0.2f) / pow(12.0f, 1.2f);
> +    return c * factor;
> +}
> +
> +float3 inverse_ootf_hlg(float3 c) {
> +    // assume a reference display with 1000 nits peak
> +    c *=  pow(12.0f, 1.2f) / (1000.0f / REFERENCE_WHITE);
> +    c /= pow(get_luma_dst(c), 0.2f / 1.2f);
> +    return c;
> +}

I would recommend parametrizing these by the peak variable. When you
tone map from HLG to HLG at a lower peak, the inverse OOTF call needs to
use the new peak. (You should also update the peak tagging in the
frame's side channel data, not sure if you do).

Ditto, for the forwards OOTF, the `peak` needs to match the value you
assume for the src sig peak down below. You have it hard-coded as 12.0
for HLG, which is the correct peak in scene-referred space, but that
doesn't necessarily need to match the display referred case, which is
what's relevant for tone mapping. If you tune the OOTF for a 1000 nits
peak display, the source peak after applying the OOTF would be 10.0, not
12.0. Alternatively, you could tune the OOTF for 1200 nits instead.
(This is what libplacebo does, although I think not intentionally. I'll
change it to use 1000 nits as well.)

More information about the ffmpeg-devel mailing list