How to Upload HDR Video to YouTube (with a LUT)

Today YouTube announced via their blog official HDR streaming support.  I alluded to the fact that this was coming in my article about grading in HDR because we've been working with them the past month to get our latest HDR video onto the platform. It's officially live now, so we can go into detail.


How to Upload HDR Video to YouTube

Similar to VR support, there are no flags on the platform itself that will allow the user to manually flag the video as HDR after it's been uploaded, so the uploaded file must include the proper HDR metadata.  But YouTube doesn't support uploading in HEVC, so there are two possible pathways to getting the right metadata into your file: DaVinci Resolve Studio 12.5.2 or higher, or the YouTube HDR Metadata Tool.  They are generally outlined in the YouTube support page, but not very clearly, so I think more detail is useful.

I did include a lengthy description on how to manage HDR metadata in DaVinci Resolve Studio 12.5.2+, with a lot more detail than they include on their support page, so if you want to use the Resolve method, head over there and check that out.  I've covered it once, so I don't see the need to cover the how-to's again.

I should note that Resolve doesn't include the necessary metadata for full HDR10 compatibility, lacking fields for MaxFALL, MaxCLL, and the Mastering Display values of SMPTE ST.2086.  It does mark the BT.2020 primaries and the transfer characteristics as either ST.2084 (PQ) or ARIB STD-B67 (HLG), which will let YouTube recognize the file as HDR Video.  YouTube will then fill in the missing metadata for you when it prepares the streaming version for HDR televisions, by assuming you're using the Sony BVM-X300.  So this works, and is relatively easy.  BUT, you don't get to include your own SDR cross conversion LUT; for that you'll need to use YouTube's HDR Metadata Tool.

 

***UPDATE: April 20, 2017*** We've discovered in our testing that if you pass uncompressed 24 bit audio into your QuickTime container out of some versions of Adobe Media Encoder / Adobe Premiere into the mkvmerge tool described below the audio will be distorted.  We recommend using 16 bit uncompressed audio or AAC instead until the solution is found.

 

YouTube's HDR Metadata Tool

Okay, let's talk about option two: YouTube's HDR Metadata Tool.  

Alright, not to criticize or anything here, but the VR metadata tool comes in a nice GUI, but the link to the HDR tool sends you straight to GitHub.  Awesome.  Don't panic, just follow the link, download the whole package, and un-Zip the file.

So the bad news: whether you're working on Windows or on a Mac, you're going to need to use the command line to run the utility.  Fire up Command Prompt (Windows) or Terminal (MacOS) to get yourself a shell.

So the really bad news: If you're using a Mac, the binary you need to run is actually inside the app package mkvmerge.app.  If you're on Windows, drag the 32 or 64 bit version of mkvmerge.exe into Command Prompt to get thing started; if you're on MacOS, right click on mkvmerge.app, select "Show Package Contents", and drag the binary file ./Contents/MacOS/mkvmerge into Terminal to get started:

Right click on mkvmerge.app and select "Show Package Contents"

Drag the mkvmerge binary into Terminal

The README.md file includes some important instructions and the default syntax to run the tool, with the assumption that you're using the Sony BVM-X300 and mastering in SMPTE ST.2084.  I've copied the relevant syntax here (I'm using a Mac; delete anything in bold before copying the command over, and replace the file paths in the **s with your content:)

./hdr_metadata-master/macos/mkvmerge.app/Contents/MacOS/mkvmerge \
-o *yourfilename.mkv* \
--colour-matrix 0:9 \
--colour-range 0:1 \
--colour-transfer-characteristics 0:16 \
--colour-primaries 0:9 \
--max-content-light 0:1000 \
--max-frame-light 0:300 \
--max-luminance 0:1000 \
--min-luminance 0:0.01 \
--chromaticity-coordinates 0:0.68,0.32,0.265,0.690,0.15,0.06 \
--white-colour-coordinates 0:0.3127,0.3290 \

If using a LUT, add the lines
--attachment-mime-type application/x-cube \
--attach-file *file-path-to-your-cube-LUT* \

In all cases end with
*yourfilename.mov*

Beyond the initial call to the binary or executable, the syntax is identical on MacOS and Windows.

The program's full syntax can be found here, but it's a little overwhelming.  If you want to look it up, just focus on section 2.8, which include the arguments we're using here.   The first four arguments set the color matrix (BT.2020 non-constant), color range (Broadcast), transfer function (ST.2084), and color space (BT.2020) by referencing specific index values, which you can find on the linked page.  If you want to use HLG instead of PQ, switch the value of --colour-transfer-characteristics to 0:18, which will flag for ARIB STD-B67.

(Note to the less code savvy: the backslashes at the end of each line allow you to break the syntax across multiple lines in the command prompt or terminal window.  You'll need them at the end of every line you copy and paste in, except for the last one)

The rest of the list of video properties should be fairly self explanatory, and match the metadata required by HDR10, which I go over in more detail here.

Now, if you want to include your own SDR cross conversion LUT, you'll need to include the arguments --attachment-mime-type application/x-cube, which tells the program you want to attach a file that's not processed (specifically, a cube LUT), and --attach-file filepath, which is the actual file you're attaching.

If you don't attach your own LUT, YouTube will handle the SDR cross conversion with their own internal LUT.  It's not bad, but personally I don't like the hard clipping above 300 nits and the loss of detail in the reds, but that's largely a personal preference.  See the comparison screenshots below to see how theirs works.

Once you've pasted in all of the arguments and set your input file path, hit enter to let it run and it'll make a new MKV.  It doesn't recompress any video data, just copies it over, so if you gave it ProRes, it'll still be the same ProRes stream but with the included HDR metadata and LUT that YouTube needs to recognize the file.

Overall, it's It's a pretty fast tool, and extremely useful beyond just YouTube applications.  You can see what it's done in this set of screenshots below.  The first is the source ProRes clip, the second is the same after passing it through mkvmerge to add the metadata only, and the third went through mkvmerge to get the metadata and my own LUT:

ProRes 422 UHD Upload Without Metadata Injection

ProRes 422 UHD Upload in MKV File. Derived from the ProRes File above and passed through the mkvmerge tool to add HDR Metadata, but no LUT.

ProRes 422 UHD Upload in MKV file. Derived from the ProRes file above and passed through the mkvmerge tool to add HDR Metadata and including our SDR cross conversion LUT. Notice the increased detail in the brights of the snake skin, and the regained detail in the red flower.


All of us at Mystery Box are extremely excited to see HDR support finally widely available on YouTube.  We've been working in the medium for over a year, and haven't been able to distribute any of our HDR content in a way that consumers would actually be able to use.  But now, there's a general content distribution platform available with full HDR support, and we're excited to see what all creators can do with these new tools!

Written by Samuel Bilodeau, Head of Technology and Post Production

HDR Video Part 3: HDR Video Terms Explained

To kick off our new weekly blog here on mysterybox.us, we’ve decided to publish five posts back-to-back on the subject of HDR video.  This is Part 3: HDR Video Terms Explained.

In HDR Video Part 1 we explored what HDR video is, and what makes it different from traditional video.  In Part 2, we looked at the hardware you need to view HDR video in a professional environment.  Since every new technology comes with a new set of vocabulary, here in Part 3, we’re going to look at all of the new terms that you’ll need to know when working with HDR video.  These fall into three main categories: key terms, standards, and metadata.


Key Terms

HDR / HDR Video - High Dynamic Range Video - Any video signal or recording using one of the new transfer functions (PQ or HLG) to capture, transmit, or display a dynamic range greater than the traditional CRT gamma or BT.1886 Gamma 2.4 transfer functions at 100-120 nits reference.

The term can also be used as a compatibility indicator, to describe any camera capable of capturing and recording a signal this way, or a display that either exhibits the extended dynamic range natively or is capable of automatically detecting an HDR video signal and renormalizing the footage for its more limited or traditional range.


SDR / SDR Video - Standard Dynamic Range Video - Any video signal or recording using the traditional transfer functions to capture, transmit, or display a dynamic range limited to the traditional CRT gamma or BT.2886 Gamma 2.4 transfer functions at 100-120 nits reference. SDR video is fully compatible with all pre-existing video technologies.


nit - A unit of brightness density, or luminance. It’s the colloquial term for the SI units of candelas per square meter (1 nit = 1 cd/m2). It directly converts with the United States customary unit of foot-lamberts (1 fl = 1 cd/foot2), with 1 fl = 3.426 nits = 3.426 cd/m2.

Note that the peak nits / foot-lamberts value of a projector is often lower than that of a display, even in HDR video: because a projected image covers more area and the image is viewed in a darker environment than consumer’s homes, the same psychological and physiological responses exist at lower light levels.

For instance, a typical digital cinema screen will have a maximum brightness of 14fl or 48 cd/m2 vs. the display average of 80-120nits for reference and 300 for LCDs and Plasmas in the home. HDR cinema actual light output ranges in theaters are adjusted accordingly, since 1000 cd/m2 on a theater’s 30 foot screen is perceived to be far brighter than on a 65” flat screen.


EOTF - Electro-Optical Transfer Function - A mathematical equation or set of instructions that translate voltages or digital values into brightness values. It is the opposite of the Optical-Electro Transfer Function, or OETF, that defines how to translate brightness levels into voltages or digital values.

Traditionally, the OETF and EOTF were incidental to the behavior of the cathode ray tube, which could be approximated by a 0-1 exponential curve with a power value (gamma) of 2.4. Now they are defined values like ‘Linear”, “Gamma 2.4” or any of the various LOG formats. OETFs are used at the acquisition end of the video pipeline (by the camera) to convert brightness values into voltages/digital values, and EOTFs are used by displays to translate voltages/digital values into brightness values for each pixel.


PQ - Perceptual Quantization - Name of the EOTF curve developed by Dolby and standardized in SMPTE ST.2084, designed to allocate bits as efficiently as possible with respect to how the human vision perceives changes in light levels.

Perceptual Quantization (PQ) Electro-Optical Transfer Function (EOTF) with Gamma 2.4 Reference

Dolby’s tests established the Barten Threshold (also called the Barten Limit or the Barten Ramp), the point at what the difference in light levels between two values does that difference become visible.

PQ is designed that when operating at 12 bits per channel, the stepping between single digital values is always below the Barten threshold, for the whole range from 0.0001 to 10,000 nits, without being so far below that threshold that the resolution between bits is wasted. At 10 bits per channel, the PQ function is just slightly above the Barten threshold, where in some (idealized) circumstances stepping may be visible, but in most cases should be unnoticeable.

Barten Thesholds for 10 bit and 12 bit Rec. 1886 and PQ curves. Source

For comparison, current log formats waste bits on the low end (making them suitable for acquisition to preserve details in the darks, but not transmission and exhibition), while the current standard gamma functions waste bits on the high end, while creating stepping in the darks.

HDR systems using PQ curves are not directly backwards compatible with standard dynamic range video.


HLG - Hybrid Log Gamma - A competing EOTF curve to PQ / SMPTE ST.2084 designed by the BBC and NHK to preserve a small amount of backwards compatibility.

Hybrid Log Gamma (HLG) Electro-Optical Transfer Function (EOTF) with Gamma 2.4 Reference

HLG vs. SDR gamma curve with and without knees.  Source

HLG vs. SDR gamma curve with and without knees. Source

On this curve, the first 50% of the curve follows the output light levels of standard Gamma 2.4, while the top 50% steeply diverges along a log curve, covering the brightness range from about 100 to 5000 nits. As with PQ, 10 bits per channel is the minimum permitted.

HLG does not expand the range of the darks like PQ curve, and as an unfortunate side effect of the backwards compatibility coupled with the max-fall necessitated by the technology of HDR displays, whites can appear grey, when viewed in standard gamma 2.4, especially when compared to footage natively graded in gamma 2.4.


Standards

SMPTE ST. 2084 - First official standardization of HDR video transfer function by a standardization body, and is at the moment (October 2016), the most widely implemented. SMPTE ST.2084 officially defines the PQ EOTF curve for translating a set of 10 bit, or 12 bit per channel digital values into a brightness range of 0.0001 to 10,000 nits. SMPTE ST.2084 provides the basis for HDR 10 Media Profile and Dolby Vision implementation standards.

This is the transfer function to select in HEVC encoding to signal a PQ HDR curve.


ARIB STD-B67 - Standardized implementation of Hybrid Log Gamma by the Association of Radio Industries and Businesses. Defines the use of the HLG curve, with 10 or 12 bits per channel color and the same color primaries as BT.2020 color space.

This is the transfer function to select in HEVC encoding to signal an HLG HDR curve.


ITU-T BT.2100 - ITU-T Recommendation BT.2100 - ITU-T’s standardization of HDR for television broadcast. Ratified in 2016, this document is the HDR equivalent of ITU-T Recommendation BT.2020 (Rec.2020 / BT.2020). When compared with BT.2020, BT.2100 includes the FHD (1920x1080) frame size in addition to the UHD and FUHD, and defines two acceptable transfer functions (PQ and HLG) for HDR broadcast, instead of the single transfer function (BT.1886 equivalent) found in BT.2020.

BT.2100 uses the same color primaries and the same RGB to YCbCr signal format transform as BT.2020, and includes similar permissions of 10 or 12 bits per channel as BT.2020, although BT.2100 also permits full range code values in 10 or 12 bits where BT.2020 is limited only to traditional legal.

BT.2100 also includes considerations for a chroma subsampling methodology based on the LMS color space (human visual system tristimulus values), called ICTCP, and a transform for ‘gamma weighting’ (in the sense of the PQ and HLG equivalent of gamma weighting) the LMS response as L’M’S’.


HDR 10 Media Profile - The Consumer Technologies Association (CTA)’s official HDR video standard for use in HDR Televisions. HDR 10 requires the use of the SMPTE ST.2084 EOTF, BT.2020 color space, 10 bits per channel, 4.2.0 chroma subsampling, and the inclusion of SMPTE ST.2086 and associated MaxCLL and MaxFALL metadata values.

HDR 10 Media Profile defines the signal televisions can decode for the inclusion of “HDR compatibility” term in the marketing of televisions.

Note that “HDR compatibility” does not necessarily define the ability to display in the higher dynamic range, simply to the compatibility to decode and renormalize footage in the HDR 10 specification for whatever the dynamic range and color space of the display happen to be.


Dolby Vision - Dolby’s proprietary implementation of the PQ curve, for theatrical setups and home devices. Dolby Vision supports both the BT.2020 and the DCI-P3 color space, at 10 and 12 bits per channel, for home and theater, respectively.

The distinguishing feature of Dolby Vision is the inclusion of shot-by-shot transform metadata that adapts the PQ graded footage into a limited range gamma 2.4 or gamma 2.6 output for SDR displays and projectors. The colorist grades the film in the target HDR space, and then runs a second adaptation pass to adapt the HDR grade into SDR, and the transform is saved into the rendered HDR output files as metadata. This allows for a level of backwards compatibility with HDR transmitted footage, while still being able to make the most of the SDR and the HDR ranges.

Because Dolby Vision is a proprietary format, it requires a license issued by Dolby and the use of qualified hardware, which at the moment (October 2016) is only the Dolby PRM-4220, the Sony BVM-X300, or the Canon DP-V2420 displays


Metadata

MaxCLL Metadata - Maximum Content Light Level - An integer metadata value defining the maximum light level, in nits, of any single pixel within an encoded HDR video stream or file. MaxCLL should be measured during or after mastering. However if you keep your color grade within the MaxCLL of your display’s HDR range, and add a hard clip for the light levels beyond your display’s maximum value, you can use your display’s maximum CLL as your metadata MaxCLL value.


MaxFALL Metadata - Maximum Frame Average Light Level - An integer metadata value defining the maximum average light level, in nits, for any single frame within an encoded HDR video stream or file. MaxFALL is calculated by averaging the decoded brightness values of all pixels within each frame (that is, converting the digital value of each frame into its corresponding nits value, and averaging all of the nits values within each frame).

MaxFALL is an important value to consider in mastering and color grading, and is usually lower than the MaxCLL value. The two values combined define how bright any individual pixel within a frame can be, and how bright the frame as a whole can be.

Displays are limited differently on both of those values, though typically only the peak (single pixel) brightness of a display is reported. As pixels get brighter and approach their peak output, they draw more power and heat up. With current technology levels, no display can push all of its pixels into the maximum HDR brightness level at the same time - the power draw would be extremely high, and the heat generated would severely damage the display.

As a result, displays will abruptly notch down the overall image brightness when the frame average brightness exceeds the rated MaxFALL, to keep the image under the safe average brightness level, regardless of what the peak brightness of the display or encoded image stream may be.

For example, while the BVM-X300 has a peak value of 1000 nits for any given pixel (MaxCLL = 1000), on average, the frame brightness cannot exceed about 180 nits (MaxFALL = 180). The MaxCLL and MaxFALL metadata included in the HDR 10 media profile allows consumer displays to adjust the entire stream’s brightness to match their own display limits.


SMPTE ST.2086 Metadata - Metadata Information about the display used to grade the HDR content. SMPTE ST.2086 includes information on six values: the three RGB primaries used, the white point used, and the display maximum and minimum light levels.

The RGB primaries and the white point values are recorded as ½ of their (X,Y) values from the CIE XYZ 1931 chromaticity standard, and expressed as the integer portion of the the first five significant digits, without a decimal place. Or, in other words:

f(XPrimary) = 100,000 × XPrimary ÷ 2

f(YPrimary) = 100,000 × YPrimary ÷ 2.

For example, the (X,Y) value of DCI-P3’s ‘red’ primary is (0.68, 0.32) in CIE XYZ; in SMPTE ST.2086 terms it’s recorded as

R(34000,16000)

because

for R(0.68,0.32):

f(XR) = 100,000 × 0.68 ÷ 2 = 34,000

f(YR) = 100,000 × 0.32 ÷ 2 = 16,000

Maximum and minimum luminance values are recorded as nits × 10,000, so that they too end up as positive integers. For instance, a display like the Sony BVM-X300 with a range from 0.0001 to 1000 nits would record its luminance as

L(10000000,1)

The full ST.2086 Metadata is ordered Green, Blue, Red, White Point, Luminance with the values as

G(XG,YG)B(XB,YB)R(XR,YR)WP(XWP,YWP)L(max,min)

all strung together, and without spaces. For instance, the ST.2086 for a DCI-P3 display with a maximum luminance of 1000 nits, a minimum of 0.0001 nit would be, and using white point D65 would be:

G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,1)

while a display like the Sony BVM-X300, using BT.2020 primaries, with a white point of D65 and the same max and min brightness would be:

G(8500,39850)B(6550,2300)R(35400,14600)WP(15635,16450)L(10000000,1)

In an ideal situation, it would be best to use a colorimeter and measure the display’s native R-G-B and white point values; however, in all practicality the RGB and white point values the display conforms to that was used in mastering, are sufficient in communicating information about the mastery to the end unit display.


That should be a good overview of the new terms that HDR video has (so far) introduced into the extended video technologies vocabulary, and are a good starting point for diving deeper into learning about and using HDR video on your own, at the professional level.

In Part 4 of our series we’re going to take the theory of HDR video and start talking about the practice, and look specifically about how to shoot with HDR in mind.

Written by Samuel Bilodeau, Head of Technology and Post Production