Changes to fMP4 outputs in Encoder version 2.153.0
Overview
This article describes the changes to fMP4 outputs starting from version 2.153.0 of the Bitmovin Encoder. Starting with this version, fMP4 outputs with codecs H.264, AAC, HE-AAC and HE-AACv2 will use an overhauled implementation that aims to improve stability and correctness of ISO-BMFF files. AV1 has already been using the overhauled fMP4 implementation.
In this article, we're going to explicitly show the differences in MP4 outputs, by comparing excerpts from mp4dump outputs of fMP4 encodings up to version 2.152.0 versus encodings starting from 2.153.0 for the same configuration.
Finally, we'll list the devices/platforms used for testing playback.
Changes to ISO-BMFF boxes
ftyp boxes
Up to version 2.152.0, the ftyp box for an H.264 initialization segment looks like
[ftyp] size=8+20
major_brand = mp42
minor_version = 1
compatible_brand = isom
compatible_brand = mp42
compatible_brand = mp41
With version 2.153.0, the ftyp box of an H.264 initialization segment looks like
[ftyp] size=8+32
major_brand = mp41
minor_version = 0
compatible_brand = iso8
compatible_brand = isom
compatible_brand = mp41
compatible_brand = dash
compatible_brand = avc1
compatible_brand = cmfc
Audio initialization segments should look the same, except for the avc1
compatible brand, which is not present.
This should not have any practical side effect for demuxers, so we don't expect and have not found any issues with the change.
Timescale in mvhd and tkhd boxes
Up to version 2.152.0, the timescale in Movie Header (mvhd) box had value of 1000, and the timescale in Track Header (tkhd) box depended on:
- frame rate for video
- sampling rate for audio.
From version 2.153.0, the mvhd timescale is the same as the tkhd timescale.
Video timescale
The video timescale is the video framerate rounded to nearest integer multiplied by 1000. Some examples:
- 24 FPS: Timescale 24000
- 23.976 FPS: Timescale 24000
Audio timescale
The audio timescale equals the sampling rate. This applies to all audio codecs, including HE-AAC and HE-AACv2 codecs, which up to version 2.152.0 had a timescale of half the sampling rate.
max_bitrate and avg_bitrate in DecoderConfig for audio
We noticed that our audio initialization segments, up to version 2.152.0, always reported 96 kbps as max_bitrate and avg_bitrate under the DecoderConfig box for audio, regardless of the configured bitrate. This was a problem only in muxing. The audio was encoded at the correct bitrate.
[esds] size=12+27
[ESDescriptor] size=2+25
es_id = 0
stream_priority = 0
[DecoderConfig] size=2+17
stream_type = 5
object_type = 64
up_stream = 0
buffer_size = 6144
max_bitrate = 96000
avg_bitrate = 96000
DecoderSpecificInfo = 12 10
[Descriptor:06] size=2+1
Starting from version 2.153.0, this issue is fixed and the values are correctly signaled depending on the specified bitrate from the codec configuration.
Sample flags in trun entries
In video segments up to version 2.152.0, the sample flags were always optimized using default sample flags
from the tfhd box and first sample flags
from the trun box, as shown below:
[traf] size=8+832
[tfhd] size=12+8, flags=20020
track ID = 1
default sample flags = 0x1010000
[tfdt] size=12+8, version=1
base media decode time = 0
[trun] size=12+780, flags=a05
sample count = 96
data offset = 872
first sample flags = 0x2000000
entries:
( 0) sample_size = 429, sample_composition_time_offset = 2002
( 1) sample_size = 72, sample_composition_time_offset = 5005
( 2) sample_size = 70, sample_composition_time_offset = 2002
While this optimization is good for reducing muxing overhead, our muxer was optimizing the flags even in situations when it shouldn't, e.g. when there are more key frames than the first frame. This resulted in warnings on the Media inspector tab on Chrome:
ISO-BMFF container metadata for video frame indicates that the frame is not a keyframe, but the video frame contents indicate the opposite.
Starting from version 2.153.0, sample flags in the trun boxes in a given segment are only optimized if it is possible. Otherwise, the flags are written per sample, like below:
[traf] size=8+1212
[tfhd] size=12+12, flags=2000a
track ID = 1
sample description index = 1
default sample duration = 1001
[tfdt] size=12+4
base media decode time = 0
[trun] size=12+1160, flags=e01
sample count = 96
data offset = 1252
entries:
( 0) sample_size = 473, sample_flags = 0, sample_composition_time_offset = 2002
( 1) sample_size = 78, sample_flags = 0x1000, sample_composition_time_offset = 5005
( 2) sample_size = 76, sample_flags = 0, sample_composition_time_offset = 2002
( 3) sample_size = 75, sample_flags = 0x1000, sample_composition_time_offset = 0
While this may increase the muxing overhead with 4 extra bytes for sample when the optimization can not be applied, it provides the most correct outputs, which may also fix playback in older players. Our muxer will still optimize the flags when it is possible.
Furthermore, the value set in sample_flags at version 2.153.0 (0 and 0x10000) differ slightly from the ones set in 2.152.0 (0x2000000 and 0x1010000) because we don't make use of the sample_depends_on
bits anymore (ISO/IEC 14496-12:2012 8.6.4.3).
Edit Lists with the edts box
We noticed that up to version 2.152.0, an edit list could be missing for H.264 streams that make use of B-frames. This edit list is required due to a delay between decoding and presenting frames (signaled in trun entries via sample_composition_time_offset
), and due to this, the first segment of some streams could have a non-zero reported start time, for example when checking it with ffprobe:
Duration: 00:00:04.00, start: 0.083417, bitrate: 4396 kb/s
From version 2.153.0, the edit list is placed in the initialization segment when needed:
[edts] size=8+28
[elst] size=12+16
entry_count = 1
entry/segment duration = 0
entry/media time = 2002
entry/media rate = 1
With this, the first segment correctly starts at 0 now:
Duration: 00:00:04.00, start: 0.000000, bitrate: 4398 kb/s
For avoiding the use of edit lists, it is also possible to use of the ALIGN_ZERO_NEGATIVE_CTO in the "ptsAlignMode" configuration of fMP4 muxings, which makes use of trun v1 boxes that allow for negative sample_composition_time_offset
.
Playback testing
Before releasing this change, Bitmovin has conducted extensive device testing to make sure that the new outputs won't have playback regressions. The following devices/platforms were tested with non-DRM and DRM outputs, using the Bitmovin player:
- Chrome and Edge (stable, beta and dev) on MacOS, Linux and Windows
- Firefox on MacOS, Linux and Windows
- Safari on iPad Air 2 (iOS 13), iPad Mini 6 (iOS 15), iPhone 11 (iOS 14), iPhone 8+ (iOS 12)
- Samsung Tizen TVs, from 2016 to 2022 models
- LG WebOS TVs, from 2016 to 2022 models
- Panasonic TV 2018
- Xbox One and Xbox Series S
- Playstation 5
- Chromecast and Chromecast Ultra
- Android Pixel2 with browsers Chrome, Firefox and Samsung Internet
- Fire TV Stick 4K and Fire TV Stick 4K Max
- Roku Streaming Stick, Roku Streaming Stick 4K
Updated almost 2 years ago