Does a track run in a fragmented MP4 have to start with a key frame?

I'm ingesting an RTMP stream and converting it to a fragmented MP4 file in JavaScript. It took a week of work but I'm almost finished with this task. I'm generating a valid ftyp atom, moov atom, and moof atom and the first frame of the video actually plays (with audio) before it goes into an infinite buffering with no errors listed in chrome://media-internals

Plugging the video into ffprobe, I get an error similar to:

[mov,mp4,m4a,3gp,3g2,mj2 @ 0x558559198080] Failed to add index entry
    Last message repeated 368 times
[h264 @ 0x55855919b300] Invalid NAL unit size (-619501801 > 966).
[h264 @ 0x55855919b300] Error splitting the input into NAL units.

This led me on a massive hunt for data alignment issues or invalid byte offsets in my tfhd and trun atoms, however no matter where I looked or how I sliced the data, I couldn't find any problems in the moof atom.

I then took the original FLV file and converted it to an MP4 in ffmpeg with the following command:

ffmpeg -i ~/Videos/rtmp/big_buck_bunny.flv -c copy -ss 5 -t 10 -movflags frag_keyframe+empty_moov+faststart test.mp4

I opened both the MP4 I was creating and the MP4 output by ffmpeg in an atom parsing file and compared the two:

The first thing that jumped out at me was the ffmpeg-generated file has multiple video samples per moof. Specifically, every moof started with 1 key frame, then contained all difference frames until the next key frame (which was used as the start of the following moof atom)

Contrast this with how I'm generating my MP4. I create a moof atom every time an FLV VIDEODATA packet arrives. This means my moof may not contain a key frame (and usually doesn't)

Could this be why I'm having trouble? Or is there something else I'm missing?

The video files in question can be downloaded here:

Another issue I noticed was ffmpeg's prolific use of base_data_offset in the tfhd atom. However when I tried tracking the total number of bytes appended and setting the base_data_offset myself, I got an error in Chrome along the lines of: "MSE doesn't support base_data_offset". Per the ISO/IEC 14996-10 spec:

If not provided, the base-data-offset for the first track in the movie fragment is the position of the first byte of the enclosing Movie Fragment Box, and for second and subsequent track fragments, the default is the end of the data defined by the preceding fragment.

This wording leads me to believe that the data_offset in the first trun atom should be equal to the size of the moof atom and the data_offset in the second trun atom should be 0 (0 bytes from the end of the data defined by the preceding fragment). However when I tried this I got an error that the video data couldn't be parsed. What did lead to data that could be parsed was the length of the moof atom plus the total length of the first track (as if the base offset were the first byte of the enclosing moof box, same as the first track)

Recommended topics

Hot tags