I'm ingesting an RTMP stream and converting it to a fragmented MP4 file in JavaScript. It took a week of work but I'm almost finished with this task. I'm generating a valid ftyp
atom, moov
atom, and moof
atom and the first frame of the video actually plays (with audio) before it goes into an infinite buffering with no errors listed in chrome://media-internals
Plugging the video into ffprobe
, I get an error similar to:
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x558559198080] Failed to add index entry
Last message repeated 368 times
[h264 @ 0x55855919b300] Invalid NAL unit size (-619501801 > 966).
[h264 @ 0x55855919b300] Error splitting the input into NAL units.
This led me on a massive hunt for data alignment issues or invalid byte offsets in my tfhd
and trun
atoms, however no matter where I looked or how I sliced the data, I couldn't find any problems in the moof
atom.
I then took the original FLV file and converted it to an MP4 in ffmpeg
with the following command:
ffmpeg -i ~/Videos/rtmp/big_buck_bunny.flv -c copy -ss 5 -t 10 -movflags frag_keyframe+empty_moov+faststart test.mp4
I opened both the MP4 I was creating and the MP4 output by ffmpeg
in an atom parsing file and compared the two:
The first thing that jumped out at me was the ffmpeg
-generated file has multiple video samples per moof
. Specifically, every moof
started with 1 key frame, then contained all difference frames until the next key frame (which was used as the start of the following moof
atom)
Contrast this with how I'm generating my MP4. I create a moof
atom every time an FLV VIDEODATA
packet arrives. This means my moof
may not contain a key frame (and usually doesn't)
Could this be why I'm having trouble? Or is there something else I'm missing?
The video files in question can be downloaded here:
Another issue I noticed was ffmpeg
's prolific use of base_data_offset
in the tfhd
atom. However when I tried tracking the total number of bytes appended and setting the base_data_offset
myself, I got an error in Chrome along the lines of: "MSE doesn't support base_data_offset". Per the ISO/IEC 14996-10 spec:
If not provided, the base-data-offset for the first track in the movie fragment is the position of the first byte of the enclosing Movie Fragment Box, and for second and subsequent track fragments, the default is the end of the data defined by the preceding fragment.
This wording leads me to believe that the data_offset
in the first trun
atom should be equal to the size of the moof
atom and the data_offset
in the second trun
atom should be 0
(0 bytes from the end of the data defined by the preceding fragment). However when I tried this I got an error that the video data couldn't be parsed. What did lead to data that could be parsed was the length of the moof
atom plus the total length of the first track (as if the base offset were the first byte of the enclosing moof
box, same as the first track)