RTP fragmentation vs UDP fragmentation

Asked 5/6, 2015 at 16:25 Answered 21/3, 2019 at 17:40

I don't understand why we bother fragmenting at RTP level if UDP (or IP) layer does the fragmentation.

As I understand it, let's say we are on Ethernet link, the MTU is 1500 bytes.

If I have to send, for example, 3880 bytes, fragmenting at IP layer, would results in 3 packets of respectively 1500, 1500, and 940 bytes (IP header is 20 bytes, so the total overhead results in 60 bytes).

If I do it at UDP layer the overhead will be 84 bytes (3x 28 bytes).

At RTP layer it's 120 bytes of overhead.

At H264/NAL packetization layer, it's 3 more bytes (so 123 bytes final) for FU-A mode.

For such a small packet, it makes a final increase of 3.1% for the initial packet size, while at IP layer, it would only waste 1.5% overall.

Is there any valid reason to bother making such a complex packetization rules at RTP layer knowing it'd always be worse than lower layer fragmentation?

Bolometer answered 5/6, 2015 at 16:25 Comment(0)

Except for the first fragment, fragmented IP traffic does not contain the source or destination port numbers. Instead it glues packets together using sequence IDs. This makes it impossible for stateless intermediate network devices (switches and routers) that need to re-install QoS (because .1p or DSCP flags were cleared by another device or never existed in the first place.) Unless the device has the resources to manage per-session state, it either has to risk rate-limiting/prioritizing fragments from unrelated streams, or not prioritizing any fragments, some of which can be voice/video.

AFAIK RTP packets never IP-fragment unless the network has MTU mismatches in it. Hence each UDP header has source and destination port numbers, so if you can tame your clients to use known port ranges, you can re-establish QoS markings based on this information, and you can pass IP fragments as vanilla traffic and not worry about dropping voice/video data.

Dugald answered 26/1, 2018 at 17:7 Comment(0)

RTP is designed with UDP in mind.

Applications typically run RTP on top of UDP to make use of its multiplexing and checksum services; both protocols contribute parts of the transport protocol functionality.

However RTP services that are added to raw UDP such as ability to detect packet reordering, losses and timing require that UDP data consists of RTP payload and also service information.

The Internet, like other packet networks, occasionally loses and reorders packets and delays them by variable amounts of time. To cope with these impairments, the RTP header contains timing information and a sequence number that allow the receivers to reconstruct the timing produced by the source, so that in this example, chunks of audio are contiguously played out the speaker every 20 ms. This timing reconstruction is performed separately for each source of RTP packets in the conference. The sequence number can also be used by the receiver to estimate how many packets are being lost.

Then RTP is designed to be extensible, common headers and data specific payload:

RTP is a protocol framework that is deliberately not complete. This document specifies those functions expected to be common across all the applications for which RTP would be appropriate. Unlike conventional protocols in which additional functions might be accommodated by making the protocol more general or by adding an option mechanism that would require parsing, RTP is intended to be tailored through modifications and/or additions to the headers as needed.

All quotes are from RFC 1889 "RTP: A Transport Protocol for Real-Time Applications".

That is, RTP overhead for H.264 stream is not just a waste of bandwidth. RTP headers and H.264 payload formatting allow, at moderate cost, to handle video data streaming in a more reliable way, and in the same time to leverage specification which is well defined and good for different sorts of data.

Outbreed answered 8/6, 2015 at 22:35 Comment(5)

While what you say is clearly true, it's not answering the question I've asked. I don't ask why to use RTP over UDP (it's clear to me), but what is the need for H264's FU-A fragmentation since sending a large packet will end up consuming more data than using IP layer fragmentation instead (which will not lead to reorder, since the complete RTP packet & header will be preserved). – Bolometer 9/6, 2015 at 11:2

You need some data to identify what kind of packet you received. Out of order packet? Packet with NAL unit? Part of NAL unit? Apparently there should be a few bytes reserved to provide such identification. Then H.264 transmission is designed in such way that there is no need to detect start codes to parse NAL units out (which would again consume a few bytes otherwise), so it should be clear whether a packet is a NAL unit with padding, or a few NAL units are small enough to fit single UDP packet. This is where FU-A comes from - a method to arrange NAL units to split them into parts. – Outbreed 9/6, 2015 at 11:40

What I meant is if I have a NAL of 56kB, and send a 56kB RTP packet, it'll be split by IP layer in as many IP packet as required and will be received as a single 56kB RTP packet by the client (and in order, or it'll not be there at all). The (initial) NAL header will still be present. That scheme waste a lot less overhead than sending 38x 1.5kB RTP packets with FU-A markers. Also, the Start and End bit in FU-A headers are redundant with RTP's header marker / seq number and timestamp since they must match. – Bolometer 12/6, 2015 at 16:32

Well, imagine you lost a packet and what would it take to catch up with your method, and with RFC defined. – Outbreed 12/6, 2015 at 18:0

If you loose any single packet (whether FU-A/RTP or IP's part version), you can not rebuild the NAL unit. In the end, you'd get the same results with both method. Hence I don't understand your last comment. – Bolometer 13/6, 2015 at 21:35

I'd like to add that a lot of RTP servers/senders go about sending split datagrams inefficiently.

They use a lot of malloc/free in dynamic buffer contexts.
They also use one syscall per part of the message instead of message-vectors.
To add insult to injury they usually do a lot of time calculation / other handling between sending every part of the datagram.

This causes even more syscalls, sometimes even stretching the packet over a long time because they have no upper bound when the packet should be finished, only that it is finished before sending the next batch of packets.

Inefficient behavior like this gets seriously in the way if you want to scale throughput or on a low power embedded CPU. For bw, network and CPU efficiency reasons, it's usually way better to send the entire datagram in one go to the kernel and let it deal with fragmentation instead of userspace trying to figure it out.

Buttercup answered 21/3, 2019 at 17:40 Comment(0)

Well, after a lot of thinking about this, there is no reason not to use IP based fragmentation up to 64kB (and this will happen if you have a lot of same timestamp's NAL unit you need to aggregate, via STAP-A for example).

The RFC6184 is clear, you can use up to 64kB of NAL this way since each NAL unit's size of exactly 2 bytes (16 bits) is appended before the actual NAL unit, although staying below the MTU is preferred.

What happen if the "single-time" NAL units cumulated size is larger than 64kB ? The RFC6184 does not say, but I guess you'll have to send all your NAL as separate FU-A packets without increasing the timestamp between them (this is where the only reason why the Start/End bit in the FU-A header is useful, since there is no more 1:1 match between the End bit and the RTP's marker bit).

The RFC states:

An aggregation packet can carry as many aggregation units as necessary; however, the total amount of data in an aggregation packet obviously MUST fit into an IP packet, and the size SHOULD be chosen so that the resulting IP packet is smaller than the MTU size

When a "single NAL per frame" is larger than the MTU (for example, 1460 bytes with Ethernet), it has to be split with a fragmentation unit packetization (for example, FU-A).

However, nothing in the RFC states that the limit should be 1460 bytes. And it makes sense to have larger than that when doing Ethernet only streaming (as computed above)

If you have a NAL unit larger than 64kB, then you must use FU-A to send it since you can not fit this in a single IP datagram.

The RFC states:

This payload type allows fragmenting a NAL unit into several RTP packets. Doing so on the application layer instead of relying on lower-layer fragmentation (e.g., by IP) has the following advantages:

o The payload format is capable of transporting NAL units bigger than 64 kbytes over an IPv4 network that may be present in pre- recorded video, particularly in High-Definition formats (there is a limit of the number of slices per picture, which results in a limit of NAL units per picture, which may result in big NAL units).

o The fragmentation mechanism allows fragmenting a single NAL unit and applying generic forward error correction as described in Section 12.5.

Which I understand as: "If you NAL unit is less than 64kbytes, and you don't care about FEC, then don't use FU-A, but use a single RTP packet for it"

Another case where FU-A are necessary is when receiving a H264 stream with RTP over RTSP (interleaved mode). The "packet" size must fit in 2 bytes (16bits), so you also must fragment larger NAL unit even if send on a reliable stream socket.

Bolometer answered 18/6, 2015 at 13:48 Comment(5)

Was hunting for different data and came across this 2015 post. If its useful: I don't see latency mentioned in this discussion. If you're shipping out audio data with RTP, you want that 20ms audio packet off your box as soon as possible. You wouldn't want your listener to wait for 64k of opus encoded data before letting IP handle the fragmentation. The receiver may not hear you for several seconds depending on the encoding compression. – Ritch 3/9, 2017 at 17:2

You're right. The question however is about the opposite, how to efficiently transfer as much data as possible on a existing link. In that specific case, using IP fragmentation is more efficient. This is valid only when latency is not an issue (as you mentionned). If latency is an issue, then small packet will work better (but with more bandwidth overhead obviously). Audio does not use H264's NAL, AFAIK. – Bolometer 18/1, 2018 at 12:50

When errors exist on the channel, UDP would need to retransmit the whole packet of all fragments. In contrast, RTP shall need to retransmit only the lost fragment. The difference shall translate into different delays if the bandwidth is not a concern. – Monition 27/10, 2019 at 18:41

If no retransmission is attempted, the difference shall translate into amount of missing data. UDP would not give you the fragments received if any fragment is lost in the same packet. RTP will deliver just whatever it has received. In the end, the presentation quality of the video or audio shall be different. – Monition 27/10, 2019 at 18:47

@minghua: Please notice that nothing in the RTP (and RTSP) RFC specifies how to retransmit packets or signal how packets are lost. To be honest, it's very hard, even with FU-A fragmented RTP packet to identify a missing packet if they happen in batch (which is the case for RTP fragmentation). Two or more packet missing aren't reconstructible because you don't know if the 2nd missing packet is from a later NAL (new timestamp) or the end of the current NAL. Also, retransmission with NACK is the worst possible method to fit a saturated network. – Bolometer 7/2 at 15:33

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags