Why sliced thread affect so much on realtime encoding using ffmpeg x264?
Asked Answered
K

1

16

I'm using ffmpeg libx264 to encode a 720p screen captured from x11 in realtime with a fps of 30. when I use -tune zerolatency paramenter, the average encode time per-frame can be as large as 12ms with profile baseline.

After a study of the ffmpeg x264 source code, I found that the key parameter leading to such long encode time is sliced-threads which enabled by -tune zerolatency. After disabled using -x264-params sliced-threads=0 the encode time can be as low as 2ms

And with sliced-threads disabled, the CPU usage will be 40%, while only 20% when enabled.

Can someone explain the details about this sliced-thread? Especially in realtime encoding(assume no frame is buffered to be encoded. only encode when a frame is captured).

Kiruna answered 10/11, 2015 at 6:14 Comment(7)
Are you using the default preset? What happens if you use -preset ultrafast?Excurvate
The ultrafast is used in both cases above.Kiruna
This is an interesting question. Are you using recent versions of ffmpeg and libx264 and on what OS / CPU. Also, how are you measuring?Excurvate
It's not the latest, the last commit on my source is on Feb 23 2014, and the libx264 is Feb 11 2014(sorry that the source code is got from another guy, I can only get the detail from the git log) The host OS is ubuntu 14.04 and the CPU is Xeon(R) CPU E5-2630 v3. I used the -benchmark_all option and dump all the output data to a file, then calculate the average encode time using a script.Kiruna
The x264/doc/threads.txt says parts of the encoder are serial and sliced-based threading doesn't scale well. Since you have 8 cores I think it spawns 8 slice threads. You could override --threads 4 or --slices / --slices-max and see what happens. This is similar to your problem: mailman.videolan.org/pipermail/x264-devel/2010-April/… I don't think it's the scheduler though, your kernel is recent.Excurvate
It seems the threads does affect the encode time. As I measured, using sliced-thread enable and set threads=1 leads to a encode time say 2.6ms, while threads=16 takes 4.3ms. But by disabling the sliced thread, the encode time is 0.8ms. I think there's still some algorithm affects the encode time besides the threads issue.Kiruna
Using too many threads can degrade performance since the overhead to maintain them exceeds the eventual gains. It's also noted in the docs that sliced-based threading has lower throughput. I think the idea is that frame-based threading introduces a latency in frames. In real-time low latency you want to send a frame as soon as possible rather than encode them super efficiently so I guess sliced-based makes sense since all threads work on the same frame. I'll try to post an asnwer, maybe someone can add based on it.Excurvate
E
14

The documentation shows that frame-based threading has better throughput than slice-based. It also notes that the latter doesn't scale well due to parts of the encoder that are serial.

Speedup vs. encoding threads for the veryfast profile (non-realtime):

threads  speedup       psnr
      slice frame   slice  frame
x264 --preset veryfast --tune psnr --crf 30
 1:   1.00x 1.00x  +0.000 +0.000
 2:   1.41x 2.29x  -0.005 -0.002
 3:   1.70x 3.65x  -0.035 +0.000
 4:   1.96x 3.97x  -0.029 -0.001
 5:   2.10x 3.98x  -0.047 -0.002
 6:   2.29x 3.97x  -0.060 +0.001
 7:   2.36x 3.98x  -0.057 -0.001
 8:   2.43x 3.98x  -0.067 -0.001
 9:         3.96x         +0.000
10:         3.99x         +0.000
11:         4.00x         +0.001
12:         4.00x         +0.001

The main difference seems to be that frame threading adds frame latency as is needs different frames to work on, while in the case of slice-based threading all threads work on the same frame. In realtime encoding it would need to wait for more frames to arrive to fill the pipeline as opposed to offline.

Normal threading, also known as frame-based threading, uses a clever staggered-frame system for parallelism. But it comes at a cost: as mentioned earlier, every extra thread requires one more frame of latency. Slice-based threading has no such issue: every frame is split into slices, each slice encoded on one core, and then the result slapped together to make the final frame. Its maximum efficiency is much lower for a variety of reasons, but it allows at least some parallelism without an increase in latency.

From: Diary of an x264 Developer

Sliceless threading: example with 2 threads. Start encoding frame #0. When it's half done, start encoding frame #1. Thread #1 now only has access to the top half of its reference frame, since the rest hasn't been encoded yet. So it has to restrict the motion search range. But that's probably ok (unless you use lots of threads on a small frame), since it's pretty rare to have such long vertical motion vectors. After a little while, both threads have encoded one row of macroblocks, so thread #1 still gets to use motion range = +/- 1/2 frame height. Later yet, thread #0 finishes frame #0, and moves on to frame #2. Thread #0 now gets motion restrictions, and thread #1 is unrestricted.

From: http://web.archive.org/web/20150307123140/http://akuvian.org/src/x264/sliceless_threads.txt

Therefore it makes sense to enable sliced-threads with -tune zereolatency as you need to send a frame as soon as possible rather then encode them efficiently (performance and quality wise).

Using too many threads on the contrary can impact performance as the overhead to maintain them can exceed the potential gains.

Excurvate answered 12/11, 2015 at 10:3 Comment(5)
“In realtime encoding it would need to wait for more frames to arrive to fill the pipeline as opposed to offline.” This is talking about frame threading right? And either slice or frame threading increases the decoding time? How about the threads number? ThanksKiruna
Yes I was talking about frame-threading as it works on different frames. It's frame threaded by default (#threads = 1.5 * cores), and imo that's why you see lower values when enabling slices. Too many threads (16) = too much overhead. About decoding time, it seems that using slices enables the decoder to take advantage of multi-threading and decode faster (eg: blu-ray requires 4 slices).Excurvate
One more thing that I'm wondering is that, if b frame is not used, why the encoder waits for later frames instead of using previous frames only.Kiruna
See my updated answer. Each extra thread adds 1 frame latency as it needs it for motion estimation.Excurvate
Thanks a lot for your patience and detail answer. It really helps me a lot.Kiruna

© 2022 - 2024 — McMap. All rights reserved.