The documentation shows that frame-based threading has better throughput than slice-based. It also notes that the latter doesn't scale well due to parts of the encoder that are serial.
Speedup vs. encoding threads for the veryfast
profile (non-realtime):
threads speedup psnr
slice frame slice frame
x264 --preset veryfast --tune psnr --crf 30
1: 1.00x 1.00x +0.000 +0.000
2: 1.41x 2.29x -0.005 -0.002
3: 1.70x 3.65x -0.035 +0.000
4: 1.96x 3.97x -0.029 -0.001
5: 2.10x 3.98x -0.047 -0.002
6: 2.29x 3.97x -0.060 +0.001
7: 2.36x 3.98x -0.057 -0.001
8: 2.43x 3.98x -0.067 -0.001
9: 3.96x +0.000
10: 3.99x +0.000
11: 4.00x +0.001
12: 4.00x +0.001
The main difference seems to be that frame threading adds frame latency as is needs different frames to work on, while in the case of slice-based threading all threads work on the same frame. In realtime encoding it would need to wait for more frames to arrive to fill the pipeline as opposed to offline.
Normal threading, also known as frame-based threading, uses a clever staggered-frame system for parallelism. But it comes at a cost: as mentioned earlier, every extra thread requires one more frame of latency. Slice-based threading has no such issue: every frame is split into slices, each slice encoded on one core, and then the result slapped together to make the final frame. Its maximum efficiency is much lower for a variety of reasons, but it allows at least some parallelism without an increase in latency.
From: Diary of an x264 Developer
Sliceless threading: example with 2 threads.
Start encoding frame #0. When it's half done, start encoding frame #1. Thread #1 now only has access to the top half of its reference frame, since the rest hasn't been encoded yet. So it has to restrict the motion search range. But that's probably ok (unless you use lots of threads on a small frame), since it's pretty rare to have such long vertical motion vectors. After a little while, both threads have encoded one row of macroblocks, so thread #1 still gets to use motion range = +/- 1/2 frame height. Later yet, thread #0 finishes frame #0, and moves on to frame #2. Thread #0 now gets motion restrictions, and thread #1 is unrestricted.
From: http://web.archive.org/web/20150307123140/http://akuvian.org/src/x264/sliceless_threads.txt
Therefore it makes sense to enable sliced-threads
with -tune zereolatency
as you need to send a frame as soon as possible rather then encode them efficiently (performance and quality wise).
Using too many threads on the contrary can impact performance as the overhead to maintain them can exceed the potential gains.
preset
? What happens if you use-preset ultrafast
? – Excurvateffmpeg
andlibx264
and on what OS / CPU. Also, how are you measuring? – Excurvatex264/doc/threads.txt
says parts of the encoder are serial and sliced-based threading doesn't scale well. Since you have 8 cores I think it spawns 8 slice threads. You could override--threads 4
or--slices
/--slices-max
and see what happens. This is similar to your problem: mailman.videolan.org/pipermail/x264-devel/2010-April/… I don't think it's the scheduler though, your kernel is recent. – Excurvate