FFMPEG's xstack command results in out of sync sound, is it possible to mix the audio in a single encoding?

I wrote a python script that generates a xstack complex filter command. The video inputs is a mixture of several formats described here:

I have 2 commands generated, one for the xstack filter, and one for the audio mixing.

Here is the stack command: (sorry the text doesn't wrap!)

'c:/ydl/ffmpeg.exe',
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-filter_complex',
'[0]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2, setsar=1[rsclbf0];[rsclbf0]fps=24[rscl0];[1]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2, setsar=1[rsclbf1];[rsclbf1]fps=24[rscl1];[2]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2, setsar=1[rsclbf2];[rsclbf2]fps=24[rscl2];[3]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2, setsar=1[rsclbf3];[rsclbf3]fps=24[rscl3];[4]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2, setsar=1[rsclbf4];[rsclbf4]fps=24[rscl4];[5]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2, setsar=1[rsclbf5];[rsclbf5]fps=24[rscl5];[6]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2, setsar=1[rsclbf6];[rsclbf6]fps=24[rscl6];[7]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2, setsar=1[rsclbf7];[rsclbf7]fps=24[rscl7];[8]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2, setsar=1[rsclbf8];[rsclbf8]fps=24[rscl8];[9]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2, setsar=1[rsclbf9];[rsclbf9]fps=24[rscl9];[10]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2, setsar=1[rsclbf10];[rsclbf10]fps=24[rscl10];[11]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2, setsar=1[rsclbf11];[rsclbf11]fps=24[rscl11];[12]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2, setsar=1[rsclbf12];[rsclbf12]fps=24[rscl12];[13]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2, setsar=1[rsclbf13];[rsclbf13]fps=24[rscl13];[14]scale=480:270:force_original_aspect_ratio=decrease,pad=480:270:(ow-iw)/2:(oh-ih)/2, setsar=1[rsclbf14];[rsclbf14]fps=24[rscl14];[rscl0][rscl1][rscl2][rscl3][rscl4]concat=n=5[cct0];[rscl5][rscl6][rscl7]concat=n=3[cct1];[rscl8][rscl9][rscl10]concat=n=3[cct2];[rscl11][rscl12][rscl13][rscl14]concat=n=4[cct3];[cct0][cct1][cct2][cct3]xstack=inputs=4:layout=0_0|w0_0|0_h0|w0_h0',
'output.mp4',

Here is the mix_audio command:

'c:/ydl/ffmpeg.exe',
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-i', 'inputX.mp4'
'-filter_complex',
'[0:a][1:a][2:a][3:a][4:a]concat=n=5:v=0:a=1[cct_a0];[5:a][6:a][7:a]concat=n=3:v=0:a=1[cct_a1];[8:a][9:a][10:a]concat=n=3:v=0:a=1[cct_a2];[11:a][12:a][13:a][14:a]concat=n=4:v=0:a=1[cct_a3];[cct_a0][cct_a1][cct_a2][cct_a3]amix=inputs=4[all_aud]',
'-map',
'15:v',
'-map',
'[all_aud]',
'-c:v',
'copy',
'output.mp4',

Of course those are sample commands, I actually use many more videos as input, this sample is shorter for the sake or readability.

Here are the videos I use, with relevant ffprobe data, in some HTML table:

I'm getting this warning:

[swscaler @ 0000020bac5a19c0] Warning: data is not aligned! This can lead to a speed loss

I think this is unrelated to audio desyncing this unaligned data is about x264 resolutions being multiple of 16, but my filter takes this into account already.

There is a perceptible audio desyncing, which is the main problem I am having. FFMPEG doesn't seem to get other errors. Is it because I use 2 commands to mix the audio after? How could I proceed to to the xstack stage and the audio mixing in a single stage?

I'm a bit confused as how FFMPEG handles diverse framerates. I was told to reencode all the video inputs before performing the xstack stage, but I would create some disk overhead, so I'd rather do it in a single ffmpeg job it possible.

Recommended topics

Hot tags