Linux: uploading unfinished files - with file size check (scp/rsync) [closed]
Asked Answered
X

1

6

I typically end up in the following situation: I have, say, a 650 MB MPEG-2 .avi video file from a camera. Then, I use ffmpeg2theora to convert it into Theora .ogv video file, say some 150 MB in size. Finally, I want to upload this .ogv file to an ssh server.

Let's say, the ffmpeg2theora encoding process takes some 15 minutes on my PC. On the other hand, the upload goes on with a speed of about 60 KB/s, which takes some 45 minutes (for the 150MB .ogv). So: if I first encode, and wait for the encoding process to finish - and then upload, it would take approximately

15 min + 45 min = 1 hr

to complete the operation.

So, I thought it would be better if I could somehow start the upload, in parallel with the encoding operation; then, in principle - as the uploading process is slower (in terms of transferred bytes/sec) than the encoding one (in terms of generated bytes/sec) - the uploading process would always "trail behind" the encoding one, and so the whole operation (enc+upl) would complete in just 45 minutes (that is, just the time of the upload process +/- some minutes depending on actual upload speed situation on wire).

My first idea was to pipe the output of ffmpeg2theora to tee (so as to keep a local copy of the .ogv) and then, pipe the output further to ssh - as in:

./ffmpeg2theora-0.27.linux32.bin -v 8 -a 3 -o /dev/stdout MVI.AVI | tee MVI.ogv | ssh [email protected] "cat > ~/myvids/MVI.ogv"

While this command does, indeed, function - one can easily see in the running log in the terminal from ffmpeg2theora, that in this case, ffmpeg2theora calculates a predicted time of completion to be 1 hour; that is, there seems to be no benefit in terms of smaller completion time for both enc+upl. (While it is possible that this is due to network congestion, and me getting less of a network speed at the time - it seems to me, that ffmpeg2theora has to wait for an acknowledgment for each little chunk of data it sends through the pipe, and that ACK finally has to come from ssh... Otherwise, ffmpeg2theora would not have been able to provide a completion time estimate. Then again, maybe the estimate is wrong, while the operation would indeed complete in 45 mins - dunno, never had patience to wait and time the process; I just get pissed at 1hr as estimate, and hit Ctrl-C ;) ...)

My second attempt was to run the encoding process in one terminal window, i.e.:

./ffmpeg2theora-0.27.linux32.bin -v 8 -a 3 MVI.AVI      # MVI.ogv is auto name for output

..., and the uploading process, using scp, in another terminal window (thereby 'forcing' 'parallelization'):

scp MVI.ogv [email protected]:~/myvids/

The problem here is: let's say, at the time when scp starts, ffmpeg2theora has already encoded 5 MB of the output .ogv file. At this time, scp sees this 5 MB as the entire file size, and starts uploading - and it exits when it encounters the 5 MB mark; while in the meantime, ffmpeg2theora may have produced additional 15 MB, making the .ogv file 20 MB in total size at the time scp has exited (finishing the transfer of the first 5 MB).

Then I learned (joen.dk » Tip: scp Resume) that rsync supports 'resume' of partially completed uploads, as in:

rsync --partial --progress myFile remoteMachine:dirToPutIn/

..., so I tried using rsync instead of scp - but it seems to behave exactly the same as scp in terms of file size, that is: it will only transfer up to the file size read at the beginning of the process, and then it will exit.

So, my question to the community is: Is there a way to parallelize the encoding and uploading process, so as to gain the decrease in total processing time?

I'm guessing there could be several ways, as in:

  • A command line option (that I haven't seen) that forces scp/rsync to continuously check the file size - if the file is open for writing by another process (then I could simply run the upload in another terminal window)
  • A bash script; say running rsync --partial in a while loop, that runs as long as the .ogv file is open for writing by another process (I don't actually like this solution, since I can hear the harddisk scanning for the resume point, every time I run rsync --partial - which, I guess, cannot be good; if I know that the same file is being written to at the same time)
  • A different tool (other than scp/rsync) that does support upload of a "currently generated"/"unfinished" file (the assumption being it can handle only growing files; it would exit if it encounters that the local file is suddenly less in size than the bytes already transferred)

... but it could also be, that I'm overlooking something - and 1hr is as good as it gets (in other words, it is maybe logically impossible to achieve 45 min total time - even if trying to parallelize) :)

Well, I look forward to comments that would, hopefully, clarify this for me ;)

Thanks in advance,
Cheers!

Xylograph answered 27/11, 2010 at 9:12 Comment(0)
S
0

May be you can try sshfs (http://fuse.sourceforge.net/sshfs.html).This being a file system should have some optimization though I am not very sure.

Shalon answered 13/1, 2011 at 22:0 Comment(1)
thanks for the suggestion, but I don't think that using a filesystem instead will make much difference; essentially, because it seems in the pipe sequence: ffmpeg2theora .. | tee .. | ssh .. basically makes ffmpeg2theora wait until ssh has written a packet; i.e., it is still serial in nature - and even if I replace with ffmpeg2theora .. | tee .. > /sshfs/.., I suspect the final pipe will still "break" (as the writing is still limited by network latency). I guess I look for a way to parallelize processes as if in threads; but without coding my own C solution :)Xylograph

© 2022 - 2024 — McMap. All rights reserved.