I have a directory of <20MB pdf files (each pdf represents an ad) on an AWS EC2 large instance. I'm trying to upload each pdf file to S3 using ruby and DM-Paperclip.
Most files upload successfully but some seem to take hours with the CPU hanging at 100%. I've located the line of code that causes the issue by printing debug statements in the relevant section.
# Takes an array of pdf file paths and uploads each to S3 using dm-paperclip
def save_pdfs(pdfs_files)
pdf_files.each do |path|
pdf = File.open(path)
ad = Ad.new
ad.pdf.assign(pdf) # <= Last debug statment is printed before this line
begin
ad.save
rescue => e
# log error
ensure
pdf.close
end
end
To help troubleshoot the issue I attached strace to the process while it was stuck at 100%. The result was hundreds of thousands of lines like this:
...
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3543, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3543, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3543, ...}) = 0
... 500K lines
Followed by a few thousand:
...
brk(0x1224d0000) = 0x1224d0000
brk(0x1224f3000) = 0x1224f3000
brk(0x122514000) = 0x122514000
...
During an upload that doesn't hang, strace looks like this:
...
ppoll([{fd=12, events=POLLOUT}], 1, NULL, NULL, 8) = 1 ([{fd=12, revents=POLLOUT}])
fstat(12, {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
fcntl(12, F_GETFL) = 0x2 (flags O_RDWR)
write(12, "%PDF-1.3\n%\342\343\317\323\n8 0 obj\n<</Filter"..., 4096) = 4096
ppoll([{fd=12, events=POLLOUT}], 1, NULL, NULL, 8) = 1 ([{fd=12, revents=POLLOUT}])
write(12, "S\34\367\23~\277u\272,h\204_\35\215\35\341\347\324\310\307u\370#\364\315\t~^\352\272\26\374"..., 4096) = 4096
ppoll([{fd=12, events=POLLOUT}], 1, NULL, NULL, 8) = 1 ([{fd=12, revents=POLLOUT}])
write(12, "\216%\267\2454`\350\177\4\36\315\211\7B\217g\33\217!e\347\207\256\264\245vy\377\304\256\307\375"..., 4096) = 4096
...
The pdf files that cause this issue seem random. They are all valid pdf files, and they are all relatively small. They vary between ~100KB to ~50MB.
Is the strace with the seemingly excessive stat system calls related to my issue?
ensure
block is not being executed when an exception occurs unless the exception is raised byad.save
. In this case,ad.pdf.assign(pdf)
might be raising an exception, and the file would not be closed. That may have happened several hundred times before the file that's taking 100% CPU usage, leaving you with references to hundreds of files. If you wrap everything in a block and pass it toFile.open
, then you can be sure the file will always be closed correctly. Depending on how many files you are dealing with, that may improve performance significantly. – Oversold<attachment>_file_size
parameter (in HTTP:Content-length
header). – Loudspeaker