Can I make sure that two XLSX files (written with openxlsx::write.xlsx
) are identical, when given the same data to write? I think there's a timestamp written to the spreadsheet which means the same data written more than one second apart creates a different file.
For example, when written in rapid succession:
library(openxlsx)
write.xlsx(mtcars, "/tmp/t1.xlsx");write.xlsx(mtcars, "/tmp/t2.xlsx")
the files are identical:
$ md5sum /tmp/t?.xlsx
c9b5f6509e20dd62b158debfbef376fe /tmp/t1.xlsx
c9b5f6509e20dd62b158debfbef376fe /tmp/t2.xlsx
but if I sleep between writes:
unlink("/tmp/t1.xlsx") # remove previous
unlink("/tmp/t2.xlsx")
write.xlsx(mtcars, "/tmp/t1.xlsx");Sys.sleep(2);write.xlsx(mtcars, "/tmp/t2.xlsx")
its all different:
$ md5sum /tmp/t?.xlsx
460945a610df3bc8a1ccdae9eb86c1fa /tmp/t1.xlsx
a4865be49994092173792c9f7354e435 /tmp/t2.xlsx
My use case is a process that generates an XLSX file which is going into a git repository. If I automate this, then the XLSX file is going to change every time, even if the source data hasn't. I suppose could test if the data has changed earlier in the process and not generate a new XLSX file, but it just seems easier to let git do the "has this changed?" test but clearly invisible metadata in the XLSX breaks that. Call me lazy.
Can the XLSX metadata be set to prevent this? I guess maybe there's a "Creation Date" in there somewhere. I don't care if this is 1970-01-01 all the time.
Pre-emptive defence: No, I can't use a CSV, the XLSX has multiple sheets and that's what my end-users want. Yes, I'm already also writing it to a SQlite database and that is identical when writing the same data to it.
I don't think this can be done with openxlsx
as is, since the difference is due to the metadata XML created: https://github.com/ycphs/openxlsx/blob/7742063a4473879490d789c552bb8e6cc9a0d2c7/R/baseXML.R#L77 where it puts the current Sys.time()
into the created
field.
There appear to be two sources of difference. First, there's the Excel metadata written to the <dcterms:created>
metadata within the MS Excel document structure. But even setting that the same (by monkey-patching openxlsx
) still leaves a difference because the document is bundled using standard ZIP format and that also has datestamp headers.
Here's two XLSX files, unzipped, and this shows all the same CRC-32 values, so the files within are identical:
Archive: test1.xlsx
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
587 Defl:N 234 60% 2022-01-31 15:22 b5dbec60 _rels/.rels
1402 Defl:N 362 74% 2022-01-31 15:22 63422601 [Content_Types].xml
284 Defl:N 173 39% 2022-01-31 15:22 f9153db0 docProps/app.xml
552 Defl:N 278 50% 2022-01-31 15:22 37126cbe docProps/core.xml
696 Defl:N 229 67% 2022-01-31 15:22 14a147d3 xl/_rels/workbook.xml.rels
4500 Defl:N 311 93% 2022-01-31 15:22 285db1ad xl/printerSettings/printerSettings1.bin
601 Defl:N 203 66% 2022-01-31 15:22 211e1d6e xl/sharedStrings.xml
1127 Defl:N 464 59% 2022-01-31 15:22 0d8ee71d xl/styles.xml
7075 Defl:N 1361 81% 2022-01-31 15:22 050f988c xl/theme/theme1.xml
950 Defl:N 382 60% 2022-01-31 15:22 1b8cce29 xl/workbook.xml
612 Defl:N 223 64% 2022-01-31 15:22 f0584777 xl/worksheets/_rels/sheet1.xml.rels
12729 Defl:N 2204 83% 2022-01-31 15:22 18057777 xl/worksheets/sheet1.xml
-------- ------- --- -------
31115 6424 79% 12 files
$ unzip -v test2.xlsx
Archive: test2.xlsx
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
587 Defl:N 234 60% 2022-01-31 15:22 b5dbec60 _rels/.rels
1402 Defl:N 362 74% 2022-01-31 15:22 63422601 [Content_Types].xml
284 Defl:N 173 39% 2022-01-31 15:22 f9153db0 docProps/app.xml
552 Defl:N 278 50% 2022-01-31 15:22 37126cbe docProps/core.xml
696 Defl:N 229 67% 2022-01-31 15:22 14a147d3 xl/_rels/workbook.xml.rels
4500 Defl:N 311 93% 2022-01-31 15:22 285db1ad xl/printerSettings/printerSettings1.bin
601 Defl:N 203 66% 2022-01-31 15:22 211e1d6e xl/sharedStrings.xml
1127 Defl:N 464 59% 2022-01-31 15:22 0d8ee71d xl/styles.xml
7075 Defl:N 1361 81% 2022-01-31 15:22 050f988c xl/theme/theme1.xml
950 Defl:N 382 60% 2022-01-31 15:22 1b8cce29 xl/workbook.xml
612 Defl:N 223 64% 2022-01-31 15:22 f0584777 xl/worksheets/_rels/sheet1.xml.rels
12729 Defl:N 2204 83% 2022-01-31 15:22 18057777 xl/worksheets/sheet1.xml
but the files still differ:
$ md5sum test1.xlsx test2.xlsx
27783e8b19631039a1c940db214f25e1 test1.xlsx
ba0678946aea1e01093ce25130b2c467 test2.xlsx
because of the ZIP metadata, visible with exiftool
:
$ exiftool test*.xlsx | grep Zip | grep Date
Zip Modify Date : 2022:01:31 15:22:52
Zip Modify Date : 2022:01:31 15:22:54
Excel
both versions, save them toxml
, you'll see that the only difference lies in the field<Created>2022-01-31T14:24:33Z</Created>
– Fermiopenxlsx
sets this. Can't see an easy way round that short of forking the package. Maybe another package... – Primateshipsystem(sudo date --set [...])
. The drawback might be in different delays between setting the date and xls... – InveighCreated
doesn't mean the data is different. This doesn't mean the results aren't reproducible. You'd only care if you wanted to use the file's hash/checksum to check whether the file has changed instead of comparing the data – Mitzi