Can I write identical xlsx files from the same data frame in R?
Asked Answered
P

2

7

Can I make sure that two XLSX files (written with openxlsx::write.xlsx) are identical, when given the same data to write? I think there's a timestamp written to the spreadsheet which means the same data written more than one second apart creates a different file.

For example, when written in rapid succession:

library(openxlsx)
write.xlsx(mtcars, "/tmp/t1.xlsx");write.xlsx(mtcars, "/tmp/t2.xlsx")

the files are identical:

$ md5sum /tmp/t?.xlsx
c9b5f6509e20dd62b158debfbef376fe  /tmp/t1.xlsx
c9b5f6509e20dd62b158debfbef376fe  /tmp/t2.xlsx

but if I sleep between writes:

unlink("/tmp/t1.xlsx") # remove previous
unlink("/tmp/t2.xlsx")
write.xlsx(mtcars, "/tmp/t1.xlsx");Sys.sleep(2);write.xlsx(mtcars, "/tmp/t2.xlsx")

its all different:

$ md5sum /tmp/t?.xlsx
460945a610df3bc8a1ccdae9eb86c1fa  /tmp/t1.xlsx
a4865be49994092173792c9f7354e435  /tmp/t2.xlsx

My use case is a process that generates an XLSX file which is going into a git repository. If I automate this, then the XLSX file is going to change every time, even if the source data hasn't. I suppose could test if the data has changed earlier in the process and not generate a new XLSX file, but it just seems easier to let git do the "has this changed?" test but clearly invisible metadata in the XLSX breaks that. Call me lazy.

Can the XLSX metadata be set to prevent this? I guess maybe there's a "Creation Date" in there somewhere. I don't care if this is 1970-01-01 all the time.

Pre-emptive defence: No, I can't use a CSV, the XLSX has multiple sheets and that's what my end-users want. Yes, I'm already also writing it to a SQlite database and that is identical when writing the same data to it.

I don't think this can be done with openxlsx as is, since the difference is due to the metadata XML created: https://github.com/ycphs/openxlsx/blob/7742063a4473879490d789c552bb8e6cc9a0d2c7/R/baseXML.R#L77 where it puts the current Sys.time() into the created field.

There appear to be two sources of difference. First, there's the Excel metadata written to the <dcterms:created> metadata within the MS Excel document structure. But even setting that the same (by monkey-patching openxlsx) still leaves a difference because the document is bundled using standard ZIP format and that also has datestamp headers.

Here's two XLSX files, unzipped, and this shows all the same CRC-32 values, so the files within are identical:

Archive:  test1.xlsx
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
     587  Defl:N      234  60% 2022-01-31 15:22 b5dbec60  _rels/.rels
    1402  Defl:N      362  74% 2022-01-31 15:22 63422601  [Content_Types].xml
     284  Defl:N      173  39% 2022-01-31 15:22 f9153db0  docProps/app.xml
     552  Defl:N      278  50% 2022-01-31 15:22 37126cbe  docProps/core.xml
     696  Defl:N      229  67% 2022-01-31 15:22 14a147d3  xl/_rels/workbook.xml.rels
    4500  Defl:N      311  93% 2022-01-31 15:22 285db1ad  xl/printerSettings/printerSettings1.bin
     601  Defl:N      203  66% 2022-01-31 15:22 211e1d6e  xl/sharedStrings.xml
    1127  Defl:N      464  59% 2022-01-31 15:22 0d8ee71d  xl/styles.xml
    7075  Defl:N     1361  81% 2022-01-31 15:22 050f988c  xl/theme/theme1.xml
     950  Defl:N      382  60% 2022-01-31 15:22 1b8cce29  xl/workbook.xml
     612  Defl:N      223  64% 2022-01-31 15:22 f0584777  xl/worksheets/_rels/sheet1.xml.rels
   12729  Defl:N     2204  83% 2022-01-31 15:22 18057777  xl/worksheets/sheet1.xml
--------          -------  ---                            -------
   31115             6424  79%                            12 files
$ unzip -v test2.xlsx
Archive:  test2.xlsx
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
     587  Defl:N      234  60% 2022-01-31 15:22 b5dbec60  _rels/.rels
    1402  Defl:N      362  74% 2022-01-31 15:22 63422601  [Content_Types].xml
     284  Defl:N      173  39% 2022-01-31 15:22 f9153db0  docProps/app.xml
     552  Defl:N      278  50% 2022-01-31 15:22 37126cbe  docProps/core.xml
     696  Defl:N      229  67% 2022-01-31 15:22 14a147d3  xl/_rels/workbook.xml.rels
    4500  Defl:N      311  93% 2022-01-31 15:22 285db1ad  xl/printerSettings/printerSettings1.bin
     601  Defl:N      203  66% 2022-01-31 15:22 211e1d6e  xl/sharedStrings.xml
    1127  Defl:N      464  59% 2022-01-31 15:22 0d8ee71d  xl/styles.xml
    7075  Defl:N     1361  81% 2022-01-31 15:22 050f988c  xl/theme/theme1.xml
     950  Defl:N      382  60% 2022-01-31 15:22 1b8cce29  xl/workbook.xml
     612  Defl:N      223  64% 2022-01-31 15:22 f0584777  xl/worksheets/_rels/sheet1.xml.rels
   12729  Defl:N     2204  83% 2022-01-31 15:22 18057777  xl/worksheets/sheet1.xml

but the files still differ:

$ md5sum test1.xlsx test2.xlsx 
27783e8b19631039a1c940db214f25e1  test1.xlsx
ba0678946aea1e01093ce25130b2c467  test2.xlsx

because of the ZIP metadata, visible with exiftool:

$ exiftool test*.xlsx | grep Zip | grep Date
Zip Modify Date                 : 2022:01:31 15:22:52
Zip Modify Date                 : 2022:01:31 15:22:54
Primateship answered 31/1, 2022 at 12:50 Comment(6)
If you open in Excel both versions, save them to xml, you'll see that the only difference lies in the field <Created>2022-01-31T14:24:33Z</Created>Fermi
@Fermi I've unzipped the excel files (.xlsx files are .zip files in disguise) and found the exact differing metadata entry, and where openxlsx sets this. Can't see an easy way round that short of forking the package. Maybe another package...Primateship
What about setting the system time prior to xlsx generation with system command? And resetting it after? Something like system(sudo date --set [...]). The drawback might be in different delays between setting the date and xls...Inveigh
@Primateship why do you care? A change in the file metadata, especially in Created doesn't mean the data is different. This doesn't mean the results aren't reproducible. You'd only care if you wanted to use the file's hash/checksum to check whether the file has changed instead of comparing the dataMitzi
@Primateship you'd also get differences if a different user produced the same file at the exact same time, simply because a different user name would be savedMitzi
@PanagiotisKanavos I care because git can't tell the difference between two excel files with just a different create time, and two excel files with totally different data. I only want to commit to git when the data is different.Primateship
F
1

A possible workaround is to redefine genBaseCore function in openxlsx namespace using assignInNamespace.

In the example below, xlsx file is created one day before Sys.time():

library(openxlsx)

genBaseCore <- function(creator = "", title = NULL, subject = NULL, category = NULL) {
  
  replaceIllegalCharacters <- function(v){
    
    vEnc <- Encoding(v)
    v <- as.character(v)
    
    flg <- vEnc != "UTF-8"
    if(any(flg))
      v[flg] <- iconv(v[flg], from = "", to = "UTF-8")
    
    v <- gsub('&', "&amp;", v, fixed = TRUE)
    v <- gsub('"', "&quot;", v, fixed = TRUE)
    v <- gsub("'", "&apos;", v, fixed = TRUE)
    v <- gsub('<', "&lt;", v, fixed = TRUE)
    v <- gsub('>', "&gt;", v, fixed = TRUE)
    
    ## Escape sequences
    v <- gsub("\a", "", v, fixed = TRUE)
    v <- gsub("\b", "", v, fixed = TRUE)
    v <- gsub("\v", "", v, fixed = TRUE)
    v <- gsub("\f", "", v, fixed = TRUE)
    
    return(v)
  }
  
  core <- '<coreProperties xmlns="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">'
  
  core <- stringi:::stri_c(core, sprintf("<dc:creator>%s</dc:creator>", replaceIllegalCharacters(creator)))
  core <- stringi:::stri_c(core, sprintf("<cp:lastModifiedBy>%s</cp:lastModifiedBy>", replaceIllegalCharacters(creator)))

# Modify creation date here
  core <- stringi:::stri_c(core, sprintf('<dcterms:created xsi:type="dcterms:W3CDTF">%s</dcterms:created>', format(Sys.time()-86400, "%Y-%m-%dT%H:%M:%SZ")))
  
  if (!is.null(title)) {
    core <- stringi:::stri_c(core, sprintf("<dc:title>%s</dc:title>", replaceIllegalCharacters(title)))
  }
  
  if (!is.null(subject)) {
    core <- stringi:::stri_c(core, sprintf("<dc:subject>%s</dc:subject>", replaceIllegalCharacters(subject)))
  }
  
  if (!is.null(category)) {
    core <- stringi:::stri_c(core, sprintf("<cp:category>%s</cp:category>", replaceIllegalCharacters(category)))
  }
  
  core <- stringi:::stri_c(core, "</coreProperties>")
  
  return(core)
}


assignInNamespace("genBaseCore", genBaseCore, ns="openxlsx")

write.xlsx(mtcars, "test.xlsx")
<Created>2022-01-30T15:13:27Z</Created>
Fermi answered 31/1, 2022 at 14:18 Comment(1)
Close. With this monkey patching, you get identical contents in the XLSX, but the XLSX is a zip file, and there's a ZIP header in each file, and that... has a modification time! I'll add this to my question...Primateship
S
0

You can try a simpler wrapper for comparing workbook objects (assuming you'd read in the previous workbook) and compare it to the current workbook.

library(openxlsx)
file1 <- temp_xlsx()
file2 <- temp_xlsx()
write.xlsx(mtcars, file1)
Sys.sleep(2)
write.xlsx(mtcars, file2)

wb1 <- loadWorkbook(file1)
wb2 <- loadWorkbook(file2)

all_equal_wb <- function(target, current) {
  exp <- "Workbook"
  attr(exp, "package") <- "openxlsx"
  stopifnot(identical(class(target), exp), identical(class(current), exp))
  target <- target$copy()
  current <- current$copy()
  target$core <- ""
  current$core <- ""
  # openxlsx::all.equal.Workbook
  all.equal(target, current)
}

all.equal(wb1, wb2)
#> [1] "Component \"core\": 1 string mismatch"
all_equal_wb(wb1, wb2)
#> [1] TRUE

Created on 2022-01-31 by the reprex package (v2.0.1)

openxlsx::all.equal.Workbook() doesn't have this sort of control (... is ignored) but it can be added. The package is still maintained so feel free to leave an issue: https://github.com/ycphs/openxlsx/issues

Septivalent answered 31/1, 2022 at 14:51 Comment(4)
Having identical workbook objects in R doesn't address the problem of having differing XLSX files on a git repo for git to think there's been a change.Primateship
So the process that builds the xlsx wouldn't be able to check if the new workbook needs to be uploaded into the repo? As stated elsewhere the only difference that you're seeing is the creation time stored in the XML. I suppose this is something that could be implemented into openxlsx (to modify the creation time) but then I suppose you'd have to be fine with using a fake time for this (not sure if excluding the time would have a bad side effect)Septivalent
It seems easier to build the data frames, dump it to XLSX, and then let git decide if its different than build the data, read in the exiting data, compare new to old, and then not write if there's no change. Thats what I mean by "I could test if the data has changed earlier in the process" but am trying to avoid.Primateship
It might have seemed easier but doesn't look like it anymore if there are multiple timestamps that would have to be managed. I think that's your answer.Septivalent

© 2022 - 2024 — McMap. All rights reserved.