Filter data highlighted in Excel by cell fill color using openxlsx
Asked Answered
P

3

6

I have a big Excel table (18k rows and 400 cols) which has some rows highlighted using different colors. Is there a way to filter the rows by colors using openxlsx?

I first loaded the workbook

wb <- loadWorkbook(file = "Items Comparison.xlsx")
getStyles(wb)
df <- read.xlsx(wb, sheet = 1)

I see the styles used in the workbook using getStyles(wb), but not sure how to use that information to filter all cells for each column by colors.

[[1]]
A custom cell style. 

 Cell formatting: GENERAL 
 Font name: Tahoma 
 Font size: 9 
 Font colour: #FFFFFF 
 Font decoration: BOLD 
 Cell borders: Top: thin, Bottom: thin, Left: thin, Right: thin 
 Cell border colours: #4E648A, #4E648A, #4E648A, #4E648A 
 Cell vert. align: top 
 Cell fill foreground:  rgb: #384C70 
 Cell fill background:  rgb: #384C70 
 wraptext: TRUE 


[[2]]
A custom cell style. 

 Cell formatting: GENERAL 
 Font name: Tahoma 
 Font size: 9 
 Font colour: #FFFFFF 
 Font decoration: BOLD 
 Cell borders: Top: thin, Bottom: thin, Left: thin, Right: thin 
 Cell border colours: #4E648A, #4E648A, #4E648A, #4E648A 
 Cell vert. align: top 
 Cell fill foreground:  rgb: #384C70 
 Cell fill background:  rgb: #384C70 
 wraptext: TRUE 

What can I do to filter data by fill colors?

UPDATE

Based on @Henrik solution, I tried to use his code but I kept getting error. So, to understand what was going on, I printed the output of x$style$fill$fillFg

       rgb 
"FF384C70" 
       rgb 
"FF384C70" 
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
       rgb 
"FF384C70" 
NULL
NULL
NULL
       rgb 
"FFFFFF00" 
       rgb 
"FFFFFF00" 
 theme 
   "0" 
 theme 
   "0" 
       rgb 
"FFFFFF00" 
NULL
 theme 
   "2" 
                theme                  tint 
                  "4" "0.79998168889431442" 
 theme 
   "8" 
 theme 
   "8" 
       rgb 
"FFFFC000" 
       rgb 
"FFFFC000" 
                theme                  tint 
                  "5" "0.39997558519241921" 
                theme                  tint 
                  "5" "0.39997558519241921" 
                theme                  tint 
                  "9" "0.39997558519241921" 
                theme                  tint 
                  "5" "0.79998168889431442" 
       rgb 
"FFFFFF00" 
       rgb 
"FF384C70" 
NULL
NULL
NULL
       rgb 
"FF384C70" 
       rgb 
"FF384C70" 
[[1]]
       rgb 
"FF384C70" 

[[2]]
       rgb 
"FF384C70" 

[[3]]
NULL

[[4]]
NULL

[[5]]
NULL

[[6]]
NULL

[[7]]
NULL

[[8]]
NULL

[[9]]
NULL

[[10]]
NULL

[[11]]
NULL

[[12]]
NULL

[[13]]
       rgb 
"FF384C70" 

[[14]]
NULL

[[15]]
NULL

[[16]]
NULL

[[17]]
       rgb 
"FFFFFF00" 

[[18]]
       rgb 
"FFFFFF00" 

[[19]]
 theme 
   "0" 

[[20]]
 theme 
   "0" 

[[21]]
       rgb 
"FFFFFF00" 

[[22]]
NULL

[[23]]
 theme 
   "2" 

[[24]]
                theme                  tint 
                  "4" "0.79998168889431442" 

[[25]]
 theme 
   "8" 

[[26]]
 theme 
   "8" 

[[27]]
       rgb 
"FFFFC000" 

[[28]]
       rgb 
"FFFFC000" 

[[29]]
                theme                  tint 
                  "5" "0.39997558519241921" 

[[30]]
                theme                  tint 
                  "5" "0.39997558519241921" 

[[31]]
                theme                  tint 
                  "9" "0.39997558519241921" 

[[32]]
                theme                  tint 
                  "5" "0.79998168889431442" 

[[33]]
       rgb 
"FFFFFF00" 

[[34]]
       rgb 
"FF384C70" 

[[35]]
NULL

[[36]]
NULL

[[37]]
NULL

[[38]]
       rgb 
"FF384C70" 

[[39]]
       rgb 
"FF384C70" 

I'm still confused why there're only 39 items. total number of rows is variable but not 39. I'm also not understanding the operation - is it rowwise or columnwise?

Papilloma answered 22/6, 2020 at 16:45 Comment(2)
Does it have to be the openxlsx-package, or are other packages an option?Coorg
other packages are also an option. I just found tidyxl which I'm looking at right nowPapilloma
C
8

enter image description here

library(tidyxl)

formats <- xlsx_formats( "./temp/test_file.xlsx" )
cells <- xlsx_cells( "./temp/test_file.xlsx" )

#what colors are used?
formats$local$fill$patternFill$fgColor$rgb
# [1] NA         "FFC00000" "FF00B0F0" NA  

#find rows fo cells  with red background
cells[ cells$local_format_id %in%
         which( formats$local$fill$patternFill$fgColor$rgb == "FFC00000"), 
       "row" ]

# [1] 1
Coorg answered 22/6, 2020 at 19:4 Comment(2)
this is great. This is what I was also reading about. Thanks. I'll let you know what I get from thisPapilloma
turned out it's actually much simpler than this. I followed @nacnudus tutorial and saw how easy it is to filter based on the color. My intention was to simply filter the cells with any colors like the following. cells %>% behead("N", "program") %>% select(address, row, col, character, program, style_format, local_format_id) %>% filter(!is.na(fill[local_format_id])). I'm accepting your response as answer for pointing me to this direction.Papilloma
I
2

In your workbook object, you find the styleObjects element. There you can dig your way to the fill color (style$fill$fillFg) and row elements. Loop over the style objects (lapply), check if the color is desired one (e.g. red, "FFFF0000"; x$style$fill$fillFg == "FFFF0000", and grab the row index (x$rows[1]).

enter image description here

wb <- loadWorkbook(file = "foo.xlsx")
unlist(lapply(wb$styleObjects, function(x){
  x$rows[1][x$style$fill$fillFg == "FFFF0000"]}))

# [1] 3

If the colored cells are non-contiguous, you may want to grab both rows and columns:

enter image description here

l = lapply(wb$styleObjects, function(x){
  if(x$style$fill$fillFg == "FFFF0000"){
    data.frame(ri = x$rows, ci = x$cols, col = "FFFF0000")}})
l[lengths(l) > 0]

# [[1]]
#   ri ci      col
# 1  1  2 FFFF0000
# 2  2  3 FFFF0000
# 3  3  1 FFFF0000
Intwine answered 22/6, 2020 at 19:57 Comment(2)
I like your solution, but I'm getting some errors: Error in if (x$style$fill$fillFg == "FFFF0000") { : argument is of length zero. I updated my question to show the output of x$style$fill$fillFgPapilloma
Thank you for your feedback. My second example was apparently not of sufficient complexity. Can you please update your question with a truly minimal reproducible example which illustrates your issue?Intwine
W
0

A solution using openxlsx package Example below looks for color "FFC000" and looks in columns 1 and 6 The approach first identifies which defined styles have the font color of interest, then looks through the style objects to see which cells those styles have been applied to, returning the index of the rows that match the color and the pre-defined column search. The result would give all rows where at least one cell within the column search has the specified color.

excelwb <- openxlsx::loadWorkbook(excel_file)
strikestyles <- getStyles(excelwb)
goldcolors <- which(sapply(strikestyles,'[[','fontColour')=="FFFFC000") 
goldcols <- c(1,6) #these are the columns that have the gold color of interest -- could also be 1:ncol
goldrows <- lapply(excelwb$styleObjects[goldcolors],
                     function(x) {
                       value_cols <- which(x$cols %in% goldcols) 
                       if (length(value_cols)==0) return(NULL)
                       else return (x$rows[value_cols])
                     })
goldrows <- as.numeric(unlist(goldrows))
Wamsley answered 15/9, 2020 at 19:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.