How do I parse an Excel file that will give me data exactly as it appears visually?
Asked Answered
G

4

7

I'm on Rails 5 (Ruby 2.4). I want to read an .xls doc and I would like to get the data into CSV format, just as it appears in the Excel file. Someone recommended I use Roo, and so I have

book = Roo::Spreadsheet.open(file_location)
sheet = book.sheet(0)
text = sheet.to_csv
arr_of_arrs = CSV.parse(text)

However what is getting returned is not the same as what I see in the spreadsheet. For isntance, a cell in the spreadsheet has

16:45.81

and when I get the CSV data from above, what is returned is

"0.011641319444444444"

How do I parse the Excel doc and get exactly what I see? I don't care if I use Roo to parse or not, just as long as I can get CSV data that is a representation of what I see rather than some weird internal representation. For reference the file type I was parsing givies this when I run "file name_of_file.xls" ...

Composite Document File V2 Document, Little Endian, Os: Windows, Version 5.1, Code page: 1252, Author: Dwight Schroot, Last Saved By: Dwight Schroot, Name of Creating Application: Microsoft Excel, Create Time/Date: Tue Sep 21 17:05:21 2010, Last Saved Time/Date: Wed Oct 13 16:52:14 2010, Security: 0
Granulocyte answered 28/3, 2017 at 19:5 Comment(2)
I tried your replicate your scenario but no such thing occurring with my tests. Maybe some formatting is there in file you're using. If possible post a link to the file ( 3 rows of data will be enough)Extemporize
I'm not able to post a link to the file but is there some other info I can provide? Some way I can analyze the cell and give you the info you're looking for?Granulocyte
G
3

You need to save the custom formula in a text format on the .xls side. If your opening the .xls file from the internet this won't work but this will fix your problem if you can manipulate the file. You can do this using the function =TEXT(A2, "mm:ss.0") A2 is just the cell I'm using as an example.

enter image description here

book = ::Roo::Spreadsheet.open(file_location)
puts book.cell('B', 2) 
=> '16.45.8' 

If manipulating the file is not an option you could just pass a custom converter to CSV.new() and convert the decimal time back to the correct format you need.

require 'roo-xls'
require 'csv'

CSV::Converters[:time_parser] = lambda do |field, info| 
  case info[:header].strip
  when "time" then  begin 
                      # 0.011641319444444444 * 24 hours * 3600 seconds = 1005.81 
                      parse_time =  field.to_f * 24 * 3600
                      # 1005.81.divmod(60) = [16, 45.809999999999999945]
                      mm, ss = parse_time.divmod(60)
                      # returns "16:45.81"
                      time = "#{mm}:#{ss.round(2)}"  
                      time 
                    rescue
                      field 
                    end
  else 
    field  
  end
end

book = ::Roo::Spreadsheet.open(file_location)
sheet = book.sheet(0)
csv = CSV.new(sheet.to_csv, headers: true, converters: [:time_parser]).map {|row| row.to_hash}
puts csv 
=> {"time "=>"16:45.81"}
   {"time "=>"12:46.0"}
Gravy answered 5/4, 2017 at 2:25 Comment(7)
I'm downloading the file from teh Internet, but its not an option to manually stop the automated process I'm running, manipulate the file by hand, and then re-kick off the process. If you have a way of including code to execute yoru suggestion, I'm in.Granulocyte
Hey thanks for adding this converter. I gave that a go but it didn't take -- still getting "0.011641319444444444" for the cell "16:45.81"? When I right lcick on my cell in Excel (Office 2010 Professional) and select "Format Cell", it defaults to the "Number" tab, has "Custom" selected in the "Category" field and "mm:ss.0" selected in the "Type" field. Any way to make a converter so it presents the data the same way Excel does when you save the file to a CSV (taht is, the way it appears in the Excel doc)?Granulocyte
You're talking about this line -- "time = "#{mm}:#{ss.round(2)}", you're saying change that to match the format I encountered? By extension, woudl I need to have a case statement for every conceivable type of format?Granulocyte
Change when 'time' then I am running the case statement off of case info[:header].strip. time = "#{mm}:#{ss.round(2)}" is just a local variable. If you look at the excel image I uploaded I have 2 headers, time and time_text. The decimals in the example I want to be formated are in the row with "time" as the header. With your question on case statement and type of format. No you should be able to use Ruby's CSV built in converters to format most problems.Gravy
I'm not guaranteed taht the .xls headers will conform to any particular naming conventions. Is there a way to auto-apply your converter to any cell in the document when that converter is needed?Granulocyte
You have the field and info argument. info will return a struct #<struct CSV::FieldInfo index=0, line=2, header="time"> and field will return the excel cells attribute. You need to figure out a way to pattern match your needs.Gravy
My needs are to export to CSV the same data I can visually see when I open the XLS doc in Office. I don't want my data to look differently in the CSV than it does in Excel. I'm a novice at this so I need a lot of hand-holding, as you can tell from my questions. I appreciate your help all the same.Granulocyte
E
1

Under the hood roo-xls gem uses the spreadsheet gem to parse the xls file. There was a similar issue to yours logged here, but it doesn't appear that there was any real resolution. Internally xls stores 16:45.81 as a Number and associates some formatting with it. I believe the issue has something to do with the spreadsheet gem not correctly handling the cell format.

I did try messing around with adding a format mm:ss.0 by following this guide but I couldn't get it to work, maybe you'll have more luck.

Eppes answered 2/4, 2017 at 0:57 Comment(3)
I'm not clear on what you're suggesting here. WHat code shoudl I try out to attempt to get an answer here (remember links will break over time so its better to cut and paste code right into your answer).Granulocyte
I don't think what you're trying to do is possible right now without making changes to the spreadsheet gem :/ You could try checking out this guide though, it seems like it might be related, specifically something like: book.add_format Format.new(:number_format => 'mm:ss:0') I got the mm:ss:0 format by looking in excel at what the format of that cell is, but when I tried it out I couldn't get it to work with your use case.Eppes
As I said in my question, I'm not tied to using any particular type of gem. If spreadsheet doesn't do it, then I'm happy to use a different one. Also, I'm the eternal optimist and think everything is possible, so I'm not going to award a bounty if the answer is "it can't be done."Granulocyte
J
0

You can use converters option. It seems looking like this:

arr_of_arrs = CSV.parse(text, {converters: :date_time})

http://ruby-doc.org/stdlib-2.0.0/libdoc/csv/rdoc/CSV.html

Jaggery answered 28/3, 2017 at 21:1 Comment(3)
Thanks but that didn't work. The string I posted still appears in decimal form -- "0.011641319444444444" .Granulocyte
Ok please give me link to original document. When I am creating xlsx document from Google docs Roo::Spreadsheet works correctly.Jaggery
Yeah but this isn't .xlsx, its .xls. I included more detailed file type information in my qestuion, the output of the "file" command.Granulocyte
D
0

Your problem seems to be with the way you're parsing (reading) the input file.

roo parses only Excel 2007-2013 (.xlsx) files. From you question, you want to parse .xls, which is a different format.

Like the documentation says, use the roo-xls gem instead.

Demonstrable answered 1/4, 2017 at 19:14 Comment(1)
I'm already using that. I included "gem 'roo-xls'" in my Gemfile.Granulocyte

© 2022 - 2024 — McMap. All rights reserved.