Getting "Ole::Storage::FormatError: OLE2 signature is invalid" when trying to get content out of a Word doc
Asked Answered
C

1

6

I'm using Rails 5. I want to get text out of a Word document (.doc) so I'm using this code

  text = nil
  MSWordDoc::Extractor.load(file_location) do |ctl00_MainContent_List1_grdData|
    text = contents.whole_contents
  end

but I'm getting the error below. I have this gem in my Gemfile

gem 'msworddoc-extractor'

What else do I need to do to get the content out of a Word doc? It would be great if I could apply the same code to .docx files as I do to .doc files.

/Users/davea/.rvm/gems/ruby-2.4.0/gems/ruby-ole-1.2.12/lib/ole/support.rb:201: warning: constant ::Fixnum is deprecated
Ole::Storage::FormatError: OLE2 signature is invalid
    from /Users/davea/.rvm/gems/ruby-2.4.0/gems/ruby-ole-1.2.12/lib/ole/storage/base.rb:378:in `validate!'
    from /Users/davea/.rvm/gems/ruby-2.4.0/gems/ruby-ole-1.2.12/lib/ole/storage/base.rb:370:in `initialize'
    from /Users/davea/.rvm/gems/ruby-2.4.0/gems/ruby-ole-1.2.12/lib/ole/storage/base.rb:112:in `new'
    from /Users/davea/.rvm/gems/ruby-2.4.0/gems/ruby-ole-1.2.12/lib/ole/storage/base.rb:112:in `load'
    from /Users/davea/.rvm/gems/ruby-2.4.0/gems/ruby-ole-1.2.12/lib/ole/storage/base.rb:79:in `initialize'
    from /Users/davea/.rvm/gems/ruby-2.4.0/gems/ruby-ole-1.2.12/lib/ole/storage/base.rb:85:in `new'
    from /Users/davea/.rvm/gems/ruby-2.4.0/gems/ruby-ole-1.2.12/lib/ole/storage/base.rb:85:in `open'
    from /Users/davea/.rvm/gems/ruby-2.4.0/gems/msworddoc-extractor-0.2.0/lib/msworddoc/extractor.rb:11:in `load'
    from /Users/davea/Documents/workspace/myproject/app/services/msword_processor_service.rb:12:in `pre_process_data'
    from /Users/davea/Documents/workspace/myproject/app/services/abstract_import_service.rb:88:in `process_race_data'
    from (irb):2
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console.rb:65:in `start'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console_helper.rb:9:in `start'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:78:in `console'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:49:in `run_command!'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands.rb:18:in `<top (required)>'
    from bin/rails:4:in `require'
    from bin/rails:4:in `<main>'
Consueloconsuetude answered 19/3, 2017 at 21:18 Comment(0)
A
6

The gem that you are using has the gem ruby-ole as a dependency. You can see it in the code:

ole = Ole::Storage.open(file)

When you import your Word document it is really being opened by the ruby-ole gem. That gem will raise an exception if it cannot validate that the file is the proper format:

raise FormatError, "OLE2 signature is invalid" unless magic == MAGIC

MAGIC refers to the header of the .doc file, which should look like this:

# i have seen it pointed out that the first 4 bytes of hex,
# 0xd0cf11e0, is supposed to spell out docfile. hmmm :)
MAGIC = "\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1"  # expected value of Header#magic

This refers to the CFBF header format for Word documents:

BYTE _abSig[8];             // [00H,08] {0xd0, 0xcf, 0x11, 0xe0, 0xa1, 0xb1,
                            // 0x1a, 0xe1} for current version

Either your .doc file is not a valid Word document, or it was made by a newer version of Word that is not supported by the ruby-ole gem.

I recommend retrying the operation with several different Word documents to find a compatible type, then re-save your original document in that format to try again.

Avocet answered 23/3, 2017 at 5:22 Comment(3)
This is a Word doc I have downloaded from teh Internet. Sadly it is not an option to re-save the doc to another format. I need to find a Ruby solution that can open the ".doc" file. I am able to open this doc With Word 2010.Consueloconsuetude
On UNIX/Linux you can use the file command, e.g., file your.doc, and it will output the file type: Microsoft Word 2007+, or Composite Document File V2 Document, Little Endian, Os: Windows, Version 5.1, Code page: 1252. This may help determine what kind of file it is. This uses the same kind of "magic test" against the file header to determine what type of file it is.Avocet
Hi THat Unix command really helped me out -- it allwoed me to figure out I wans't downloading the file properly. Anyway if you feel pretty good about MS word parsing, I've got another one open taht will probably go to a bounty in a day ro so -- #43078397Consueloconsuetude

© 2022 - 2024 — McMap. All rights reserved.