Merging Word documents in Ruby
Asked Answered
C

3

7

I have N Word documents (Office 2003) from which I want to make a single Word document by merging all the N documents together in some order. How do I go about doing this in Ruby? Thanks!

It's just the documents that are created in MS Office. I do not use Windows and would prefer non-Windows solutions.

EDIT: Will this be easy if the docs are odt files rather than doc files?

Chadburn answered 16/7, 2010 at 20:13 Comment(8)
@Vijay Dev: To answer your edit, the answer is: maybe. You still have to do the conversion to ODT from DOC, which is one extra step. If you have to then convert them back to DOC, it's yet another step. If you're familiar with OOo and programming against it, it may be easier, but either way it's going to take a little elbow grease.Spaceport
I use JODConverter in some other application. I can use it to do the odt to doc conversion I think.Chadburn
@Vijay Dev: does the below answer your question?Spaceport
Hi Otaku, Haven't had the time to check this out. Will let you know soon. Thanks!Chadburn
@Otaku: Sorry, but how do I use what is mentioned in that link?Chadburn
@Vijay Dev: That part you'll need to figure out. If you have a knowledge of the Word OM, this will be somewhat easier.Spaceport
@Vijay Dev: just wanted to follow up to see if the below answers your question.Spaceport
@Otaku: Hi! Needed to drop this problem due to changes in specifications. Haven't given a try after that. Thanks for the help!Chadburn
T
3

There is a whole series of really good articles about word and ruby at http://rubyonwindows.blogspot.com/search/label/word. Word files are really complicated, at least before 2007, so you're better off automating word to do it.

Trinatte answered 16/7, 2010 at 20:27 Comment(2)
Automate how? Can you explain? Also, mine is a Linux server, if it matters.Chadburn
The blogs are quite helpful for teaching you how to do the automation. but as they automating word they will only work on windows, or maybe under wine. You would probably do better to look at automating open office.Trinatte
S
2

The only non-Windows solution that I know of is Ruby bindings in POI. After that, the code would be really similar to to this .NET code: Merge Word Documents As Pages Of A Single Document Using VB.NET. The key code you'll want is to use Selection.InsertFile for as many doucments as you need in the order you choose.

For ODT document merges, see this thread: http://cpanforum.com/threads/9938

Spaceport answered 1/8, 2010 at 16:15 Comment(1)
People have reported success in using docx4j via JRuby; we have a commercial component called MergeDocx which can also be used.Bonitabonito
E
0

Understand, almost any answer to this question will depend on the constraints of the doc files you are using...

That being said, in my mind the first option if you are going to do this would be to convert them to a more easily parsed format - RTF is a great example, and if you can get them into this format the RTF Pocket Guide from O Reilly is a GREAT resource for understanding the structure of the files. To convert the files is pretty simple if you can install abiword on the Linux machine. From a command line, you'd just run:

abiword --to=rtf some_file_name.doc

Of course, in Ruby you'd just wrap these commands.

It's the merging that is more complicated -- it will depend on your files. You'll have to make some programmer decisions about whether you're going to combine the stylesheets in each individual doc, the font tables, etc, etc, etc. The content just sits in the middle of that rtf file, but it's all the semantic and style data that you'll have to make choices about. There is no 'one way' here, simply because it depends on what you want on the other side. Here is wher ethe RTF Pocket Guide is a great help - basically you'll want to use it to understand the structure of your rtf's, and decide what you do and don't want.

Otherwise, if you just want the content with NONE of the semantics, you could always convert them to txt files, then concat them. The command is very similar:

abiword --to=txt some_file_name.doc

This is dead simple, it will just split out the text, and you can concat it and be done with it. But again, you'll lose ALL the formatting of any sort.

Emersonemery answered 5/8, 2010 at 14:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.