how to extract plain text from .docx file using R
Asked Answered
F

5

6

Anyone know of anything they can recommend in order to extract just the plain text from an article with in .docx format (preferable with R) ?

Speed isn't crucial, and we could even use a website that has some API to upload and extract the files but i've been unable to find one. I need to extract the introduction, the method, the result and the conclusion I want to delete the abstract, the references, and specially the graphics and the table thanks

Foppery answered 20/5, 2018 at 21:48 Comment(0)
A
7

You can try to use readtext library:

library(readtext)
x <- readtext("/path/to/file/myfile.docx")
# x$text will contain the plain text in the file

Variable x contains just the text without any formatting, so if you need to extract some information you need to perform string search. For example for the document you mentioned in your comment, one approach could be as follows:

library(readtext)
doc.text <- readtext("test.docx")$text

# Split text into parts using new line character:
doc.parts <- strsplit(doc.text, "\n")[[1]]

# First line in the document- the name of the Journal
journal.name <- doc.parts[1]
journal.name
# [1] "International Journal of Science and Research (IJSR)"

# Similarly we can extract some other parts from a header
issn <-  doc.parts[2]
issue <- doc.parts[3]

# Search for the Abstract:
abstract.loc <- grep("Abstract:", doc.parts)[1]

# Search for the Keyword
Keywords.loc <- grep("Keywords:", doc.parts)[1]

# The text in between these 2 keywords will be abstract text:
abstract.text <- paste(doc.parts[abstract.loc:(Keywords.loc-1)], collapse=" ")

# Same way we can get Keywords text:
Background.loc <- Keywords.loc + grep("1\\.", doc.parts[-(1:Keywords.loc)])[1]
Keywords.text <- paste(doc.parts[Keywords.loc:(Background.loc-1)], collapse=" ")
Keywords.text
# [1] "Keywords: Nephronophtisis, NPHP1 deletion, NPHP4 mutations, Tunisian patients"

# Assuming that Methods is part 2
Methods.loc <- Background.loc + grep("2\\.", doc.parts[-(1:Background.loc)])[1]
Background.text <- paste(doc.parts[Background.loc:(Methods.loc-1)], collapse=" ")


# Assuming that Results is Part 3
Results.loc <- Methods.loc- + grep("3\\.", doc.parts[-(1:Methods.loc)])[1]
Methods.text <- paste(doc.parts[Methods.loc:(Results.loc-1)], collapse=" ")

# Similarly with other parts. For example for Acknowledgements section:
Ack.loc <- grep("Acknowledgements", doc.parts)[1]
Ref.loc <- grep("References", doc.parts)[1]
Ack.text <- paste(doc.parts[Ack.loc:(Ref.loc-1)], collapse=" ")
Ack.text
# [1] "6. Acknowledgements We are especially grateful to the study participants. 
# This study was supported by a grant from the Tunisian Ministry of Health and 
# Ministry of Higher Education ...

The exact approach depends on the common structure of all the documents you need to search through. For example if the first section is always named "Background" you can use this word for your search. However if this could sometimes be "Background" and sometimes "Introduction" then you might want to search for "1." pattern.

Aiguillette answered 20/5, 2018 at 22:13 Comment(3)
I'am really thankfull for your help that is what I am searching. I have an other question if yo don't matter. Is there a solution to ingore the extraction of tables and graghiques (including captions) because the algorithm provides some problems in this caseFoppery
@AzzaAbidi Is there anything consistent about these tables and graphs? For example, Are they always at the end of the paper after References like in the file you gave me?Aiguillette
the code worked pretty well in the begining but it runs un error while extracting the text in between the 2 wordsparsed > abstract.text <- paste(doc.parts[abstract.loc:(Keywords.loc-1)], collapse=" ") #Error in abstract.loc:(Keywords.loc - 1) : NA/NaN argumentFoppery
S
4

Pandoc is a fantastic solution for tasks like this. With a document named a.docx you would run at the command line

pandoc -f docx -t markdown -o a.md a.docx

You could then use regex tools in R to extract what you needed from the newly-created a.md, which is text. By default, images are not converted.

Pandoc is part of RStudio, by the way, so you may already have it.

Skeet answered 20/5, 2018 at 22:5 Comment(0)
D
4

You should find that one of these packages will do the trick for you.

At the end of the day the modern Office file formats (OpenXML) are simply *.zip files containing structured XML content and so if you have well structured content then you may just want to open it that way. I would start here (http://officeopenxml.com/anatomyofOOXML.php) and you should be able to unpick the OpenXML SDK for guidance as well (https://msdn.microsoft.com/en-us/library/office/bb448854.aspx)

Domestic answered 20/5, 2018 at 22:8 Comment(2)
Well, they might be 'simply' XML, but the text is usually buried between a ludicrous number of tags that might as well not be human readable.Readymade
the docx files im working on are already converted from pdf, their xml structures are not correctFoppery
A
1

You can do it with package officer:

library(officer)
example_pptx <- system.file(package = "officer", "doc_examples/example.docx")
doc <- read_docx(example_pptx)
summary_paragraphs <- docx_summary(doc)
summary_paragraphs[summary_paragraphs$content_type %in% "paragraph", "text"]
#>  [1] "Title 1"                                                                
#>  [2] "Lorem ipsum dolor sit amet, consectetur adipiscing elit. "              
#>  [3] "Title 2"                                                                
#>  [4] "Quisque tristique "                                                     
#>  [5] "Augue nisi, et convallis "                                              
#>  [6] "Sapien mollis nec. "                                                    
#>  [7] "Sub title 1"                                                            
#>  [8] "Quisque tristique "                                                     
#>  [9] "Augue nisi, et convallis "                                              
#> [10] "Sapien mollis nec. "                                                    
#> [11] ""                                                                       
#> [12] "Phasellus nec nunc vitae nulla interdum volutpat eu ac massa. "         
#> [13] "Sub title 2"                                                            
#> [14] "Morbi rhoncus sapien sit amet leo eleifend, vel fermentum nisi mattis. "
#> [15] ""                                                                       
#> [16] ""                                                                       
#> [17] ""
Anthropomorphic answered 21/5, 2018 at 11:31 Comment(6)
it's cool thank you fr your help.. In my case, working on scientific articles dataset, i need to extract the intrduction, methods, and the results whithout abstract and references) can you help me with the piece of code that do soFoppery
@AzzaAbidi Do you have an example document that contains this info?Aiguillette
@Aiguillette I have a dataset of 400 scientific article in docx format stuctured like this title <br/> authors <br/> abstract <br/> introdution <br/> methods <br/> results <br/> conclusion <br/> refrencesFoppery
@Aiguillette there is a sampleFoppery
@AzzaAbidi I added some attempt to handle the file you provided. The exact algorithm would depend on the common structure of all the documents you need to go through as I mentioned in my example.Aiguillette
In my experience, everything about the officer package is terrific (thanks, David) and I use it as my first resortWindermere
R
1

Here is another approach that can be considered :

library(RDCOMClient)
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
path_To_Word_File <- "D:\\word_File_Table.docx"
doc <- wordApp[["Documents"]]$Open(normalizePath(path_To_Word_File), ConfirmConversions = FALSE)
wordApp[["ActiveDocument"]]$SaveAs2(FileName = "D:\\word_File_Table.txt", FileFormat = 2)
readLines("D:\\word_File_Table.txt")
Roseannaroseanne answered 19/4, 2023 at 1:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.