How do I read information from text files?
Asked Answered
D

4

6

I have hundreds of text files with the following information in each file:

*****Auto-Corelation Results******
1     .09    -.19     .18     non-Significant

*****STATISTICS FOR MANN-KENDELL TEST******
S=  609
VAR(S)=      162409.70
Z=           1.51
Random : No trend at 95%

*****SENs STATISTICS ******
SEN SLOPE =  .24

Now, I want to read all these files, and "collect" Sen's Statistics from each file (eg. .24) and compile into one file along with the corresponding file names. I have to do it in R.

I have worked with CSV files but not sure how to use text files.

This is the code I am using now:

require(gtools)
GG <- grep("*.txt", list.files(), value = TRUE)
GG<-mixedsort(GG)
S <- sapply(seq(GG), function(i){
X <- readLines(GG[i])
grep("SEN SLOPE", X, value = TRUE)
})
spl <- unlist(strsplit(S, ".*[^.0-9]"))
SenStat <- as.numeric(spl[nzchar(spl)])
SenStat<-data.frame( SenStat,file = GG)
write.table(SenStat, "sen.csv",sep = ", ",row.names = FALSE)

The current code is not able to read all values correctly and giving this error:

Warning message:
NAs introduced by coercion 

Also I am not getting the file names the other column of Output. Please help!


Diagnosis 1

The code is reading the = sign as well. This is the output of print(spl)

 [1] ""       "5.55"   ""       "-.18"   ""       "3.08"   ""       "3.05"   ""       "1.19"   ""       "-.32"  
[13] ""       ".22"    ""       "-.22"   ""       ".65"    ""       "1.64"   ""       "2.68"   ""       ".10"   
[25] ""       ".42"    ""       "-.44"   ""       ".49"    ""       "1.44"   ""       "=-1.07" ""       ".38"   
[37] ""       ".14"    ""       "=-2.33" ""       "4.76"   ""       ".45"    ""       ".02"    ""       "-.11"  
[49] ""       "=-2.64" ""       "-.63"   ""       "=-3.44" ""       "2.77"   ""       "2.35"   ""       "6.29"  
[61] ""       "1.20"   ""       "=-1.80" ""       "-.63"   ""       "5.83"   ""       "6.33"   ""       "5.42"  
[73] ""       ".72"    ""       "-.57"   ""       "3.52"   ""       "=-2.44" ""       "3.92"   ""       "1.99"  
[85] ""       ".77"    ""       "3.01"

Diagnosis 2

Found the problem I think. The negative sign is a bit tricky. In some files it is

SEN SLOPE =-1.07
SEN SLOPE = -.11

Because of the gap after =, I am getting NAs for the first one, but the code is reading the second one. How can I modify the regex to fix this? Thanks!

Dion answered 13/4, 2014 at 2:14 Comment(0)
F
10

Assume "text.txt" is one of your text files. Read into R with readLines, you can use grep to find the line containing SEN SLOPE. With no further arguments, grep returns the index number(s) for the element where the regular expression was found. Here we find that it's the 11th line. Add the value = TRUE argument to get the line as it reads.

x <- readLines("text.txt")
grep("SEN SLOPE", x)
## [1] 11
( gg <- grep("SEN SLOPE", x, value = TRUE) )
## [1] "SEN SLOPE =  .24"

To find all the .txt files in the working directory we can use list.files with a regular expression.

list.files(pattern = "*.txt")
## [1] "text.txt"

LOOPING OVER MULTIPLE FILES

I created a second text file, text2.txt with a different SEN SLOPE value to illustrate how I might apply this method over multiple files. We can use sapply, followed by strsplit, to get the spl values that are desired.

GG <- list.files(pattern = "*.txt")
S <- sapply(seq_along(GG), function(i){
    X <- readLines(GG[i])
    ifelse(length(X) > 0, grep("SEN SLOPE", X, value = TRUE), NA)
    ## added 04/23/14 to account for empty files (as per comment)
})
spl <- unlist(strsplit(S, split = ".*((=|(\\s=))|(=\\s|\\s=\\s))"))
## above regex changed to capture up to and including "=" and 
## surrounding space, if any - 04/23/14 (as per comment)
SenStat <- as.numeric(spl[nzchar(spl)])

Then we can put the results into a data frame and send it to a file with write.table

( SenStatDf <- data.frame(SenStat, file = GG) )
##   SenStat      file
## 1    0.46 text2.txt
## 2    0.24  text.txt

We can write it to a file with

write.table(SenStatDf, "myFile.csv", sep = ", ", row.names = FALSE)

UPDATED 07/21/2014:

Since the result is being written to a file, this can be made much more simple (and faster) with

( SenStatDf <- cbind(
      SenSlope = c(lapply(GG, function(x){
          y <- readLines(x)
          z <- y[grepl("SEN SLOPE", y)]
          unlist(strsplit(z, split = ".*=\\s+"))[-1]
          }), recursive = TRUE),
      file = GG
 ) )
#      SenSlope file       
# [1,] ".46"   "test2.txt"
# [2,] ".24"   "test.txt" 

And then written and read into R with

write.table(SenStatDf, "myFile.txt", row.names = FALSE)
read.table("myFile.txt", header = TRUE)
#   SenSlope      file
# 1     1.24 test2.txt
# 2     0.24  test.txt
Furman answered 13/4, 2014 at 2:19 Comment(8)
Oops. Need some help. Some of the values are negative. And the code is reading them as positive. What change do I have to make to read them correctly?Dion
The code was working before but I now I am suddenly seeing this error: Error in strsplit(S, ".*[^(-|\\s).0-9]") : non-character argument I am not sure what is going wrong. :-( Could you please help? Also, the expected values are between -5 and 5 @richardDion
Yes, same code, same place. :( Everything was working fine till few days back.Dion
Hi Richard, the code is working now. But I am not getting the file names in column 2. And some NA values are getting added. Don't know why. Not reading all values. Out of 44 files in the first run, I couldn't read 5 values. I have added some diagnosisDion
Because there was a problem with the naming of the files. Could you please check this: #23160587 It shouldn't affect your code, I think.Dion
Hurrah! It works. But the files names are still missing. Now I have all the values correctly.Dion
Haha. Thanks. Hard to keep calm with project report due in two days. Did all the work and now I am stuck with sorting. Thanks!:PDion
Can you add a logic somewhere for this: I have found that it is again not working because sometimes one file is completely blank. I have manually found them. But how can I put that in that code? I will use your strsplit edit.Dion
N
4

First make a sample text file:

cat('*****Auto-Corelation Results******
1     .09    -.19     .18     non-Significant

*****STATISTICS FOR MANN-KENDELL TEST******
S=  609
VAR(S)=      162409.70
Z=           1.51
Random : No trend at 95%

*****SENs STATISTICS ******
SEN SLOPE =  .24',file='samp.txt')

Then read it in:

tf <- readLines('samp.txt')

Now extract the appropriate line:

sen_text <- grep('SEN SLOPE',tf,value=T)

And then get the value past the equals sign:

sen_value <- as.numeric(unlist(strsplit(sen_text,'='))[2])

Then combine these results for each of your files (no file structure mentioned in the original question)

Norling answered 13/4, 2014 at 2:22 Comment(4)
This is exactly the same as my answer. :)Furman
@RichardScriven Well, then you have a pretty good answer ;-) readLines is really cool.Norling
Thank you. How do I add all the Sen values in a single dataframe in a loop? And then export it into a CSV file that basically has this format (filename,sen_value)Dion
See @richardscriven 's answer above on how to loop through each text fileNorling
P
1

If you're text files are always of that format, (eg Sen Slope is always on line 11) and the text is identical over all your files you can do what you need in just two lines.

char_vector <- readLines("Path/To/Document/sample.txt")
statistic <- as.numeric(strsplit(char_vector[11]," ")[[1]][5])

That will give you 0.24.

You then iterate over all your files via an apply statement or a for loop.

For clarity:

> char_vector[11]
[1] "SEN SLOPE =  .24"

and

> strsplit(char_vector[11]," ")
[[1]]
[1] "SEN"   "SLOPE" "="     ""      ".24"  

Thus you want [[1]] [5] of the result from strsplit.

Plankton answered 13/4, 2014 at 2:21 Comment(0)
D
1

Step1: Save complete fileNames in a single variable:

fileNames <- dir(dataDir,full.names=TRUE)

Step2: Lets read and process one of the files, and ensure that it is giving correct results:

data.frame(
  file=basename(fileNames[1]), 
  SEN.SLOPE= as.numeric(tail(
    strsplit(grep('SEN SLOPE',readLines(fileNames[1]),value=T),"=")[[1]],1))
  )

Step3: Do this on all the fileNames

do.call(
  rbind,
  lapply(fileNames, 
         function(fileName) data.frame(
           file=basename(fileName), 
           SEN.SLOPE= as.numeric(tail(
             strsplit(grep('SEN SLOPE',
                           readLines(fileName),value=T),"=")[[1]],1)
             )
           )
         )
  )

Hope this helps!!

Dyanna answered 24/4, 2014 at 6:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.