Skip metadata when Importing dataset in R

Asked 16/3, 2015 at 7:18 Answered 16/3, 2015 at 11:4

My question involves how to skip metadata in the beginning of a file when importing data into R. My data is in .txt format where the first lines are metadata describing the data and these need to be filtered out. Below is a minimal example of the data frame in tab delimited format:

Type=GenePix Export                         
DateTime=2010/03/04 16:04:16                        
PixelSize=10                        
Wavelengths=635                     
ImageFiles=Not Saved                        
NormalizationMethod=None                        
NormalizationFactors=1                      
JpegImage=                      
StdDev=Type 1                       
FeatureType=Circular                        
Barcode=                        
BackgroundSubtraction=LocalFeature                      
ImageOrigin=150, 10                     
JpegOrigin=150, 2760                        
Creator=GenePix Pro 7.2.29.002                      
var1    var2    var3    var4    var5    var6    var7
1   1   1   molecule1   1F3 400 4020
1   2   1   molecule2   1B5 221 4020
1   3   1   molecule3   1H5 122 2110
1   4   1   molecule4   1D1 402 2110
1   5   1   molecule5   1F1 600 4020

I could use the basic command shown below if I know the line that the actual data starts from:

mydata <- read.table("mydata.txt",header=T, skip=15)

Which would return;

mydata
  var1 var2 var3      var4 var5 var6 var7
1    1    1    1 molecule1  1F3  400 4020
2    1    2    1 molecule2  1B5  221 4020
3    1    3    1 molecule3  1H5  122 2110
4    1    4    1 molecule4  1D1  402 2110
5    1    5    1 molecule5  1F1  600 4020

The problem is that I need to write a script that can read various datasets where the row number where the actual data starts from varies from one data set to another. I could imagine using something like the sqldf package but I am not quite familiar with sql.

Any assistance would be greatly appreciated.

Spielman answered 16/3, 2015 at 7:18 Comment(5)

Have you determined what would determine the splitting point between metadata and actual data (this would need to work for all datasets; for example, maybe data uses tabulations for field separators, and no tabulations are ever present in metadata)? – Silurian 16/3, 2015 at 7:38

@Spielman Do you have any specific patterns where the column names begin? – Avril 16/3, 2015 at 7:38

Yes, one splitting point between metadata and actual data is that data uses tabs for field separators, and no tabs are ever present in metadata and this works for all datasets @DominicComtois – Spielman 16/3, 2015 at 8:19

Yes, there is a specific pattern where the column names begin i.e. var1 and this works for all datasets @Avril – Spielman 16/3, 2015 at 8:20

@Spielman Thanks, I updated the solution based on that info – Avril 16/3, 2015 at 8:44

You could use count.fields() to determine the skip argument. I call your file "x.txt"

read.table("x.txt", skip = which.max(count.fields("x.txt") == 7) - 1, 
    header = TRUE)
#   var1 var2 var3      var4 var5 var6 var7
# 1    1    1    1 molecule1  1F3  400 4020
# 2    1    2    1 molecule2  1B5  221 4020
# 3    1    3    1 molecule3  1H5  122 2110
# 4    1    4    1 molecule4  1D1  402 2110
# 5    1    5    1 molecule5  1F1  600 4020

So this starts reading the file at the first occurrence of seven fields

Yser answered 16/3, 2015 at 8:50 Comment(3)

Why not skip = which.max(count.fields("x.txt")) - 1, so that having a file with 6, 8 or 12 columns would also work? – Silurian 16/3, 2015 at 12:12

Yes, I would consider just having skip = which.max(count.fields("x.txt")) - 1 an improvement to the solution as it works generally in different cases. @DominicComtois – Spielman 16/3, 2015 at 15:1

@Spielman And btw you can add sep="\t" as an argument to count.fields to be even more thorough. – Silurian 16/3, 2015 at 15:3

How about using the already existing functions for reading the DNA microarray data? These are available in the packages developed by the Bioconductor project.

For example, roughly something like this

library(limma)

mydata<-read.maimages("mydata.txt", source="genepix")

See the limma manual for more examples. It can readily import most DNA microarray formats.

Uel answered 16/3, 2015 at 7:53 Comment(1)

When I tried this code using the input data,

read.maimages('mydata.txt', source='genepix') #Error in readGPRHeader(fullname) :    File is not in Axon Text File (ATF) format

– Avril 16/3, 2015 at 11:46

You could use count.fields() to determine the skip argument. I call your file "x.txt"

read.table("x.txt", skip = which.max(count.fields("x.txt") == 7) - 1, 
    header = TRUE)
#   var1 var2 var3      var4 var5 var6 var7
# 1    1    1    1 molecule1  1F3  400 4020
# 2    1    2    1 molecule2  1B5  221 4020
# 3    1    3    1 molecule3  1H5  122 2110
# 4    1    4    1 molecule4  1D1  402 2110
# 5    1    5    1 molecule5  1F1  600 4020

So this starts reading the file at the first occurrence of seven fields

Yser answered 16/3, 2015 at 8:50 Comment(3)

Why not skip = which.max(count.fields("x.txt")) - 1, so that having a file with 6, 8 or 12 columns would also work? – Silurian 16/3, 2015 at 12:12

@Spielman And btw you can add sep="\t" as an argument to count.fields to be even more thorough. – Silurian 16/3, 2015 at 15:3

Suppose if all the files have Creator as the last metadata line,

read.table(pipe("awk 'NR ==1, /Creator/ {next}{print}' mydata.txt"),
              header=TRUE)
#  var1 var2 var3      var4 var5 var6 var7
#1    1    1    1 molecule1  1F3  400 4020
#2    1    2    1 molecule2  1B5  221 4020
#3    1    3    1 molecule3  1H5  122 2110
#4    1    4    1 molecule4  1D1  402 2110
#5    1    5    1 molecule5  1F1  600 4020

If you know the number of columns, you could also do

read.table(pipe("awk 'NF==7{print}' mydata.txt"), header=TRUE)
#  var1 var2 var3      var4 var5 var6 var7
#1    1    1    1 molecule1  1F3  400 4020
#2    1    2    1 molecule2  1B5  221 4020
#3    1    3    1 molecule3  1H5  122 2110
#4    1    4    1 molecule4  1D1  402 2110
#5    1    5    1 molecule5  1F1  600 4020

Update

If we need to start reading from the first occurence of 'var1' to the end of the file,

  read.table(pipe("awk '/var1/ { matched = 1}matched { print }' mydata.txt"), 
             header=TRUE)    
 #   var1 var2 var3      var4 var5 var6 var7
 #1    1    1    1 molecule1  1F3  400 4020
 #2    1    2    1 molecule2  1B5  221 4020
 #3    1    3    1 molecule3  1H5  122 2110
 #4    1    4    1 molecule4  1D1  402 2110
 #5    1    5    1 molecule5  1F1  600 4020

The above solutions work fine on a linux system. On Windows, it failed (as per the comments). An option that could work on both systems is

 lines <- readLines('mydata.txt')
 read.table(text=lines[grep('var1', lines):length(lines)],header=TRUE)
 #   var1 var2 var3      var4 var5 var6 var7
 #1    1    1    1 molecule1  1F3  400 4020
 #2    1    2    1 molecule2  1B5  221 4020
 #3    1    3    1 molecule3  1H5  122 2110
 #4    1    4    1 molecule4  1D1  402 2110
 #5    1    5    1 molecule5  1F1  600 4020

Avril answered 16/3, 2015 at 7:40 Comment(10)

The character in the last metadata line varies from one data set to another so solution one might not work in all cases. About the two other possible solutions I get the errors: Error in read.table(pipe("awk 'NF==7{print}' mydata.txt"), header = TRUE) : no lines available in input and Error in read.table(pipe("awk '/var1/ { matched = 1}matched { print }' mydata.txt"), : no lines available in input. What might be wrong? @Avril – Spielman 16/3, 2015 at 9:49

@Spielman I was just copy/pasting the input data showed in the post and saved it as mydata.txt. So, I don't know what might be the error in your case as it works on my system. – Avril 16/3, 2015 at 10:41

@DominicComtois I am using linux mint But, I think it could work on windows #21928444 – Avril 16/3, 2015 at 10:44

@Spielman If I understand your comments, the last two solutions get the error. Do you have the problem with the first code? (just for curiosity) – Avril 16/3, 2015 at 10:48

There is actually an awk version for Windows but it's not there by default, gotta install it from gnuwin32.sourceforge.net/packages/gawk.htm – Silurian 16/3, 2015 at 11:2

@DominicComtois Thanks for the link. I don't use windows that much, so didn't know that this will not work – Avril 16/3, 2015 at 11:4

Yes, I have a problem with the first solution too @Avril – Spielman 16/3, 2015 at 11:28

I am on windows 8. It seems from the gawk installation site that there is no gawk version for Windows 8 @DominicComtois – Spielman 16/3, 2015 at 11:29

Ok... well that explains why this solution doesn't work. But it seems you have plenty of other very good options! – Silurian 16/3, 2015 at 11:30

@Spielman Anyway, this will be an option if you have a chance to use linux system. – Avril 16/3, 2015 at 11:30

A solution based on the presence of tabulations in the actual data (and not in the metadata). As a "bonus" you have the option to display (via cat whatever lines were considered metadata).

Main reading function

read.genepix <- function(filename, disp.meta = FALSE) {

    infile <- file(description = filename, open = "r" )

    # create a meta indicator function
    is.meta <- function(text) !grepl(pattern = "\\t", x = text)

    # Prepare to store meta text (if needed)
    meta.text <- c()
    meta <- TRUE

    while(isTRUE(meta)) {

        last.pos <- seek(infile, where = NA)
        current.line <- readLines(infile, n = 1)
        meta <- is.meta(current.line)

        if(isTRUE(meta)) {
            meta.text <- append(meta.text, current.line)
        } else {
            seek(infile, where = last.pos)
            data.txt <- paste0(readLines(infile),collapse="\n")
            close(infile)
            break
        }
    }

    if(isTRUE(disp.meta)) {
        cat(paste(meta.text, collapse="\n"))
    }

    return(read.table(text=data.txt, header = TRUE, sep = "\t", quote=""))
}

Usage / Results

my.data <- read.genepix("somefile.txt")

my.data

#   var1 var2 var3      var4 var5 var6 var7
# 1    1    1    1 molecule1  1F3  400 4020
# 2    1    2    1 molecule2  1B5  221 4020
# 3    1    3    1 molecule3  1H5  122 2110
# 4    1    4    1 molecule4  1D1  402 2110
# 5    1    5    1 molecule5  1F1  600 4020

Sample data used in this answer (saved to disk as "somefile.txt") - but note that SO replaces tabs with series of spaces in the data section - so in your text editor you'll need to replace those spaces with tabulations in order for the code to work.

capture.output(cat("Type=GenePix Export
DateTime=2010/03/04 16:04:16
PixelSize=10
Wavelengths=635
ImageFiles=Not Saved
NormalizationMethod=None
NormalizationFactors=1
JpegImage=
StdDev=Type 1
FeatureType=Circular
Barcode=
BackgroundSubtraction=LocalFeature
ImageOrigin=150, 10
JpegOrigin=150, 2760
Creator=GenePix Pro 7.2.29.002
var1    var2    var3    var4    var5    var6    var7
1   1   1   molecule1   1F3 400 4020
1   2   1   molecule2   1B5 221 4020
1   3   1   molecule3   1H5 122 2110
1   4   1   molecule4   1D1 402 2110
1   5   1   molecule5   1F1 600 4020
"), file="somefile.txt")

Silurian answered 16/3, 2015 at 9:25 Comment(4)

The function read.genepix seems to print the metadata irrespective of whether print.meta = TRUE) or print.meta = FALSE). I guess there is a small bug in the function. @DominicComtois – Spielman 16/3, 2015 at 9:57

I updated the solution so that the file gets read only once; the printing of meta should also work fine now. – Silurian 16/3, 2015 at 9:59

The problem with the printing of the metadata even when print.meta = FALSE) still appears @DominicComtois – Spielman 16/3, 2015 at 12:26

I can't really say why... it didn't on my machine. I renamed the argument in case it had anything to do with it, and added a stronger condition; it should work fine now. If not, make sure you restart your session before re-testing it pls! – Silurian 16/3, 2015 at 12:51

The comments to your question describe the following:

The metadata section does not include any tabs
The data section is tab-delimited

As such, I'm guessing that your data are something like the sample data found at the end of this question.

If this is the case, you can use the magic of fread to automatically determine where your data starts.

Here's a demo:

cat(A, file = "mytest.txt", sep = "\n")
library(data.table)
fread("mytest.txt")
#    var1 var2 var3      var4 var5 var6 var7
# 1:    1    1    1 molecule1  1F3  400 4020
# 2:    1    2    1 molecule2  1B5  221 4020
# 3:    1    3    1 molecule3  1H5  122 2110
# 4:    1    4    1 molecule4  1D1  402 2110
# 5:    1    5    1 molecule5  1F1  600 4020

Sample data:

A <- c("Type=GenePix Export", "DateTime=2010/03/04 16:04:16", "PixelSize=10", 
"Wavelengths=635", "ImageFiles=Not Saved", "NormalizationMethod=None", 
"NormalizationFactors=1", "JpegImage=", "StdDev=Type 1", "FeatureType=Circular", 
"Barcode=", "BackgroundSubtraction=LocalFeature", "ImageOrigin=150, 10", 
"JpegOrigin=150, 2760", "Creator=GenePix Pro 7.2.29.002", 
"var1\tvar2\tvar3\tvar4\tvar5\tvar6\tvar7", 
"1\t1\t1\tmolecule1\t1F3\t400\t4020", "1\t2\t1\tmolecule2\t1B5\t221\t4020", 
"1\t3\t1\tmolecule3\t1H5\t122\t2110", "1\t4\t1\tmolecule4\t1D1\t402\t2110", 
"1\t5\t1\tmolecule5\t1F1\t600\t4020")

A
#  [1] "Type=GenePix Export"               
#  [2] "DateTime=2010/03/04 16:04:16"      
#  [3] "PixelSize=10"                      
#  [4] "Wavelengths=635"                   
#  [5] "ImageFiles=Not Saved"              
#  [6] "NormalizationMethod=None"          
#  [7] "NormalizationFactors=1"            
#  [8] "JpegImage="                        
#  [9] "StdDev=Type 1"                     
# [10] "FeatureType=Circular"              
# [11] "Barcode="                          
# [12] "BackgroundSubtraction=LocalFeature"
# [13] "ImageOrigin=150, 10"               
# [14] "JpegOrigin=150, 2760"              
# [15] "Creator=GenePix Pro 7.2.29.002"    
# [16] "var1\tvar2\tvar3\tvar4\tvar5\tvar6\tvar7"
# [17] "1\t1\t1\tmolecule1\t1F3\t400\t4020"      
# [18] "1\t2\t1\tmolecule2\t1B5\t221\t4020"      
# [19] "1\t3\t1\tmolecule3\t1H5\t122\t2110"      
# [20] "1\t4\t1\tmolecule4\t1D1\t402\t2110"      
# [21] "1\t5\t1\tmolecule5\t1F1\t600\t4020"

Bacitracin answered 16/3, 2015 at 11:4 Comment(4)

The solution involving fread("mytest.txt") works with the data set created by converting the original data set to a different format using cat(A, file = "mytest.txt", sep = "\n") but it does not filter out the metadata in the original data set. @AnandaMahto – Spielman 16/3, 2015 at 12:46

@amo, you will always get best input if you provide a reproducible example. An easy way to do this would be to use something like dput(readLines("mytest.txt", n = 20)) (where "mytest.txt" should be replaced by your actual file name) so that we can see what exactly the first 20 lines of your file look like. With that, we can all use cat to recreate a sample file and test our solutions. – Bacitracin 16/3, 2015 at 13:4

The process of recreating a sample file works fine with the minimal data set but It might be quite a task for the real data set which contains >3000 rows. @AnandaMahto – Spielman 16/3, 2015 at 15:6

@amo, My point was simply that since you didn't provide reproducible sample data, I created sample data that describes what you mentioned in your comments: a file with metadata that doesn't contain tabs followed by tabular data with each column separated by tabs. I pointed you to fread because fread handles such problems with ease, even datasets with millions of rows. Sorry it didn't work out. I'm guessing that your description of the problem was off. My guess is that there are tabs in your metadata. You can easily inspect that by setting your text editor to show whitespace characters. – Bacitracin 16/3, 2015 at 15:34

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Update

Recommended topics

Hot tags