Reading .dat and .dct directly from R
Asked Answered
S

2

7

I need to read a .dat file using a .dct file. Has anyone done that using R?

The format is:

dictionary {
  # how many lines per record
  _lines(1)
  # start defining the first line
  _line(1)

  # starting column / storage type / variable name / read format / variable label
  _column(1)    str8    aid    %8s    "respondent identifier"
  ...
}

'read formats' are like:

%2f        2 column integer variable
%12s      12 column string variable
%8.2f      8 column number with 2 implied decimal places. 

Storage types are described here: http://www.stata.com/help.cgi?datatypes

Other sites used for info:

http://library.columbia.edu/indiv/dssc/technology/stata_write.html

http://www.stata.com/support/faqs/data-management/reading-fixed-format-data/

The .dat file is a bunch of numbers corresponding to the variables specified in the .dct file. (Presumably this is data in fixed width columns).

Here a real example:

.dtc file http://goo.gl/qHZOk

data http://goo.gl/FRGRF

A specific example from the stata site is:

The .dat file ("test.raw" in this instance)

C1245A101George Costanza
B1223B011Cosmo Kramer

The .dct file

dictionary using test2.raw {
 _column(1)     str5     code   %5s
 _column(2)     int      call   %4f
 _column(6)     str1     city   %1s
 _column(7)     int      neigh  %3f
 _column(10)    str16    name   %16s
}

The resulting data file:

      +-----------------------------------------------+
      |  code   call   city   neigh              name |
      |-----------------------------------------------|
   1. | C1245   1245      A     101   George Costanza |
   2. | B1223   1223      B      11      Cosmo Kramer |
      +-----------------------------------------------+
Silt answered 8/1, 2013 at 21:26 Comment(6)
Can you provide some documentation or reference about these files you're talking about? From some preliminary searching I'm guessing these are files from Stata?Synchro
What is a .dct file? What specific .dat filetype are you talking about? We are going to need more detailed information to answer you.Nicolella
Give us some example files. Complete examples. And more info about where they come from. Otherwise the solution is just as likely to be found by a million monkeys with a million typewriters.Muenster
I wholeheartedly agree with @Spacedman, *IF these files come from stata (which is guesswork), perhaps the memisc package will be useful, as suggested in the help for read.dta, which you would have navigated towards after reading the wonderful Data Import / Export ManualTimbal
@Silt - I have edited your question to provide some actual useful information. This was from 5 minutes on Google - could you verify if this looks ok?Nicolella
Thank you @thelatemail, my question is just if there is way to read those files using R. I have several big .dct and .dat files that I would like to read using R. Any ideas?Silt
R
16

@thelatemail is spot-on about how to proceed. Here's a small function I threw together to get you started on a more robust solution:

read.dat.dct <- function(dat, dct) {
    temp <- readLines(dct)
    pattern <- "_column\\(([0-9]+)\\)\\s+([a-z0-9]+)\\s+([a-z0-9_]+)\\s+%([0-9]+).*"
    classes <- c("numeric", "character", "character", "numeric")
    metadata <- setNames(lapply(1:4, function(x) {
        out <- gsub(pattern, paste("\\", x, sep = ""), temp)
        out <- gsub("^\\s+|\\s+$|.*\\{|\\}", "", out)
        out <- out[out != ""]
        class(out) <- classes[x] ; out }), 
                         c("StartPos", "Str", "ColName", "ColWidth"))
    read.fwf(dat, widths = metadata[["ColWidth"]], 
             col.names = metadata[["ColName"]])
}

There is still a lot you would have to do with respect to error checking, generalizing the function, and so on. For example, this function does not work with overlapping columns, as are present in the example that @thelatemail added to your question. Some error checking in the form of "StartPos[n] + ColWidth[n]" should equal "StartPos[n+1]" could be used to stop reading the file if this is not true with an error message. Additionally, the classes of the resulting data can also be extracted from the "metadata" list generated by the function and assigned in read.fwf using the colClasses argument.

Here is a dat file and a dct file to demonstrate:

Copy and paste the following two lines into a text editor and save it in your working directory as "test.dat".

C1245A101George Costanza
B1223B011Cosmo Kramer

Copy and paste the following lines into a text editor and save it in your working directory as "test.dct"

dictionary using test.dat {
    _column(1)     str1     code   %1s
    _column(2)     int      call   %4f
    _column(6)     str1     city   %1s
    _column(7)     int      neigh  %3f
    _column(10)    str16    name   %16s
}

Now, run the function:

read.dat.dct(dat = "test.dat", dct = "test.dct")
#   code call city neigh            name
# 1    C 1245    A   101 George Costanza
# 2    B 1223    B    11    Cosmo Kramer

Update: An improved function (with still a lot of room for improvement)

read.dat.dct <- function(dat, dct, labels.included = "no") {
    temp <- readLines(dct)
    temp <- temp[grepl("_column", temp)]
    switch(labels.included,
           yes = {
               pattern <- "_column\\(([0-9]+)\\)\\s+([a-z0-9]+)\\s+(.*)\\s+%([0-9]+)[a-z]\\s+(.*)"
               classes <- c("numeric", "character", "character", "numeric", "character")
               N <- 5
               NAMES <- c("StartPos", "Str", "ColName", "ColWidth", "ColLabel")
           },
           no = {
               pattern <- "_column\\(([0-9]+)\\)\\s+([a-z0-9]+)\\s+(.*)\\s+%([0-9]+).*"
               classes <- c("numeric", "character", "character", "numeric")
               N <- 4
               NAMES <- c("StartPos", "Str", "ColName", "ColWidth")
           })
    metadata <- setNames(lapply(1:N, function(x) {
        out <- gsub(pattern, paste("\\", x, sep = ""), temp)
        out <- gsub("^\\s+|\\s+$", "", out)
        out <- gsub('\"', "", out, fixed = TRUE)
        class(out) <- classes[x] ; out }), NAMES)

    metadata[["ColName"]] <- make.names(gsub("\\s", "", metadata[["ColName"]]))

    myDF <- read.fwf(dat, widths = metadata[["ColWidth"]], 
             col.names = metadata[["ColName"]])
    if (labels.included == "yes") {
        attr(myDF, "col.label") <- metadata[["ColLabel"]]
    }
    myDF
}

How does it work with your data?

temp <- read.dat.dct(dat = "http://dl.getdropbox.com/u/18116710/21600-0009-Data.txt", 
                     dct = "http://dl.getdropbox.com/u/18116710/21600-0009-Setup.dct",
                     labels.included = "yes")
dim(temp)                     # How big is the dataset?
# [1] 180  40
head(temp[, 1:6])             # What do the first few columns & rows look like?
#   CASEID      AID RRELNO RPREGNO H3PC1.H3PC1 H3PC2.H3PC2
# 1      1 57118381      5       1           1           1
# 2      2 57134970      1       2           1           1
# 3      3 57135078      1       1           1           1
# 4      4 57135078      5       1           1           1
# 5      5 57164981      1       1           7           3
# 6      6 57191909      1       3           1           1
head(attr(temp, "col.label")) # What are the variable labels?
# [1] "CASE IDENTIFICATION NUMBER"             "RESPONDENT IDENTIFIER"                 
# [3] "ROMANTIC RELATIONSHIP NUMBER"           "RELATIONSHIP PREGNANCY NUMBER"         
# [5] "S23Q1 1 TOLD PARTNER PREGNANT-W3"       "S23Q2 MONTHS PREG WHEN TOLD PARTNER-W3"

What about with the original example?

read.dat.dct("test.dat", "test.dct", labels.included = "no")
#   code call city neigh            name
# 1    C 1245    A   101 George Costanza
# 2    B 1223    B    11    Cosmo Kramer
Respondent answered 9/1, 2013 at 18:52 Comment(5)
@Silt - not to be too snarky, but the people of stackoverflow are not your personal research staff. You asked a vague question, which I added to heavily to make it answerable. You have now been given a close-to-generalisable answer by Ananda. At some stage there is an expectation that you will build on the info provided rather than constantly moving the goal posts.Nicolella
Thank you @thelatemail. I really appreciate all the great help you and others provide. It wasn't my intention to be snarky. Ananda's solution is simply great! I will try to go through and see if I can solve my problem. The intention of my question was just to know if someone had done something similar before, but apparently this issue is not that common. It doesn't seem straightforward to get these .dat and .dct files directly from R. My last solution will be to use first STATA and then import to R. Thank you everyone again!Silt
It looks great @Ananda Mahto. I still can't deal with my files. I added an example. Thanks! .dtc file goo.gl/qHZOk data goo.gl/FRGRFSilt
Thanks @AnandaMahto. I am not familiar with regular expressions, so I haven't been able to deal with this data format in the 4th column of one of my dct files: %12.7f. I tried using \\s+%(.*) but I get this: Warning message: In class(out) <- classes[x] : NAs introduced by coercion, and only NAs in the dataset produced. Your expression for this was: \\s+%([0-9]+)[a-z]. I will try with this: #5917582Silt
It worked well. Here you have another example: dct goo.gl/jmj9V dat goo.gl/Ix4yu. Most of the data I am using is restricted so I will use your function and if I have further problems I will let you know. I will try to improve the varname.varname variable name pattern using your function, so that we only get "varname". Thank you!Silt
N
10

You may be able to read the dat files using ?read.fwf as the .dat data is essentially just a fixed width data file.

See here - Organizing Messy Notepad data - using the column(X) values from the .dct dictionary file as the widths.

The dictionary file could be scraped using readLines to extract the info, which you could then pass to arguments in the read.fwf call.

E.g.: the 'variable names' align with the col.names= argument and, the 'storage types' align with the colClasses= argument.

There would be some manual handling in this though.

Nicolella answered 9/1, 2013 at 6:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.