Read fixed width text file

Asked 17/1, 2013 at 16:33 Answered 31/3, 2016 at 14:34

I'm trying to load this ugly-formatted data-set into my R session: http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for

Weekly SST data starts week centered on 3Jan1990

Nino1+2      Nino3        Nino34        Nino4
Week          SST SSTA     SST SSTA     SST SSTA     SST SSTA 
03JAN1990     23.4-0.4     25.1-0.3     26.6 0.0     28.6 0.3 
10JAN1990     23.4-0.8     25.2-0.3     26.6 0.1     28.6 0.3 
17JAN1990     24.2-0.3     25.3-0.3     26.5-0.1     28.6 0.3

So far, i can read the lines with

  x = readLines(path)

But the file mixes 'white space' with '-' as separators, and i'm not a regex expert. I Appreciate any help on turning this into a nice and clean R data-frame. thanks!

Mozza answered 17/1, 2013 at 16:33 Comment(4)

And take a look at read.fwf to read read fixed width formatted data. – Panchito 17/1, 2013 at 16:37

I think it's a better idea to process each row. It mixes '-' with ' ' characters. – Mozza 17/1, 2013 at 16:38

Alternatively, you could say white-space or - is just one character, so first replace all multiple occurrences of a space with a tab character, then split all tab-separated entry's on - or white space. – Joettejoey 17/1, 2013 at 16:45

Fixed width = no separators. That means the "-" is a minus sign and the spaces are not separators either, they just occur when the number doesn't fill the entire available width – Barye 15/2, 2015 at 5:53

191

This is a fixed width file. Use read.fwf() to read it:

x <- read.fwf(
  file=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"),
  skip=4,
  widths=c(12, 7, 4, 9, 4, 9, 4, 9, 4))

head(x)

            V1   V2   V3   V4   V5   V6   V7   V8  V9
1  03JAN1990   23.4 -0.4 25.1 -0.3 26.6  0.0 28.6 0.3
2  10JAN1990   23.4 -0.8 25.2 -0.3 26.6  0.1 28.6 0.3
3  17JAN1990   24.2 -0.3 25.3 -0.3 26.5 -0.1 28.6 0.3
4  24JAN1990   24.4 -0.5 25.5 -0.4 26.5 -0.1 28.4 0.2
5  31JAN1990   25.1 -0.2 25.8 -0.2 26.7  0.1 28.4 0.2
6  07FEB1990   25.8  0.2 26.1 -0.1 26.8  0.1 28.4 0.3

Update

The package readr (released April, 2015) provides a simple and fast alternative.

library(readr)

x <- read_fwf(
  file="http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for",   
  skip=4,
  fwf_widths(c(12, 7, 4, 9, 4, 9, 4, 9, 4)))

Speed comparison: readr::read_fwf() was ~2x faster than utils::read.fwf ().

Marrero answered 17/1, 2013 at 16:45 Comment(5)

@Marrero how did you know what were the widths and skips? – Harpoon 20/4, 2014 at 19:55

@Koba: I copied and pasted one of the lines into a text editor that had a column count and I manually counted the widths for each column (including whitespace when required). Also you can tell that you need to skip 4 whole lines before you get to the raw data. – Vita 21/4, 2014 at 1:57

@Pavithra's answer below with negative column widths for skipping unwanted whitespace might be better suited for the accepted answer. – Hungarian 19/11, 2014 at 20:49

@Marrero How did you get the fwf_widths values? – Afield 28/7, 2017 at 23:16

@Ala I believe readr::fwf_empty will attempt to guess the widths for you. The examples for readr::read_fwf shows the usage for readr::fwf_empty. – Rubin 30/7, 2017 at 19:56

Another way to determine widths...

df <- read.fwf(
  file=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"),
  widths=c(-1, 9, -5, 4, 4, -5, 4, 4, -5, 4, 4, -5, 4, 4),
  skip=4
)

The -1 in the widths argument says there is a one-character column that should be ignored,the -5 in the widths argument says there is a five-character column that should be ignored, likewise...

ref : https://www.inkling.com/read/r-cookbook-paul-teetor-1st/chapter-4/recipe-4-6

Bromine answered 12/6, 2014 at 21:42 Comment(0)

First off, that question is directly from a the Coursera "Get Data and Clean It" course by Leeks. While there is another part of the question, the tough part is reading the file.

That said, the course is mostly intended for learning.

I hate R's fixed width procedure. It is slow and for large number of variables, it very quickly becomes a pain to negate certain columns, etc.

I think its easier to use readLines() and then from that use substr() to make your variables

x <- readLines(con=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"))

# Skip 4 lines
x <- x[-(1:4)]

mydata <- data.frame(var1 = substr(x, 1, 10),
                     var2 = substr(x, 16, 19),
                     var3 = substr(x, 20, 23),
                     var4 = substr(x, 29, 32)  # and so on and so on
                     )

Poplar answered 17/10, 2014 at 18:35 Comment(1)

This approach worked for me. Two additional tips: 1) you can define mydata to be only the data you need. So it could be as simple as mydata <- data.frame(var4 = substr(x,29,32)) if you only need the fourth column of data. Also, for Windows users, Notepad++ with the TextFX plugin will get you a plain and simple, counted character ruler so you can figure out what to put in the start and stop values in substr. Note, however that the stop value is one more than the position of the last character you want to preserve. – Boyles 19/3, 2017 at 11:22

You can now use the read_fwf() function in Hadley Wickham's readr package.

Annoucement: http://blog.rstudio.org/2015/04/09/readr-0-1-0/
Development page: https://github.com/hadley/readr
CRAN page: http://cran.r-project.org/web/packages/readr/index.html

A huge performance improvement is to be expected, compared to base read.fwf().

Remainderman answered 10/4, 2015 at 10:20 Comment(0)

I document here the list of alternatives for reading fixed-width files in R, as well as providing some benchmarks for which is fastest.

My preferred approach is to combine fread with stringi; it's competitive as the fastest approach, and has the added benefit (IMO) of storing your data as a data.table:

library(data.table)
library(stringi)

col_ends <- 
  list(beg = c(1, 10, 15, 19, 23, 28, 32, 36,
               41, 45, 49, 54, 58),
       end = c(9, 14, 18, 22, 27, 31, 35,
               40, 44, 48, 53, 57, 61))

data = fread(
  "http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for", 
  header = FALSE, skip = 4L, sep = NULL
  )[, lapply(1:(length(col_ends$beg)),
             function(ii) 
               stri_sub(V1, col_ends$beg[ii], col_ends$end[ii]))
    ][ , paste0("V", c(2, 5, 8, 11)) := NULL]
#              V1   V3   V4   V6   V7   V9  V10  V12  V13
#    1: 03JAN1990 23.4 -0.4 25.1 -0.3 26.6  0.0 28.6  0.3
#    2: 10JAN1990 23.4 -0.8 25.2 -0.3 26.6  0.1 28.6  0.3
#    3: 17JAN1990 24.2 -0.3 25.3 -0.3 26.5 -0.1 28.6  0.3
#    4: 24JAN1990 24.4 -0.5 25.5 -0.4 26.5 -0.1 28.4  0.2
#    5: 31JAN1990 25.1 -0.2 25.8 -0.2 26.7  0.1 28.4  0.2
#   ---                                                  
# 1365: 24FEB2016 27.1  0.9 28.4  1.8 29.0  2.1 29.5  1.4
# 1366: 02MAR2016 27.3  1.0 28.6  1.8 28.9  1.9 29.5  1.4
# 1367: 09MAR2016 27.7  1.2 28.6  1.6 28.9  1.8 29.6  1.5
# 1368: 16MAR2016 27.5  1.0 28.8  1.7 28.9  1.7 29.6  1.4
# 1369: 23MAR2016 27.2  0.9 28.6  1.4 28.8  1.5 29.5  1.2

Note that fread automatically strips leading and trailing whitespace -- sometimes, this is undesirable, in which case set strip.white = FALSE.

We could also have started with a vector of column widths ww by doing:

ww <- c(9, 5, 4, 4, 5, 4, 4, 5, 4, 4, 5, 4, 4)
nd <- cumsum(ww)

col_ends <-
  list(beg = c(1, nd[-length(nd)]+1L),
       end = nd)

And we could have picked which columns to exclude more robustly by using negative indices like:

col_ends <- 
  list(beg = c(1, -10, 15, 19, -23, 28, 32, -36,
               41, 45, -49, 54, 58),
       end = c(9, 14, 18, 22, 27, 31, 35,
               40, 44, 48, 53, 57, 61))

Then replace col_ends$beg[ii] with abs(col_ends$beg[ii]) and in the next line:

paste0("V", which(col_ends$beg < 0))

Lastly, if you want the column names to be read programatically as well, you could clean up with readLines:

cols <-
  gsub("\\s", "", 
       sapply(1:(length(col_ends$beg)),
              function(ii) 
                stri_sub(readLines(URL, n = 4L)[4L], 
                         col_ends$beg[ii]+1L,
                         col_ends$end[ii]+1L)))

cols <- cols[cols != ""]

(note that combining this step with fread would require creating a copy of the table in order to remove the header row, and would thus be inefficient for large data sets)

Heliostat answered 31/3, 2016 at 14:34 Comment(0)

I don't know a thing about R, but I can provide you with a regex that will match such lines:

\s[0-9]{2}[A-Z]{3}[0-9]{4}(\s{5}[0-9]+\.[0-9]+[ -][0-9]+\.[0-9]+){4}

Deanndeanna answered 17/1, 2013 at 16:43 Comment(0)

Recommended topics

Hot tags