how to read text files and create a data frame in R

Asked 28/10, 2015 at 5:6 Answered 29/10, 2015 at 3:5

Need to read the txt file in https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt

and convert them into a data frame R with column number as: LastName, FirstName, streetno, streetname, city, state, and zip...

Tried to use sep command to separate them but failed...

Prescind answered 28/10, 2015 at 5:6 Comment(2)

For starters, dat = readLines("addr.txt") will get the data into R with each row as a text string. Then you'll have to parse it into a data frame using string manipulation functions. As @3209Cigs said, better to ask this question on StackOverflow. – Unconcerned 28/10, 2015 at 5:27

Actually, I think all you need to do after readLines is dat = as.data.frame(do.call(rbind, strsplit(dat, split=" {2,10}"))) and you'll have the data in a data frame. Then just change the column names to the desired values. – Unconcerned 28/10, 2015 at 5:34

Expanding on my comments, here's another approach. You may need to tweak some of the code if your full data set has a wider range of patterns to account for.

library(stringr) # For str_trim 

# Read string data and split into data frame
dat = readLines("addr.txt")
dat = as.data.frame(do.call(rbind, strsplit(dat, split=" {2,10}")), stringsAsFactors=FALSE)
names(dat) = c("LastName", "FirstName", "address", "city", "state", "zip")

# Separate address into number and street (if streetno isn't always numeric,
# or if you don't want it to be numeric, then just remove the as.numeric wrapper).
dat$streetno = as.numeric(gsub("([0-9]{1,4}).*","\\1",  dat$address))
dat$streetname = gsub("[0-9]{1,4} (.*)","\\1",  dat$address)

# Clean up zip
dat$zip = gsub("O","0", dat$zip)
dat$zip = str_trim(dat$zip)

dat = dat[,c(1:2,7:8,4:6)]

dat
      LastName  FirstName streetno           streetname       city state        zip
1        Bania  Thomas M.      725    Commonwealth Ave.     Boston    MA      02215
2      Barnaby      David      373        W. Geneva St.   Wms. Bay    WI      53191
3       Bausch       Judy      373        W. Geneva St.   Wms. Bay    WI      53191
...
41      Wright       Greg      791  Holmdel-Keyport Rd.    Holmdel    NY 07733-1988
42     Zingale    Michael     5640        S. Ellis Ave.    Chicago    IL      60637

Unconcerned answered 29/10, 2015 at 2:26 Comment(0)

Try this.

x<-scan("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt" , 
  what = list(LastName="", FirstName="", streetno="", streetname="", city="", state="",zip=""))

data<-as.data.frame(x)

Lachance answered 28/10, 2015 at 6:44 Comment(1)

The content of the data frame is not consistent with expected, try head( data ) – Jenks 28/10, 2015 at 7:19

I found it easiest to fix up the file into a csv by adding the commas where they belong, then read it.

## get the page as text
txt <- RCurl::getURL(
    "https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt"
)
## fix the EOL (end-of-line) markers
g1 <- gsub(" \n", "\n", txt, fixed = TRUE)
## read it
df <- read.csv(
    ## add most comma-separators, then the last for the house number
    text = gsub("(\\d+) (\\D+)", "\\1,\\2", gsub("\\s{2,}", ",", g1)), 
    header = FALSE,
    ## set the column names
    col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip")
)
## result
head(df)
#     LastName  FirstName streetno        streetname     city state   zip
# 1      Bania  Thomas M.      725 Commonwealth Ave.   Boston    MA O2215
# 2    Barnaby      David      373     W. Geneva St. Wms. Bay    WI 53191
# 3     Bausch       Judy      373     W. Geneva St. Wms. Bay    WI 53191
# 4    Bolatto    Alberto      725 Commonwealth Ave.   Boston    MA O2215
# 5  Carlstrom       John      933       E. 56th St.  Chicago    IL 60637
# 6 Chamberlin Richard A.      111        Nowelo St.     Hilo    HI 96720

Borisborja answered 29/10, 2015 at 3:5 Comment(0)

Here your problem is not how to use R to read in this data, but rather it's that your data is not sufficiently structured using regular delimiters between the variable-length fields you have as inputs. In addition, the zip code field contains some alpha "O" characters that should be "0".

So here is a way to use regular expression substitution to add in delimiters, and then parse the delimited text using read.csv(). Note that depending on exceptions in your full set of text, you may need to adjust the regular expressions. I have done them step by step here to make it clear what is being done and so that you can adjust them as you find exceptions in your input text. (For instance, some city names like `Wms. Bay" are two words.)

addr.txt <- readLines("https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt")
addr.txt <- gsub("\\s+O(\\d{4})", " 0\\1", addr.txt)       # replace O with 0 in zip
addr.txt <- gsub("(\\s+)([A-Z]{2})", ", \\2", addr.txt)    # state
addr.txt <- gsub("\\s+(\\d{5}(\\-\\d{4}){0,1})\\s*", ", \\1", addr.txt) # zip
addr.txt <- gsub("\\s+(\\d{1,4})\\s", ", \\1, ", addr.txt) # streetno
addr.txt <- gsub("(^\\w*)(\\s+)", "\\1, ", addr.txt)       # LastName (FirstName)
addr.txt <- gsub("\\s{2,}", ", ", addr.txt)                # city, by elimination

addr <- read.csv(textConnection(addr.txt), header = FALSE,
                 col.names = c("LastName", "FirstName", "streetno", "streetname", "city", "state", "zip"),
                 stringsAsFactors = FALSE)
head(addr)
##     LastName   FirstName streetno         streetname      city state    zip
## 1      Bania   Thomas M.      725  Commonwealth Ave.    Boston    MA  02215
## 2    Barnaby       David      373      W. Geneva St.  Wms. Bay    WI  53191
## 3     Bausch        Judy      373      W. Geneva St.  Wms. Bay    WI  53191
## 4    Bolatto     Alberto      725  Commonwealth Ave.    Boston    MA  02215
## 5  Carlstrom        John      933        E. 56th St.   Chicago    IL  60637
## 6 Chamberlin  Richard A.      111         Nowelo St.      Hilo    HI  96720

Leslee answered 28/10, 2015 at 15:21 Comment(0)

Recommended topics

Hot tags