How do I read a fixed width format text file in pandas?
Asked Answered
D

4

16

I just got my hands on pandas and am figuring out how I can read a file. The file is from the WRDS database and is the SP500 constituents list all the way back to the 1960s. I checked the file and no matter what I do to import it using read_csv, I still can't display the data correctly.

df = read_csv('sp500-sb.txt')

df

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1231 entries, 0 to 1230
Data columns: gvkeyx      from      thru     conm
                                        gvkey      co_conm
...(the column names)
dtypes: object(1)

What does the above chunk of output mean? Anything would be helpful.

Dorri answered 15/3, 2012 at 14:8 Comment(2)
That Pandas may have eaten your data?Welldressed
Could you show the first few lines of the file?Forfeit
D
10

Wes answered me in an email. Cheers.

This is a fixed-width-format file (not delimited by commas or tabs as usual). I realize that pandas does not have a fixed-width reader like R does, though one can be fashioned very easily. I'll see what I can do. In the meantime if you can export the data in another format (like csv--truly comma separated) you'll be able to read it with read_csv. I suspect with some unix magic you can transform a FWF file into a CSV file.

I recommend following the issue on github as your e-mail is about to disappear from my inbox :)

https://github.com/pydata/pandas/issues/920

best, Wes

Dorri answered 16/3, 2012 at 1:26 Comment(1)
Follow up to the answer. Reading fix width format is now implementing by read_fwf(). See pandas.pydata.org/pandas-docs/dev/…Fraternity
C
27

pandas.read_fwf() was added in pandas 0.7.3 (April 2012) to handle fixed-width files.

  1. API reference

  2. An example from other question

Counteraccusation answered 27/10, 2015 at 6:15 Comment(0)
D
10

Wes answered me in an email. Cheers.

This is a fixed-width-format file (not delimited by commas or tabs as usual). I realize that pandas does not have a fixed-width reader like R does, though one can be fashioned very easily. I'll see what I can do. In the meantime if you can export the data in another format (like csv--truly comma separated) you'll be able to read it with read_csv. I suspect with some unix magic you can transform a FWF file into a CSV file.

I recommend following the issue on github as your e-mail is about to disappear from my inbox :)

https://github.com/pydata/pandas/issues/920

best, Wes

Dorri answered 16/3, 2012 at 1:26 Comment(1)
Follow up to the answer. Reading fix width format is now implementing by read_fwf(). See pandas.pydata.org/pandas-docs/dev/…Fraternity
C
0

What do you mean by display? Doesn't df['gvkey'] give you the data in the gvkey column?

If what you do is print the whole data frame to the console, then take a look at df.to_string(), but it'll be hard to read if you have too many columns. Pandas won't print the whole thing by default if you have too many columns:

import pandas
import numpy 

df1 = pandas.DataFrame(numpy.random.randn(10, 3), columns=['col%d' % d for d in range(3)] )
df2 = pandas.DataFrame(numpy.random.randn(10, 30), columns=['col%d' % d for d in range(30)] )

print df1   # <--- substitute by df2 to see the difference
print
print df1['col1']
print
print df1.to_string()
Conversion answered 15/3, 2012 at 23:28 Comment(2)
thanks TR. My data file is a fixed width format...currently not supported. Wes has graciously added to the source forge pandas issue list.Dorri
Do you still have access to WRDS? You should have an option there to output the file in csv format.Conversion
C
0

user, if you need to deal with the fixed format right now, you can use something like the following:

def fixed_width_to_items(filename, fields, first_column_is_index=False, ignore_first_rows=0):
    reader = open(filename, 'r')
    # skip first rows 
    for i in xrange(ignore_first_rows):
        reader.next()
    if first_column_is_index:
        index = slice(0, fields[1])
        fields = [slice(*x) for x  in zip(fields[1:-1], fields[2:])]
        return ((line[index], [line[x].strip() for x in fields]) for line in reader)
    else:
        fields = [slice(*x) for x  in zip(fields[:-1], fields[1:])]
        return ((i, [line[x].strip() for x in fields]) for i,line in enumerate(reader)) 

Here's a test program:

import pandas
import numpy
import tempfile

# create a data frame
df = pandas.DataFrame(numpy.random.randn(100, 5))
file_ = tempfile.NamedTemporaryFile(delete=True)
file_.write(df.to_string())
file_.flush()

# specify fields
fields = [0, 3, 12, 22, 32, 42, 52]
df2 = pandas.DataFrame.from_items( fixed_width_to_items(file_.name, fields, first_column_is_index=True, ignore_first_rows=1) ).T

# need to specify the datatypes, otherwise everything is a string
df2 = pandas.DataFrame(df2, dtype=float)
df2.index = [int(x) for x in df2.index]

# check
assert (df - df2).abs().max().max() < 1E-6

This should do the trick if you need it right now, but bear in mind that the function above is very simple, in particular it doesn't do anything about data types.

Conversion answered 17/3, 2012 at 19:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.