Create Pandas DataFrame from a string
Asked Answered
J

7

485

In order to test some functionality I would like to create a DataFrame from a string. Let's say my test data looks like:

TESTDATA="""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
"""

What is the simplest way to read that data into a Pandas DataFrame?

Jesselton answered 24/3, 2014 at 8:43 Comment(0)
J
819

A simple way to do this is to use StringIO.StringIO (python2) or io.StringIO (python3) and pass that to the pandas.read_csv function. E.g:

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd

TESTDATA = StringIO("""col1;col2;col3
    1;4.4;99
    2;4.5;200
    3;4.7;65
    4;3.2;140
    """)

df = pd.read_csv(TESTDATA, sep=";")
Jesselton answered 24/3, 2014 at 9:21 Comment(8)
FYI - pd.read_table() is an equivalent function, just slightly better nomenclature: df = pd.read_table(TESTDATA, sep=";").Infiltration
@AntonvBR Noted that one could use pandas.compat.StringIO. That way we don't have to import StringIO separately. However the pandas.compat package is considered private according to pandas.pydata.org/pandas-docs/stable/api.html?highlight=compat so leaving the answer as is for now.Jesselton
Time to sort out which import: Should we use pandas.compat.StringIO or Python 2/3 StringIO?Lyra
If you create TESTDATA with df.to_csv(TESTDATA), use TESTDATA.seek(0)Ruffi
I receive 'Error tokenizing data. C error: Expected 2 fields in line 26, saw 12\n',)Ambrosia
This is what I needed to do when uploading CSV files in colab. In order to make a dataframe out of the CSV, I used this methodKneedeep
@Ambrosia The "Expected fields error" is usually due to inconsistent number of fields in each row, or wrong sep/delimiter.Tobe
@Asclepius the suggestion to use pandas.compat only works for versions before pandas 0.25, but will now raise ImportError.Townswoman
C
64

In one line, but first import io

import pandas as pd
import io   

TESTDATA="""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
"""

df = pd.read_csv(io.StringIO(TESTDATA), sep=";")
print(df)
Croat answered 26/11, 2020 at 9:10 Comment(2)
What is the difference between this and the accepted answer? Except you move io operation to read_csv, which makes no difference... Please always check if similar answer is not posted already, redundancy is unnecessary.Koehler
I like this answer because it is explained with one example in one line.Terzetto
S
59

Split Method

data = input_string
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
print(df)
Shill answered 21/2, 2019 at 17:27 Comment(4)
If you want the first line to be used for column names, change the 2nd line to this: df = pd.DataFrame([x.split(';') for x in data.split('\n')[1:]], columns=[x for x in data.split('\n')[0].split(';')])Brainwash
This is wrong, since on CSV files the newline (\n) character can be part of a field.Wagoner
This is not very robust, and most people would be better with the accepted answer. There is a very partial list of things that can go wrong with this at thomasburette.com/blog/2014/05/25/…Literacy
This is a weak and unnecessary approach. This circumvents the built-in pd.read_csv() call which handles any CSV and has been battle-tested in favor of a naive couple of string splits that don't handle quoting and are liable to fail in many cases. It's also more verbose. Please use the CSV parser Pandas kindly gives you!Nachison
G
26

A quick and easy solution for interactive work is to copy-and-paste the text by loading the data from the clipboard.

Select the content of the string with your mouse:

Copy data for pasting into a Pandas dataframe

In the Python shell use read_clipboard()

>>> pd.read_clipboard()
  col1;col2;col3
0       1;4.4;99
1      2;4.5;200
2       3;4.7;65
3      4;3.2;140

Use the appropriate separator:

>>> pd.read_clipboard(sep=';')
   col1  col2  col3
0     1   4.4    99
1     2   4.5   200
2     3   4.7    65
3     4   3.2   140

>>> df = pd.read_clipboard(sep=';') # save to dataframe
Gratiana answered 30/11, 2018 at 8:17 Comment(1)
Not good for reproducibility, but otherwise a pretty neat solution!Brainwash
A
9

This answer applies when a string is manually entered, not when it's read from somewhere.

A traditional variable-width CSV is unreadable for storing data as a string variable. Especially for use inside a .py file, consider fixed-width pipe-separated data instead. Various IDEs and editors may have a plugin to format pipe-separated text into a neat table.

Using read_csv

Store the following in a utility module, e.g. util/pandas.py. An example is included in the function's docstring.

import io
import re

import pandas as pd


def read_psv(str_input: str, **kwargs) -> pd.DataFrame:
    """Read a Pandas object from a pipe-separated table contained within a string.

    Input example:
        | int_score | ext_score | eligible |
        |           | 701       | True     |
        | 221.3     | 0         | False    |
        |           | 576       | True     |
        | 300       | 600       | True     |

    The leading and trailing pipes are optional, but if one is present,
    so must be the other.

    `kwargs` are passed to `read_csv`. They must not include `sep`.

    In PyCharm, the "Pipe Table Formatter" plugin has a "Format" feature that can 
    be used to neatly format a table.

    Ref: https://stackoverflow.com/a/46471952/
    """

    substitutions = [
        ('^ *', ''),  # Remove leading spaces
        (' *$', ''),  # Remove trailing spaces
        (r' *\| *', '|'),  # Remove spaces between columns
    ]
    if all(line.lstrip().startswith('|') and line.rstrip().endswith('|') for line in str_input.strip().split('\n')):
        substitutions.extend([
            (r'^\|', ''),  # Remove redundant leading delimiter
            (r'\|$', ''),  # Remove redundant trailing delimiter
        ])
    for pattern, replacement in substitutions:
        str_input = re.sub(pattern, replacement, str_input, flags=re.MULTILINE)
    return pd.read_csv(io.StringIO(str_input), sep='|', **kwargs)

Non-working alternatives

The code below doesn't work properly because it adds an empty column on both the left and right sides.

df = pd.read_csv(io.StringIO(df_str), sep=r'\s*\|\s*', engine='python')

As for read_fwf, it doesn't actually use so many of the optional kwargs that read_csv accepts and uses. As such, it shouldn't be used at all for pipe-separated data.

Antimasque answered 28/9, 2017 at 14:42 Comment(1)
I found (by trial&error) that read_fwf takes more of read_csvs arguments than is documented, but it's true that some have no effect.Courtneycourtrai
B
4

Object: Take string make dataframe.

Solution

def str2frame(estr, sep = ',', lineterm = '\n', set_header = True):
    dat = [x.split(sep) for x in estr.strip(lineterm).split(lineterm)]
    df = pd.DataFrame(dat)
    if set_header:
        df = df.T.set_index(0, drop = True).T # flip, set ix, flip back
    return df

Example

estr = """
sym,date,strike,genus
APPLE,20MAY20,50.0,Malus
ORANGE,22JUL20,50.0,Rutaceae
"""

df = str2frame(estr)

print(df)
0     sym     date strike     genus
1   APPLE  20MAY20   50.0     Malus
2  ORANGE  22JUL20   50.0  Rutaceae
Bradbradan answered 18/6, 2021 at 9:35 Comment(4)
Please use descriptive variable names and not estr, dat, cdf.Kerby
The style is intentional; feel free to copy this answer and use your own notation.Bradbradan
Same remark as this answer. Please don't reinvent the wheel, just use pd.read_csv().Nachison
If you have one, a better solution depending only on pandas would be appreciated.Bradbradan
H
0

Emample:

text = [ ['This is the NLP TASKS ARTICLE written by Anjum**'] ,['IN this article I”ll be explaining various DATA-CLEANING techniques '], ['So stay tuned for FURther More && '],['Nah I dont think he goes to usf ; he lives around']]
df = pd.DataFrame({'text':text})

Output enter image description here

Hypnogenesis answered 20/11, 2022 at 8:51 Comment(3)
This has nothing to do with the question. OP wants to read a CSV from a string, completely missing here.Nachison
@ggorlen. First see the question: "Create Pandas DataFrame from a string" and then see my answer. lolHypnogenesis
Sorry, I did that and don't see what you mean. There's no CSV string in your answer. You've hardcoded a list, completely bypassing a conversion step that all of the other answers here address. The question has nothing to do with converting lists to dataframes. Even if you were going to hardcode lists, this is a poor solution because it adds extra single-element inner list wrappers on each of the strings for no apparent reason, making the text column an object rather than a string datatype.Nachison

© 2022 - 2024 — McMap. All rights reserved.