Create Pandas DataFrame from a string

Asked 24/3, 2014 at 8:43 Answered 20/11, 2022 at 8:51

Solved python string pandas csv csv-import

485

In order to test some functionality I would like to create a DataFrame from a string. Let's say my test data looks like:

TESTDATA="""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
"""

What is the simplest way to read that data into a Pandas DataFrame?

Jesselton answered 24/3, 2014 at 8:43 Comment(0)

819

A simple way to do this is to use StringIO.StringIO (python2) or io.StringIO (python3) and pass that to the pandas.read_csv function. E.g:

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

import pandas as pd

TESTDATA = StringIO("""col1;col2;col3
    1;4.4;99
    2;4.5;200
    3;4.7;65
    4;3.2;140
    """)

df = pd.read_csv(TESTDATA, sep=";")

Jesselton answered 24/3, 2014 at 9:21 Comment(8)

FYI - pd.read_table() is an equivalent function, just slightly better nomenclature: df = pd.read_table(TESTDATA, sep=";"). – Infiltration 6/12, 2017 at 23:17

@AntonvBR Noted that one could use pandas.compat.StringIO. That way we don't have to import StringIO separately. However the pandas.compat package is considered private according to pandas.pydata.org/pandas-docs/stable/api.html?highlight=compat so leaving the answer as is for now. – Jesselton 12/12, 2017 at 6:4

Time to sort out which import: Should we use pandas.compat.StringIO or Python 2/3 StringIO? – Lyra 11/5, 2018 at 0:30

If you create TESTDATA with df.to_csv(TESTDATA), use TESTDATA.seek(0) – Ruffi 24/10, 2019 at 8:30

I receive 'Error tokenizing data. C error: Expected 2 fields in line 26, saw 12\n',) – Ambrosia 31/7, 2020 at 8:49

This is what I needed to do when uploading CSV files in colab. In order to make a dataframe out of the CSV, I used this method – Kneedeep 5/1, 2021 at 13:12

@Ambrosia The "Expected fields error" is usually due to inconsistent number of fields in each row, or wrong sep/delimiter. – Tobe 26/3, 2021 at 16:38

@Asclepius the suggestion to use pandas.compat only works for versions before pandas 0.25, but will now raise ImportError. – Townswoman 22/12, 2021 at 0:11

In one line, but first import io

import pandas as pd
import io   

TESTDATA="""col1;col2;col3
1;4.4;99
2;4.5;200
3;4.7;65
4;3.2;140
"""

df = pd.read_csv(io.StringIO(TESTDATA), sep=";")
print(df)

Croat answered 26/11, 2020 at 9:10 Comment(2)

What is the difference between this and the accepted answer? Except you move io operation to read_csv, which makes no difference... Please always check if similar answer is not posted already, redundancy is unnecessary. – Koehler 26/11, 2020 at 9:35

I like this answer because it is explained with one example in one line. – Terzetto 6/6, 2022 at 18:19

Split Method

data = input_string
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
print(df)

Shill answered 21/2, 2019 at 17:27 Comment(4)

If you want the first line to be used for column names, change the 2nd line to this: df = pd.DataFrame([x.split(';') for x in data.split('\n')[1:]], columns=[x for x in data.split('\n')[0].split(';')]) – Brainwash 18/10, 2019 at 1:34

This is wrong, since on CSV files the newline (\n) character can be part of a field. – Wagoner 3/4, 2020 at 13:13

This is not very robust, and most people would be better with the accepted answer. There is a very partial list of things that can go wrong with this at thomasburette.com/blog/2014/05/25/… – Literacy 15/5, 2020 at 17:21

This is a weak and unnecessary approach. This circumvents the built-in pd.read_csv() call which handles any CSV and has been battle-tested in favor of a naive couple of string splits that don't handle quoting and are liable to fail in many cases. It's also more verbose. Please use the CSV parser Pandas kindly gives you! – Nachison 16/3, 2023 at 4:59

A quick and easy solution for interactive work is to copy-and-paste the text by loading the data from the clipboard.

Select the content of the string with your mouse:

In the Python shell use read_clipboard()

>>> pd.read_clipboard()
  col1;col2;col3
0       1;4.4;99
1      2;4.5;200
2       3;4.7;65
3      4;3.2;140

Use the appropriate separator:

>>> pd.read_clipboard(sep=';')
   col1  col2  col3
0     1   4.4    99
1     2   4.5   200
2     3   4.7    65
3     4   3.2   140

>>> df = pd.read_clipboard(sep=';') # save to dataframe

Gratiana answered 30/11, 2018 at 8:17 Comment(1)

Not good for reproducibility, but otherwise a pretty neat solution! – Brainwash 18/10, 2019 at 1:27

This answer applies when a string is manually entered, not when it's read from somewhere.

A traditional variable-width CSV is unreadable for storing data as a string variable. Especially for use inside a .py file, consider fixed-width pipe-separated data instead. Various IDEs and editors may have a plugin to format pipe-separated text into a neat table.

Using `read_csv`

Store the following in a utility module, e.g. util/pandas.py. An example is included in the function's docstring.

import io
import re

import pandas as pd


def read_psv(str_input: str, **kwargs) -> pd.DataFrame:
    """Read a Pandas object from a pipe-separated table contained within a string.

    Input example:
        | int_score | ext_score | eligible |
        |           | 701       | True     |
        | 221.3     | 0         | False    |
        |           | 576       | True     |
        | 300       | 600       | True     |

    The leading and trailing pipes are optional, but if one is present,
    so must be the other.

    `kwargs` are passed to `read_csv`. They must not include `sep`.

    In PyCharm, the "Pipe Table Formatter" plugin has a "Format" feature that can 
    be used to neatly format a table.

    Ref: https://stackoverflow.com/a/46471952/
    """

    substitutions = [
        ('^ *', ''),  # Remove leading spaces
        (' *$', ''),  # Remove trailing spaces
        (r' *\| *', '|'),  # Remove spaces between columns
    ]
    if all(line.lstrip().startswith('|') and line.rstrip().endswith('|') for line in str_input.strip().split('\n')):
        substitutions.extend([
            (r'^\|', ''),  # Remove redundant leading delimiter
            (r'\|$', ''),  # Remove redundant trailing delimiter
        ])
    for pattern, replacement in substitutions:
        str_input = re.sub(pattern, replacement, str_input, flags=re.MULTILINE)
    return pd.read_csv(io.StringIO(str_input), sep='|', **kwargs)

Non-working alternatives

The code below doesn't work properly because it adds an empty column on both the left and right sides.

df = pd.read_csv(io.StringIO(df_str), sep=r'\s*\|\s*', engine='python')

As for read_fwf, it doesn't actually use so many of the optional kwargs that read_csv accepts and uses. As such, it shouldn't be used at all for pipe-separated data.

Antimasque answered 28/9, 2017 at 14:42 Comment(1)

I found (by trial&error) that read_fwf takes more of read_csvs arguments than is documented, but it's true that some have no effect. – Courtneycourtrai 20/1, 2020 at 16:12

Object: Take string make dataframe.

Solution

def str2frame(estr, sep = ',', lineterm = '\n', set_header = True):
    dat = [x.split(sep) for x in estr.strip(lineterm).split(lineterm)]
    df = pd.DataFrame(dat)
    if set_header:
        df = df.T.set_index(0, drop = True).T # flip, set ix, flip back
    return df

Example

estr = """
sym,date,strike,genus
APPLE,20MAY20,50.0,Malus
ORANGE,22JUL20,50.0,Rutaceae
"""

df = str2frame(estr)

print(df)
0     sym     date strike     genus
1   APPLE  20MAY20   50.0     Malus
2  ORANGE  22JUL20   50.0  Rutaceae

Bradbradan answered 18/6, 2021 at 9:35 Comment(4)

Please use descriptive variable names and not estr, dat, cdf. – Kerby 7/9, 2022 at 7:3

The style is intentional; feel free to copy this answer and use your own notation. – Bradbradan 24/11, 2022 at 16:24

Same remark as this answer. Please don't reinvent the wheel, just use pd.read_csv(). – Nachison 16/3, 2023 at 5:1

If you have one, a better solution depending only on pandas would be appreciated. – Bradbradan 18/3, 2023 at 13:49

Emample:

text = [ ['This is the NLP TASKS ARTICLE written by Anjum**'] ,['IN this article I”ll be explaining various DATA-CLEANING techniques '], ['So stay tuned for FURther More && '],['Nah I dont think he goes to usf ; he lives around']]
df = pd.DataFrame({'text':text})

Output

Hypnogenesis answered 20/11, 2022 at 8:51 Comment(3)

This has nothing to do with the question. OP wants to read a CSV from a string, completely missing here. – Nachison 16/3, 2023 at 5:2

@ggorlen. First see the question: "Create Pandas DataFrame from a string" and then see my answer. lol – Hypnogenesis 4/4, 2023 at 19:53

Sorry, I did that and don't see what you mean. There's no CSV string in your answer. You've hardcoded a list, completely bypassing a conversion step that all of the other answers here address. The question has nothing to do with converting lists to dataframes. Even if you were going to hardcode lists, this is a poor solution because it adds extra single-element inner list wrappers on each of the strings for no apparent reason, making the text column an object rather than a string datatype. – Nachison 4/4, 2023 at 19:54

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Using read_csv

Non-working alternatives

Solution

Example

Recommended topics

Hot tags

Using `read_csv`