How to make good reproducible pandas examples
Asked Answered
S

5

238

Having spent a decent amount of time watching both the and tags on SO, the impression that I get is that pandas questions are less likely to contain reproducible data. This is something that the R community has been pretty good about encouraging, and thanks to guides like this, newcomers are able to get some help on putting together these examples. People who are able to read these guides and come back with reproducible data will often have much better luck getting answers to their questions.

How can we create good reproducible examples for pandas questions? Simple dataframes can be put together, e.g.:

import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane', 'Alice'], 
                   'income': [40000, 50000, 42000]})

But many example datasets need more complicated structure, e.g.:

  • datetime indices or data
  • Multiple categorical variables (is there an equivalent to R's expand.grid() function, which produces all possible combinations of some given variables?)
  • MultiIndex data

For datasets that are hard to mock up using a few lines of code, is there an equivalent to R's dput() that allows you to generate copy-pasteable code to regenerate your datastructure?

Siemens answered 20/11, 2013 at 23:31 Comment(2)
If you copy the output of printing, most of the time answerers can use read_clipboard()... except for MultiIndex :s. Saying that, dict is good additionGoff
In addition to what Andy said, I think copy-pasting df.head(N).to_dict(), where N is some reasonable number is a good way to go. Bonus +1's for adding pretty-line breaks to the output. For timestamps, you'll typically just need to add from pandas import Timestamp to the top of the code.Twentyfour
G
487

Note: Most of the ideas here are pretty generic for Stack Overflow, indeed questions in general. See Minimal, Reproducible Example or Short, Self Contained, Correct Example.

Disclaimer: Writing a good question is hard.

The Good:

  • Do include a small example DataFrame, either as runnable code:

    In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
    

    or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').

    In [2]: df
    Out[2]:
       A  B
    0  1  2
    1  1  3
    2  4  6
    

    Test it yourself to make sure it works and reproduces the issue.

    • You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.

    • I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows,[citation needed] and I bet I can do it in 5. Can you reproduce the error with df = df.head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.

      But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun to profile your code), where you should generate:

      df = pd.DataFrame(np.random.randn(100000000, 10))
      

      Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site.

    • For getting runnable code, df.to_dict is often useful, with the different orient options for different cases. In the example above, I could have grabbed the data and columns from df.to_dict('split').

  • Write out the outcome you desire (similarly to above)

    In [3]: iwantthis
    Out[3]:
       A  B
    0  1  5
    1  4  6
    

    Explain where the numbers come from:

    The 5 is the sum of the B column for the rows where A is 1.

  • Do show the code you've tried:

    In [4]: df.groupby('A').sum()
    Out[4]:
       B
    A
    1  5
    4  6
    

    But say what's incorrect:

    The A column is in the index rather than a column.

  • Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

    The docstring for sum simply states "Compute sum of group values"

    The groupby documentation doesn't give any examples for this.

    Aside: the answer here is to use df.groupby('A', as_index=False).sum().

  • If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.

    df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
    

    Sometimes this is the issue itself: they were strings.

The Bad:

  • Don't include a MultiIndex, which we can't copy and paste (see above). This is kind of a grievance with Pandas' default display, but nonetheless annoying:

    In [11]: df
    Out[11]:
         C
    A B
    1 2  3
      2  6
    

    The correct way is to include an ordinary DataFrame with a set_index call:

    In [12]: df = pd.DataFrame([[1, 2, 3], [1, 2, 6]], columns=['A', 'B', 'C'])
    
    In [13]: df = df.set_index(['A', 'B'])
    
    In [14]: df
    Out[14]:
         C
    A B
    1 2  3
      2  6
    
  • Do provide insight to what it is when giving the outcome you want:

       B
    A
    1  1
    5  0
    

    Be specific about how you got the numbers (what are they)... double check they're correct.

  • If your code throws an error, do include the entire stack trace. This can be edited out later if it's too noisy. Show the line number and the corresponding line of your code which it's raising against.

  • Pandas 2.0 introduced a number of changes, and Pandas 1.0 before that, so if you're getting unexpected output, include the version:

    pd.__version__
    

    On that note, you might also want to include the version of Python, your OS, and any other libraries. You could use pd.show_versions() or the session_info package (which shows loaded libraries and Jupyter/IPython environment).

The Ugly:

  • Don't link to a CSV file we don't have access to (and ideally don't link to an external source at all).

    df = pd.read_csv('my_secret_file.csv') # ideally with lots of parsing options
    

    Most data is proprietary, we get that. Make up similar data and see if you can reproduce the problem (something small).

  • Don't explain the situation vaguely in words, like you have a DataFrame which is "large", mention some of the column names in passing (be sure not to mention their dtypes). Try and go into lots of detail about something which is completely meaningless without seeing the actual context. Presumably no one is even going to read to the end of this paragraph.

    Essays are bad; it's easier with small examples.

  • Don't include 10+ (100+??) lines of data munging before getting to your actual question.

    Please, we see enough of this in our day jobs. We want to help, but not like this.... Cut the intro, and just show the relevant DataFrames (or small versions of them) in the step which is causing you trouble.

Goff answered 23/11, 2013 at 6:19 Comment(18)
+1 for the pd.read_clipboard(sep='\s\s+') tip. When I post SO questions that need a special but easily shared dataframe, like this one I build it in excel, copy it to my clipboard, then instruct SOers to do the same. Saves so much time!Irremeable
the pd.read_clipboard(sep='\s\s+') suggestion does not seem to work if you're using Python on a remote server, which is where a lot of large data sets live.Phrygia
Why pd.read_clipboard(sep='\s\s+'), and not a simpler pd.read_clipboard() (with the default ‘s+’)? The first need at least 2 whitespace characters, which may cause problems if there is only 1 (e. g. see such in the @JohnE 's answer).Weighted
@Weighted the reason that \s\s+ is so popular is that there is often one e.g. in a column name, but multiple is rarer, and pandas output nicely puts in at least two between columns. Since this is just for toy/small datasets it's pretty powerful/majority of cases. Note: tabs separated would be a different story, though stackoverflow replaces tabs with spaces, but if you have a tsv then just use \t.Goff
Ugh, i always use pd.read_clipboard(), when their are spaces, i do: pd.read_clipboard(sep='\s+{2,}', engine='python') :PDominance
I think it should be mentioned that df.head().to_dict() often produces a minimal representation of the dataset which can then be used to copy and paste code for the question. When it doesn't, it's usually because there are too many columns. If there are too many columns, a slice of the columns using df.columns[...] or df.select_dtypes will be very helpful.Grasso
Concerning read_csv: you can use StringIO to import the data from a string. In that way you can mimic this as: import pandas as pd; from io import StringIO; text = """ Product,Perc,Storage,Price Azure,(2.4%,Server,£540 AWS,,Server,£640 GCP,,Server,£540 """; data = pd.read_csv(StringIO(text))Luba
Could we add a point about column alignment? I wrote a gist with what I'm proposing. In short, tabs get mangled, so make sure to use print(df) instead of just df if you're on a GUI. I'm not a Pandas expert, and Andy's account has been inactive for years, so I'm not sure about just going ahead and adding it myself.Sarilda
It's worth noting that negative numbers and datetime columns only get one space separator when printed.Sarilda
The Ugly: Don't link to a CSV. lol, even worse are all these answers. Nobody has presented a good way of converting actual csv to df, csv has properties in columns but DataFrames have properties in rows.Cribb
@Cribb Sorry, what are you talking about exactly? If you have a question about parsing a CSV into a df, you can post the CSV; nobody's arguing against that. Andy's saying, if you have data, you need to post it; you can't just put a filename in your code and expect us to assume what the contents are. And I'm not sure what you mean about columns vs rows; CSVs and DFs are actually laid out the same in that respect...Sarilda
Look at the OPs question and consider the data compared to a csv. Again, eg: df = pd.DataFrame({'num_legs': [2, 4, 8, 0], 'num_wings': [2, 0, 0, 0], 'num_specimen_seen': [10, 2, 1, 8]}, index=['falcon', 'dog', 'spider', 'fish'])Cribb
@Cribb Well, that's only one way of constructing a df. Other ways include a list of dicts, which is more like a CSV, or pd.read_csv() ofc. What I meant was if you print it, you get a table, like a CSV (but more readable).Sarilda
@AndyHayden I provided a df. Feel free to demonstrate the dictionary approach but I don't think it changes my original comment.Cribb
@Cribb Did you reply to the wrong person? I'm still not really sure what you're talking about.Sarilda
For the record, flywire invited me to a chatSarilda
As shown in the chat, StringIO() and pd.read_csv() seem useful for including csv data in code without requiring the csv to be transformed. A row-oriented list is also demonstrated.Cribb
Reading this again, my comments from Nov 6 and Jan 27 seem to be covered by the one sentence: "Test it yourself."Sarilda
O
100

How to create sample datasets

This is to mainly to expand on AndyHayden's answer by providing examples of how you can create sample dataframes. Pandas and (especially) NumPy give you a variety of tools for this such that you can generally create a reasonable facsimile of any real dataset with just a few lines of code.

After importing NumPy and Pandas, be sure to provide a random seed if you want folks to be able to exactly reproduce your data and results.

import numpy as np
import pandas as pd

np.random.seed(123)

A kitchen sink example

Here's an example showing a variety of things you can do. All kinds of useful sample dataframes could be created from a subset of this:

df = pd.DataFrame({

    # some ways to create random data
    'a':np.random.randn(6),
    'b':np.random.choice( [5,7,np.nan], 6),
    'c':np.random.choice( ['panda','python','shark'], 6),

    # some ways to create systematic groups for indexing or groupby
    # this is similar to R's expand.grid(), see note 2 below
    'd':np.repeat( range(3), 2 ),
    'e':np.tile(   range(2), 3 ),

    # a date range and set of random dates
    'f':pd.date_range('1/1/2011', periods=6, freq='D'),
    'g':np.random.choice( pd.date_range('1/1/2011', periods=365,
                          freq='D'), 6, replace=False)
    })

This produces:

          a   b       c  d  e          f          g
0 -1.085631 NaN   panda  0  0 2011-01-01 2011-08-12
1  0.997345   7   shark  0  1 2011-01-02 2011-11-10
2  0.282978   5   panda  1  0 2011-01-03 2011-10-30
3 -1.506295   7  python  1  1 2011-01-04 2011-09-07
4 -0.578600 NaN   shark  2  0 2011-01-05 2011-02-27
5  1.651437   7  python  2  1 2011-01-06 2011-02-03

Some notes:

  1. np.repeat and np.tile (columns d and e) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r's expand.grid() but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy.
  2. For a more direct replacement for R's expand.grid() see the itertools solution in the pandas cookbook or the np.meshgrid solution shown here. Those will allow any number of dimensions.
  3. You can do quite a bit with np.random.choice. For example, in column g, we have a random selection of six dates from 2011. Additionally, by setting replace=False we can assure these dates are unique -- very handy if we want to use this as an index with unique values.

Fake stock market data

In addition to taking subsets of the above code, you can further combine the techniques to do just about anything. For example, here's a short example that combines np.tile and date_range to create sample ticker data for 4 stocks covering the same dates:

stocks = pd.DataFrame({
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })

Now we have a sample dataset with 100 lines (25 dates per ticker), but we have only used 4 lines to do it, making it easy for everyone else to reproduce without copying and pasting 100 lines of code. You can then display subsets of the data if it helps to explain your question:

>>> stocks.head(5)

        date      price ticker
0 2011-01-01   9.497412   aapl
1 2011-01-02  10.261908   aapl
2 2011-01-03   9.438538   aapl
3 2011-01-04   9.515958   aapl
4 2011-01-05   7.554070   aapl

>>> stocks.groupby('ticker').head(2)

         date      price ticker
0  2011-01-01   9.497412   aapl
1  2011-01-02  10.261908   aapl
25 2011-01-01   8.277772   goog
26 2011-01-02   7.714916   goog
50 2011-01-01   5.613023   yhoo
51 2011-01-02   6.397686   yhoo
75 2011-01-01  11.736584   msft
76 2011-01-02  11.944519   msft
Ornament answered 24/5, 2015 at 14:22 Comment(1)
Great answer. After writing this question I actually did write a very short, simple implementation of expand.grid() that's included in the pandas cookbook, you could include that in your answer as well. Your answer shows how to create more complex datasets than my expand_grid() function could handle, which is great.Siemens
Z
70

Diary of an Answerer

My best advice for asking questions would be to play on the psychology of the people who answer questions. Being one of those people, I can give insight into why I answer certain questions and why I don't answer others.

Motivations

I'm motivated to answer questions for several reasons

  1. Stackoverflow.com has been a tremendously valuable resource to me. I wanted to give back.
  2. In my efforts to give back, I've found this site to be an even more powerful resource than before. Answering questions is a learning experience for me and I like to learn. Read this answer and comment from another vet. This kind of interaction makes me happy.
  3. I like points!
  4. See #3.
  5. I like interesting problems.

All my purest intentions are great and all, but I get that satisfaction if I answer 1 question or 30. What drives my choices for which questions to answer has a huge component of point maximization.

I'll also spend time on interesting problems but that is few and far between and doesn't help an asker who needs a solution to a non-interesting question. Your best bet to get me to answer a question is to serve that question up on a platter ripe for me to answer it with as little effort as possible. If I'm looking at two questions and one has code I can copy paste to create all the variables I need... I'm taking that one! I'll come back to the other one if I have time, maybe.

Main Advice

Make it easy for the people answering questions.

  • Provide code that creates variables that are needed.
  • Minimize that code. If my eyes glaze over as I look at the post, I'm on to the next question or getting back to whatever else I'm doing.
  • Think about what you're asking and be specific. We want to see what you've done because natural languages (English) are inexact and confusing. Code samples of what you've tried help resolve inconsistencies in a natural language description.
  • PLEASE show what you expect!!! I have to sit down and try things. I almost never know the answer to a question without trying some things out. If I don't see an example of what you're looking for, I might pass on the question because I don't feel like guessing.

Your reputation is more than just your reputation.

I like points (I mentioned that above). But those points aren't really really my reputation. My real reputation is an amalgamation of what others on the site think of me. I strive to be fair and honest and I hope others can see that. What that means for an asker is, we remember the behaviors of askers. If you don't select answers and upvote good answers, I remember. If you behave in ways I don't like or in ways I do like, I remember. This also plays into which questions I'll answer.


Anyway, I can probably go on, but I'll spare all of you who actually read this.

Zobkiw answered 19/7, 2016 at 18:35 Comment(0)
G
46

The Challenge One of the most challenging aspects of responding to SO questions is the time it takes to recreate the problem (including the data). Questions which don't have a clear way to reproduce the data are less likely to be answered. Given that you are taking the time to write a question and you have an issue that you'd like help with, you can easily help yourself by providing data that others can then use to help solve your problem.

The instructions provided by @Andy for writing good Pandas questions are an excellent place to start. For more information, refer to how to ask and how to create Minimal, Complete, and Verifiable examples.

Please clearly state your question upfront. After taking the time to write your question and any sample code, try to read it and provide an 'Executive Summary' for your reader which summarizes the problem and clearly states the question.

Original question:

I have this data...

I want to do this...

I want my result to look like this...

However, when I try to do [this], I get the following problem...

I've tried to find solutions by doing [this] and [that].

How do I fix it?

Depending on the amount of data, sample code and error stacks provided, the reader needs to go a long way before understanding what the problem is. Try restating your question so that the question itself is on top, and then provide the necessary details.

Revised Question:

Qustion: How can I do [this]?

I've tried to find solutions by doing [this] and [that].

When I've tried to do [this], I get the following problem...

I'd like my final results to look like this...

Here is some minimal code that can reproduce my problem...

And here is how to recreate my sample data: df = pd.DataFrame({'A': [...], 'B': [...], ...})

PROVIDE SAMPLE DATA IF NEEDED!!!

Sometimes just the head or tail of the DataFrame is all that is needed. You can also use the methods proposed by @JohnE to create larger datasets that can be reproduced by others. Using his example to generate a 100 row DataFrame of stock prices:

stocks = pd.DataFrame({ 
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })

If this was your actual data, you may just want to include the head and/or tail of the dataframe as follows (be sure to anonymize any sensitive data):

>>> stocks.head(5).to_dict()
{'date': {0: Timestamp('2011-01-01 00:00:00'),
  1: Timestamp('2011-01-01 00:00:00'),
  2: Timestamp('2011-01-01 00:00:00'),
  3: Timestamp('2011-01-01 00:00:00'),
  4: Timestamp('2011-01-02 00:00:00')},
 'price': {0: 10.284260107718254,
  1: 11.930300761831457,
  2: 10.93741046217319,
  3: 10.884574289565609,
  4: 11.78005850418319},
 'ticker': {0: 'aapl', 1: 'aapl', 2: 'aapl', 3: 'aapl', 4: 'aapl'}}

>>> pd.concat([stocks.head(), stocks.tail()], ignore_index=True).to_dict()
{'date': {0: Timestamp('2011-01-01 00:00:00'),
  1: Timestamp('2011-01-01 00:00:00'),
  2: Timestamp('2011-01-01 00:00:00'),
  3: Timestamp('2011-01-01 00:00:00'),
  4: Timestamp('2011-01-02 00:00:00'),
  5: Timestamp('2011-01-24 00:00:00'),
  6: Timestamp('2011-01-25 00:00:00'),
  7: Timestamp('2011-01-25 00:00:00'),
  8: Timestamp('2011-01-25 00:00:00'),
  9: Timestamp('2011-01-25 00:00:00')},
 'price': {0: 10.284260107718254,
  1: 11.930300761831457,
  2: 10.93741046217319,
  3: 10.884574289565609,
  4: 11.78005850418319,
  5: 10.017209045035006,
  6: 10.57090128181566,
  7: 11.442792747870204,
  8: 11.592953372130493,
  9: 12.864146419530938},
 'ticker': {0: 'aapl',
  1: 'aapl',
  2: 'aapl',
  3: 'aapl',
  4: 'aapl',
  5: 'msft',
  6: 'msft',
  7: 'msft',
  8: 'msft',
  9: 'msft'}}

You may also want to provide a description of the DataFrame (using only the relevant columns). This makes it easier for others to check the data types of each column and identify other common errors (e.g. dates as string vs. datetime64 vs. object):

stocks.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 3 columns):
date      100 non-null datetime64[ns]
price     100 non-null float64
ticker    100 non-null object
dtypes: datetime64[ns](1), float64(1), object(1)

NOTE: If your DataFrame has a MultiIndex:

If your DataFrame has a multiindex, you must first reset before calling to_dict. You then need to recreate the index using set_index:

# MultiIndex example.  First create a MultiIndex DataFrame.
df = stocks.set_index(['date', 'ticker'])
>>> df
                       price
date       ticker           
2011-01-01 aapl    10.284260
           aapl    11.930301
           aapl    10.937410
           aapl    10.884574
2011-01-02 aapl    11.780059
...

# After resetting the index and passing the DataFrame to `to_dict`, make sure to use 
# `set_index` to restore the original MultiIndex.  This DataFrame can then be restored.

d = df.reset_index().to_dict()
df_new = pd.DataFrame(d).set_index(['date', 'ticker'])
>>> df_new.head()
                       price
date       ticker           
2011-01-01 aapl    10.284260
           aapl    11.930301
           aapl    10.937410
           aapl    10.884574
2011-01-02 aapl    11.780059
Gardie answered 12/9, 2015 at 7:6 Comment(0)
D
24

Here is my version of dput - the standard R tool to produce reproducible reports - for Pandas DataFrames. It will probably fail for more complex frames, but it seems to do the job in simple cases:

import pandas as pd
def dput(x):
    if isinstance(x,pd.Series):
        return "pd.Series(%s,dtype='%s',index=pd.%s)" % (list(x),x.dtype,x.index)
    if isinstance(x,pd.DataFrame):
        return "pd.DataFrame({" + ", ".join([
            "'%s': %s" % (c,dput(x[c])) for c in x.columns]) + (
                "}, index=pd.%s)" % (x.index))
    raise NotImplementedError("dput",type(x),x)

now,

df = pd.DataFrame({'a':[1,2,3,4,2,1,3,1]})
assert df.equals(eval(dput(df)))
du = pd.get_dummies(df.a,"foo")
assert du.equals(eval(dput(du)))
di = df
di.index = list('abcdefgh')
assert di.equals(eval(dput(di)))

Note that this produces a much more verbose output than DataFrame.to_dict, e.g.,

pd.DataFrame({
  'foo_1':pd.Series([1, 0, 0, 0, 0, 1, 0, 1],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)),
  'foo_2':pd.Series([0, 1, 0, 0, 1, 0, 0, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)),
  'foo_3':pd.Series([0, 0, 1, 0, 0, 0, 1, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1)),
  'foo_4':pd.Series([0, 0, 0, 1, 0, 0, 0, 0],dtype='uint8',index=pd.RangeIndex(start=0, stop=8, step=1))},
  index=pd.RangeIndex(start=0, stop=8, step=1))

vs

{'foo_1': {0: 1, 1: 0, 2: 0, 3: 0, 4: 0, 5: 1, 6: 0, 7: 1}, 
 'foo_2': {0: 0, 1: 1, 2: 0, 3: 0, 4: 1, 5: 0, 6: 0, 7: 0}, 
 'foo_3': {0: 0, 1: 0, 2: 1, 3: 0, 4: 0, 5: 0, 6: 1, 7: 0}, 
 'foo_4': {0: 0, 1: 0, 2: 0, 3: 1, 4: 0, 5: 0, 6: 0, 7: 0}}

for du above, but it preserves column types. E.g., in the above test case,

du.equals(pd.DataFrame(du.to_dict()))
==> False

because du.dtypes is uint8 and pd.DataFrame(du.to_dict()).dtypes is int64.

Dalmatic answered 16/12, 2016 at 17:57 Comment(3)
it is clearer, though i admit i don't see why i would want to use it over to_dictTwentyfour
Because it preserves column types. More specifically, du.equals(eval(dput(df))).Dalmatic
I like this. I have a more modern version with interpolated strings, which also breaks up the output with line breaks: def dput(x): indent = " " if isinstance(x,pd.Series): return f"pd.Series({list(x)},dtype='{x.dtype}',index=pd.{x.index}),\r\n" if isinstance(x,pd.DataFrame): temp = "pd.DataFrame({\r\n" + indent temp += indent.join([ f"'{c}': {dput(x[c])}" for c in x.columns]) temp += (f"}}, index=pd.{x.index})") return temp.replace("nan", "float(\'NaN\')") raise NotImplementedError("dput",type(x),x)Vitals

© 2022 - 2024 — McMap. All rights reserved.