The Challenge One of the most challenging aspects of responding to SO questions is the time it takes to recreate the problem (including the data). Questions which don't have a clear way to reproduce the data are less likely to be answered. Given that you are taking the time to write a question and you have an issue that you'd like help with, you can easily help yourself by providing data that others can then use to help solve your problem.
The instructions provided by @Andy for writing good Pandas questions are an excellent place to start. For more information, refer to how to ask and how to create Minimal, Complete, and Verifiable examples.
Please clearly state your question upfront. After taking the time to write your question and any sample code, try to read it and provide an 'Executive Summary' for your reader which summarizes the problem and clearly states the question.
Original question:
I have this data...
I want to do this...
I want my result to look like this...
However, when I try to do [this], I get the following problem...
I've tried to find solutions by doing [this] and [that].
How do I fix it?
Depending on the amount of data, sample code and error stacks provided, the reader needs to go a long way before understanding what the problem is. Try restating your question so that the question itself is on top, and then provide the necessary details.
Revised Question:
Qustion: How can I do [this]?
I've tried to find solutions by doing [this] and [that].
When I've tried to do [this], I get the following problem...
I'd like my final results to look like this...
Here is some minimal code that can reproduce my problem...
And here is how to recreate my sample data:
df = pd.DataFrame({'A': [...], 'B': [...], ...})
PROVIDE SAMPLE DATA IF NEEDED!!!
Sometimes just the head or tail of the DataFrame is all that is needed. You can also use the methods proposed by @JohnE to create larger datasets that can be reproduced by others. Using his example to generate a 100 row DataFrame of stock prices:
stocks = pd.DataFrame({
'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
'price':(np.random.randn(100).cumsum() + 10) })
If this was your actual data, you may just want to include the head and/or tail of the dataframe as follows (be sure to anonymize any sensitive data):
>>> stocks.head(5).to_dict()
{'date': {0: Timestamp('2011-01-01 00:00:00'),
1: Timestamp('2011-01-01 00:00:00'),
2: Timestamp('2011-01-01 00:00:00'),
3: Timestamp('2011-01-01 00:00:00'),
4: Timestamp('2011-01-02 00:00:00')},
'price': {0: 10.284260107718254,
1: 11.930300761831457,
2: 10.93741046217319,
3: 10.884574289565609,
4: 11.78005850418319},
'ticker': {0: 'aapl', 1: 'aapl', 2: 'aapl', 3: 'aapl', 4: 'aapl'}}
>>> pd.concat([stocks.head(), stocks.tail()], ignore_index=True).to_dict()
{'date': {0: Timestamp('2011-01-01 00:00:00'),
1: Timestamp('2011-01-01 00:00:00'),
2: Timestamp('2011-01-01 00:00:00'),
3: Timestamp('2011-01-01 00:00:00'),
4: Timestamp('2011-01-02 00:00:00'),
5: Timestamp('2011-01-24 00:00:00'),
6: Timestamp('2011-01-25 00:00:00'),
7: Timestamp('2011-01-25 00:00:00'),
8: Timestamp('2011-01-25 00:00:00'),
9: Timestamp('2011-01-25 00:00:00')},
'price': {0: 10.284260107718254,
1: 11.930300761831457,
2: 10.93741046217319,
3: 10.884574289565609,
4: 11.78005850418319,
5: 10.017209045035006,
6: 10.57090128181566,
7: 11.442792747870204,
8: 11.592953372130493,
9: 12.864146419530938},
'ticker': {0: 'aapl',
1: 'aapl',
2: 'aapl',
3: 'aapl',
4: 'aapl',
5: 'msft',
6: 'msft',
7: 'msft',
8: 'msft',
9: 'msft'}}
You may also want to provide a description of the DataFrame (using only the relevant columns). This makes it easier for others to check the data types of each column and identify other common errors (e.g. dates as string vs. datetime64 vs. object):
stocks.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 0 to 99
Data columns (total 3 columns):
date 100 non-null datetime64[ns]
price 100 non-null float64
ticker 100 non-null object
dtypes: datetime64[ns](1), float64(1), object(1)
NOTE: If your DataFrame has a MultiIndex:
If your DataFrame has a multiindex, you must first reset before calling to_dict
. You then need to recreate the index using set_index
:
# MultiIndex example. First create a MultiIndex DataFrame.
df = stocks.set_index(['date', 'ticker'])
>>> df
price
date ticker
2011-01-01 aapl 10.284260
aapl 11.930301
aapl 10.937410
aapl 10.884574
2011-01-02 aapl 11.780059
...
# After resetting the index and passing the DataFrame to `to_dict`, make sure to use
# `set_index` to restore the original MultiIndex. This DataFrame can then be restored.
d = df.reset_index().to_dict()
df_new = pd.DataFrame(d).set_index(['date', 'ticker'])
>>> df_new.head()
price
date ticker
2011-01-01 aapl 10.284260
aapl 11.930301
aapl 10.937410
aapl 10.884574
2011-01-02 aapl 11.780059
df.head(N).to_dict()
, whereN
is some reasonable number is a good way to go. Bonus +1's for adding pretty-line breaks to the output. For timestamps, you'll typically just need to addfrom pandas import Timestamp
to the top of the code. – Twentyfour