Pandas scatterplot categorical and timeseries axes
Asked Answered
G

2

5

I'm looking to create a chart much like nltk's lexical dispersion plot, but am drawing a blank how to construct this. I was thinking that scatter would be my best geom, using '|' as markers, and setting the alpha, but I am running into all sorts of issues setting the parameters. An example of this is below:

enter image description here

I have the dataframe arranged with a datetime index, freq='D', over a 5 year period, and each column represents the count of a particular word used that date. For example:

tst = pd.DataFrame(index=pd.date_range(datetime.datetime(2010, 1, 1), end=datetime.datetime(2010, 2, 1), freq='D'), data=[[randint(0, 5), randint(0, 1), randint(0, 2)] for x in range(32)])

Currently I'm trying something akin to the following:

plt.figure()
tst.plot(kind='scatter', x=tst.index, y=tst.columns, marker='|', color=sns.xkcd_rgb['dodger blue'], alpha=.05, legend=False)
yticks = plt.yticks()[0]
plt.yticks(yticks, top_words)

the above code yields a KeyError:

KeyError: "['2009-12-31T19:00:00.000000000-0500' '2010-01-01T19:00:00.000000000-0500'\n '2010-01-02T19:00:00.000000000-0500' '2010-01-03T19:00:00.000000000-0500'\n '2010-01-04T19:00:00.000000000-0500' '2010-01-05T19:00:00.000000000-0500'\n '2010-01-06T19:00:00.000000000-0500' '2010-01-07T19:00:00.000000000-0500'\n '2010-01-08T19:00:00.000000000-0500' '2010-01-09T19:00:00.000000000-0500'\n '2010-01-10T19:00:00.000000000-0500' '2010-01-11T19:00:00.000000000-0500'\n '2010-01-12T19:00:00.000000000-0500' '2010-01-13T19:00:00.000000000-0500'\n '2010-01-14T19:00:00.000000000-0500' '2010-01-15T19:00:00.000000000-0500'\n '2010-01-16T19:00:00.000000000-0500' '2010-01-17T19:00:00.000000000-0500'\n '2010-01-18T19:00:00.000000000-0500' '2010-01-19T19:00:00.000000000-0500'\n '2010-01-20T19:00:00.000000000-0500' '2010-01-21T19:00:00.000000000-0500'\n '2010-01-22T19:00:00.000000000-0500' '2010-01-23T19:00:00.000000000-0500'\n '2010-01-24T19:00:00.000000000-0500' '2010-01-25T19:00:00.000000000-0500'\n '2010-01-26T19:00:00.000000000-0500' '2010-01-27T19:00:00.000000000-0500'\n '2010-01-28T19:00:00.000000000-0500' '2010-01-29T19:00:00.000000000-0500'\n '2010-01-30T19:00:00.000000000-0500' '2010-01-31T19:00:00.000000000-0500'] not in index" 

Any help would be appreciated.

With help, I was able to produce the following:

plt.plot(tst.index, tst, marker='|', color=sns.xkcd_rgb['dodger blue'], alpha=.25, ms=.5, lw=.5)
plt.ylim([-1, 20])
plt.yticks(range(20), top_words)

enter image description here

Unfortunately, it only appears that the upper bars will show up when there is a corresponding bar to be built on top of. That's not how my data looks.

Grapple answered 2/9, 2015 at 15:35 Comment(0)
L
6

I am not sure you can do this with .plot method. However, it is easy to do it straightly in matplotlib:

plt.plot(tst.index, tst, marker='|', lw=0, ms=10)
plt.ylim([-0.5, 5.5])

enter image description here

Loaves answered 2/9, 2015 at 16:47 Comment(1)
Worked almost exactly as expected. I do have a bit of a shift on my axes though. My argument for 0 forms a small bar at the bottom, where every other integer forms a line going up. I'll post the result in my question.Grapple
G
2

If you can install seaborn, try stripplot():

import seaborn as sns
sns.stripplot(data=tst, orient='h', marker='|', edgecolor='blue');

plot

Note that I changed your data to make it look a bit more interesting:

tst = pd.DataFrame(index=pd.date_range(datetime.datetime(2010, 1, 1), end=datetime.datetime(2010, 2, 1), freq='D'), 
                   data=(150000 * np.random.rand(32, 3)).astype('int'))

More information on seaborn:

http://stanford.edu/~mwaskom/software/seaborn/tutorial/categorical.html

Gladysglagolitic answered 3/9, 2015 at 8:27 Comment(2)
Yes, this works very well. I had come across this module in the docs, but couldn't access it previously. I was using an outdated version of seaborn. Thanks for the suggestion!Grapple
I do want to say, though, that the scale on the bottom should be reading the dates. From my original dataset, the scatter point should be at the intersection of the column and the index, with the point darkened according to the degree in the data.Grapple

© 2022 - 2024 — McMap. All rights reserved.