How can I load data from mongodb collection into pandas' DataFrame?
Asked Answered
I

6

38

I am new to pandas (well, to all things "programming"...), but have been encouraged to give it a try. I have a mongodb database - "test" - with a collection called "tweets". I access the database in ipython:

import sys
import pymongo
from pymongo import Connection
connection = Connection()
db = connection.test
tweets = db.tweets

the document structure of documents in tweets is as follows:

entities': {u'hashtags': [],
  u'symbols': [],
  u'urls': [],
  u'user_mentions': []},
 u'favorite_count': 0,
 u'favorited': False,
 u'filter_level': u'medium',
 u'geo': {u'coordinates': [placeholder coordinate, -placeholder coordinate], u'type': u'Point'},
 u'id': 349223842700472320L,
 u'id_str': u'349223842700472320',
 u'in_reply_to_screen_name': None,
 u'in_reply_to_status_id': None,
 u'in_reply_to_status_id_str': None,
 u'in_reply_to_user_id': None,
 u'in_reply_to_user_id_str': None,
 u'lang': u'en',
 u'place': {u'attributes': {},
  u'bounding_box': {u'coordinates': [[[placeholder coordinate, placeholder coordinate],
     [-placeholder coordinate, placeholder coordinate],
     [-placeholder coordinate, placeholder coordinate],
     [-placeholder coordinate, placeholder coordinate]]],
   u'type': u'Polygon'},
  u'country': u'placeholder country',
  u'country_code': u'example',
  u'full_name': u'name, xx',
  u'id': u'user id',
  u'name': u'name',
  u'place_type': u'city',
  u'url': u'http://api.twitter.com/1/geo/id/1820d77fb3f65055.json'},
 u'retweet_count': 0,
 u'retweeted': False,
 u'source': u'<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 u'text': u'example text',
 u'truncated': False,
 u'user': {u'contributors_enabled': False,
  u'created_at': u'Sat Jan 22 13:42:59 +0000 2011',
  u'default_profile': False,
  u'default_profile_image': False,
  u'description': u'example description',
  u'favourites_count': 100,
  u'follow_request_sent': None,
  u'followers_count': 100,
  u'following': None,
  u'friends_count': 100,
  u'geo_enabled': True,
  u'id': placeholder_id,
  u'id_str': u'placeholder_id',
  u'is_translator': False,
  u'lang': u'en',
  u'listed_count': 0,
  u'location': u'example place',
  u'name': u'example name',
  u'notifications': None,
  u'profile_background_color': u'000000',
  u'profile_background_image_url': u'http://a0.twimg.com/images/themes/theme19/bg.gif',
  u'profile_background_image_url_https': u'https://si0.twimg.com/images/themes/theme19/bg.gif',
  u'profile_background_tile': False,
  u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/241527685/1363314054',
  u'profile_image_url':       u'http://a0.twimg.com/profile_images/378800000038841219/8a71d0776da0c48dcc4ef6fee9f78880_normal.jpeg',
  u'profile_image_url_https':     u'https://si0.twimg.com/profile_images/378800000038841219/8a71d0776da0c48dcc4ef6fee9f78880_normal.jpeg', 
  u'profile_link_color': u'000000',
  u'profile_sidebar_border_color': u'FFFFFF',
  u'profile_sidebar_fill_color': u'000000',
  u'profile_text_color': u'000000',
  u'profile_use_background_image': False,
  u'protected': False,
  u'screen_name': placeholder screen_name',
  u'statuses_count': xxxx,
  u'time_zone': u'placeholder time_zone',
  u'url': None,
  u'utc_offset': -21600,
  u'verified': False}}

Now, as far as I understand, pandas' main data structure - a spreadsheet-like table - is called DataFrame. How can I load the data from my "tweets" collection into pandas' DataFrame? And how can I query for a subdocument within the database?

Imbecilic answered 23/7, 2013 at 8:39 Comment(1)
There ought to be a way to do this using read_json, which would be more effecient (especially for large datasets).Maxwell
M
57

Comprehend the cursor you got from the MongoDB before passing it to DataFrame

import pandas as pd
df = pd.DataFrame(list(tweets.find()))
Morocco answered 23/7, 2013 at 8:54 Comment(10)
Great, by passing "df" the documents of the collection are brought up in a data column. However, I need to query for a subdocument - "hashtags.text" - in one of the documents, "entities". Any idea how I can do this from within pandas?Imbecilic
Can you show some examples for your documents so that I can give you a help?Morocco
What do you need? The hashtags field?Morocco
Yes, I am interested in the hashtags field.Imbecilic
What's your expect output? What're your columns and rows?Morocco
I'd like pandas to give me an overview of who tweets with hashtags and how many times they have done it. One way to do this would be to have a dataframe structure with names of user ("screen_name") in the rows, and frequency/count of hashtagged tweets in the columns.Imbecilic
Perhaps you should post a new problem about what you've just asked. Provide some sample examples and the expected output format.Morocco
is there a way to avoid list() constructor and use the generator directly ?Arawak
I have a collection with 283000 rows, each with 10 columns (5 doubles, 2 Longs, 2 Strings and 1 ISODate). It takes 3-5 seconds to give me the DataFrame. I expected this to take around zero seconds. I see that list() takes the most of the time. Is that expected or do I have some bad configuration somewhere? (FYI I am reading the whole collection, i.e., using find())Dovetailed
@Dovetailed I'm having the same problemCorundum
P
37

If you have data in MongoDb like this:

[
    {
        "name": "Adam", 
        "age": 27, 
        "address":{
            "number": 4, 
            "street": "Main Road", 
            "city": "Oxford"
        }
     },
     {
        "name": "Steve", 
        "age": 32, 
        "address":{
            "number": 78, 
            "street": "High Street", 
            "city": "Cambridge"
        }
     }
]

You can put the data straight into a dataframe like this:

from pandas import DataFrame

df = DataFrame(list(db.collection_name.find({}))

And you will get this output:

df.head()

|    | name    | age  | address                                                   |
|----|---------|------|-----------------------------------------------------------|
| 1  | "Steve" | 27   | {"number": 4, "street": "Main Road", "city": "Oxford"}    | 
| 2  | "Adam"  | 32   | {"number": 78, "street": "High St", "city": "Cambridge"}  |

However the subdocuments will just appear as JSON inside the subdocument cell. If you want to flatten objects so that subdocument properties are shown as individual cells you can use json_normalize without any parameters.

from pandas.io.json import json_normalize

datapoints = list(db.collection_name.find({})

df = json_normalize(datapoints)

df.head()

This will give the dataframe in this format:

|    | name   | age  | address.number | address.street | address.city |
|----|--------|------|----------------|----------------|--------------|
| 1  | Thomas | 27   |     4          | "Main Road"    | "Oxford"     |
| 2  | Mary   | 32   |     78         | "High St"      | "Cambridge"  |
Phagocytosis answered 20/12, 2016 at 10:50 Comment(5)
If we dont want to mention collection name then how to fetch data for all collections??Georama
Will this work for like several GB of data from MongoDB ? or does the Pandas Dataframe suffer and we need to try another approach ? Like I have a tweets JSON data of almost 15 GB imported into MongoDB and I am trying to get it into a CSV.Banal
Traceback File "C:\DEV\Python\lib\site-packages\pymongo\network.py", line 235, in _receive_data_on_socket buf = bytearray(length) MemoryErrorBanal
``` result_df = pd.json_normalize( # data=json.loads(raw_json_line_text)) data=pymongo_collection.find() # data=tuple(pymongo_collection.find()) )``` works just fine without converting | reading pymongo cursor to list or tuple .Bruges
Loading a large collection leads to memory issue. Is there a way I can pull data in small batches?Epicotyl
M
6

You can load your MongoDB data to pandas DataFame using this code. It works for me.

import pymongo
import pandas as pd
from pymongo import Connection
connection = Connection()
db = connection.database_name
input_data = db.collection_name
data = pd.DataFrame(list(input_data.find()))
Madel answered 23/12, 2014 at 9:1 Comment(1)
here we are mentioned the collection name.If we dont want to mention collection name then how we can archive that.?Georama
E
3

Use PyMongoArrow. This is a tool built by MongoDB that allows you to work directly with both MongoDB and Pandas. In addition to pandas, it also supports moving data in and out of MongoDB into other data formats such as NumPy Array, Apache Arrow Table.

Eidolon answered 2/10, 2023 at 19:42 Comment(0)
A
2

This is the simplest technique to achieve your aim.

import pymongo
import pandas as pd
from pymongo import Connection
conn = Connection()
db = conn.your_database_name
input_data = db.your_collection_name
pandas_data_frame = pd.DataFrame(list(input_data.find()))
print(pandas_data_frame)
Ass answered 3/6, 2022 at 20:49 Comment(0)
R
1

Use:

df=pd.DataFrame.from_dict(collection)
Rail answered 18/5, 2020 at 20:20 Comment(1)
There is a lack of context. For example, if I try it as you say without any context, I get the following error: "TypeError: 'Collection' object is not iterable".Broadside

© 2022 - 2024 — McMap. All rights reserved.