How to import data from mongodb to pandas?
Asked Answered
C

16

129

I have a large amount of data in a collection in mongodb which I need to analyze. How do i import that data to pandas?

I am new to pandas and numpy.

EDIT: The mongodb collection contains sensor values tagged with date and time. The sensor values are of float datatype.

Sample Data:

{
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
    "SensorReport"
],
"Readings" : [
    {
        "a" : 0.958069536790466,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
        "b" : 6.296118156595,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95574014778624,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
        "b" : 6.29651468650064,
        "_cls" : "Reading"
    },
    {
        "a" : 0.953648289182713,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
        "b" : 7.29679823731148,
        "_cls" : "Reading"
    },
    {
        "a" : 0.955931884300997,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
        "b" : 6.29642922525632,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95821381,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
        "b" : 7.28956613,
        "_cls" : "Reading"
    },
    {
        "a" : 4.95821335,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
        "b" : 6.28956574,
        "_cls" : "Reading"
    },
    {
        "a" : 9.95821341,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
        "b" : 0.28956488,
        "_cls" : "Reading"
    },
    {
        "a" : 1.95667927,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
        "b" : 0.29115237,
        "_cls" : "Reading"
    }
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}
Crowfoot answered 27/4, 2013 at 7:59 Comment(1)
Using a custom field type with MongoEngine can make storing and retrieving Pandas DataFrames as simple as mongo_doc.data_frame = my_pandas_dfMalvin
F
175

pymongo might give you a hand, followings is some code I'm using:

import pandas as pd
from pymongo import MongoClient


def _connect_mongo(host, port, username, password, db):
    """ A util for making a connection to mongo """

    if username and password:
        mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
        conn = MongoClient(mongo_uri)
    else:
        conn = MongoClient(host, port)


    return conn[db]


def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
    """ Read from Mongo and Store into DataFrame """

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df
Fivefold answered 27/4, 2013 at 18:45 Comment(8)
Thanks, this is the method i ended up using. I also had an array of embedded documents in each row. So I had to iterate that as well within each row. Is there a better way to do this??Crowfoot
Is it possible to provide some samples of your mongodb's structure?Fivefold
Look at edit for a sample data row. An array of the embedded document "Reading" is stored inside readings. Now i am doing a query to find multiple records and then iterate through each reading in the readings array for each and every record. Is there any easier way to import data in my case?Crowfoot
Note the list() inside df = pd.DataFrame(list(cursor)) evaluates as a list or generator, to keep the CPU cool. If u have a zillionty-one data items, and the next few lines would have reasonably partioned, level-of-detailed, and clipped them, the whole shmegegge is still safe to drop in. Nice.Sarcophagus
It's very slow @ df = pd.DataFrame(list(cursor)). Pure db quering is much faster. Could we change list casting to something else?Towill
@Peter that line also caught my eyes. Casting a database cursor, which is designed to be iterable and potentially wraps large amounts of data, into an in-memory list does not seem clever to me.Bulldoze
I have a cursor with ~150k rows and the list(cursor) alone takes 20s while the query takes 0.01s. Is there a faster way to load mongodb data into a pandas dataframe?Impletion
@Towill @Rafa Use iter: pd.DataFrame(iter(cursor)). Unless you want to use something like dask to never have all the data loaded into memory (for the most part).Unifoliate
L
52

You can load your mongodb data to pandas DataFrame using this code. It works for me. Hopefully for you too.

import pymongo
import pandas as pd
from pymongo import MongoClient
client = MongoClient()
db = client.database_name
collection = db.collection_name
data = pd.DataFrame(list(collection.find()))
Lentigo answered 23/12, 2014 at 9:15 Comment(0)
A
27

As per PEP, simple is better than complicated:

import pandas as pd
df = pd.DataFrame.from_records(db.<database_name>.<collection_name>.find())

You can include conditions as you would working with regular mongoDB database or even use find_one() to get only one element from the database, etc.

and voila!

Ascot answered 23/10, 2016 at 11:43 Comment(2)
pd.DataFrame.from_records seems to be as slow as DataFrame(list()), but the results are very inconsistent. %%time showed anything from 800 ms to 1.9 sComplacent
This isnt good for huge records as this doesnot shows memory error, instread hangs the system for too big data. while pd.DataFrame(list(cursor)) shows memory error.Sfax
I
25

Monary does exactly that, and it's super fast. (another link)

See this cool post which includes a quick tutorial and some timings.

Indisposition answered 19/12, 2013 at 22:33 Comment(3)
Does Monary support string data type ?Balch
I tried Monary, but it is taking a lot of time. Am I missing some optimization? Tried client = Monary(host, 27017, database="db_tmp") columns = ["col1", "col2"] data_type = ["int64", "int64"] arrays = client.query("db_tmp", "coll", {}, columns, data_type) For 50000 records takes around 200s.Lucilla
That sounds extremely slow... Frankly, I don't know what the status of this project is, now, 4 years later...Indisposition
O
17

Another option I found very useful is:

from pandas.io.json import json_normalize

cursor = my_collection.find()
df = json_normalize(cursor)

(or json_normalize(list(cursor)), depending on your python/pandas versions).

This way you get the unfolding of nested mongodb documents for free.

Oneiric answered 29/3, 2018 at 8:57 Comment(3)
I got an error with this method TypeError: data argument can't be an iteratorMorphology
Strange, this works on my python 3.6.7 using pandas 0.24.2. Maybe you can try df = json_normalize(list(cursor)) instead?Practise
For +1. docs, max_level argument defines max level of dict depth. I just made a test and it's not true, so some columns would need to be split with .str accesrors. Still, very nice feature for working with mongodb.Burnejones
B
15
import pandas as pd
from odo import odo

data = odo('mongodb://localhost/db::collection', pd.DataFrame)
Busty answered 20/10, 2016 at 23:33 Comment(0)
N
10

For dealing with out-of-core (not fitting into RAM) data efficiently (i.e. with parallel execution), you can try Python Blaze ecosystem: Blaze / Dask / Odo.

Blaze (and Odo) has out-of-the-box functions to deal with MongoDB.

A few useful articles to start off:

And an article which shows what amazing things are possible with Blaze stack: Analyzing 1.7 Billion Reddit Comments with Blaze and Impala (essentially, querying 975 Gb of Reddit comments in seconds).

P.S. I'm not affiliated with any of these technologies.

Nagging answered 27/9, 2016 at 0:16 Comment(2)
I've also written a post using Jupyter Notebook with an example how Dask helps to speedup execution even on a data fitting into memory by using multiple cores on a single machine.Nagging
Looks like blaze is deprecated.Miniskirt
B
6

Using

pandas.DataFrame(list(...))

will consume a lot of memory if the iterator/generator result is large

better to generate small chunks and concat at the end

def iterator2dataframes(iterator, chunk_size: int):
  """Turn an iterator into multiple small pandas.DataFrame

  This is a balance between memory and efficiency
  """
  records = []
  frames = []
  for i, record in enumerate(iterator):
    records.append(record)
    if i % chunk_size == chunk_size - 1:
      frames.append(pd.DataFrame(records))
      records = []
  if records:
    frames.append(pd.DataFrame(records))
  return pd.concat(frames)
Bet answered 12/9, 2016 at 8:19 Comment(0)
O
6

You can also use pymongoarrow -- it's an official library offered by MongoDB for exporting mongodb data to pandas, numPy, parquet files, etc.

Ormand answered 11/5, 2021 at 16:57 Comment(2)
this library is almost useless due to very limited type support, it does not even support str.Shareeshareholder
Hi @Wang, PyMongoArrow supports majority of the bson data types including string. You can see the list of all data types that is supported here: mongo-arrow.readthedocs.io/en/latest/data_types.htmlWray
T
3

http://docs.mongodb.org/manual/reference/mongoexport

export to csv and use read_csv or JSON and use DataFrame.from_records()

Teague answered 27/4, 2013 at 11:32 Comment(1)
It's DataFrame.from_records().Paraselene
S
2

You can achieve what you want with pdmongo in three lines:

import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [], "mongodb://localhost:27017/mydb")

If your data is very large, you can do an aggregate query first by filtering data you do not want, then map them to your desired columns.

Here is an example of mapping Readings.a to column a and filtering by reportCount column:

import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [{'$match': {'reportCount': {'$gt': 6}}}, {'$unwind': '$Readings'}, {'$project': {'a': '$Readings.a'}}], "mongodb://localhost:27017/mydb")

read_mongo accepts the same arguments as pymongo aggregate

Sardella answered 5/8, 2020 at 20:19 Comment(1)
Cool tool but doesn't work with TLS/activedirectory auth.Miniskirt
R
1

Following this great answer by waitingkuo I would like to add the possibility of doing that using chunksize in line with .read_sql() and .read_csv(). I enlarge the answer from Deu Leung by avoiding go one by one each 'record' of the 'iterator' / 'cursor'. I will borrow previous read_mongo function.

def read_mongo(db, 
           collection, query={}, 
           host='localhost', port=27017, 
           username=None, password=None,
           chunksize = 100, no_id=True):
""" Read from Mongo and Store into DataFrame """


# Connect to MongoDB
#db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
client = MongoClient(host=host, port=port)
# Make a query to the specific DB and Collection
db_aux = client[db]


# Some variables to create the chunks
skips_variable = range(0, db_aux[collection].find(query).count(), int(chunksize))
if len(skips_variable)<=1:
    skips_variable = [0,len(skips_variable)]

# Iteration to create the dataframe in chunks.
for i in range(1,len(skips_variable)):

    # Expand the cursor and construct the DataFrame
    #df_aux =pd.DataFrame(list(cursor_aux[skips_variable[i-1]:skips_variable[i]]))
    df_aux =pd.DataFrame(list(db_aux[collection].find(query)[skips_variable[i-1]:skips_variable[i]]))

    if no_id:
        del df_aux['_id']

    # Concatenate the chunks into a unique df
    if 'df' not in locals():
        df =  df_aux
    else:
        df = pd.concat([df, df_aux], ignore_index=True)

return df
Rootstock answered 6/3, 2018 at 10:43 Comment(0)
V
1

A similar approach like Rafael Valero, waitingkuo and Deu Leung using pagination:

def read_mongo(
       # db, 
       collection, query=None, 
       # host='localhost', port=27017, username=None, password=None,
       chunksize = 100, page_num=1, no_id=True):

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Calculate number of documents to skip
    skips = chunksize * (page_num - 1)

    # Sorry, this is in spanish
    # https://www.toptal.com/python/c%C3%B3digo-buggy-python-los-10-errores-m%C3%A1s-comunes-que-cometen-los-desarrolladores-python/es
    if not query:
        query = {}

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query).skip(skips).limit(chunksize)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df
Vanatta answered 20/3, 2018 at 1:19 Comment(0)
D
1
  1. Start mongo in shell with: mongosh

  2. Scroll up on shell until you see where mongo is connected to. It should look something like this: mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.5.4

  3. Copy and paste that into mongoclient

  4. Here is the code:

from pymongo import MongoClient
import pandas as pd

client = MongoClient('mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.5.4')

mydatabase = client.yourdatabasename
mycollection = mydatabase.yourcollectionname
cursor = mycollection.find()
listofDocuments = list(cursor)
df = pd.DataFrame(listofDocuments)
df
Doak answered 24/8, 2022 at 19:48 Comment(0)
W
1

Although this is an old post, I think it’s still very relevant till this date as the popularity of both MongoDB and Pandas has only increased over time and will continue to increase.

MongoDB recently created a new library called "PyMongoArrow" which allows you to easily move data from MongoDB database to many other data formats such as Pandas DataFrame, Numpy Array, or Apache Arrow Table in just a few lines of code.

It has out of the box support for a lot of data types including both float and datetime you mentioned. For more details, on what data types are supported, see their documentation. This is built on top of PyMongo.

Wray answered 2/10, 2023 at 19:20 Comment(0)
Y
0

You can use the "pandas.json_normalize" method:

import pandas as pd
display(pd.json_normalize( x ))
display(pd.json_normalize( x , record_path="Readings" ))

It should display two tables, where x is your cursor or:

from bson import ObjectId
def ISODate(st):
    return st

x = {
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
    "SensorReport"
],
"Readings" : [
    {
        "a" : 0.958069536790466,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
        "b" : 6.296118156595,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95574014778624,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
        "b" : 6.29651468650064,
        "_cls" : "Reading"
    },
    {
        "a" : 0.953648289182713,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
        "b" : 7.29679823731148,
        "_cls" : "Reading"
    },
    {
        "a" : 0.955931884300997,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
        "b" : 6.29642922525632,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95821381,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
        "b" : 7.28956613,
        "_cls" : "Reading"
    },
    {
        "a" : 4.95821335,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
        "b" : 6.28956574,
        "_cls" : "Reading"
    },
    {
        "a" : 9.95821341,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
        "b" : 0.28956488,
        "_cls" : "Reading"
    },
    {
        "a" : 1.95667927,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
        "b" : 0.29115237,
        "_cls" : "Reading"
    }
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}
Yea answered 7/11, 2021 at 3:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.