Insert a Pandas Dataframe into mongodb using PyMongo
Asked Answered
C

13

62

What is the quickest way to insert a pandas DataFrame into mongodb using PyMongo?

Attempts

db.myCollection.insert(df.to_dict())

gave an error

InvalidDocument: documents must have only string keys, the key was Timestamp('2013-11-23 13:31:00', tz=None)


 db.myCollection.insert(df.to_json())

gave an error

TypeError: 'str' object does not support item assignment


 db.myCollection.insert({id: df.to_json()})

gave an error

InvalidDocument: documents must have only string a keys, key was <built-in function id>


df

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 150 entries, 2013-11-23 13:31:26 to 2013-11-23 13:24:07
Data columns (total 3 columns):
amount    150  non-null values
price     150  non-null values
tid       150  non-null values
dtypes: float64(2), int64(1)
Chaves answered 23/11, 2013 at 20:1 Comment(3)
what do you want to do afterwards? do you want one doc per record or one doc per dataframe?Connolly
Each mongo record will have the fields date, amount, price, and tid. tid should be a unique fieldChaves
you can convert the dataframe to a dict-list by: records = json.loads(df.to_json(orient='records')), the result will be like:[{'c1': 1, 'c2': 1},{'c1': 2, 'c2': 2},{'c1': 3, 'c2': 3}], then just use db.coll.insert_many(records). btw, use df.to_dict('recoreds') may counter Type errorMoxa
C
49

I doubt there is a both quickest and simple method. If you don't worry about data conversion, you can do

>>> import json
>>> df = pd.DataFrame.from_dict({'A': {1: datetime.datetime.now()}})
>>> df
                           A
1 2013-11-23 21:14:34.118531

>>> records = json.loads(df.T.to_json()).values()
>>> db.myCollection.insert(records)

But in case you try to load data back, you'll get:

>>> df = read_mongo(db, 'myCollection')
>>> df
                     A
0  1385241274118531000
>>> df.dtypes
A    int64
dtype: object

so you'll have to convert 'A' columnt back to datetimes, as well as all not int, float or str fields in your DataFrame. For this example:

>>> df['A'] = pd.to_datetime(df['A'])
>>> df
                           A
0 2013-11-23 21:14:34.118531
Connolly answered 23/11, 2013 at 21:17 Comment(1)
db.myCollection.insert(records) should be replaced by db.myCollection.insert_many(records) see warning //anaconda/bin/ipython:1: DeprecationWarning: insert is deprecated. Use insert_one or insert_many instead. #!/bin/bash //anaconda/bin/python.appFederalize
C
77

Here you have the very quickest way. Using the insert_many method from pymongo 3 and 'records' parameter of to_dict method.

db.collection.insert_many(df.to_dict('records'))
Chindwin answered 26/6, 2015 at 16:34 Comment(3)
This is the best idea imo, although I don't think the syntax is going to work for the original use case. The basic problem is that mongo needs string keys, whereas your df has a Timestamp index. You need to use the parameters passed to to_dict() to make the keys in mongo be something other than dates. A frequent use case that I've had is where you actually want each row in the df to be a record with an additional 'date' field.Amaryllidaceous
You should correct the code snippet to include the collection.Ultimate
This doesn't preserve any data types though does it? Example {'numfield': NumberLong("16797951")}Avril
C
49

I doubt there is a both quickest and simple method. If you don't worry about data conversion, you can do

>>> import json
>>> df = pd.DataFrame.from_dict({'A': {1: datetime.datetime.now()}})
>>> df
                           A
1 2013-11-23 21:14:34.118531

>>> records = json.loads(df.T.to_json()).values()
>>> db.myCollection.insert(records)

But in case you try to load data back, you'll get:

>>> df = read_mongo(db, 'myCollection')
>>> df
                     A
0  1385241274118531000
>>> df.dtypes
A    int64
dtype: object

so you'll have to convert 'A' columnt back to datetimes, as well as all not int, float or str fields in your DataFrame. For this example:

>>> df['A'] = pd.to_datetime(df['A'])
>>> df
                           A
0 2013-11-23 21:14:34.118531
Connolly answered 23/11, 2013 at 21:17 Comment(1)
db.myCollection.insert(records) should be replaced by db.myCollection.insert_many(records) see warning //anaconda/bin/ipython:1: DeprecationWarning: insert is deprecated. Use insert_one or insert_many instead. #!/bin/bash //anaconda/bin/python.appFederalize
F
9

odo can do it using

odo(df, db.myCollection)
Federalize answered 27/12, 2015 at 17:37 Comment(2)
I really like odo, but it fails terribly when the mongo uri has non alpha username, passwd. I wouldn't recommend it for anything but using an unauthenticated mongo.Hew
I think odo's development has halted or been delayed recently, as of 2019-ish.Prudential
B
5

I think there is cool ideas in this question. In my case I have been spending time more taking care of the movement of large dataframes. In those case pandas tends to allow you the option of chunksize (for examples in the pandas.DataFrame.to_sql). So I think I con contribute here by adding the function I am using in this direction.

def write_df_to_mongoDB(  my_df,\
                          database_name = 'mydatabasename' ,\
                          collection_name = 'mycollectionname',
                          server = 'localhost',\
                          mongodb_port = 27017,\
                          chunk_size = 100):
    #"""
    #This function take a list and create a collection in MongoDB (you should
    #provide the database name, collection, port to connect to the remoete database,
    #server of the remote database, local port to tunnel to the other machine)
    #
    #---------------------------------------------------------------------------
    #Parameters / Input
    #    my_list: the list to send to MongoDB
    #    database_name:  database name
    #
    #    collection_name: collection name (to create)
    #    server: the server of where the MongoDB database is hosted
    #        Example: server = 'XXX.XXX.XX.XX'
    #    this_machine_port: local machine port.
    #        For example: this_machine_port = '27017'
    #    remote_port: the port where the database is operating
    #        For example: remote_port = '27017'
    #    chunk_size: The number of items of the list that will be send at the
    #        some time to the database. Default is 100.
    #
    #Output
    #    When finished will print "Done"
    #----------------------------------------------------------------------------
    #FUTURE modifications.
    #1. Write to SQL
    #2. Write to csv
    #----------------------------------------------------------------------------
    #30/11/2017: Rafael Valero-Fernandez. Documentation
    #"""



    #To connect
    # import os
    # import pandas as pd
    # import pymongo
    # from pymongo import MongoClient

    client = MongoClient('localhost',int(mongodb_port))
    db = client[database_name]
    collection = db[collection_name]
    # To write
    collection.delete_many({})  # Destroy the collection
    #aux_df=aux_df.drop_duplicates(subset=None, keep='last') # To avoid repetitions
    my_list = my_df.to_dict('records')
    l =  len(my_list)
    ran = range(l)
    steps=ran[chunk_size::chunk_size]
    steps.extend([l])

    # Inser chunks of the dataframe
    i = 0
    for j in steps:
        print j
        collection.insert_many(my_list[i:j]) # fill de collection
        i = j

    print('Done')
    return
Bankston answered 6/3, 2018 at 9:48 Comment(2)
This is really useful, thanks. You may want to update the Args (Input) section with the current inputs.Scalariform
AttributeError: 'range' object has no attribute 'extend'Litho
R
4

If your dataframe has missing data (i.e None,nan) and you don't want null key values in your documents:

db.insert_many(df.to_dict("records")) will insert keys with null values. If you don't want the empty key values in your documents you can use a modified version of pandas .to_dict("records") code below:

from pandas.core.common import _maybe_box_datetimelike
my_list = [dict((k, _maybe_box_datetimelike(v)) for k, v in zip(df.columns, row) if v != None and v == v) for row in df.values]
db.insert_many(my_list)

where the if v != None and v == v I've added checks to make sure the value is not None or nan before putting it in the row's dictionary. Now your .insert_many will only include keys with values in the documents (and no null data types).

Runic answered 15/6, 2016 at 0:0 Comment(1)
This is a good way because dealing with null values is indeed necessary when uploading dataframe to mongodb, and this method is faster that DataFrame.to_dict(), BTW, columns = list(df.columns), then [{k: _maybe_box_datetimelike(v) for k, v in zip(columns, row) if v != None and v == v} for row in df.values] is even faster.Unpeopled
D
4

I use the following part to insert the dataframe to a collection in the database.

df.reset_index(inplace=True)
data_dict = df.to_dict("records")
myCollection.insert_many(data_dict)
Dmz answered 5/6, 2021 at 20:59 Comment(1)
Question, if my data_dict has, say 1M rows, is it better to do one single insertMany? or should one do "batches" of insertMany?Relativize
R
2

how about this:

db.myCollection.insert({id: df.to_json()})

id will be a unique string for that df

Rufescent answered 23/11, 2013 at 20:20 Comment(5)
Thanks, I get the error InvalidDocument: documents must have only string keys, key was <built-in function id>Chaves
you have to generate that id by youselfRufescent
Is this id the same as the usual _.id in mongo documents? If so, it looks like a random hash, how do I generate it?Chaves
It fails for @Chaves since id is a buildin function in Python, overriding is not recommended. You can generate an simple test-id by using id(df) but since object ID's aren't persistent across sessions this might cause you problems depending on how you use it. Works for testing though.Trinidadtrinitarian
I got maximum recursion level reached error. Fixed it with sys.setrecursionlimit(1000000)Adventurous
D
1

Just make string keys!

import json
dfData = json.dumps(df.to_dict('records'))
savaData = {'_id': 'a8e42ed79f9dae1cefe8781760231ec0', 'df': dfData}
res = client.insert_one(savaData)

##### load dfData
data = client.find_one({'_id': 'a8e42ed79f9dae1cefe8781760231ec0'}).get('df')
dfData = json.loads(data)
df = pd.DataFrame.from_dict(dfData)
Deerskin answered 27/10, 2018 at 3:26 Comment(0)
J
1

If you want to send several at one time:

db.myCollection.insert_many(df.apply(lambda x: x.to_dict(), axis=1).to_list())
Joellejoellen answered 23/8, 2020 at 15:34 Comment(0)
H
0

If you want to make sure that you're not raising InvalidDocument errors, then something like the following is a good idea. This is because mongo does not recognize types such as np.int64, np.float64, etc.

from pymongo import MongoClient
client = MongoClient()
db = client.test 
col = db.col


def createDocsFromDF(df, collection = None, insertToDB=False):
    docs = [] 
    fields = [col for col in df.columns]
    for i in range(len(df)):
        doc = {col:df[col][i] for col in df.columns if col != 'index'}
        for key, val in doc.items():
            # we have to do this, because mongo does not recognize these np. types
            if type(val) == np.int64:
                doc[key] = int(val)
            if type(val) == np.float64:
                doc[key] = float(val)
            if type(val) == np.bool_:
                doc[key] = bool(val)
        docs.append(doc) 
    if insertToDB and collection:
        db.collection.insert_many(docs)
    return docs 
Honeysweet answered 22/1, 2021 at 0:29 Comment(0)
S
0

For upserts this worked.

for r in df2.to_dict(orient="records"):
    db['utest-pd'].update_one({'a':r['a']},{'$set':r})

Does it one record at a time but it didn't seem upsert_many was able to work with more than one filter value for different records.

Suttle answered 14/4, 2022 at 2:50 Comment(0)
L
0

Use PyMongoArrow, a python library built by MongoDB just for this purpose. It is built on top of PyMongo.

PyMongoArrow allows you to easily and quickly move data from MongoDB into many other data formats such as Pandas DataFrame, NumPy Array, or Apache Arrow Table. You can also directly write data back to MongoDB from these other data formats using PyMongoArrow.

All this can be achieved in only a few lines of code. Checkout the documentation: https://mongo-arrow.readthedocs.io/en/latest/quickstart.html

Liddy answered 2/10, 2023 at 19:33 Comment(0)
L
0

Very simple answer:- First Read Excel File and convert into Dataframe using Pandas Library.

df = pd.read_excel("/path/file.xlsx")

After Convert Dataframe into Dictionary

fields = df.to_dict('records')

After Make Connection of Pymongo and use insert many command:

mycol1.insert_many(fields)

Here is Full Code:-

myclient1 = pymongo.MongoClient("mongodb://localhost:27017/")
mydb1 = myclient1["Database_name"]
mycol1 = mydb1["Col_name"]

df = pd.read_excel("path/Excel.xlsx")
fields = df.to_dict('records')
mycol1.insert_many(fields) 
Leffler answered 30/11, 2023 at 13:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.