Convert CSV to mongoimport-friendly JSON using Python
Asked Answered
C

2

5

I have a 300 mb CSV with 3 million rows worth of city information from Geonames.org. I am trying to convert this CSV into JSON to import into MongoDB with mongoimport. The reason I want JSON is that it allows me to specify the "loc" field as an array and not a string for use with the geospatial index. The CSV is encoded in UTF-8.

A snippet of my CSV looks like this:

"geonameid","name","asciiname","alternatenames","loc","feature_class","feature_code","country_code","cc2","admin1_code","admin2_code","admin3_code","admin4_code"
3,"Zamīn Sūkhteh","Zamin Sukhteh","Zamin Sukhteh,Zamīn Sūkhteh","[48.91667,32.48333]","P","PPL","IR",,"15",,,
5,"Yekāhī","Yekahi","Yekahi,Yekāhī","[48.9,32.5]","P","PPL","IR",,"15",,,
7,"Tarvīḩ ‘Adāī","Tarvih `Adai","Tarvih `Adai,Tarvīḩ ‘Adāī","[48.2,32.1]","P","PPL","IR",,"15",,,

The desired JSON output (except the charset) that works with mongoimport is below:

{"geonameid":3,"name":"Zamin Sukhteh","asciiname":"Zamin Sukhteh","alternatenames":"Zamin Sukhteh,Zamin Sukhteh","loc":[48.91667,32.48333] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
{"geonameid":5,"name":"Yekahi","asciiname":"Yekahi","alternatenames":"Yekahi,Yekahi","loc":[48.9,32.5] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
{"geonameid":7,"name":"Tarvi? ‘Adai","asciiname":"Tarvih `Adai","alternatenames":"Tarvih `Adai,Tarvi? ‘Adai","loc":[48.2,32.1] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}

I have tried all available online CSV-JSON converters and they do not work because of the file size. The closest I got was with Mr Data Converter (the one pictured above) which would import to MongoDb after removing the start and end bracket and commas between documents. Unfortunately that tool doesn't work with a 300 mb file.

The JSON above is set to be encoded in UTF-8 but still has charset problems, most likely due to a conversion error?

I spent the last three days learning Python, trying to use Python CSVKIT, trying all the CSV-JSON scripts on stackoverflow, importing CSV to MongoDB and changing "loc" string to array (this retains the quotation marks unfortunately) and even trying to manually copy and paste 30,000 records at a time. A lot of reverse engineering, trial and error and so forth.

Does anyone have a clue how to achieve the JSON above while keeping the encoding proper like in the CSV above? I am at a complete standstill.

Credent answered 20/11, 2012 at 16:59 Comment(3)
possible duplicates: https://mcmap.net/q/1410950/-csv-to-json-scriptCompliant
My question is regarding formatting and not error messages. I am not getting any errors but neither the desired output.Credent
This question is not a duplicate: there are both encoding issues and special output format requirements not present in the other question referred to above.Yorktown
Y
9

Python standard library (plus simplejson for decimal encoding support) has all you need:

import csv, simplejson, decimal, codecs

data = open("in.csv")
reader = csv.DictReader(data, delimiter=",", quotechar='"')

with codecs.open("out.json", "w", encoding="utf-8") as out:
   for r in reader:
      for k, v in r.items():
         # make sure nulls are generated
         if not v:
            r[k] = None
         # parse and generate decimal arrays
         elif k == "loc":
            r[k] = [decimal.Decimal(n) for n in v.strip("[]").split(",")]
         # generate a number
         elif k == "geonameid":
            r[k] = int(v)
      out.write(simplejson.dumps(r, ensure_ascii=False, use_decimal=True)+"\n")

Where "in.csv" contains your big csv file. The above code is tested as working on Python 2.6 & 2.7, with about 100MB csv file, producing a properly encoded UTF-8 file. Without surrounding brackets, array quoting nor comma delimiters, as requested.

It is also worth noting that passing both the ensure_ascii and use_decimal parameters is required for the encoding to work properly (in this case).

Finally, being based on simplejson, the python stdlib json package will also gain decimal encoding support sooner or later. So only the stdlib will ultimately be needed.

Yorktown answered 20/11, 2012 at 18:44 Comment(3)
Petri, thank you, It worked! You are the BEST! Is it possible to order the output the same way as the CSV and preserve the geonameid field as a number instead of making it a string? The script added quotes to the geonameid field.Credent
Updated the example so geonameid is encoded into a number as well. Does order really matter here, or are you just aiming for perfection for its own sake? :) You can switch to using a regular csv.reader, first read the header row: headers=reader.next() and later use that to generate ordered dictionary for each row, ie. r=OrderedDict(zip(headers, row)). Try it out, I am sure you can make it work.Yorktown
I have noticed that the alternative names field works very slow with queries because the entire field is treated as a single string. Search would be much faster if the alternative names were each put within quotes and the field was made into an array. The field would look like this: alternatenames:["Zamin Sukhteh", "Zamīn Sūkhteh"] Is it possible to update the solution to make this happen with Python? I think anyone converting the geonames database to MongoDB might find this better because queries on the field are currently not possible.Credent
C
3

Maybe you could try importing the csv directly into mongodb using

mongoimport -d <dB> -c <collection> --type csv --file location.csv --headerline
Coupon answered 17/9, 2015 at 12:25 Comment(2)
this approach saved me quite a lot of memory on one of my servers v. running a python script that read the .csv file first.Semiannual
I'm very glad to hear that :-)Coupon

© 2022 - 2024 — McMap. All rights reserved.