I have a 300 mb CSV with 3 million rows worth of city information from Geonames.org. I am trying to convert this CSV into JSON to import into MongoDB with mongoimport. The reason I want JSON is that it allows me to specify the "loc" field as an array and not a string for use with the geospatial index. The CSV is encoded in UTF-8.
A snippet of my CSV looks like this:
"geonameid","name","asciiname","alternatenames","loc","feature_class","feature_code","country_code","cc2","admin1_code","admin2_code","admin3_code","admin4_code"
3,"Zamīn Sūkhteh","Zamin Sukhteh","Zamin Sukhteh,Zamīn Sūkhteh","[48.91667,32.48333]","P","PPL","IR",,"15",,,
5,"Yekāhī","Yekahi","Yekahi,Yekāhī","[48.9,32.5]","P","PPL","IR",,"15",,,
7,"Tarvīḩ ‘Adāī","Tarvih `Adai","Tarvih `Adai,Tarvīḩ ‘Adāī","[48.2,32.1]","P","PPL","IR",,"15",,,
The desired JSON output (except the charset) that works with mongoimport is below:
{"geonameid":3,"name":"Zamin Sukhteh","asciiname":"Zamin Sukhteh","alternatenames":"Zamin Sukhteh,Zamin Sukhteh","loc":[48.91667,32.48333] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
{"geonameid":5,"name":"Yekahi","asciiname":"Yekahi","alternatenames":"Yekahi,Yekahi","loc":[48.9,32.5] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
{"geonameid":7,"name":"Tarvi? ‘Adai","asciiname":"Tarvih `Adai","alternatenames":"Tarvih `Adai,Tarvi? ‘Adai","loc":[48.2,32.1] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
I have tried all available online CSV-JSON converters and they do not work because of the file size. The closest I got was with Mr Data Converter (the one pictured above) which would import to MongoDb after removing the start and end bracket and commas between documents. Unfortunately that tool doesn't work with a 300 mb file.
The JSON above is set to be encoded in UTF-8 but still has charset problems, most likely due to a conversion error?
I spent the last three days learning Python, trying to use Python CSVKIT, trying all the CSV-JSON scripts on stackoverflow, importing CSV to MongoDB and changing "loc" string to array (this retains the quotation marks unfortunately) and even trying to manually copy and paste 30,000 records at a time. A lot of reverse engineering, trial and error and so forth.
Does anyone have a clue how to achieve the JSON above while keeping the encoding proper like in the CSV above? I am at a complete standstill.