Python JSON to CSV - bad encoding, UnicodeDecodeError: 'charmap' codec can't decode byte
Asked Answered
M

2

9

I have a problem converting nested JSON to CSV. For this i use https://github.com/vinay20045/json-to-csv (forked a bit to support python 3.4), here is full json-to-csv.py file. Converting is working, if i set

    #Base Condition
else:
    reduced_item[str(key)] = (str(value)).encode('utf8','ignore')

and

fp = open(json_file_path, 'r', encoding='utf-8')

but when i import csv to MS Excel i see bad cyrillic characters, for example \xe0\xf1 , english text is ok. Experimented with setting encode('cp1251','ignore') but then i got an error UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to (as here UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>)

import sys
import json
import csv

##
# This function converts an item like 
# {
#   "item_1":"value_11",
#   "item_2":"value_12",
#   "item_3":"value_13",
#   "item_4":["sub_value_14", "sub_value_15"],
#   "item_5":{
#       "sub_item_1":"sub_item_value_11",
#       "sub_item_2":["sub_item_value_12", "sub_item_value_13"]
#   }
# }
# To
# {
#   "node_item_1":"value_11",
#   "node_item_2":"value_12",
#   "node_item_3":"value_13",
#   "node_item_4_0":"sub_value_14", 
#   "node_item_4_1":"sub_value_15",
#   "node_item_5_sub_item_1":"sub_item_value_11",
#   "node_item_5_sub_item_2_0":"sub_item_value_12",
#   "node_item_5_sub_item_2_0":"sub_item_value_13"
# }
##
def reduce_item(key, value):
    global reduced_item

    #Reduction Condition 1
    if type(value) is list:
        i=0
        for sub_item in value:
            reduce_item(key+'_'+str(i), sub_item)
            i=i+1

    #Reduction Condition 2
    elif type(value) is dict:
        sub_keys = value.keys()
        for sub_key in sub_keys:
            reduce_item(key+'_'+str(sub_key), value[sub_key])

    #Base Condition
    else:
        reduced_item[str(key)] = (str(value)).encode('cp1251','ignore')


if __name__ == "__main__":
    if len(sys.argv) != 4:
        print("\nUsage: python json_to_csv.py <node_name> <json_in_file_path> <csv_out_file_path>\n")
    else:
        #Reading arguments
        node = sys.argv[1]
        json_file_path = sys.argv[2]
        csv_file_path = sys.argv[3]

        fp = open(json_file_path, 'r', encoding='cp1251')
        json_value = fp.read()
        raw_data = json.loads(json_value)

        processed_data = []
        header = []
        for item in raw_data[node]:
            reduced_item = {}
            reduce_item(node, item)

            header += reduced_item.keys()

            processed_data.append(reduced_item)

        header = list(set(header))
        header.sort()

        with open(csv_file_path, 'wt+') as f:#wb+ for python 2.7
            writer = csv.DictWriter(f, header, quoting=csv.QUOTE_ALL, delimiter=',')
            writer.writeheader()
            for row in processed_data:
                writer.writerow(row)

        print("Just completed writing csv file with %d columns" % len(header))

How to convert cyrillic correctly and also i want to skip bad characters?

Multitudinous answered 27/1, 2015 at 13:29 Comment(0)
P
15

You need to know cyrylic encoding of which file are you going to open. For example that is enough in python3:

with open(args.input_file, 'r', encoding="cp866") as input_file:
        data = input_file.read()
        structure = json.loads(data)

In python3 data variable is automatically utf-8. In python2 there might be problem with feeding input to json.

Also try to print out in python interpreter line and see if symbols are right. Without input file is hard to tell if everything is right. Also are you sure that it is python, not excel related problem? Did you tried to open in notepad++ or similar encodings respecting editors?

Most important thing working with encodings is cheking that input and output is right. I would suggest to look here.

Pilaf answered 27/1, 2015 at 17:48 Comment(1)
To clarify, that was my fault: to code from github i added myself encoding settings in 2 sides, that was wrong. So the answer is to add encoding="utf8" to with open(args.input_file, 'r', encoding="cp866") as input_file: and after that file correctly opens in Excel. And delete encoding declaration from (str(value)).encode('cp1251','ignore') and fp = open(json_file_path, 'r', encoding='cp1251')Multitudinous
M
8

maybe you could use the chardet to detect the file's encoding.

import chardet

File='arq.GeoJson'
enc=chardet.detect(open(File,'rb').read())['encoding']
with open(File,'r', encoding = enc) as f:
    data=json.load(f)
    f.close()

This avoids 'to kick' the encoding.

Mescal answered 25/7, 2015 at 10:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.