Python: Traceback codecs.charmap_decode(input,self.errors,decoding_table)[0]
Asked Answered
H

3

9

Following is sample code, aim is just to merges text files from give folder and it's sub folder. i am getting Traceback occasionally so not sure where to look. also need some help to enhance the code to prevent blank line being merge & to display no lines in merged/master file. Probably it's good idea to before merging file, some cleanup should performed or just to ignores blank line during merging process.

Text file in folder is not more then 1000 lines but aggregate master file could cross 10000+ lines very easily.

import os
root = 'C:\\Dropbox\\ans7i\\'
files = [(path,f) for path,_,file_list in os.walk(root) for f in file_list]
out_file = open('C:\\Dropbox\\Python\\master.txt','w')
for path,f_name in files:
    in_file = open('%s/%s'%(path,f_name), 'r')

    # write out root/path/to/file (space) file_contents
    for line in in_file:
        out_file.write('%s/%s %s'%(path,f_name,line))
    in_file.close()

    # enter new line after each file
    out_file.write('\n')

with open('master.txt', 'r') as f:
  lines = f.readlines()
with open('master.txt', 'w') as f:
  f.write("".join(L for L in lines if L.strip())) 



Traceback (most recent call last):
  File "C:\Dropbox\Python\master.py", line 9, in <module> for line in in_file:
  File "C:\PYTHON32\LIB\encodings\cp1252.py", line  23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]  
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 972: character maps to <undefined>  
Hampshire answered 31/8, 2012 at 10:4 Comment(5)
Does the error really occur in for line in in_file? I assume it's the line after that, but not really sure. Can you test if Python runs into the loop?Ovariotomy
@Fabian: it's that line, the traceback is quite clear. While reading the file it throws an error.Spontaneous
@MartijnPieters but a UnicodeDecodeError? Weird.Ovariotomy
@Fabian: not weird at all, python 3 decodes text files automatically.Spontaneous
@MartijnPieters ahhh, Python 3. Okay then.Ovariotomy
S
20

The error is thrown because Python 3 opens your files with a default encoding that doesn't match the contents.

If all you are doing is copying file contents, you'd be better off using the shutil.copyfileobj() function together with opening the files in binary mode. That way you avoid encoding issues altogether (as long as all your source files are the same encoding of course, so you don't end up with a target file with mixed encodings):

import shutil
import os.path

with open('C:\\Dropbox\\Python\\master.txt','wb') as output:
    for path, f_name in files:
        with open(os.path.join(path, f_name), 'rb') as input:
            shutil.copyfileobj(input, output)
        output.write(b'\n') # insert extra newline between files

I've cleaned up the code a little to use context managers (so your files get closed automatically when done) and to use os.path to create the full path for your files.

If you do need to process your input line by line you'll need to tell Python what encoding to expect, so it can decode the file contents to python string objects:

open(path, mode, encoding='UTF8')

Note that this requires you to know up front what encoding the files use.

Read up on the Python Unicode HOWTO if you have further questions about python 3, files and encodings.

Spontaneous answered 31/8, 2012 at 10:21 Comment(11)
My test files very basic in nature, but sometimes there are pasted content and looks like it's creating problem.Hampshire
Just tired with above 1st code, and now getting some precise error, it says TypeError: invalid file: and tried to point out to file. but strange thing is that, this is just a blank file.Hampshire
@user1582596: can you use pastie.org to show me the traceback?Spontaneous
Also, the copyfileobj function implementation is exceedingly simple; all it does is read a buffer-size amount from the input, and writes that to the output, until all of the input file has been read.Spontaneous
10x, here is the latest code with traceback, as a test, deleted .txt file and but next time same error returns for another text file. pastie.org/4630939Hampshire
@user1582596: Ah, my mistake, updated the example. The os.path.join() function takes multiple args, not a tuple.Spontaneous
Sorry, use .write(b'\n') instead; we need to write bytes, not python (unicode) strings as the file is open in binary mode. Answer adjusted.Spontaneous
@user1582596: that's the code below my changes, where you read all lines in the master file to strip them. You'll still will have to figure out an encoding for those, or devise a way to read and write in binary mode where you remove all whitespace around newlines. Also see the second half of my answer for that part. Dealing with stripping lines in binary mode would be an interesting new question on Stack Overflow perhaps.Spontaneous
is there any way to determine, which text file is creating problem. may be i will try to rectify the issue manually and re-run the program.Hampshire
@user1582596: you can do that by looping over all the filenames, printing the filename, then opening the file in text mode (open(filename, 'r'), looping over the file to read all the lines, until you get the exception. The last filename printed is your problem file.Spontaneous
"print (in_file)" did the tweak, found issue with two text files and rectified manually. many 10x for your time on recent helps. Hats off for your knowledge and sharing knowledge spirit.Hampshire
H
6

I faced the similar issue while removing the file using os module remove function.

The required changes i performed is:

file = open(filename)

to

file = open(filename, encoding="utf8")

Add an encoding=“utf-8”

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding. ... UTF-8 uses the following rules: If the code point is < 128, it's represented by the corresponding byte value.

Handwriting answered 8/11, 2021 at 23:44 Comment(0)
J
-2

Handling import and decode error with file handling

  1. Open file with full absolute path

(source - absolute path for directory of file folder, getting all files inside file_folder)

import os
file_list = os.listdir(source)
for file in file_list:
    absolute_file_path = os.path.join(source,file)    
    file = open(absolute_file_path)
  1. Encoding the file as we open

file = open(absolute_file_path, mode, encoding, errors=ignore)

Jacquettajacquette answered 29/11, 2022 at 7:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.