So I have a series of huge files (several GB) of tabular data. They are txt and each column is defined by a fixed width. This width is indicated by a number of dashes right below the headers. So far so good, I have a script that reads those files line by line and outputs them to XML.
One challenge is that most but NOT all of the content is encoded in UTF-8. Trying to decode the content while processing will throw an error somewhere down the line. Hence, my script only reads and processes byte strings. This will cause readability issues in the output but that's tolerable and not my concern.
My problem: The widths were calculated with the decoded content in mind. Non-ascii characters that are represented by several bytes in UTF-8 are not accounted for.
Example: The string ´Zürich, Albisgütli´ has a length of 18 and is found in a column with a fixed width of 19. In its UTF8 representation, however, the string is ´Z\xc3\xbcrich, Albisg\xc3\xbctli´ which is 20 chars long and thus will throw off the parsing of the rest of the data row.
Solution attempts so far:
- Tried decoding the data first so that the length is correct, but as mentioned, a few data entries aren't actually UTF8 and I'd prefer to avoid the whole encoding thing.
- Identify all non-ASCII characters that could come up so that I can adjust the parsing. This is an issue because the data are huge and I'm not confident that I can come up with an exhaustive list of non-ASCII characters that could come up. Also, I don't know yet how to efficiently correct the parsing in these cases.
One issue is also that I'm using copied code for the parsing so I don't know how I could change its behavior to count non-Ascii chars differently.
Thankful for any pointers what a possible approach could be!
Code as it is now:
def convert(infile, outfile):
secondline = infile.readline() # the dashes are in this line
maxlen = len(secondline)
fieldwidths = get_widths(secondline) # counts the dashes to get the widths
# code taken from: https://mcmap.net/q/224867/-how-to-efficiently-parse-fixed-width-files
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
for fw in fieldwidths)
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from
c = 0
outfile.write(b"<?xml version='1.0' encoding='UTF-8'?>\n")
namespace = f'xmlns="http://www.bar.admin.ch/xmlns/siard/2/table.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.bar.admin.ch/xmlns/siard/2/table.xsd {table_w_num}.xsd" version="2.1"'.encode()
outfile.write(b'<table ' + namespace + b'>\n')
for line in infile:
diff = maxlen - len(line)
padded_line = bytearray()
padded_line += line
for _ in range(diff):
padded_line += b' '
data = [elem.strip() for elem in parse(padded_line)]
data = parse(padded_line)
if b"Albis" in line:
print(line)
print(data)
row = b''
for elem, n in zip(data, range(1, len(data)+1)):
# Timestamp-Fix
elem = re.sub(b"(\d{4}\-\d{2}\-\d{2}) (\d{2}:\d{2}:\d{2}(\.\d+)?)\S*?", b"\g<1>T\g<2>Z", elem)
if elem == b'' or elem == b'NULL':
pass
else:
row = b'%s<c%s>%s</c%s>' % (row, str(n).encode(), xml_escape(elem), str(n).encode())
row = b"<row>%s</row>" % (row)
outfile.write(b''.join([row, b'\n']))
c += 1
if c % infostep == 0:
timestamp = int(time.time() - start_time)
print(f"Quarter done, time needed: {str(timestamp)} seconds")
outfile.write(b'</table>')
EDIT:
Now trying to get away from my handwritten and likely-to-fail code, but the problem from the second paragraph is kicking in. After a few thousand rows suddenly I find this:
b'Z\xfcrich'
This decodes well in ANSI/Windows-1252. Running the ftfy
library on the whole file first somehow didn't catch this. I'm hesitant to write chaotic code with a bunch of try except
loops that try to decode the lines. I don't even know if whole lines are suddenly in ANSI or just single fields.
Z\xc3\xbcrich, Albisg\xc3\xbctli
looks suspiciously like what UTF8 bytes would look in a debugger. This page is UTF8, like most web sited. Check the source, you won't find any escape sequences – Gerlacworking behind a corporate firewall
does corporate want to pay 3x your gross salary due to time waste replicating libraries? And then handling the bugs and edge cases that were already solved? For example there are a lot of XML libraries that can produce valid XML from data. You could even use a template engine if you think XML libraries are slow (they aren't). Are you sure you aren't missing some closing bracket or slash in the raw text you generated? How much time would it take you to identify such a problem? With an XML library, it's 0. – Gerlaccodec
error when trying to read a line. Reading with the wrong encoding isn't something that can be solved after the fact. – Gerlac