Dealing with non-ASCII characters when parsing fixed-width txt file
Asked Answered
D

1

1

So I have a series of huge files (several GB) of tabular data. They are txt and each column is defined by a fixed width. This width is indicated by a number of dashes right below the headers. So far so good, I have a script that reads those files line by line and outputs them to XML.

One challenge is that most but NOT all of the content is encoded in UTF-8. Trying to decode the content while processing will throw an error somewhere down the line. Hence, my script only reads and processes byte strings. This will cause readability issues in the output but that's tolerable and not my concern.

My problem: The widths were calculated with the decoded content in mind. Non-ascii characters that are represented by several bytes in UTF-8 are not accounted for.

Example: The string ´Zürich, Albisgütli´ has a length of 18 and is found in a column with a fixed width of 19. In its UTF8 representation, however, the string is ´Z\xc3\xbcrich, Albisg\xc3\xbctli´ which is 20 chars long and thus will throw off the parsing of the rest of the data row.

Solution attempts so far:

  • Tried decoding the data first so that the length is correct, but as mentioned, a few data entries aren't actually UTF8 and I'd prefer to avoid the whole encoding thing.
  • Identify all non-ASCII characters that could come up so that I can adjust the parsing. This is an issue because the data are huge and I'm not confident that I can come up with an exhaustive list of non-ASCII characters that could come up. Also, I don't know yet how to efficiently correct the parsing in these cases.

One issue is also that I'm using copied code for the parsing so I don't know how I could change its behavior to count non-Ascii chars differently.

Thankful for any pointers what a possible approach could be!

Code as it is now:

def convert(infile, outfile):
    secondline = infile.readline() # the dashes are in this line
    maxlen = len(secondline)
    fieldwidths = get_widths(secondline) # counts the dashes to get the widths

    # code taken from: https://mcmap.net/q/224867/-how-to-efficiently-parse-fixed-width-files
    fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                        for fw in fieldwidths)
    fieldstruct = struct.Struct(fmtstring)
    parse = fieldstruct.unpack_from
    
    c = 0
    
    outfile.write(b"<?xml version='1.0' encoding='UTF-8'?>\n")
    
    namespace = f'xmlns="http://www.bar.admin.ch/xmlns/siard/2/table.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.bar.admin.ch/xmlns/siard/2/table.xsd {table_w_num}.xsd" version="2.1"'.encode()
    
    outfile.write(b'<table ' + namespace + b'>\n')
    for line in infile:
        diff = maxlen - len(line)
        padded_line = bytearray()
        padded_line += line
        for _ in range(diff):
            padded_line += b' '

        data = [elem.strip() for elem in parse(padded_line)]
        data = parse(padded_line)
        
        if b"Albis" in line:
            print(line)
            print(data)
        row = b''
        for elem, n in zip(data, range(1, len(data)+1)):
            # Timestamp-Fix
            elem = re.sub(b"(\d{4}\-\d{2}\-\d{2}) (\d{2}:\d{2}:\d{2}(\.\d+)?)\S*?", b"\g<1>T\g<2>Z", elem)
            if elem == b'' or elem == b'NULL':
                pass
            else:
                row = b'%s<c%s>%s</c%s>' % (row, str(n).encode(), xml_escape(elem), str(n).encode())
        row = b"<row>%s</row>" % (row)
        outfile.write(b''.join([row, b'\n']))
        
        c += 1
        if c % infostep == 0:
            timestamp = int(time.time() - start_time)
            print(f"Quarter done, time needed: {str(timestamp)} seconds")
    
    outfile.write(b'</table>')

EDIT:

Now trying to get away from my handwritten and likely-to-fail code, but the problem from the second paragraph is kicking in. After a few thousand rows suddenly I find this:

b'Z\xfcrich'

This decodes well in ANSI/Windows-1252. Running the ftfy library on the whole file first somehow didn't catch this. I'm hesitant to write chaotic code with a bunch of try except loops that try to decode the lines. I don't even know if whole lines are suddenly in ANSI or just single fields.

Decurion answered 26/7, 2022 at 16:37 Comment(9)
Z\xc3\xbcrich, Albisg\xc3\xbctli looks suspiciously like what UTF8 bytes would look in a debugger. This page is UTF8, like most web sited. Check the source, you won't find any escape sequencesGerlac
Where do you load that file? The code you posted doesn't seem relevant. You're reading a text file, so it's character counts that matter, not bytes. Python strings are Unicode already, so by definition, Python can already count characters instead of bytes. Pandas's read_fwf has no problem with UTF8 and neither do other librariesGerlac
What you used is a 2011 hack that splits on bytes instead of characters. It doesn't work with UTF8. Simply reading lines one by one and splicing would work. Pandas and numpy can read fixed width files and even detect the schemaGerlac
Pandas was unavailable in my Python installation last I checked (working behind a corporate firewall). Not sure about numpy. I'll try to get either of them working, should work well then.Decurion
Why not just load the lines and split the strings? That's the simplest solution. Get the code to work first and then try to optimize it. A more meaningful optimization would be to split the loading and HTML generation code into separate methods. Right now you can't even debug your data loading code, much less test it.Gerlac
working behind a corporate firewall does corporate want to pay 3x your gross salary due to time waste replicating libraries? And then handling the bugs and edge cases that were already solved? For example there are a lot of XML libraries that can produce valid XML from data. You could even use a template engine if you think XML libraries are slow (they aren't). Are you sure you aren't missing some closing bracket or slash in the raw text you generated? How much time would it take you to identify such a problem? With an XML library, it's 0.Gerlac
BTW Pandas has both to_xml and read_fwf. What you do can be done in 2 lines. 3 if you need to call rename to rename columns.Gerlac
Another BTW: The answer you used handles UTF8 in the Update 3 snippet. You used the old Python 2.7 snippetGerlac
How is the file opened? If it contains invalid byte sequences you'd get a codec error when trying to read a line. Reading with the wrong encoding isn't something that can be solved after the fact.Gerlac
G
1

The first snippet in the related answer only works for single-byte codepages because it counts bytes, not characters. It doesn't even work for UTF16 which usually uses 2 bytes per character and certainly not UTF8 which uses a variable number of bytes. That was pointed out in the comments.

The same answer shows how to handle UTF8 in the third snippet:

from itertools import accumulate, zip_longest
def make_parser(fieldwidths):
    cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
    pads = tuple(fw < 0 for fw in fieldwidths) # bool flags for padding fields
    flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1]  # ignore final one
    slcs = ', '.join('line[{}:{}]'.format(i, j) for pad, i, j in flds if not pad)
    parse = eval('lambda line: ({})\n'.format(slcs))  # Create and compile source code.
    # Optional informational function attributes.
    parse.size = sum(abs(fw) for fw in fieldwidths)
    parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                                for fw in fieldwidths)
    return parse

Using it on a string including the question's text produces the expected results. Note that the original string has 18, not 19 characters :

>>> parser = make_parser([18,3,4])
>>> line="Zürich, Albisgütli3456789"
>>> parser(line)
('Zürich, Albisgütli', '345', '6789')

And to prove this really does work with UTF8 text files:

with open('testutf8.txt',encoding="utf-8",mode='w') as f:
    for i in range(3):
        f.write(line)
        f.write('\n')

with open('testutf8.txt',encoding="utf-8",mode='r') as f1:
    for line in f1.readlines():
        print(parser(line))
-------
('Zürich, Albisgütli', '345', '6789')
('Zürich, Albisgütli', '345', '6789')
('Zürich, Albisgütli', '345', '6789')

The question's code should be split into separate functions, one to read the data and another to generate the XML output. Both operations are already available through various modules though. There are several modules that can read fixed-width files, several XML parser and serialization libraries, and some, like Pandas, can read multiple data formats, process the data and export it as XML.

Using Pandas

Pandas is one of the most popular if not the most popular module for analytics and data science. It's a great tool for data processing too. It's part of Python distributions created for analysis, like Anaconda.

For example, with Pandas, this code could be replaced with just 2 function calls:

import pandas as pd
namespaces={ 
    "xmlns" : "http://www.bar.admin.ch/xmlns/siard/2/table.xsd" ,
    "xsi" : "http://www.w3.org/2001/XMLSchema-instance" ,
    ...
}    

df=pd.read_fwf('data.csv')
df.to_xml('data.xml', root_name='table', namespaces=namespaces)

This will be faster and use less memory than the explicit string manipulations in the question's code. String manipulations create new temporary strings each time, costing in both CPU and RAM.

By default read_fwf will try to infer the column widths based on the first 100 rows. You can increase the number of rows with the infer_nrows parameter or specify a list of (from,to) tuples with the colspecs parameter, eg colspecs=[(1,3),(3,5),(10,14)]

to_xml offers several parameters for controlling the XML output, like namespaces, what element name to use for the root and rows, which columns to output as attributes and which as child elements, etc. It can even write to a compressed file

Attribute names can be specified through the attr_cols parameter, eg:

df.to_xml(attr_cols=[
          'index', 'shape', 'degrees', 'sides'
          ]) 

You can also rename Dataframe columns, change their type eg to parse string fields into dates or numbers:

df['Timestamp'] = pd.to_datetime(df['Col3'],infer_datetime_format=True)
df=df.rename(columns={'Col1':'Bananas',...})

Using Standard Library xml modules

Even if you can't use Pandas to solve the entire problem in a couple of lines, you can use one of Python's xml processing modules. These are part of the Python Standard Library which means they're available in all distributions

The Building XML Documents example of the ElementTree module shows how to create XML docs programmatically :

>>> a = ET.Element('a')
>>> b = ET.SubElement(a, 'b')
>>> c = ET.SubElement(a, 'c')
>>> d = ET.SubElement(c, 'd')
>>> ET.dump(a)
<a><b /><c><d /></c></a>

Corporate Policies

That's not a logical argument. It may sound so to non-technical managers with limited security experience, certainly limited Python experience. Or C#, Go, JavaScript experience. Security is on of the main reason even "standard" libraries/modules/assemblies are distributed and updated through package managers. Nobody wants (or can afford) to wait 2 years for security and bug fixes any more.

The Python Standard Library contains modules that started as third-party packages. That's certainly the case with ElementTree. In many cases the docs advise using external packages to handle more complex cases.

Besides, Pandas is almost certainly used in the company already. It's one of the most common if not the most common library in analytics and data science. In fact, distributions like Anaconda already include it.

Does whoever wrote that policy understand they're telling you to keep using already disclosed security vulnerabilities and conciously deploy insecure code? Because that's what you get if you don't upgrade packages. And every Python distribution comes with a lot of packages that require updating after a while.

Finally, with your hand-written code you have to reinvent and support code that's already written, tested and vetted for several years in various packages. That's time that's not spent producing valuable, billable work for the company.

A rule of thumb is that a company needs to make 2-3x your gross salary to justify the work. How much will be the cost of hand-written text reading and XML generation over the product's lifetime, when you include bug fixes and downtime incurred due to those bugs?

Gerlac answered 27/7, 2022 at 7:14 Comment(17)
That was pointed out in the comments. by me. That was pointed out by me.Decurion
You said several times that "there are several libraries". Great. I agree with you it would have been more efficient. Now... maybe name those libraries?Decurion
@chartmann I did. I also included explanations why the company doesn't really want to prevent you from using well tested and secure libraries. Even the Python Standard Library includes modules that are developed externally. Worst case, ask them to install AnacondaGerlac
@chartmann as for pointed out I meant the linked answer. The very first comment by Reiner Gerecke warns that the first snippet doesn't work with Unicode. The final snippet was posted in 2013 and updated with various fixes since.Gerlac
While your paragraphs about corporate policy will be helpful for other users, you're making a few assumptions about my employer that are not true. People are working to get us full access to python, but it apparently takes a while. Your arguments are logical, yes, but my employer is not. You're also shooting the messenger here which is not appreciated, just saying.Decurion
In that case you misunderstood the answer. I'm not shooting the messenger. Half of the real problem here isn't technical, so I included the answers given to similar questions in the past.Gerlac
As for shooting the messenger - you should see what the author of the answer you linked to said when I asked for an edit that puts the UTF8 code on top.Gerlac
BTW you haven't addressed a central issue yet, the random non-UTF8 elements present in the file. It's the whole reason I went looking for byte-based solutions in the first place. Some research has led me to the ftfy library. It looks very promising.Decurion
@chartmann you didn't post any non-UTF8 element. Z\xc3\xbcrich, Albisg\xc3\xbctli is how a debugger displays the bytes of the UTF8 string Zürich, Albisgütli. In UTF8 the US-ASCII characters, ie characters in the range 0x00-07F are encoded using the same byte as US-ASCII. Characters above 0x7F are encoded using two or more bytes. ü is encoded using the bytes 0xC3 and 0xBC](compart.com/en/unicode/U+00FC)Gerlac
I did not post any, but the second paragraph explains that not all of the data are in UTF-8. It's a classic "not my mistake, but my problem now" situation. I have a few solution options on the horizon right now that I am testing. Got myself full Python with all libraries now (sometimes miracles happen when you insist on them...) - thanks for your help so far.Decurion
Post an example then. You can't ask people to guess what's going on. not UTF-8 explains little, especially when you already posted UTF8 bytes. b'Z\xc3\xbcrich, Albisg\xc3\xbctli' is UTF8. Are you saying that someone wrote half of the file using UTF8 and another half using a single-byte codepage? You wouldn't be able to open such a file in any text editor without garbling or losing the non-US-ASCII charactersGerlac
@chartmann I posted the actual results of the method on the string you provided. If the file contained a mix of UTF8 and non-UTF8 characters you'd get a codec error when you tried to read the line with the invalid sequence byte sequence.Gerlac
I posted an EDIT with the problematic byte sequence. It's random ANSI content in this huge file. The file is 99% UTF8 and has ANSI (and possibly other codecs) used in random places. This makes Pandas fail unfortunately. Since I don't know what other codecs and what kind of symbols come up, I'm hesitant. I think this is why I initally tried to do my task by reading the data as bytestrings. I open the file with the "rb" argument and iterate through it line by line because it is too large to be opened all at once.Decurion
@chartmann ask whoever created that file to fix it then. You can't just assume it's Latin1. The byte 0xFC is valid in other codepages as well. In ΙΣΟ 8859-7 and Windows-1253 it's ό. In 8859-7 (Cyrillic) it's ќ. It's extremely unlikely that the same file has mixed encodings though. It's more likely that some files are UTF8 and some are Latin1.Gerlac
@chartmann you can use the chardet module to try and determine a file's encoding. This is shown in this SO question.Gerlac
I agree that I can't assume the meanings of the bytes. That is why I wanted to just move all the byte strings into XML. Handing over that problem over if you will. But it's not possible because the fields with special characters appear longer than they actually are, throwing off the fixed-width parsing. Currently trying ftfy.fix_file because it handled the one example well. However, you point out correctly that assuming Windows-1252 isn't wise. I do suspect that if I'm unlucky, ISO-8859-2 could be present as well.Decurion
So Pandas ended up dealing with a lot of the noise alone, it is clearly the superior way to go about this. Some problems remain, but all in all your comment contains 100% of the solution based on what I posted. The problems that remain are new to me and will have to be handled by the people who created this data.Decurion

© 2022 - 2024 — McMap. All rights reserved.