Iterate a large .xz file line by line in python

About

Asked 18/3, 2018 at 12:52 Answered 18/3, 2018 at 13:4

I have a large .xz file (few gigabytes). It's full of plain text. I want to process the text to create custom dataset. I want to read it line by line because it is too big. Anyone have an idea how to do it ?

I already tried this How to open and read LZMA file in-memory but it's not working.

EDIT: i got this error 'ascii' codec can't decode byte 0xfd in position 0: ordinal not in range(128)

on the line for line in uncompressed: from the link

EDIT2: My code (using python 3.5)

with open(filename) as compressed:
    with lzma.LZMAFile(compressed) as uncompressed:
        for line in uncompressed:
            print(line)

Cottonwood answered 18/3, 2018 at 12:52 Comment(5)

How is it not working? – Upwards 18/3, 2018 at 12:53

Questions seeking help debugging should include a minimal reproducible example – Isolde 18/3, 2018 at 12:55

i ll edit the question – Cottonwood 18/3, 2018 at 12:56

Can we see the code that you are using not just the error message? And what version of Python are you using? – Graecize 18/3, 2018 at 12:57

i have edited the question – Cottonwood 18/3, 2018 at 13:0

I was faced to the same question some weeks ago. This snippet worked for me:

import lzma
with lzma.open('filename.xz', mode='rt') as file:
    for line in file:
       print(line)

This assumes that the text data in the compressed file was encoded in utf-8 (which was the case for my data). There is an encoding argument in function lzma.open() which allows you to set another encoding if needed

EDIT (after you own edit): try to force encoding='utf-8' in lmza.open()

Norway answered 18/3, 2018 at 13:4 Comment(1)

Thx man! I used your code with the encoding parameter and it worked. :) – Cottonwood 18/3, 2018 at 13:10

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags