Python readline with custom delimiter

Asked 23/8, 2018 at 7:49 Answered 23/8, 2018 at 8:32

novice here. I am trying to read lines from a file, however a single line in a .txt file has a \n in the middle somewhere and while trying to read that line with .readline python cuts it in the middle and outputs as two lines.

when I copy and past the line to this window, it shows up as two lines. So i uploaded the file here: https://ufile.io/npt3n
also added screenshot of the file as it shows in txt file.
this is group chat history exported from Whatsup..if you are wondering.
Please help me to read one line completely as shown in txt file.

f= open("f.txt",mode='r',encoding='utf8')

for i in range(4):
    lineText=f.readline()
    print(lineText)

f.close()

Constitute answered 23/8, 2018 at 7:49 Comment(6)

How can a line have \n in the middle? \n is the thing that separates each line from the next. – Gaiser 23/8, 2018 at 7:58

@Gaiser not on Windows. It's OS specific. – Alikee 23/8, 2018 at 8:1

Python recognizes \n as an eond of line marker. However Windows uses \r\n, so a mere \n does not split the line in e.g. Notepad. Maybe this question might help you with it. – Lamanna 23/8, 2018 at 8:5

I think its because there is a 'next line character' in the sentence. Maybe, the person on the chat had entered the text 'Kocaeli 24...' in a new line. But the file when on my system shows it on a different line by default on notepad++. So maybe its a issue with notepad. – Ringed 23/8, 2018 at 8:10

Python has "universal newlines support". Basically all of \n, \r and \n\r are considered a newline. If you open the file in text-mode python will convert those 3 line-endings into just \n. If you have to interpret the text differently you want to open the file in binary mode and handle lines by hand. – Violante 23/8, 2018 at 8:16

Related: #47927539 – Celibate 27/9, 2021 at 21:42

Python 3 allows you to define what is the newline for a particular file. It is seldom used, because the default universal newlines mode is very tolerant:

When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller.

So here you should made explicit that only '\r\n' is an end of line:

f= open("f.txt",mode='r',encoding='utf8', newline='\r\n')

# use enumerate to show that second line is read as a whole
for i, line in enumerate(fd):   
    print(i, line)

Divan answered 23/8, 2018 at 8:32 Comment(4)

It can be one of None, '', '\n', '\r', and '\r\n'. I tried giving it "/>\n" for an xml file I had and it gave me a ValueError: illegal newline value. The file is too big to fit in memory by doing a full read, so I can't do that and split. And of the millions of rows I have, one is bound to have a bad "\n" instead of the proper "\\n" as the data has a free text field which is escaped by double quotes. Worst case scenario that line, and a poor neighbour get corrupted as I use regex and drop the line if corrupt. My concern is that poor neighbouring line. – Tuantuareg 27/9, 2018 at 7:9

@devssh: It is a different question. I would read lines with standart newline value (None) and concat them if last character is not a "\>". But anyway, using regexes for XML is generally a poor solution. BTW, xml.sax can be used to process a xml file without loading everything in memory... – Divan 27/9, 2018 at 7:43

Aah, such a good idiom it would have been for storing multiple JSONs in single file for stream-like parsing! Universe is sadistic. – Indisposition 24/10, 2019 at 9:58

It should be noted that the only allowed values for the newline argument are None, '', '\n', '\r', and '\r\n'. This may bite you when lines in your file are separated by one of the various Unicode newline characters. – Potence 5/3, 2021 at 13:37

Instead of using readline function, you can read whole content and split lines by regex:

import re

with open("txt", "r") as f:
    content = f.read()
    # remove end line characters
    content = content.replace("\n", "")
    # split by lines
    lines = re.compile("(\[[0-9//, :\]]+)").split(content)
    # clean "" elements
    lines = [x for x in lines if x != ""]
# join by pairs
lines = [i + j for i, j in zip(lines[::2], lines[1::2])]

If all content has the same beginning [...] you can split by this, then clean all parts omitting the "" elements. Then you can join each part with zip function (https://mcmap.net/q/239722/-joining-pairs-of-elements-of-a-list-duplicate)

Picofarad answered 23/8, 2018 at 8:11 Comment(1)

As stated in a comment (after your answer), the file is too big to fit in memory. – Espouse 12/7, 2020 at 17:14

Recommended topics

Hot tags