Replace multiple newlines with single newlines during reading file
Asked Answered
D

7

22

I have the next code which reads from multiple files, parses obtained lines and prints the result:

import os
import re

files=[]
pars=[]

for i in os.listdir('path_to_dir_with_files'):
    files.append(i)

for f in files:
    with open('path_to_dir_with_files'+str(f), 'r') as a:
       pars.append(re.sub('someword=|\,.*|\#.*','',a.read()))

for k in pars:
   print k

But I have problem with multiple new lines in output:

test1


test2

Instead of it I want to obtain the next result without empty lines in output:

 test1
 test2

and so on.

I tried playing with regexp:

pars.append(re.sub('someword=|\,.*|\#.*|^\n$','',a.read()))

But it doesn't work. Also I tried using strip() and rstrip() including replace. It also doesn't work.

Dru answered 6/3, 2017 at 15:34 Comment(0)
S
26

You could use a second regex to replace multiple new lines with a single new line and use strip to get rid of the last new line.

import os
import re

files=[]
pars=[]

for i in os.listdir('path_to_dir_with_files'):
    files.append(i)

for f in files:
    with open('path_to_dir_with_files/'+str(f), 'r') as a:
        word = re.sub(r'someword=|\,.*|\#.*','', a.read())
        word = re.sub(r'\n+', '\n', word).strip()
        pars.append(word)

for k in pars:
   print k
Sivie answered 6/3, 2017 at 15:54 Comment(2)
Could you do this line-wise, not file-wise? Like for line in f: And can you explain what the re.sub does? Comma and hash are escaped, I do not understand the someword=. There is no = in the example..Zeldazelde
Sure you can do it line-wise but f is the filename in this case not the content. re.sub replaces stuff that matches the first argument with whatever you put in the second argument. Check the docs and try it out.Sivie
S
2

Without changing your code much, one easy way would just be to check if the line is empty before you print it, e.g.:

import os
import re

files=[]
pars=[]

for i in os.listdir('path_to_dir_with_files'):
    files.append(i)

for f in files:
    with open('path_to_dir_with_files'+str(f), 'r') as a:
        pars.append(re.sub('someword=|\,.*|\#.*','',a.read()))

for k in pars:
    if not k.strip() == "":
        print k

*** EDIT Since each element in pars is actually the entire content of the file (not just a line), you need to go through an replace any double end lines, easiest to do with re

import os
import re

files=[]
pars=[]

for i in os.listdir('path_to_dir_with_files'):
    files.append(i)

for f in files:
    with open('path_to_dir_with_files'+str(f), 'r') as a:
        pars.append(re.sub('someword=|\,.*|\#.*','',a.read()))

for k in pars:
    k = re.sub(r"\n+", "\n", k)
    if not k.strip() == "":
        print k

Note that this doesn't take care of the case where a file ends with a newline and the next one begins with one - if that's a case you are worried about you need to either add extra logic to deal with it or change the way you're reading the data in

Sybille answered 6/3, 2017 at 15:42 Comment(4)
or just if k.strip()Hatter
This should also be done while adding to pars and not when iterating over pars.Foxworth
Unfortunately it didn't give an appropriate result. In case of if not k.strip() == "" I still obtain multiple empty lines. If displaying just list without iterating through it I obtain: test1[]\n\n\n test2\n test5\ntest7[]\ntest[*]\n etc...Dru
Oh I see, because you are just reading the entire line into each item in pars, so it isn't printing line by line. I edited my answer, it just uses regular expressions to go through and get rid of any duplicate \n with a single \nSybille
G
1

just a simple one, but may not be efficent.

entire_file = "whatever\nmay\n\n\n\nhappen"

while '\n\n' in entire_file:
    entire_file = entire_file.replace("\n\n", "\n")

print(entire_file)
Gull answered 23/2, 2022 at 1:4 Comment(2)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Woven
This has a bug. It only cuts the \n in half. It replaces every double set, \n\n, with one \n. So if there are 4, \n\n\n\n, there will now be 2, \n\n.Latimer
A
1

One liner

re.sub(r'[\r\n][\r\n]{2,}', '\n\n', sourceFileContents)
Admit answered 1/1 at 18:53 Comment(0)
L
0

Use lookahead regular expression to find all of the double return characters r'\n(?=\n) and replace that with nothing. This will find and replace all of these cases in one pass

import re

files=[]
pars=[]

for i in os.listdir('path_to_dir_with_files'):
    files.append(i)

for f in files:
    with open('path_to_dir_with_files'+str(f), 'r') as a:
       pars.append(re.sub(r'\n(?=\n)','',a.read()))

for k in pars:
   print k

Note: this won't help you if the last character is \n of files[0] and the first character of file[1] is also '\n' but... you can use strip for this and your print will take care of the single space between files

import os
import re

files=[]
pars=[]

for i in os.listdir('path_to_dir_with_files'):
    files.append(i)

for f in files:
    with open('path_to_dir_with_files'+str(f), 'r') as a:
       pars.append(re.sub(r'\n(?=\n)','',a.read().strip()))

for k in pars:
   print k
Lightship answered 18/3, 2022 at 22:7 Comment(0)
P
0

Using regex is the only solution here (apart from using a loop to iterate over the string)

text = re.sub(r'[\n]+', '\n', text)

Phthalocyanine answered 7/3 at 20:45 Comment(0)
U
-3

Just would like to point out: regexes aren't the best way to handle that. Replacing two empty lines by one in a Python str is quite simple, no need for re:

entire_file = "whatever\nmay\n\nhappen"
entire_file = entire_file.replace("\n\n", "\n")

And voila! Much faster than re and (in my opinion) much easier to read.

Ustulation answered 9/8, 2019 at 18:43 Comment(3)
This won't work if the file contains more than 2 consecutive "\n" like "whatever\nmay\n\n\nhappen"Kinesthesia
It's true, but still could do with a loop: while "\n\n" in text: text = text.replace("\n\n", "\n")Ustulation
This form of 'elision' is fragile and requires adaption based on the length of the desired run. E.g. desiring two newlines between "paragraphs" would require three .replace("\n\n\n", "\n\n") calls. Iterative reconstruction means a duplication of the entire string per iteration. Regular expressions can far more easily combine actual measured runs of repeating characters, with explicit control over run length: \n{min,max}, and perform such an operation in, essentially, O(1) time without excessive memory duplication.Struggle

© 2022 - 2024 — McMap. All rights reserved.