Replacing a multi character pattern that includes a newline with some characters

Asked 26/8, 2024 at 22:30 Answered 27/8, 2024 at 2:22

I have a file that has newlines and then some line extension that I need to unwrap.

Example:

X123
+ a b c
+ d e f g
Y4567
+ a1 b2
+ c1 d2
+ e1 f2

Expected:

X123 a b c d e f g
Y4567 a1 b2 c1 d2 e1 f2

I tried : perl -00pe 's/\n\+ / /g'

But it gave a failure:

Substitution loop at -e line 1, <> chunk 1.
11.715u 18.455s 0:33.14 91.0%   0+0k 13426056+0io 155pf+0w

Jeffers answered 26/8, 2024 at 22:30 Comment(6)

The command works as expected when run properly. E.g. perl -00pe 's/\n\+ / /g' file.txt. You must be doing something else. – Ambidexter 26/8, 2024 at 22:45

@TLP. Could be memory issue then? Does perl read the entire file into memory buffer and only then starts parsing? File has 85Mlines and 7.6GB size. Is there a way to start perl with more ram? – Jeffers 26/8, 2024 at 22:56

That would change the question, yes. But -00 is paragraph mode, meaning readline will use double spacing. – Ambidexter 26/8, 2024 at 23:4

I am actually surprised the code works with paragraph mode. It should not work if a double line starts with +. Perhaps it is a coincidence – Ambidexter 26/8, 2024 at 23:5

If you load the entire file into memory it might become an issue. You could do a line-by-line read with logic instead of a regex solution. – Ambidexter 26/8, 2024 at 23:13

Please don't change your question after it has answers. If you have a new question, post it in a new post, not in this one. – Consensus 29/8, 2024 at 18:36

You operated on a string that is more than 2³¹ chars in length, which is longer than the regex engine can handle. To handle strings that long, upgrade to Perl 5.22 or higher.

perl5220delta:

s///g now works on very long strings (where there are more than 2 billion iterations) instead of dying with 'Substitution loop'. [GH #11742]. [GH #14190].

Alternatively, you could mess with the line terminator.

perl -pe'BEGIN { $/ = "\n+ " } s/\n\+ \z/ /'

Depending on how many lines start with the sequence, this risks not fixing the problem. So you could use a solution which doesn't read more than one line at a time.

perl -ne'
   chomp;
   print "\n" if !s/^\+ / / && $. != 1;
   print;
   END { print "\n"; }
'

Same idea, but shortened at the cost of readability:

perl -pe'print $l if !s/^\+ / /; $l = chop; END { print $l }'

Lumbering answered 26/8, 2024 at 23:7 Comment(4)

I said alternatively, but you should upgrade anyway seeing as you're using a version of Perl that's at least a decade old. – Lumbering 26/8, 2024 at 23:19

perl -pe '$a=chomp; print $/ if !s/^\+ / / && $.>1; END{print $/ if $a}' ? – Anett 27/8, 2024 at 21:7

@jhnc, The && $.>1 can be removed by switching to chop: perl -pe'print $l if !s/^\+ / /; $l = chop; END { print $l }'. Even though chop is used, this handles a missing terminating LF! – Lumbering 27/8, 2024 at 21:18

In an update, the OP mentions they are searching for \n + in contradiction to their existing code and examples. This answer searches for \n+ as they do. – Lumbering 28/8, 2024 at 13:43

If you want a line-by-line version, you could change the input record separator to \n+ and then remove that with chomp. It would in effect just delete those characters from the file with a normal -p one-liner. I.e.:

$ perl -pe'BEGIN{$/="\n+"}; chomp;' file.txt

The process is that it reads a "line" that ends with newline and a plus and puts that in $_, then chomp removes that ending, and the line is printed.

Ambidexter answered 26/8, 2024 at 23:33 Comment(17)

Not quite the same since the OP looked for ␊+␠, but probably good enough. /// Also, depending on how many lines start with the sequence, this risks not fixing the problem. It probably does, though – Lumbering 27/8, 2024 at 13:27

@Lumbering I don't know what ␊+␠ stands for, and I'm not sure what you mean by number if lines being an issue. This will read an infinite number of lines. – Ambidexter 27/8, 2024 at 17:9

Re "I don't know what ␊+␠ stands for", It's a LF, followed by a PLUS SIGN, followed by a SPACE. – Lumbering 27/8, 2024 at 19:26

The problems is not with larger number of strings read in; it's with a small number of them. The problem the OP is facing is that they are reading a string that is over 2 GiB in size. That's still possible with your solution (e.g. if the matching pattern is only found near the top and/or end of the file). – Lumbering 27/8, 2024 at 19:27

@Ambidexter I had to scale up my terminal font to around 50pt before I could read it :-) – Anett 27/8, 2024 at 21:19

@Lumbering I'm sorry, but that does not clarify what you mean. OP said "85Mlines and 7.6GB size." Are you just trying to cast doubt here? I see you made the same comment on all other answers except your own. – Ambidexter 27/8, 2024 at 21:53

@Lumbering You really shouldn't do that. Your answer is clearly the best already, I was just trying to add something. – Ambidexter 27/8, 2024 at 21:54

Re "I see you made the same comment on all other answers except your own", Cause mine does what the OP asked. /// Re "You really shouldn't do that.", What? Point out that your answer doesn't actually do what the OP asked? I disagree. – Lumbering 27/8, 2024 at 22:52

The perl line worked: cat xtors | perl -pe'BEGIN{$/="\n+"}; chomp;' | head – Jeffers 28/8, 2024 at 0:38

@Lumbering You have claimed that all the other answers except yours doesn't work, all the while failing to describe how they don't work. You're just being vague and hinting to problems that you cannot seem to describe. – Ambidexter 28/8, 2024 at 13:17

@GertGottschalk That's great. Know that you don't need to use cat, that will just use up more system resources. Add the filename as argument to perl, e.g. perl '....' xtors | head. – Ambidexter 28/8, 2024 at 13:19

Re "all the while failing to describe how they don't work", I did describe how they deviate from the OP. The OP looked for ␊+␠ (LF + PLUS SYMBOL + SPACE). Yours looks for ␊+ (LF + PLUS SYMBOL) instead. – Lumbering 28/8, 2024 at 13:19

@Lumbering Yeah, well those squiggles might be characters to you, but I don't know what they are. Do you mean that there is no space after the plus? If they are essential to the pattern, I'm sure the OP is smart enough to put them in. – Ambidexter 28/8, 2024 at 13:22

@Lumbering Also, as per the update, he is actually looking for: replace char sequence '\n +' into ' ', not '\n+ '. – Ambidexter 28/8, 2024 at 13:25

@TLP, I had not seen that. My comment was made before that update. Also, that edit is obviously wrong. They're clearly not looking for \n +. Even your own answer doesn't search for that. Maybe you're arguing that what the OP wants is ambiguous. If so, that is also worth mentioning. I will therefore add a comment on my own answer. – Lumbering 28/8, 2024 at 13:45

@Lumbering I think we've explored the solutions enough that he can continue on his own. – Ambidexter 28/8, 2024 at 13:48

I'm just answering your questions :) – Lumbering 28/8, 2024 at 14:4

Given your input example, here is an awk:

awk '/^[^+]/{if (s) print s; s=$0; next} 
            {sub(/^\+/,""); s=s $0} 
     END{print s}' file

Or another awk:

awk 'sub(/^\+/,"")==0 && FNR>1 {print ""} {printf} END{print ""}' file

Or a Ruby:

ruby -ne 'chomp
puts if !$_.sub!(/^\+\s*/," ") && $. > 1
print $_ + ($<.eof? ? "\n" : "")' file

Any of those prints:

X123 a b c d e f g
Y4567 a1 b2 c1 d2 e1 f2

Salicylate answered 27/8, 2024 at 0:13 Comment(4)

Not quite the same since the OP looked for ␊+␠, but probably good enough – Lumbering 27/8, 2024 at 13:27

I am counting on that space to be the field separator once the + is gone and the line is appended to the previous line... – Salicylate 27/8, 2024 at 14:15

Tried the 'awk' code to no success. cat xtors | awk 'sub(/^\+/,"")==0 && FNR>1 {print ""} {printf} END{print ""}' | head awk: cmd. line:1: (FILENAME=- FNR=1) fatal: printf: no arguments – Jeffers 28/8, 2024 at 0:35

So older awk I think. Just modify to {printf "%s", $0} and you should be g2g – Salicylate 28/8, 2024 at 0:50

Borrowing @TLP's solution, with gawk which allows RS to be a regex (standard awk doesn't):

gawk 1 RS='\n[+]' ORS= file

As @ikegami notes, this may not do the right thing if you have input like:

X123
+ a b c
+d e f g

that should become

X123 a b c
+d e f g

Anett answered 27/8, 2024 at 2:22 Comment(5)

pretty much only gawk -c and gawk -P refuse to deal with RS being multi-byte or regex nowadays. Other than worthless awk variants on Solaris, I couldn't think of any awks in widespread use that can't handle that RS regex. – Beaulahbeaulieu 27/8, 2024 at 4:22

@Ambidexter oops, apparently I completely rewrote my original perl version, and didn't test if properly (I did test the original with + in middle of line) – Anett 27/8, 2024 at 8:37

@Ambidexter hmm, looks like I broke the first awk too - this answer is getting short... – Anett 27/8, 2024 at 8:42

Sometimes shorter is better. ;) I liked the gawk best tbh. – Ambidexter 27/8, 2024 at 9:21

Bravo! This works on BSD / MacOS awk also. – Salicylate 27/8, 2024 at 11:48

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags