Replacing a multi character pattern that includes a newline with some characters
Asked Answered
J

4

10

I have a file that has newlines and then some line extension that I need to unwrap.

Example:

X123
+ a b c
+ d e f g
Y4567
+ a1 b2
+ c1 d2
+ e1 f2

Expected:

X123 a b c d e f g
Y4567 a1 b2 c1 d2 e1 f2

I tried : perl -00pe 's/\n\+ / /g'

But it gave a failure:

Substitution loop at -e line 1, <> chunk 1.
11.715u 18.455s 0:33.14 91.0%   0+0k 13426056+0io 155pf+0w
Jeffers answered 26/8, 2024 at 22:30 Comment(6)
The command works as expected when run properly. E.g. perl -00pe 's/\n\+ / /g' file.txt. You must be doing something else.Ambidexter
@TLP. Could be memory issue then? Does perl read the entire file into memory buffer and only then starts parsing? File has 85Mlines and 7.6GB size. Is there a way to start perl with more ram?Jeffers
That would change the question, yes. But -00 is paragraph mode, meaning readline will use double spacing.Ambidexter
I am actually surprised the code works with paragraph mode. It should not work if a double line starts with +. Perhaps it is a coincidenceAmbidexter
If you load the entire file into memory it might become an issue. You could do a line-by-line read with logic instead of a regex solution.Ambidexter
Please don't change your question after it has answers. If you have a new question, post it in a new post, not in this one.Consensus
L
11

You operated on a string that is more than 231 chars in length, which is longer than the regex engine can handle. To handle strings that long, upgrade to Perl 5.22 or higher.

perl5220delta:

s///g now works on very long strings (where there are more than 2 billion iterations) instead of dying with 'Substitution loop'. [GH #11742]. [GH #14190].

Alternatively, you could mess with the line terminator.

perl -pe'BEGIN { $/ = "\n+ " } s/\n\+ \z/ /'

Depending on how many lines start with the sequence, this risks not fixing the problem. So you could use a solution which doesn't read more than one line at a time.

perl -ne'
   chomp;
   print "\n" if !s/^\+ / / && $. != 1;
   print;
   END { print "\n"; }
'

Same idea, but shortened at the cost of readability:

perl -pe'print $l if !s/^\+ / /; $l = chop; END { print $l }'
Lumbering answered 26/8, 2024 at 23:7 Comment(4)
I said alternatively, but you should upgrade anyway seeing as you're using a version of Perl that's at least a decade old.Lumbering
perl -pe '$a=chomp; print $/ if !s/^\+ / / && $.>1; END{print $/ if $a}' ?Anett
@jhnc, The && $.>1 can be removed by switching to chop: perl -pe'print $l if !s/^\+ / /; $l = chop; END { print $l }'. Even though chop is used, this handles a missing terminating LF!Lumbering
In an update, the OP mentions they are searching for \n + in contradiction to their existing code and examples. This answer searches for \n+ as they do.Lumbering
A
6

If you want a line-by-line version, you could change the input record separator to \n+ and then remove that with chomp. It would in effect just delete those characters from the file with a normal -p one-liner. I.e.:

$ perl -pe'BEGIN{$/="\n+"}; chomp;' file.txt

The process is that it reads a "line" that ends with newline and a plus and puts that in $_, then chomp removes that ending, and the line is printed.

Ambidexter answered 26/8, 2024 at 23:33 Comment(17)
Not quite the same since the OP looked for ␊+␠, but probably good enough. /// Also, depending on how many lines start with the sequence, this risks not fixing the problem. It probably does, thoughLumbering
@Lumbering I don't know what ␊+␠ stands for, and I'm not sure what you mean by number if lines being an issue. This will read an infinite number of lines.Ambidexter
Re "I don't know what ␊+␠ stands for", It's a LF, followed by a PLUS SIGN, followed by a SPACE.Lumbering
The problems is not with larger number of strings read in; it's with a small number of them. The problem the OP is facing is that they are reading a string that is over 2 GiB in size. That's still possible with your solution (e.g. if the matching pattern is only found near the top and/or end of the file).Lumbering
@Ambidexter I had to scale up my terminal font to around 50pt before I could read it :-)Anett
@Lumbering I'm sorry, but that does not clarify what you mean. OP said "85Mlines and 7.6GB size." Are you just trying to cast doubt here? I see you made the same comment on all other answers except your own.Ambidexter
@Lumbering You really shouldn't do that. Your answer is clearly the best already, I was just trying to add something.Ambidexter
Re "I see you made the same comment on all other answers except your own", Cause mine does what the OP asked. /// Re "You really shouldn't do that.", What? Point out that your answer doesn't actually do what the OP asked? I disagree.Lumbering
The perl line worked: cat xtors | perl -pe'BEGIN{$/="\n+"}; chomp;' | headJeffers
@Lumbering You have claimed that all the other answers except yours doesn't work, all the while failing to describe how they don't work. You're just being vague and hinting to problems that you cannot seem to describe.Ambidexter
@GertGottschalk That's great. Know that you don't need to use cat, that will just use up more system resources. Add the filename as argument to perl, e.g. perl '....' xtors | head.Ambidexter
Re "all the while failing to describe how they don't work", I did describe how they deviate from the OP. The OP looked for ␊+␠ (LF + PLUS SYMBOL + SPACE). Yours looks for ␊+ (LF + PLUS SYMBOL) instead.Lumbering
@Lumbering Yeah, well those squiggles might be characters to you, but I don't know what they are. Do you mean that there is no space after the plus? If they are essential to the pattern, I'm sure the OP is smart enough to put them in.Ambidexter
@Lumbering Also, as per the update, he is actually looking for: replace char sequence '\n +' into ' ', not '\n+ '.Ambidexter
@TLP, I had not seen that. My comment was made before that update. Also, that edit is obviously wrong. They're clearly not looking for \n +. Even your own answer doesn't search for that. Maybe you're arguing that what the OP wants is ambiguous. If so, that is also worth mentioning. I will therefore add a comment on my own answer.Lumbering
@Lumbering I think we've explored the solutions enough that he can continue on his own.Ambidexter
I'm just answering your questions :)Lumbering
S
5

Given your input example, here is an awk:

awk '/^[^+]/{if (s) print s; s=$0; next} 
            {sub(/^\+/,""); s=s $0} 
     END{print s}' file

Or another awk:

awk 'sub(/^\+/,"")==0 && FNR>1 {print ""} {printf} END{print ""}' file

Or a Ruby:

ruby -ne 'chomp
puts if !$_.sub!(/^\+\s*/," ") && $. > 1
print $_ + ($<.eof? ? "\n" : "")' file

Any of those prints:

X123 a b c d e f g
Y4567 a1 b2 c1 d2 e1 f2
Salicylate answered 27/8, 2024 at 0:13 Comment(4)
Not quite the same since the OP looked for ␊+␠, but probably good enoughLumbering
I am counting on that space to be the field separator once the + is gone and the line is appended to the previous line...Salicylate
Tried the 'awk' code to no success. cat xtors | awk 'sub(/^\+/,"")==0 && FNR>1 {print ""} {printf} END{print ""}' | head awk: cmd. line:1: (FILENAME=- FNR=1) fatal: printf: no argumentsJeffers
So older awk I think. Just modify to {printf "%s", $0} and you should be g2gSalicylate
A
4

Borrowing @TLP's solution, with gawk which allows RS to be a regex (standard awk doesn't):

gawk 1 RS='\n[+]' ORS= file

As @ikegami notes, this may not do the right thing if you have input like:

X123
+ a b c
+d e f g

that should become

X123 a b c
+d e f g
Anett answered 27/8, 2024 at 2:22 Comment(5)
pretty much only gawk -c and gawk -P refuse to deal with RS being multi-byte or regex nowadays. Other than worthless awk variants on Solaris, I couldn't think of any awks in widespread use that can't handle that RS regex.Beaulahbeaulieu
@Ambidexter oops, apparently I completely rewrote my original perl version, and didn't test if properly (I did test the original with + in middle of line)Anett
@Ambidexter hmm, looks like I broke the first awk too - this answer is getting short...Anett
Sometimes shorter is better. ;) I liked the gawk best tbh.Ambidexter
Bravo! This works on BSD / MacOS awk also.Salicylate

© 2022 - 2025 — McMap. All rights reserved.