Editing the last instance in a file

B

7

9

I have a huge text file (~1.5GB) with numerous lines ending with ".Ends".
I need a linux oneliner (perl\ awk\ sed) to find the last place '.Ends' appear in the file and add a couple of lines before it.

I tried using tac twice, and stumbled with my perl:

When I use:
tac ../../test | perl -pi -e 'BEGIN {$flag = 1} if ($flag==1 && /.Ends/) {$flag = 0 ; print "someline\n"}' | tac
It first prints the "someline\n" and only than prints the .Ends The result is:
…
.Ends
someline

When I use:
tac ../../test | perl -e 'BEGIN {$flag = 1} print ; if ($flag==1 && /.Ends/) {$flag = 0 ; print "someline\n"}' | tac
It doesn’t print anything.

And when I use:
tac ../../test | perl -p -e 'BEGIN {$flag = 1} print $_ ; if ($flag==1 && /.Ends/) {$flag = 0 ; print "someline\n"}' | tac
It prints everything twice:
…
.Ends
someline
.Ends

Is there a smooth way to perform this edit?
Don't have to be with my solution direction, I'm not picky...
Bonus - if the lines can come from a different file, it would be great (but really not a must)

Edit
test input file:

gla2 
fla3 
dla4 
rfa5 
.Ends
shu
sha
she
.Ends
res
pes
ges
.Ends  
--->
...
pes
ges
someline
.Ends  
# * some irrelevant junk * #

Bohon answered 19/11, 2022 at 20:45 Comment(10)

You're right. Done. – Bohon 19/11, 2022 at 20:56

will the last line of the file always end with .Ends? – Thromboembolism 19/11, 2022 at 20:58

No. there are various other lines after the last .Ends, but I don't care about these – Bohon 19/11, 2022 at 20:59

while you may not care about them (lines after the last .Ends) it would matter when coming up with a solution, ie, it's easier to always replace the last line – Thromboembolism 19/11, 2022 at 21:6

I'm certain it's easier, but it's not relevant - all the lines after the last .Ends are comments and information, nothing functional, so the insertion must be within the .Ends bound. – Bohon 19/11, 2022 at 21:14

Why do you need an automated function to edit "a file" in one place? Sounds like all you need to do is use a text editor with a search function. – Frigg 19/11, 2022 at 22:56

Regarding it's not relevant - yes, it is. If you don't state in your question that there could be lines after the last .Ends and don't include lines after the last .Ends in your example then someone trying to help you might reasonably create and test a solution that relies on .Ends being the last line and thereby waste their time and, to a much lesser extent, yours. – Pilpul 20/11, 2022 at 0:28

You added some white space to the end of the last .Ends line in your input now - can that really be present or is it a mistake? – Pilpul 20/11, 2022 at 11:43

2 whitespaces, to skip line. theoretically they can also exist in the input (nobody promised it will be ^\.Ends$), but I just wanted to have the added lines, as you requested above. I'll remove them if skip line can be taken without them – Bohon 20/11, 2022 at 11:56

You said you wanted to find lines ending with ".Ends", not lines ending with ".Ends" possibly followed by spaces or other characters. Does this mean the lines might also be foobar.Ends or foo.Ends.bar or other sequences of characters with .Ends in the middle? I don't know what 2 whitespaces, to skip line. and if skip line can be taken without them means. – Pilpul 20/11, 2022 at 11:59

T

4

Inputs:

$ cat test.dat
dla4
.Ends
she
.Ends
res
.Ends
abc

$ cat new.dat
newline 111
newline 222

One awk idea that sticks with OP's tac | <process> | tac approach:

$ tac test.dat | awk -v new_dat="new.dat" '1;/\.Ends/ && !(seen++) {system("tac " new_dat)}' | tac
dla4
.Ends
she
.Ends
res
newline 111
newline 222
.Ends
abc

Another awk idea that replaces the dual tac calls with a dual-pass of the input file:

$ awk -v new_dat="new.dat" 'FNR==NR { if ($0 ~ /\.Ends/) lastline=FNR; next} FNR==lastline { system("cat "new_dat) }; 1' test.dat test.dat
dla4
.Ends
she
.Ends
res
newline 111
newline 222
.Ends
abc

NOTES:

both of these solutions write the modified data to stdout (same thing OP's current code does)
neither of these solutions modify the original input file (test.dat)

Thromboembolism answered 19/11, 2022 at 21:14 Comment(9)

nice! I really liked the definition of seen in the middle, also the call to system from the oneliner is new for me. Will keep the post open for a while longer, to see if anyone can suggest a trick for in-place editing, but your answer is working and is totally legit! Thanks. – Bohon 19/11, 2022 at 21:27

wow, edit is interesting. will try that as well. – Bohon 19/11, 2022 at 21:29

Thanks @Thromboembolism both your answers work nicely. first method (with both tac commands) works slightly faster and does the better job. – Bohon 20/11, 2022 at 10:12

/.Ends/ would match a line that contains FooEndsBar and you can't rely on the output of system("tac " new_dat) appearing where you want it inside the output of the awk command that calls it (not sure exactly why, buffering maybe, but I've seen the called command output come after all of the awk output rather than in the middle of it), you'd need to call the command and use a while getline loop then print it from awk to robustly ensure the output order. – Pilpul 20/11, 2022 at 11:50

I just tried and can't reproduce that system() issue I mentioned using tac in the middle of a large input/output stream so maybe it happens in some other context (pipes in the command?), idk, but I personally still wouldn't trust it. – Pilpul 20/11, 2022 at 12:18

@EdMorton It worked for me on the 1.5GB file. in any case, can look for /^.Ends/ and be certain I get what I need. – Bohon 20/11, 2022 at 14:30

Things that aren't guaranteed to work usually do work until they don't. You can't test something that might not work, find it works in your test(s) and deduce from that that it'll always work. For example an awk loop like for ( i in arr ) print i will usually print i in some specific order but then sometimes it won't. Similarly /^.Ends/ will match what you want but also strings you don't want, e.g. BEnds, so it'll probably do what you want for the data you're testing with but then it'll fail later with different data. – Pilpul 20/11, 2022 at 17:11

In my test, your first solution is 50x slower than zdim's, and your second solution is 2x slower than your first. TLP's is off the chart slow. – Resultant 22/11, 2022 at 19:21

@Resultant that doesn't surprise me .... both answers are reading the entire source file twice; as for zdim's solution, again, doesn't surprise me ... gets to the row quickly (assuming near the end of the file); did you time the ed solution? – Thromboembolism 22/11, 2022 at 20:18

E

6

If the last instance of that phrase is far enough down the file it helps performance greatly to process the file from the back, for example using File::ReadBackwards. This approach in fact helps in any case as we need to read only what is strictly necessary (the rest after the last instance of the phrase), and once.

Since you need to add other text to the file before the last marker then we have to copy the rest of it so to able to put it back after the addition.

use warnings;
use strict;
use feature 'say';
use Path::Tiny;
use File::ReadBackwards;
    
my $file = shift // die "Usage: $0 file\n"; 

my $bw = File::ReadBackwards->new($file);

my @rest_after_marker; 

while ( my $line = $bw->readline ) { 
    unshift @rest_after_marker, $line;
    last if $line =~ /\.Ends/;
}
# Position after which to add text and copy back the rest
my $pos = $bw->tell;    
$bw->close;

open my $fh, '+<', $file or die $!;    
seek $fh, $pos, 0;
truncate $fh, $pos;    
print $fh $_ for path("add.txt")->slurp, @rest_after_marker;

New text to add before the last .Ends is presumably in a file add.txt.

The question remains of how much of the file there is after the last .Ends marker? We copy all that in memory, to be able to write it back. If that is too much, copy it to a temporary file instead of memory, then use it from there and in the end remove that file.

Erysipeloid answered 19/11, 2022 at 22:44 Comment(5)

Note, this edits the input file in place. – Erysipeloid 19/11, 2022 at 23:35

This isn't a one-liner. Code seems valid (and I really prefer in-place editing), but that's not what I asked for... – Bohon 20/11, 2022 at 7:52

@Bohon Well, yeah ... I just removed a note on that, which I had in text, since I consider it a bit irrelevant in general. (Also, people often mention it only to turn out that it doesn't matter -- and some other requirements here are unclear.) This code does exactly what's asked and is about as efficient as possible, and that may matter on a Gig-and-a-half files. But feel free to discard if it being a "one"-liner matters (rhis can of course be shortened and turned into a command-line program but that'd be misplaced in my opinion). I hope it's still of use to others. – Erysipeloid 20/11, 2022 at 8:14

I agree, and voted you up regardless. let it be for the greater good :) – Bohon 20/11, 2022 at 9:4

@user2141046, Re "This isn't a one-liner.", Sure it is. Nothing stops you from putting it one one line. – Resultant 20/11, 2022 at 17:39

T

4

Inputs:

$ cat test.dat
dla4
.Ends
she
.Ends
res
.Ends
abc

$ cat new.dat
newline 111
newline 222

One awk idea that sticks with OP's tac | <process> | tac approach:

$ tac test.dat | awk -v new_dat="new.dat" '1;/\.Ends/ && !(seen++) {system("tac " new_dat)}' | tac
dla4
.Ends
she
.Ends
res
newline 111
newline 222
.Ends
abc

Another awk idea that replaces the dual tac calls with a dual-pass of the input file:

$ awk -v new_dat="new.dat" 'FNR==NR { if ($0 ~ /\.Ends/) lastline=FNR; next} FNR==lastline { system("cat "new_dat) }; 1' test.dat test.dat
dla4
.Ends
she
.Ends
res
newline 111
newline 222
.Ends
abc

NOTES:

both of these solutions write the modified data to stdout (same thing OP's current code does)
neither of these solutions modify the original input file (test.dat)