Editing the last instance in a file
Asked Answered
B

7

9

I have a huge text file (~1.5GB) with numerous lines ending with ".Ends".
I need a linux oneliner (perl\ awk\ sed) to find the last place '.Ends' appear in the file and add a couple of lines before it.

I tried using tac twice, and stumbled with my perl:

When I use:
tac ../../test | perl -pi -e 'BEGIN {$flag = 1} if ($flag==1 && /.Ends/) {$flag = 0 ; print "someline\n"}' | tac
It first prints the "someline\n" and only than prints the .Ends The result is:

.Ends
someline

When I use:
tac ../../test | perl -e 'BEGIN {$flag = 1} print ; if ($flag==1 && /.Ends/) {$flag = 0 ; print "someline\n"}' | tac
It doesn’t print anything.

And when I use:
tac ../../test | perl -p -e 'BEGIN {$flag = 1} print $_ ; if ($flag==1 && /.Ends/) {$flag = 0 ; print "someline\n"}' | tac
It prints everything twice:

.Ends
someline
.Ends

Is there a smooth way to perform this edit?
Don't have to be with my solution direction, I'm not picky...
Bonus - if the lines can come from a different file, it would be great (but really not a must)

Edit
test input file:

gla2 
fla3 
dla4 
rfa5 
.Ends
shu
sha
she
.Ends
res
pes
ges
.Ends  
--->
...
pes
ges
someline
.Ends  
# * some irrelevant junk * #
Bohon answered 19/11, 2022 at 20:45 Comment(10)
You're right. Done.Bohon
will the last line of the file always end with .Ends?Thromboembolism
No. there are various other lines after the last .Ends, but I don't care about theseBohon
while you may not care about them (lines after the last .Ends) it would matter when coming up with a solution, ie, it's easier to always replace the last lineThromboembolism
I'm certain it's easier, but it's not relevant - all the lines after the last .Ends are comments and information, nothing functional, so the insertion must be within the .Ends bound.Bohon
Why do you need an automated function to edit "a file" in one place? Sounds like all you need to do is use a text editor with a search function.Frigg
Regarding it's not relevant - yes, it is. If you don't state in your question that there could be lines after the last .Ends and don't include lines after the last .Ends in your example then someone trying to help you might reasonably create and test a solution that relies on .Ends being the last line and thereby waste their time and, to a much lesser extent, yours.Pilpul
You added some white space to the end of the last .Ends line in your input now - can that really be present or is it a mistake?Pilpul
2 whitespaces, to skip line. theoretically they can also exist in the input (nobody promised it will be ^\.Ends$), but I just wanted to have the added lines, as you requested above. I'll remove them if skip line can be taken without themBohon
You said you wanted to find lines ending with ".Ends", not lines ending with ".Ends" possibly followed by spaces or other characters. Does this mean the lines might also be foobar.Ends or foo.Ends.bar or other sequences of characters with .Ends in the middle? I don't know what 2 whitespaces, to skip line. and if skip line can be taken without them means.Pilpul
T
4

Inputs:

$ cat test.dat
dla4
.Ends
she
.Ends
res
.Ends
abc

$ cat new.dat
newline 111
newline 222

One awk idea that sticks with OP's tac | <process> | tac approach:

$ tac test.dat | awk -v new_dat="new.dat" '1;/\.Ends/ && !(seen++) {system("tac " new_dat)}' | tac
dla4
.Ends
she
.Ends
res
newline 111
newline 222
.Ends
abc

Another awk idea that replaces the dual tac calls with a dual-pass of the input file:

$ awk -v new_dat="new.dat" 'FNR==NR { if ($0 ~ /\.Ends/) lastline=FNR; next} FNR==lastline { system("cat "new_dat) }; 1' test.dat test.dat
dla4
.Ends
she
.Ends
res
newline 111
newline 222
.Ends
abc

NOTES:

  • both of these solutions write the modified data to stdout (same thing OP's current code does)
  • neither of these solutions modify the original input file (test.dat)
Thromboembolism answered 19/11, 2022 at 21:14 Comment(9)
nice! I really liked the definition of seen in the middle, also the call to system from the oneliner is new for me. Will keep the post open for a while longer, to see if anyone can suggest a trick for in-place editing, but your answer is working and is totally legit! Thanks.Bohon
wow, edit is interesting. will try that as well.Bohon
Thanks @Thromboembolism both your answers work nicely. first method (with both tac commands) works slightly faster and does the better job.Bohon
/.Ends/ would match a line that contains FooEndsBar and you can't rely on the output of system("tac " new_dat) appearing where you want it inside the output of the awk command that calls it (not sure exactly why, buffering maybe, but I've seen the called command output come after all of the awk output rather than in the middle of it), you'd need to call the command and use a while getline loop then print it from awk to robustly ensure the output order.Pilpul
I just tried and can't reproduce that system() issue I mentioned using tac in the middle of a large input/output stream so maybe it happens in some other context (pipes in the command?), idk, but I personally still wouldn't trust it.Pilpul
@EdMorton It worked for me on the 1.5GB file. in any case, can look for /^.Ends/ and be certain I get what I need.Bohon
Things that aren't guaranteed to work usually do work until they don't. You can't test something that might not work, find it works in your test(s) and deduce from that that it'll always work. For example an awk loop like for ( i in arr ) print i will usually print i in some specific order but then sometimes it won't. Similarly /^.Ends/ will match what you want but also strings you don't want, e.g. BEnds, so it'll probably do what you want for the data you're testing with but then it'll fail later with different data.Pilpul
In my test, your first solution is 50x slower than zdim's, and your second solution is 2x slower than your first. TLP's is off the chart slow.Resultant
@Resultant that doesn't surprise me .... both answers are reading the entire source file twice; as for zdim's solution, again, doesn't surprise me ... gets to the row quickly (assuming near the end of the file); did you time the ed solution?Thromboembolism
E
6

If the last instance of that phrase is far enough down the file it helps performance greatly to process the file from the back, for example using File::ReadBackwards. This approach in fact helps in any case as we need to read only what is strictly necessary (the rest after the last instance of the phrase), and once.

Since you need to add other text to the file before the last marker then we have to copy the rest of it so to able to put it back after the addition.

use warnings;
use strict;
use feature 'say';
use Path::Tiny;
use File::ReadBackwards;
    
my $file = shift // die "Usage: $0 file\n"; 

my $bw = File::ReadBackwards->new($file);

my @rest_after_marker; 

while ( my $line = $bw->readline ) { 
    unshift @rest_after_marker, $line;
    last if $line =~ /\.Ends/;
}
# Position after which to add text and copy back the rest
my $pos = $bw->tell;    
$bw->close;

open my $fh, '+<', $file or die $!;    
seek $fh, $pos, 0;
truncate $fh, $pos;    
print $fh $_ for path("add.txt")->slurp, @rest_after_marker;

New text to add before the last .Ends is presumably in a file add.txt.

The question remains of how much of the file there is after the last .Ends marker? We copy all that in memory, to be able to write it back. If that is too much, copy it to a temporary file instead of memory, then use it from there and in the end remove that file.

Erysipeloid answered 19/11, 2022 at 22:44 Comment(5)
Note, this edits the input file in place.Erysipeloid
This isn't a one-liner. Code seems valid (and I really prefer in-place editing), but that's not what I asked for...Bohon
@Bohon Well, yeah ... I just removed a note on that, which I had in text, since I consider it a bit irrelevant in general. (Also, people often mention it only to turn out that it doesn't matter -- and some other requirements here are unclear.) This code does exactly what's asked and is about as efficient as possible, and that may matter on a Gig-and-a-half files. But feel free to discard if it being a "one"-liner matters (rhis can of course be shortened and turned into a command-line program but that'd be misplaced in my opinion). I hope it's still of use to others.Erysipeloid
I agree, and voted you up regardless. let it be for the greater good :)Bohon
@user2141046, Re "This isn't a one-liner.", Sure it is. Nothing stops you from putting it one one line.Resultant
T
4

Inputs:

$ cat test.dat
dla4
.Ends
she
.Ends
res
.Ends
abc

$ cat new.dat
newline 111
newline 222

One awk idea that sticks with OP's tac | <process> | tac approach:

$ tac test.dat | awk -v new_dat="new.dat" '1;/\.Ends/ && !(seen++) {system("tac " new_dat)}' | tac
dla4
.Ends
she
.Ends
res
newline 111
newline 222
.Ends
abc

Another awk idea that replaces the dual tac calls with a dual-pass of the input file:

$ awk -v new_dat="new.dat" 'FNR==NR { if ($0 ~ /\.Ends/) lastline=FNR; next} FNR==lastline { system("cat "new_dat) }; 1' test.dat test.dat
dla4
.Ends
she
.Ends
res
newline 111
newline 222
.Ends
abc

NOTES:

  • both of these solutions write the modified data to stdout (same thing OP's current code does)
  • neither of these solutions modify the original input file (test.dat)
Thromboembolism answered 19/11, 2022 at 21:14 Comment(9)
nice! I really liked the definition of seen in the middle, also the call to system from the oneliner is new for me. Will keep the post open for a while longer, to see if anyone can suggest a trick for in-place editing, but your answer is working and is totally legit! Thanks.Bohon
wow, edit is interesting. will try that as well.Bohon
Thanks @Thromboembolism both your answers work nicely. first method (with both tac commands) works slightly faster and does the better job.Bohon
/.Ends/ would match a line that contains FooEndsBar and you can't rely on the output of system("tac " new_dat) appearing where you want it inside the output of the awk command that calls it (not sure exactly why, buffering maybe, but I've seen the called command output come after all of the awk output rather than in the middle of it), you'd need to call the command and use a while getline loop then print it from awk to robustly ensure the output order.Pilpul
I just tried and can't reproduce that system() issue I mentioned using tac in the middle of a large input/output stream so maybe it happens in some other context (pipes in the command?), idk, but I personally still wouldn't trust it.Pilpul
@EdMorton It worked for me on the 1.5GB file. in any case, can look for /^.Ends/ and be certain I get what I need.Bohon
Things that aren't guaranteed to work usually do work until they don't. You can't test something that might not work, find it works in your test(s) and deduce from that that it'll always work. For example an awk loop like for ( i in arr ) print i will usually print i in some specific order but then sometimes it won't. Similarly /^.Ends/ will match what you want but also strings you don't want, e.g. BEnds, so it'll probably do what you want for the data you're testing with but then it'll fail later with different data.Pilpul
In my test, your first solution is 50x slower than zdim's, and your second solution is 2x slower than your first. TLP's is off the chart slow.Resultant
@Resultant that doesn't surprise me .... both answers are reading the entire source file twice; as for zdim's solution, again, doesn't surprise me ... gets to the row quickly (assuming near the end of the file); did you time the ed solution?Thromboembolism
L
4

Using GNU sed, -i.bak will create a backup file with a .bak extension while saving the original file in-place

$ sed -Ezi.bak 's/(.*)(\.Ends)/\1newline\nnewline\n\2/' input_file
$ cat input_file
gla2
fla3
dla4
rfa5
.Ends
shu
sha
she
.Ends
res
pes
ges
.Ends
--->
...
pes
ges
someline
newline
newline
.Ends
Lunulate answered 19/11, 2022 at 21:45 Comment(2)
I got to give it a try - this solution will probably work fine for a small file, but for a file the size I'm dealing with, I suspect problems...Bohon
yeah, as I thought - it couldn't handle the larger fileBohon
T
2

Inputs:

$ cat test.dat
dla4
.Ends
she
.Ends
res
.Ends
abc

$ cat new.dat
newline 111
newline 222

One ed approach:

$ ed test.dat >/dev/null 2>&1 <<EOF
1
?.Ends
-1r new.dat
wq
EOF

Or as a one-liner:

$ ed test.dat < <(printf '%s\n' 1 ?.Ends '-1r new.dat' wq) >/dev/null 2>&1

Where:

  • >/dev/null 2>&1 - brute force suppression of diagnostic and info messages
  • 1 - go to line #1
  • ?.Ends - search backwards in file for string .Ends (ie, find last .Ends in file)
  • -1r new.dat - move back/up 1 line (-1) in file and read in the contents of new.dat
  • wq - write and quit (aka save and exit)

This generates:

$ cat test.dat
dla4
.Ends
she
.Ends
res
newline 111
newline 222
.Ends
abc

NOTE: unlike OP's current code which writes the modified data to stdout, this solution modifies the original input file (test.dat)

Thromboembolism answered 19/11, 2022 at 22:18 Comment(6)
I believe your answer works (hack, both of your previous answers work and I still try to figure out the second), but this is not a one-liner.Bohon
@Bohon re: not a one-liner ... an 'easy' solution is to place the code in a function wrapper, or place in a file and then source the file ... both methods can allow for a 'one-liner' solution at the command promptThromboembolism
to be honest ... I'm not an ed user so this answer took about 15 minutes to research and test but during that research I recall a few examples where a multi-line answer (like above) was collapsed into a single line ... something like (but don't quote me): ed '1;?.Ends;-1r new.dat;wq' test.datThromboembolism
net result ... in many cases a multi-liner can be reduced to a one-linerThromboembolism
@Bohon fwiw ... after a few minutes of chatting with Mr Google I was able to figure out how to write this one as a one-liner, too; answer updatedThromboembolism
Thanks, but I'll stick to your other answer, with the awk. as the rule says, if it works - don't fix it :)Bohon
P
2

Since you want to read the new lines from a file:

$ cat new
foo
bar
etc
$ tac file | awk 'NR==FNR{str=$0 ORS str; next} {print} $0==".Ends"{printf "%s", str; str=""}' new - | tac
gla2
fla3
dla4
rfa5
.Ends
shu
sha
she
.Ends
res
pes
ges
.Ends
--->
...
pes
ges
someline
foo
bar
etc
.Ends
# * some irrelevant junk * #

The above assumes the white space after .Ends on some lines of your posted sample input are a mistake. If they really can be present then change $0==".Ends" to /^\.Ends[[:space:]]*$/ or even /^[[:space:]]*\.Ends[[:space:]]*$/ if there might also be leading white space on those lines or just /\.Ends/ if there can be any chars before/after .Ends.

Pilpul answered 20/11, 2022 at 0:3 Comment(2)
Can you please explain what's the dash after "new" is doing in this awk command? Not familiar with single dash (and aliased - to less in my env, so want to prevent collisions)Bohon
In every shell script - in the context of input represents stdin. Don't alias it to less (I didn't know you COULD alias symbols!) or you'll run into problems.Pilpul
B
0

First let grep do the searching, then inject the lines with awk.

$ cat insert
new content
new content

$ line=$(cat insert)

$ awk -v var="${line}" '
      NR==1{last=$1; next} 
      FNR==last{print var}1' <(grep -n "^\.Ends$" file | cut -f 1 -d : | tail -1) file
rfa5 
.Ends
she
.Ends
ges
.Ends  
ges
new content
new content
.Ends
ges
ges

Data

$ cat file
rfa5 
.Ends
she
.Ends
ges
.Ends  
ges
.Ends
ges
ges
Beacham answered 19/11, 2022 at 21:37 Comment(4)
Your answer relies on certain OS shenanigans that my OS (csh) isn't supporting - such as round braces and having the spaces saved when performing set line=`cat insert`, so I can't check it.Bohon
@Bohon please read some/all of the articles that google.com/search?q=csh+why+not will find.Pilpul
@EdMorton it's nothing I can control - that's what I'm given and what my tools require. I read these articles when I tried aliasing something with commas and ended up with 5 chars per each comma sign...Bohon
@Bohon if your boss is forcing you to write scripts in csh, you should push back as it's hurting your productivity and ability to write concise, robust, efficient, portable solutions and I'd hope your boss would appreciate that feedback. I'm not aware of any tools that must call or be called from csh rather than any other shell but if they exist they are poorly thought out and should be replaced with other portable tools (or if shell scripts you should add a csh shebang at the top).Pilpul
G
0

Two general points in advance:

  1. When you pipe the output of perl to tac, it doesn't make sense to run perl -i for in-place edit.

  2. $flag is false by default. You can swap the meaning to make the code more handy:

    - BEGIN {$flag = 1} if ($flag==1 && /.Ends/) {$flag = 0 ; print "..."}
    + if (!$f && /.Ends/) {$f=1; print "..."}
    

Now to the questions:

When I use:

tac ../../test | perl -pi -e 'BEGIN {$flag = 1} if ($flag==1 && /.Ends/) {$flag = 0 ; print "someline\n"}' | tac

It first prints the someline\n and only than prints the .Ends. The result is:.Ends\nsomeline.

Yes, because you're going backwards, the output is put after .Ends. You can inverse the output of the current line and the new line:

perl -pe 'if (!$f && /.Ends/) {$f=1 ; print $_ . "someline\n" ; $_=""}'

When I use:

tac ../../test | perl  -e 'BEGIN {$flag = 1} print ; if ($flag==1 && /.Ends/) {$flag = 0 ; print "someline\n"}' | tac

It doesn’t print anything.

You're just missing -n. It works.

perl -ne ...

[...] It prints everything twice:

No explanations needed for that :)

In general, using three commands is not a bad idea: You can avoid high memory usage by piping the perl output to a tmp file. Otherwise the second tac would need to keep the entire input in memory.

awk looks very similar:

tac test | awk '!f && $0==".Ends" {print $0 ORS "newline2" ORS "newline1"; f=1; next}1' | tac
Gerund answered 22/9, 2023 at 22:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.