Print first few and last few lines of file through a pipe with "..." in the middle
Asked Answered
L

4

16

Problem Description

This is my file

1
2
3
4
5
6
7
8
9
10

I would like to send the cat output of this file through a pipe and receive this

% cat file | some_command
1
2
...
9
10

Attempted solutions

Here are some solutions I've tried, with their output

% cat temp | (head -n2 && echo '...' && tail -n2)
1
2
...
% cat temp | tee >(head -n3) >(tail -n3) >/dev/null
1
2
3
8
9
10
# I don't know how to get the ...
% cat temp | sed -e 1b -e '$!d'
1
10

% cat temp | awk 'NR==1;END{print}'
1
10
# Can only get 2 lines
Leanora answered 7/12, 2021 at 21:3 Comment(0)
B
8

An awk:

awk -v head=2 -v tail=2 'FNR==NR && FNR<=head
FNR==NR && cnt++==head {print "..."}
NR>FNR && FNR>(cnt-tail)' file file

Or if a single pass is important (and memory allows), you can use perl:

perl -0777 -lanE 'BEGIN{$head=2; $tail=2;}
END{say join("\n", @F[0..$head-1],("..."),@F[-$tail..-1]);}' file   

Or, an awk that is one pass:

awk -v head=2 -v tail=2 'FNR<=head
{lines[FNR]=$0}
END{
    print "..."
    for (i=FNR-tail+1; i<=FNR; i++) print lines[i]
}' file

Or, nothing wrong with being a caveman direct like:

head -2 file; echo "..."; tail -2 file

Any of these prints:

1
2
...
9
10

It terms of efficiency, here are some stats.

For small files (ie, less than 10 MB or so) all these are less than 1 second and the 'caveman' approach is 2 ms.

I then created a 1.1 GB file with seq 99999999 >file

  • The two pass awk: 50 secs
  • One pass perl: 10 seconds
  • One pass awk: 29 seconds
  • 'Caveman': 2 MS
Betaine answered 7/12, 2021 at 21:13 Comment(7)
Now handle cases where lines count is less than head and tail, and case when head and tail lines intersects ^^Pecan
They all handle overlapping head and tail.Betaine
Especially with large files, the "caveman" approach is the best, because it's the only one that won't read the whole file (head stops after a few lines, and tail seeks to the end and works its way back). Try the perl version with a file that's larger than your available RAM and you're in for a surprise.Seleta
@dawg, I think that by overlapping head and tail, they mean e.g. a case where the file has only three lines. Given three lines 1, 2, and 3, that last head+tail solution would print 1, 2, ..., 2, 3, which is probably technically correct at least for some phrasings of the problem, but it might also be considered misleading. Looks like the others print the same.Causal
@ilkkachu: think the case of three line file is at best ambiguous what the 'correct result' is. I think 1\n2\n...\n2\n3 is most correct in my view. What do you think is a better result for that?Betaine
@GuntramBlohm: Agreed and I added a note to that effect. The two pass awk is reasonable as well in that situation.Betaine
@dawg, in this narrow context of this Q, we don't know, since the post doesn't say. But more generally, 1\n2\n...\n2\n3 implies that there's something removed in the part where it says ..., and that's not true in the case of a three or four-line file. It would make more sense to me to print a three line file just as-is, without the ellipsis. In general. Of course we don't know what they're doing in this particular case, if there's a use-case that requires/expects all four lines and the ..., and where the doubled 2 line makes sense, then that needs to be done.Causal
M
1

You may consider this awk solution:

awk -v top=2 -v bot=2 'FNR == NR {++n; next} FNR <= top || FNR > n-top; FNR == top+1 {print "..."}' file{,}

1
2
...
9
10
Megdal answered 7/12, 2021 at 21:12 Comment(0)
S
1

Two single pass sed solutions:

sed '1,2b
     3c\
...
     N
     $!D'

and

sed '1,2b
     3c\
...
     $!{h;d;}
     H;g'
Samarskite answered 7/12, 2021 at 22:4 Comment(1)
How does this work? It would be more helpful for future readers with related problems (like a count other than 2) if you commented the code and said what you're doing with the pattern / hold space.Merger
C
0

Assumptions:

  • as OP has stated, a solution must be able to work with a stream from a pipe
  • the total number of lines coming from the stream is unknown
  • if the total number of lines is less than the sum of the head/tail offsets then we'll print duplicate lines (we can add more logic if OP updates the question with more details on how to address this situation)

A single-pass awk solution that implements a queue in awk to keep track of the most recent N lines; the queue allows us to limit awk's memory usage to just N lines (as opposed to loading the entire input stream into memory, which could be problematic when processing a large volume of lines/data on a machine with limited available memory):

h=2 t=3

cat temp | awk -v head=${h} -v tail=${t} '
    { if (NR <= head) print $0
      lines[NR % tail] = $0
    }

END { print "..."

      if (NR < tail) i=0
      else           i=NR

      do { i=(i+1)%tail
           print lines[i]
         } while (i != (NR % tail) )
    }'

This generates:

1
2
...
8
9
10

Demonstrating the overlap issue:

$ cat temp4
1
2
3
4

With h=3;t=3 the proposed awk code generates:

$ cat temp4 | awk -v head=${h} -v tail=${t} '...'
1
2
3
...
2
3
4

Whether or not this is the 'correct' output will depend on OP's requirements.

Catt answered 8/12, 2021 at 16:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.