split file on Nth occurrence of delimiter

Asked 21/3, 2013 at 23:19 Answered 21/3, 2013 at 23:49

Is there a one-liner to split a text file into pieces / chunks after every Nth occurrence of a delimiter?

example: the delimiter below is "+"

entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
entry 4
some more
+
...

There are several million entries, so splitting on every occurrence of delimiter "+" is a bad idea. I want to split on, say, every 50,000th instance of delimiter "+".

Unix commands "split" and "csplit" just don't seem to do this...

Sirkin answered 21/3, 2013 at 23:19 Comment(0)

Using awk you could:

awk '/^\+$/ { delim++ } { file = sprintf("chunk%s.txt", int(delim / 50000)); print >> file; }' < input.txt

Update:

To not include the delimiter, try this:

awk '/^\+$/ { if(++delim % 50000 == 0) { next } } { file = sprintf("chunk%s.txt", int(delim / 50000)); print > file; }' < input.txt

The next keyword causes awk to halt processing rules for this record and and advance to the next (line). I also changed the >> to > since if you run it more than once you probably don't want to append the old chunk files.

Coomer answered 21/3, 2013 at 23:41 Comment(5)

But this would append each line individually... . won't that be incredibly slow because of so much i/o ? – Sirkin 21/3, 2013 at 23:47

From the gawk manual "Redirecting output using >', >>', or `|' asks the system to open a file or pipe only if the particular file or command you've specified has not already been written to by your program, or if it has been closed since it was last written to." So it's a bit different than doing it in a shell. – Coomer 21/3, 2013 at 23:51

Wow, that is extremely technical catch. But useful! – Sirkin 22/3, 2013 at 0:31

One final question for bonus points - with this method, the first line in each "chunks" file that is created is the delimiter + above). What if I want NEITHER the first NOR last line of each file to be a delimiter? (i.e., begin and end "cleanly"). – Sirkin 22/3, 2013 at 1:41

I always keep coming back to this useful little gem. It has saved me countless of times and countless hours. Thank you! – Cannice 31/3 at 5:14

It isn't very hard to do in Perl if you can't find a suitable alternative (and it will perform pretty well):

#!/usr/bin/env perl
use strict;
use warnings;

# Configuration items - could be set by argument handling
my $prefix = "rs.";     # File prefix
my $number = 1;         # First file number
my $width  = 4;         # Number of digits to use in file name
my $rx     = qr/^\+$/;  # Match regex
my $limit  = 3;         # 50,000 in real case
my $quiet  = 0;         # Set to 1 to suppress file names

sub next_file
{
    my $name = sprintf("%s%.*d", $prefix, $width, $number++);
    open my $fh, '>', $name or die "Failed to open $name for writing";
    print "$name\n" unless $quiet;
    return $fh;
}

my $fh = next_file;  # Output file handle
my $counter = 0;     # Match counter
while (<>)
{
    print $fh $_;
    $counter++ if (m/$rx/);
    if ($counter >= $limit)
    {
        close $fh;
        $fh = next_file;
        $counter = 0;
    }
}
close $fh;

That's far from being a one-liner; I'm not sure whether that's a merit or not. The items that should be configured are grouped together, and could be set via command line options, for example. You could end up with an empty file; you could spot that and remove it if necessary. You'd need a second counter; the existing one is a 'match counter' but you'd also need a line counter, and if the line counter was zero at the you'd remove the last file. You'd also need the name to be able to remove it...fiddly, but not difficult.

Give the input (basically two copies of your sample data), the output from repsplit.pl (repeat split) was as shown:

$ perl repsplit.pl data
rs.0001
rs.0002
rs.0003
$ cat data
entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
entry 4
some more
+
entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
entry 4
some more
+
$ cat rs.0001
entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
$ cat rs.0002
entry 4
some more
+
entry 1
some more
+
entry 2
some more
even more
+
$ cat rs.0003
entry 3
some more
+
entry 4
some more
+
$

Trivalent answered 21/3, 2013 at 23:49 Comment(0)

Using perl and + as input separator in a concise "one-liner" :

If you'd like to do $_ > newprefix.part.$c like stated in your comment :

$ limit=50000 perl -053 -Mautodie -lne '
    BEGIN{$\=""}
    $count++;
    if ($count >= $ENV{limit}) {
        open my $fh, ">", "newprefix.part.$c";
        print $fh $_;
        close $fh;
    }
' file.txt

$ ls -l newprefix.part.*

Doc

Reefer answered 21/3, 2013 at 23:32 Comment(2)

"doSomethingWith" would have to be something like cat $_ > newprefix.part.$c right? – Sirkin 21/3, 2013 at 23:46

doSomethingWith() can be what ever you want to do with every chunk, so yes. Do you want it like that ? – Irons 22/3, 2013 at 0:20

Doc

Recommended topics

Hot tags