Randomize txt file in Linux but guarantee no repetition of lines
Asked Answered
F

7

5

I have a file called test.txt which looks like this:

Line 1
Line 2
Line 3
Line 3
Line 3
Line 4
Line 8

I need some code which will randomize these lines BUT GUARANTEE that the same text cannot appear on consecutive lines ie "Line 3" must be split up and not appear twice or even three times in a row.

I've seen many variations of this problem answered on here but as yet, none that deal with the repetition of lines.

So far I have tested the following:

shuf test.txt

awk 'BEGIN{srand()}{print rand(), $0}' test.txt | sort -n -k 1 | awk 'sub(/\S /,"")'*

awk 'BEGIN {srand()} {print rand(), $0}' test.txt | sort -n | cut -d ' ' -f2-

cat test.txt | while IFS= read -r f; do printf "%05d %s\n" "$RANDOM" "$f"; done | sort -n | cut -c7-

perl -e 'print rand()," $_" for <>;' test.txt | sort -n | cut -d ' ' -f2-

perl -MList::Util -e 'print List::Util::shuffle <>' test.txt

All of which randomize the lines within the file but often end up with the same lines appearing consecutively within the file.

Is there any way I can do this?

This is the data before edit. You can see number 82576483 appears in consecutive lines

REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>94.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>24.79</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576324</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>98.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576113</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>28.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82590483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>25.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576883</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>17.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>

NOTE: asterisks added to highlight lines of interest; asterisks do not exist in the data file

This is what I need to happen where number 82576483 is spread out across the file rather than being on consecutive lines

REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>94.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>24.79</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576324</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>98.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576113</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>28.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82590483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>25.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576883</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>17.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
Fatidic answered 7/9, 2023 at 10:30 Comment(0)
W
2

General approach:

  • use associative array (linecnt[]) to keep count of number of times a line is seen
  • break linecnt[] into two separate normal arrays: single[1]=<lineX>; single[2]=<lineY> and multi[1]=<lineA_copy1>; multi[2]=<lineA_copy2>; multi[3]=<lineB_copy1>
  • while we have at least one entry in both arrays (single[] / multi[]) intersperse our printing (ie, print random(single[]), print randome(multi[]), print random(single[]), print randome(multi[])); NOTE: obviously not truly random but this allows us to maximize chances of separating dupes while limiting cpu overhead (ie, no need to repetitively shuffle hoping for a 'random' ordering that splits dupes)
  • if we have any single[] entries left then print random(single[])
  • if we have any multi[] entries left then print random(multi[]); NOTE: assumes OP's comment re: tough!! means dupes can be printed consecutively if this is all that's left

One awk idea:

$ cat dupes.awk

function print_random(a, acnt,     ndx) {
    ndx=int(1 + rand() * acnt)
    print a[ndx]
    if (acnt>1) { a[ndx]=a[acnt]; delete a[acnt] }
    return --acnt
}

BEGIN { srand() }

      { linecnt[$0]++ }

END   { for (line in linecnt) {
            if (linecnt[line] == 1)
               single[++scnt]=line
            else
               for (i=1; i<=linecnt[line]; i++)
                   multi[++mcnt]=line
            delete linecnt[line]
        }

        while (scnt>0 && mcnt>0) {
              scnt=print_random(single,scnt)
              mcnt=print_random(multi,mcnt)
        }

        while (scnt>0)
              scnt=print_random(single,scnt)

        while (mcnt>0)
              mcnt=print_random(multi,mcnt)
      }

NOTES:

  • srand() isn't truly random (eg, two quick, successive runs can generate the same exact output)
  • additional steps could be added to insure quick, successive runs don't generate exact output (eg, providing an OS-level seed for use in srand())

Running against OP's sample set of data:

$ awk -f dupes.awk test.txt
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476483</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576883</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82590483</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>

REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576113</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576324</CUST-ACNT-N>

NOTES:

  • data lines cut for brevity
  • blank line added to highlight a) 1st block of interleaved single[] / multi[] entries and b) 2nd block finishing off the rest of the single[] entries
  • repeated runs will generate different results

An example of processing duplicates ...

$ cat test.txt
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N>

Result of running our awk script:

$ awk -f dupes.awk test.txt
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>

REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>

NOTES:

  • blank line added to highlight a) 1st block of interleaved single[] / multi[] entries and b) 2nd block finishing off the rest of the multi[] entries
  • repeated runs will generate different results
Wellspoken answered 7/9, 2023 at 15:25 Comment(5)
Thank you so much for this. Does the trick exactly as intended & I'm familiar with most of what's going on in here...Thanks for everyone's efforts in helping me in all this. It's much appreciatedFatidic
Suppose you have the input seq 0 10 | awk 'BEGIN{ ff[0]="A"; ff[1]="B"; ff[2]="C"} {print ff[$1%(FNR>3?3:2)]}' | sort | awk '{print $1, "Line", FNR-1}' >file Your approach does not work (using first column as key.)Boutonniere
@Boutonniere not sure what you're getting at; your code generates a bunch of lines like A line 0; A line 1; B line 2; ..., with none of the lines being duplicated; OP's question refers to the whole line being the 'key'; at no point does anyone (OP, me) suggest this approach would work for some other data set where we're looking at duplicate columns (as opposed to duplicate rows) ... ??????Wellspoken
Delete the ` Line \d` part then. seq 0 10 | awk 'BEGIN{ ff[0]="A"; ff[1]="B"; ff[2]="C"} {print ff[$1%(FNR>3?3:2)]}' | sort >file does not work either...Boutonniere
define does not work; your code generates nothing but duplicates; OP hasn't defined how to process an excess number of duplicates; my previous answer (see the edit history) went a bit further to randomize excessive duplicates but still had limitations (eg, how to randomize 3 lines that are all A)Wellspoken
P
4

An efficient approach, at least compared to just trying at random repeatedly:

  1. Sort all the unique string
  2. For each duplicate,
    1. Identify the places in which it could be placed.
    2. Pick one at random.
    3. Insert the duplicate there.
use strict;
use warnings;

use List::Util qw( shuffle );

my %counts; ++$counts{ $_ } while <>;

my @strings = shuffle keys %counts;

for my $string ( keys( %counts ) ) {
   my $count = $counts{ $string };
   for ( 2 .. $count ) {
      my @safe =
         grep { $_ == 0        || $strings[ $_ - 1 ] ne $string }
         grep { $_ == @strings || $strings[ $_ - 0 ] ne $string }
         0 .. @strings;

      my $pick = @safe ? $safe[ rand( @safe ) ] : rand( @strings+1 );

      splice( @strings, $pick, 0, $string );
   }
}

print( @strings );

(Just wrap with perl -e'...' to run form the shell.)

Tested. There may be an even better approach.

Pergrim answered 7/9, 2023 at 16:45 Comment(0)
W
2

General approach:

  • use associative array (linecnt[]) to keep count of number of times a line is seen
  • break linecnt[] into two separate normal arrays: single[1]=<lineX>; single[2]=<lineY> and multi[1]=<lineA_copy1>; multi[2]=<lineA_copy2>; multi[3]=<lineB_copy1>
  • while we have at least one entry in both arrays (single[] / multi[]) intersperse our printing (ie, print random(single[]), print randome(multi[]), print random(single[]), print randome(multi[])); NOTE: obviously not truly random but this allows us to maximize chances of separating dupes while limiting cpu overhead (ie, no need to repetitively shuffle hoping for a 'random' ordering that splits dupes)
  • if we have any single[] entries left then print random(single[])
  • if we have any multi[] entries left then print random(multi[]); NOTE: assumes OP's comment re: tough!! means dupes can be printed consecutively if this is all that's left

One awk idea:

$ cat dupes.awk

function print_random(a, acnt,     ndx) {
    ndx=int(1 + rand() * acnt)
    print a[ndx]
    if (acnt>1) { a[ndx]=a[acnt]; delete a[acnt] }
    return --acnt
}

BEGIN { srand() }

      { linecnt[$0]++ }

END   { for (line in linecnt) {
            if (linecnt[line] == 1)
               single[++scnt]=line
            else
               for (i=1; i<=linecnt[line]; i++)
                   multi[++mcnt]=line
            delete linecnt[line]
        }

        while (scnt>0 && mcnt>0) {
              scnt=print_random(single,scnt)
              mcnt=print_random(multi,mcnt)
        }

        while (scnt>0)
              scnt=print_random(single,scnt)

        while (mcnt>0)
              mcnt=print_random(multi,mcnt)
      }

NOTES:

  • srand() isn't truly random (eg, two quick, successive runs can generate the same exact output)
  • additional steps could be added to insure quick, successive runs don't generate exact output (eg, providing an OS-level seed for use in srand())

Running against OP's sample set of data:

$ awk -f dupes.awk test.txt
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476483</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576883</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82590483</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>

REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576113</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576324</CUST-ACNT-N>

NOTES:

  • data lines cut for brevity
  • blank line added to highlight a) 1st block of interleaved single[] / multi[] entries and b) 2nd block finishing off the rest of the single[] entries
  • repeated runs will generate different results

An example of processing duplicates ...

$ cat test.txt
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N>

Result of running our awk script:

$ awk -f dupes.awk test.txt
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>

REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>

NOTES:

  • blank line added to highlight a) 1st block of interleaved single[] / multi[] entries and b) 2nd block finishing off the rest of the multi[] entries
  • repeated runs will generate different results
Wellspoken answered 7/9, 2023 at 15:25 Comment(5)
Thank you so much for this. Does the trick exactly as intended & I'm familiar with most of what's going on in here...Thanks for everyone's efforts in helping me in all this. It's much appreciatedFatidic
Suppose you have the input seq 0 10 | awk 'BEGIN{ ff[0]="A"; ff[1]="B"; ff[2]="C"} {print ff[$1%(FNR>3?3:2)]}' | sort | awk '{print $1, "Line", FNR-1}' >file Your approach does not work (using first column as key.)Boutonniere
@Boutonniere not sure what you're getting at; your code generates a bunch of lines like A line 0; A line 1; B line 2; ..., with none of the lines being duplicated; OP's question refers to the whole line being the 'key'; at no point does anyone (OP, me) suggest this approach would work for some other data set where we're looking at duplicate columns (as opposed to duplicate rows) ... ??????Wellspoken
Delete the ` Line \d` part then. seq 0 10 | awk 'BEGIN{ ff[0]="A"; ff[1]="B"; ff[2]="C"} {print ff[$1%(FNR>3?3:2)]}' | sort >file does not work either...Boutonniere
define does not work; your code generates nothing but duplicates; OP hasn't defined how to process an excess number of duplicates; my previous answer (see the edit history) went a bit further to randomize excessive duplicates but still had limitations (eg, how to randomize 3 lines that are all A)Wellspoken
P
2

has some nice syntax for a concise approach.

https://stackoverflow.com/a/65843200 is easily modified for your data:

ruby -e '

regex = /<CUST-ACNT-N>\d+<\/CUST-ACNT-N>/

arr = readlines.map {|line| {:k => line[regex], :v => line}}
arr = arr.sort_by {|kv| kv[:k]}
mid = arr.size.succ / 2
arr = arr[0..mid-1].zip(arr[mid..-1]).flatten.compact.map {|kv| kv[:v]}
idx = (1..arr.size-1).find { |i| arr[i] == arr[i-1] }

puts idx ? arr.rotate(idx) : arr 

' file.txt

Pages answered 7/9, 2023 at 20:19 Comment(0)
B
2

Another approach: First shuffle the lines then go line by line, collecting dupes as they come. For each line check the existing dupes to slip them in if possible. After the input has been processed this way then go over the result from the beginning to try to place the remaining dupes

use warnings;
use strict;
use feature 'say';    
use List::Util qw(shuffle any);

# Push dupes to data unless same as last element or has been added already
sub add_dupes {
    my ($data, $dupes, $mask) = @_;

    for my $idx (0..$#$dupes) {
        next if $dupes->[$idx] eq $data->[-1];
        next if any { $idx == $_ } @$mask;

        push @$data, $dupes->[$idx];
        push @$mask, $idx;
    }
}

my @lines = <>; 
chomp @lines;

my @res = shift @lines;

foreach my $line (shuffle @lines) {
    if ($line eq $res[-1]) { push @dupes, $line }
    else                   { push @res, $line }

    # Redistribute dupes found so far if possible
    add_dupes(\@res, \@dupes, \@mask_dupes);
}

# Redistribute remaining (unused) dupes   
my @final;
foreach my $line (@res) {
    if ($line eq $final[-1]) { push @dupes, $line }
    else                     { push @final, $line }

    add_dupes(\@final, \@dupes, \@mask_dupes);
}

say "\nFinal (", scalar @final, " items):";
say for @final;

This stores dupes on an array as they are found, and for each line checks whether it can slip in existing dupe(s). It uses an ancillary mask array to mark indices of dupes that have been used.

Notes

  • Shuffling first helps since many of the consecutive duplicate lines will get moved around, with an overwhelming probability

  • The duplicates array is searched for each line of data so in principle the worst case is O(N2) (or, rather, O(NM)). This, I think, has to be done in some way in any approach, but it should be possible to minimize these cross searches.

    However, the array of dupes is expected to be rather short and most of the time not the whole array is searched. So if the input isn't gigantic with a lot of dupes this should perform well.

  • If there happen to be no duplicates in the end we are copying an array needlessly. But that's not a terrible sin, if it's once.

Tested with various input, with many duplicates of multiple lines, but needs more testing. (At least, add basic diagnostic prints and run repeatedly -- it shuffles each time so repeated runs help -- and examine the output.)

Bushelman answered 7/9, 2023 at 23:50 Comment(0)
H
1

Using any awk:

$ cat tst.awk
match($0,/<CUST-ACNT-N>[^<]+<\/CUST-ACNT-N>/) {
    key = substr($0,RSTART,RLENGTH)
    gsub(/^<CUST-ACNT-N>|<\/CUST-ACNT-N>$/,"",key)
    keys[NR] = key
    lines[NR] = $0
}
END {
    srand()
    maxAttempts = 1000
    while ( (output == "") && (++attempts <= maxAttempts) ) {
        output = distribute()
    }
    printf "%s", output
    if ( output == "" ) {
        print "Error: Failed to distribute the input." | "cat>&2"
        exit 1
    }
}

function distribute(    iters,numLines,maxIters,tmpLines,tmpKeys,idx,i,ret) {
    for ( idx in keys ) {
        tmpKeys[idx] = keys[idx]
        tmpLines[idx] = lines[idx]
        numLines++
    }

    maxIters = 1000
    while ( (numLines > 0) && (++iters <= maxIters) ) {
        idx = int(1+rand()*numLines)

        if ( tmpKeys[idx] != prev ) {
            ret = ret tmpLines[idx] ORS
            prev = tmpKeys[idx]
            for ( i=idx; i<numLines; i++ ) {
                tmpKeys[i] = tmpKeys[i+1]
                tmpLines[i] = tmpLines[i+1]
            }
            numLines--
        }
    }

    if ( numLines ) {
        ret = ""
    }
    return ret
}

$ awk -f tst.awk file
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576324</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>98.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82590483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>25.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576113</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>28.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>94.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576883</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>17.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>24.79</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>

So, in one attempt to produce output it tries 1000 times (maxIters) to find at random from the set of unprocessed lines a next line to output that isn't the same as the line it just added to the output but that could ultimately still fail, and so it tries 1000 times (maxAttempts) to produce output. That could still fail - increase those values if you like but you could still end up with output that simply can't be organized as you like (e.g. only 2 lines of input where both lines are identical).

You could make it more efficient and increase it's chances of success by changing this code:

        ret = ret tmpLines[idx] ORS
        prev = tmpKeys[idx]
        for ( i=idx; i<numLines; i++ ) {
            tmpKeys[i] = tmpKeys[i+1]
            tmpLines[i] = tmpLines[i+1]
        }
        numLines--

to create/use secondary arrays consisting of only the keys+lines that do not have the same key as the one just processed so then we wouldn't need the if ( tmpKeys[idx] != prev ) test above it and we wouldn't run the risk of idx = int(1+rand()*numLines) above that randomly finding the same key 1000 times when there were others to choose from. That enhancement is left as an exercise :-).

Hallam answered 7/9, 2023 at 12:4 Comment(2)
Thanks Ed. It;s working most of the time but I'm getting the odd "Failed to distribute the input" error when I try to run it with the "file" in your example. Is this just a quirk of the random nature of what we're trying to do?Fatidic
You're welcome. I explained that in the paragraph at the bottom of my answer. It'll try up to 1000*1000 = 1,000,000 times to produce the desired output before giving up.Hallam
H
1

Using TXR Lisp:

$ txr spread-sort.tl < data
Line 2
Line 4
Line 3
Line 1
Line 3
Line 8
Line 3
$ txr spread-sort.tl < data
Line 4
Line 3
Line 1
Line 3
Line 8
Line 3
Line 2
$ txr spread-sort.tl < data
Line 4
Line 3
Line 8
Line 3
Line 1
Line 3
Line 2

The code:

(set *random-state* (make-random-state))

(let ((dupstack (vec)))
  (labels ((distrib (single)
             (build
               (pend single)
               (each ((i 0..(len dupstack)))
                 (iflet ((item (pop [dupstack i])))
                   (add item)))
               (upd dupstack (remq nil))))
           (distrib-push (dupes)
             (prog1
               (distrib nil)
               (vec-push dupstack dupes))))
    (flow (get-lines)
      sort-group
      shuffle
      (mapcar [iff cdr distrib-push distrib])
      (mapcar distrib)
      tprint)))

This is not a correct algorithm in that if that if the input has a high ratio of duplicates, such that there is a correct ordering, such as:

1
2
2

it will not consistently produce the two 2 1 2 orders that separates the duplicates.

The main flow of the algorithm is in the flow form. Lines are obtained from standard input and passed through sort-group, which will group the duplicates and sort, resulting in a list of lists of strings. Lines which aren't duplicates are lists of length 1. We shuffle this list of lists randomly, which means that the duplicates stay together.

We then distribute the duplicates using two passes, which use a vector called dupstack.

In the first pass, we map the list of lists such that the singletons are passed through distrib and the duplicates are passed through distrib-push. This moves around the duplicates in the way described below. After this pass, some items remain in the dupstack; so the list-of-lists does not have all the items. We make another pass, this time just passing every list through distrib, which distributes the items out of dupstack.

The dupstack is a vector of lists, which are lists of duplicate lines. E.g. [dupstack 0] might contain ("Line 3" "Line 3" "Line 3") and such.

How distrib works is: it sweeps through dupstack, pops one element off the front of each element and appends it to the input list, returning that input list. If we map using this operation, it means that to each list we visit, we add one item from each duplicate set. After each sweep through this stack, we condense it using (upd dupstack (remq nil)) to purge it of lists that have become empty.

The function distrib-push is used in the first pass when processing lists that have more than one element (indicated by the Lisp cdr function returning nonempty). What distrib-push does is call distrib with an empty list, just to collect any available duplicates, one of each. These items cherry-picked from dupstack then replace the current items. Those items, identical strings, are pushed into a new slot in the duplicate stack.

Heddi answered 8/9, 2023 at 2:49 Comment(0)
B
1

Here is a sample file:

$ cat file
A Line 0
A Line 1
A Line 2
A Line 3
A Line 4
B Line 5
B Line 6
B Line 7
B Line 8
C Line 9
C Line 10
C Line 11
D Line 12

With the first column defined as the key and the constraint that no key can be next to the same key, randomize the file. Given that constraint, the result will only be randomish since there are more A's than B's and A will have to be at the start of the sequence. (Odd total items have more even than odd indexes since the parity of 0 is even.)

The general approach would be:

  1. Group all similar key lines together;
  2. Randomly choose from groups of input lines to be odd or even so the defined group is distributed;
  3. Randomly choose the remaining lines for output.

This is easily done in Ruby:

ruby -e '
BEGIN{keys=Hash.new { |h, k| h[k] = []} }
data=$<.read.split(/\R/)
data.each.with_index{|s,i| 
    s.match(/^(\S+)/); keys[$1]<<i 
} # regex for key goes here
olines=(0..data.length-1).to_a; nlines=Hash.new()
grp_cnt=keys.values.map{|sa| sa.length if sa.length>1}.compact.sum
keys.sort_by{|k,v| [-v.length, v[0]]}.each{|k, grp| 
    if grp.length>1 then
        evens, odds=olines.partition{|n| n.even?}
        if grp_cnt.to_f/data.length > 0.6 then
            pool=evens.length>odds.length ? evens[0...grp.length] : odds.reverse[0...grp.length]
        else
            pool=evens.length>odds.length ? evens : odds 
        end
        if pool.length<grp.length then pool=olines end
    else
        pool=olines
    end
    
    this_grp=pool.sample(grp.length)
    grp.zip(this_grp).each{|ks, vs| nlines[ks]=vs}
    
    olines.reject!{|line| this_grp.include?(line) }      # remove the used lines
}
nlines.sort_by{|k,v| v}.each{|v,k| puts "Line #{v} in => Line #{k} out; \"#{data[k]}\" => \"#{data[v]}\""}
' file 

Prints:

Line 0 in => Line 0 out; "A Line 0" => "A Line 4"
Line 12 in => Line 1 out; "A Line 1" => "D Line 12"
Line 2 in => Line 2 out; "A Line 2" => "A Line 2"
Line 10 in => Line 3 out; "A Line 3" => "C Line 10"
Line 3 in => Line 4 out; "A Line 4" => "A Line 3"
Line 7 in => Line 5 out; "B Line 5" => "B Line 7"
Line 1 in => Line 6 out; "B Line 6" => "A Line 1"
Line 5 in => Line 7 out; "B Line 7" => "B Line 5"
Line 4 in => Line 8 out; "B Line 8" => "A Line 0"
Line 6 in => Line 9 out; "C Line 9" => "B Line 6"
Line 11 in => Line 10 out; "C Line 10" => "C Line 11"
Line 8 in => Line 11 out; "C Line 11" => "B Line 8"
Line 9 in => Line 12 out; "D Line 12" => "C Line 9"

This is trivial to change to accomodate the OP example input. Only the regex for the key and the output line is changed:

ruby -e '
BEGIN{keys=Hash.new { |h, k| h[k] = []} }
data=$<.read.split(/\R/)
data.each.with_index{|s,i| 
    s.match(/<CUST-ACNT-N>([^<]+)</); keys[$1]<<i 
} # regex for key goes here
olines=(0..data.length-1).to_a; nlines=Hash.new()
grp_cnt=keys.values.map{|sa| sa.length if sa.length>1}.compact.sum
keys.sort_by{|k,v| [-v.length, v[0]]}.each{|k, grp| 
    if grp.length>1 then
        evens, odds=olines.partition{|n| n.even?}
        if grp_cnt.to_f/data.length > 0.6 then
            pool=evens.length>odds.length ? evens[0...grp.length] : odds.reverse[0...grp.length]
        else
            pool=evens.length>odds.length ? evens : odds 
        end
        if pool.length<grp.length then pool=olines end
    else
        pool=olines
    end
    
    this_grp=pool.sample(grp.length)
    grp.zip(this_grp).each{|ks, vs| nlines[ks]=vs}
    
    olines.reject!{|line| this_grp.include?(line) }      # remove the used lines
}
nlines.sort_by{|k,v| v}.each{|v,k| puts "#{data[v]}"}
' file 

Prints:

REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82590483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>25.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576324</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>98.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576883</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>17.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>94.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>24.79</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576113</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>28.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
Boutonniere answered 9/9, 2023 at 13:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.