implementing a `maxsplit` function in POSIX awk
Asked Answered
C

4

3

I'm trying to implement an awk function for splitting a string into an array; the fundamental difference with the built-in split is that one is able to limit the number of "splits" that are performed, just like in Python's str.split

The prototype would be something like this:

maxsplit(s, a, n[, fs ])
Split the string s into array elements, performing at most n splits (thus, the array will have at most n+1 elements a[1], a[2], ..., a[n+1]), and return the actual number of elements in the array. All elements of the array shall be deleted before the split is performed. The separation shall be done with the ERE fs or with the field separator FS when fs is not provided.
At the moment, the effects of a null string as fs value, or a negative number as n value, are unspecified.

Here's the code I came-up with:
edit1: fixed the border cases pointed-out by @markp-fuso and @Daweo
edit2: added handling of non-provided fs argument (thanks @EdMorton)

function maxsplit(s,a,n,fs,    i,j) {
    delete a
    if (fs == 0 && fs == "")
        fs = FS
    if (fs == " ")
    {
        for (i = 1; (i <= n) && match(s,/[^[:space:]]+/); i++) {
            a[i] = substr(s,RSTART,RLENGTH)
            s = substr(s,RSTART+RLENGTH)
        }
        sub(/^[[:space:]]+/, "", s)
        if (s != "")
           a[i] = s
        else
           --i
    }
    else if (length(fs) == 1)
    {
        for (i = 1; (i <= n) && (j = index(s,fs)); i++) {
            a[i] = substr(s,1,j-1)
            s = substr(s,j+1)
        }
        a[i] = s
    }
    else
    {
        i = 1
        while ( (i <= n) && (s != "") && match(s,fs) ) {
            if (RLENGTH) {
                a[i] = a[i] substr(s,1,RSTART-1)
                s = substr(s,RSTART+RLENGTH)
                ++i
            } else {
                a[i] = a[i] substr(s,1,1)
                s = substr(s,2)
            }
        }
        a[i] = s
    }
    return i
}

As you can see, I wrote multiple code paths for handling the different types of fs. My question is: what other edge cases need to be handled?


Here's a set of test cases:
edit1: reordered the input/output fields and added a few more tests.
edit2: converted the input to a colon-delimited format (instead of xargs format), and added a few more tests.
note: I'm still looking for test-cases that could be tricky to handle.

#inputString:fieldSep:maxSplit
: :0
: :1
: :100
:,:0
:,:1
:,:100
:[ ]:0
:[ ]:1
:[ ]:100
,:,:0
,:,:1
,:,:2
,:,:100
  foo bar   baz    : :0
  foo bar   baz    : :1
  foo bar   baz    : :2
  foo bar   baz    : :3
  foo bar   baz    : :100
foo=bar=baz:=:0
foo=bar=baz:=:1
foo=bar=baz:=:2
foo=bar=baz:=:3
foo=bar=baz:=:100
foo|bar|baz|:|:0
foo|bar|baz|:|:1
foo|bar|baz|:|:2
foo|bar|baz|:|:3
foo|bar|baz|:|:4
foo|bar|baz|:|:100
 foo  bar :[ ]:0
 foo  bar :[ ]:1
 foo  bar :[ ]:2
 foo  bar :[ ]:3
 foo  bar :[ ]:4
 foo  bar :[ ]:5
 foo  bar :[ ]:6
 foo  bar :[ ]:100
_.{2}__.?_[0-9]*:_+:0
_.{2}__.?_[0-9]*:_+:1
_.{2}__.?_[0-9]*:_+:2
_.{2}__.?_[0-9]*:_+:3
_.{2}__.?_[0-9]*:_+:4
_.{2}__.?_[0-9]*:_+:100
foo:^[fo]:0
foo:^[fo]:1
foo:^[fo]:2
foo:^[fo]:100
foo_bar_:_?:0
foo_bar_:_?:1
foo_bar_:_?:2
foo_bar_:_?:3
foo_bar_:_?:100

And the code wrapper for processing the test-cases.txt:

awk -F: '
    function maxsplit(s,a,n,fs,    i,j) {
        # ...
    }
    function shell_quote(s) {
        gsub(/\047/,"\047\\\047\047")
        return "\047" s "\047"
    }
    function array_quote(a    ,s,i) {
        s=""
        for (i=1;i in a;i++)
            s = s " " shell_quote(a[i])
        return "(" s " )"
    }
    NR > 1 {
        s  = $1
        fs = $2
        n  = $3
        maxsplit(s,a,n,fs)
        print "s=" shell_quote(s),    \
             "fs=" shell_quote(fs),   \
              "n=" sprintf("%-3d",n), \
              "a=" array_quote(a)
    }
' test-cases.txt

My expected output is:

s='' fs=' ' n=0   a=( )
s='' fs=' ' n=1   a=( )
s='' fs=' ' n=100 a=( )
s='' fs=',' n=0   a=( )
s='' fs=',' n=1   a=( )
s='' fs=',' n=100 a=( )
s='' fs='[ ]' n=0   a=( )
s='' fs='[ ]' n=1   a=( )
s='' fs='[ ]' n=100 a=( )
s=',' fs=',' n=0   a=( ',' )
s=',' fs=',' n=1   a=( '' '' )
s=',' fs=',' n=2   a=( '' '' )
s=',' fs=',' n=100 a=( '' '' )
s='  foo bar   baz    ' fs=' ' n=0   a=( 'foo bar   baz    ' )
s='  foo bar   baz    ' fs=' ' n=1   a=( 'foo' 'bar   baz    ' )
s='  foo bar   baz    ' fs=' ' n=2   a=( 'foo' 'bar' 'baz    ' )
s='  foo bar   baz    ' fs=' ' n=3   a=( 'foo' 'bar' 'baz' )
s='  foo bar   baz    ' fs=' ' n=100 a=( 'foo' 'bar' 'baz' )
s='foo=bar=baz' fs='=' n=0   a=( 'foo=bar=baz' )
s='foo=bar=baz' fs='=' n=1   a=( 'foo' 'bar=baz' )
s='foo=bar=baz' fs='=' n=2   a=( 'foo' 'bar' 'baz' )
s='foo=bar=baz' fs='=' n=3   a=( 'foo' 'bar' 'baz' )
s='foo=bar=baz' fs='=' n=100 a=( 'foo' 'bar' 'baz' )
s='foo|bar|baz|' fs='|' n=0   a=( 'foo|bar|baz|' )
s='foo|bar|baz|' fs='|' n=1   a=( 'foo' 'bar|baz|' )
s='foo|bar|baz|' fs='|' n=2   a=( 'foo' 'bar' 'baz|' )
s='foo|bar|baz|' fs='|' n=3   a=( 'foo' 'bar' 'baz' '' )
s='foo|bar|baz|' fs='|' n=4   a=( 'foo' 'bar' 'baz' '' )
s='foo|bar|baz|' fs='|' n=100 a=( 'foo' 'bar' 'baz' '' )
s=' foo  bar ' fs='[ ]' n=0   a=( ' foo  bar ' )
s=' foo  bar ' fs='[ ]' n=1   a=( '' 'foo  bar ' )
s=' foo  bar ' fs='[ ]' n=2   a=( '' 'foo' ' bar ' )
s=' foo  bar ' fs='[ ]' n=3   a=( '' 'foo' '' 'bar ' )
s=' foo  bar ' fs='[ ]' n=4   a=( '' 'foo' '' 'bar' '' )
s=' foo  bar ' fs='[ ]' n=5   a=( '' 'foo' '' 'bar' '' )
s=' foo  bar ' fs='[ ]' n=6   a=( '' 'foo' '' 'bar' '' )
s=' foo  bar ' fs='[ ]' n=100 a=( '' 'foo' '' 'bar' '' )
s='_.{2}__.?_[0-9]*' fs='_+' n=0   a=( '_.{2}__.?_[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=1   a=( '' '.{2}__.?_[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=2   a=( '' '.{2}' '.?_[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=3   a=( '' '.{2}' '.?' '[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=4   a=( '' '.{2}' '.?' '[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=100 a=( '' '.{2}' '.?' '[0-9]*' )
s='foo' fs='^[fo]' n=0   a=( 'foo' )
s='foo' fs='^[fo]' n=1   a=( '' 'oo' )
s='foo' fs='^[fo]' n=2   a=( '' 'oo' )
s='foo' fs='^[fo]' n=100 a=( '' 'oo' )
s='foo_bar_' fs='_?' n=0   a=( 'foo_bar_' )
s='foo_bar_' fs='_?' n=1   a=( 'foo' 'bar_' )
s='foo_bar_' fs='_?' n=2   a=( 'foo' 'bar' '' )
s='foo_bar_' fs='_?' n=3   a=( 'foo' 'bar' '' )
s='foo_bar_' fs='_?' n=100 a=( 'foo' 'bar' '' )
Catchfly answered 27/5 at 14:35 Comment(35)
Please provide some concise, testable sample input and expected output that demonstrates your needs and we can copy/paste to test a potential solution against.Motorcade
@EdMorton It's a little hard to be concise as it is a draft for general-purpose function based on the behaviors of AWK's split and Python's str.split. I added a few test-cases but I'm sure I've missed some tricky usagesCatchfly
@EdMorton : side note : maybe I'm missing something, but this code u wrote split($0,tmp); $0 = ""; for ( i=1; i<=(NF-2); i++ ) { $i = tmp[i] } shouldn't run any for( ) loop cycles at all since u just blanked out $0 = "", thus NF auto set to 0, causing 2nd arg of the for( ) loop to check for i <= (0 - 2). The alternative is much worse since any i that actually passes that filtering criteria would be assigning into a negative field number. Maybe just pre-delete the 2 highest indices in tmp[ ] then use the hands-free for ( i in tmp ) { } iterator?Lefthander
@Catchfly : one optimization I could think of is to see if :::::::::::::::::: :::::::::::: ::::::::::::: :::::::: fs == FS && s == $0 && NF <= n. If all 3 are true, then awk's built-in field splitting has already done all the work for you, so you could either directly clone field contents into the array, or straight up split( s, a , fs ). A second optimization would be checking for either n < 1 || s !~ fs. If either is true, then just make a 1 cell array and throw the entire string there :: a[ 1 ] = sLefthander
@Catchfly : you can also obtain an upfront total potential split count with theo_max_n = gsub( fs, "&", s), and clamp n accordingly. Oh one more optimization I could think of - if fs happens to be a string that isn't regex at all (e.g. fs == "<html>"), then use index( ) instead of match( ) each round for major speed gainLefthander
@RAREKpopManifesto Regarding "this code u wrote..." - I didn't [yet] write any code in this question and the code you show in your comment doesn't look familiar so idk what you're referring to. If it's about code I wrote in some other question then please provide a reference.Motorcade
Since your question is tagged posix: delete a is an extension, for(i in a) delete a[i] is how to do it POSIXly.Dubiety
Nothing to do with edge cases but have you considered simplifying your function by using awk's split function followed by concatenating and removing 'excess' elements if the array is longer than n+1, placing them as the as the last element. An advantage of this, besides simplicity, is that you could use negative fs values to fill the array backwards, by concatenating and deleting elements from the start of an over-length array.Ferocious
@Dubiety split("",a) is how to efficiently delete an array POSIXly,Motorcade
@Catchfly compared to POSIX split(), gawk's split() has an extra argument to populate an array of the separators between fields which comes in extremely useful at times. It'd be very little extra effort for you to add such an argument to your function - just something to consider.Motorcade
@DavePritlove I did consider it but I think It will be difficult, maybe impossible, to split and reassemble the last part correctly when fs is a regexCatchfly
@DavePritlove Fravadona's right. Since POSIX split() doesn't save the strings that match the field separators so you can't reconstruct the remainder of the original string plus you can't use negative fs values when fs can't easily be negated, e.g. how do you write code to negate an fs of something like )+(%.*[a-z]|[][]{3,9})* (note that the meaning of a-z is locale-specific!).Motorcade
@EdMorton You're right, I didn't think of adding the GNU "fourth" argument of split, which might prove helpful. For now I'm trying to "validate" the function on edge-cases that I might have overlooked.Catchfly
Regarding delete arr vs split("",arr) btw - although the former is technically not specific by POSIX yet, it will be in an upcoming release of the spec and every modern awk, POSIX or not, supports delete arr so IMO it's not worth avoiding delete arr.Motorcade
@Catchfly yes, fs makes concatenation difficult but instead a substring could be made on $0 once the position was known.Ferocious
@EdMorton I was referring to interpreting negative arguments specifying fs. the - would be removed to create fs but the negation would indicate the count was to be made from the end of the string. As per "At the moment, the effects of a null string as fs value, or a negative number as n value, are unspecified" in the Q. (acually, I misread that to mean -ve possible fs)Ferocious
@DavePritlove it's not so simple to determine the starting position (in bytes) of the last field. A way could be to build a regex using the previous fields, separated with fs, but you'll have to escape the contents before using them as part of a regexCatchfly
@DavePritlove So a negative n would be like splitting the string from the right? That's an interesting idea; I can't see how to implement it easily thoughCatchfly
I don't understand why the first line of expected output is s='' n=0 fs=' ' a=( ) but the 4th is s='' n=0 fs='_' a=( '' ) - why should splitting a null string (s='') produce different output for one fs char (a blank) vs another (an underscore)?Motorcade
@Catchfly yes, that's what I was thinking. Sorry for crowding the discussion, my initial comment was thinking out loud but you've considered and decided your approach. I can't help with edge cases. Good luck!Ferocious
@EdMorton I'm not so sure about this; apart being easier to implement, that's the behavior pf Python's str.split: "".split() => [] and "".split(sep="_") => ['']Catchfly
@Catchfly awk split() for both cases would just produce an empty array so IMO that should be the expected output for both. Unless, of course, you really are trying to recreate pythons split() functionality (you did say "just like Python's str.split") instead of awks in which case it might become quite a different question depending on what else pythons split() does differently from awks.Motorcade
@EdMorton, I should reconsider it then; that's basically a special case that needs to be handled after the for and while loops: if (i == 1 && s == "")Catchfly
I just updated my answer to mention a 2nd case of undefined behavior you'd probably need to handle differently depending on which awk version your function is being called from (or make a decision that may not match the existing split() for that awk) so now there's 2 - 1) null fs, and 2) an fs with \ before an ordinary character.Motorcade
Another dark corner would be handling things like hex escape sequences (\x... and \u...) and shorthand for character classes such as \s that POSIX doesn't specify but the awk implementation your function is being called from (e.g. gawk) might support.Motorcade
I think I'm coming to the reluctant conclusion that the only way to create a portable function that is POSIX compliant but also behaves like the split() in the awk it's being called from is to call split() as there's too many ways that different POSIX compliant awks could behave differently from each other due to how they choose to implement various behaviors that aren't specified by POSIX.Motorcade
@EdMorton escaping a literal string for using in an ERE seems possible: gsub(/[.[\(*^$+?{|]/, "\\\\&", str)Catchfly
@Catchfly it is but you just have to treat \ and ^ differently from every other characters - they get a backslash in front while every other character gets enclosed in [...]. You can't escape every char as some become have a different meaning while most are undefined behavior when escaped while ^ and \ have a different meaning inside a bracket expression. See how I did it in the if ( length(fs) == 1 ) { block in the script in my answer below.Motorcade
Hmm, watch out for a regexp like a\tb - you need the \t in that to remain \t so it matches a literal tab in the input, it can't become \\t or \\\t or \\[t] or [\][t] or anything else.Motorcade
@EdMorton Won't "\t" become a literal TAB when used in for eg. split(s,a,"\t+"). I may be wrong but I don't think we need to care about those escape sequences; plus, we wouldn't be escaping fs but the stored literal stringsCatchfly
@Catchfly that depends how the string is defined. If you do printf 'xa\tby\n' | awk 'function x(fs) {printf "%s\n", fs} BEGIN{fs=ARGV[1]; x(fs); ARGV[ARGC--]=""} {n=split($0,f,ARGV[1]); print n, f[1], f[2]}' 'a\tb', for example, then you'll see in the output that the value of the fs parameter inside your function (x() here) will literally be the 4 characters a\tb and split() will still be able to split the input using that as a tab and match($0,fs) would also still handle \t as a tab in the fs. So AFAIK what we can't do is change a\tb into a\\tb or similar.Motorcade
None of that discussion about \t applies if the literal string you're asking about escaping is just a single character of course.Motorcade
@EdMorton I didn't know that the \ escaping also applies to the literal strings taken from ARGV or $0 when used in split. It seems to be the case only for fs though, and the ERE in match also have this behavior, so I think it's still possible to automatically build an ERE that matches the first N fields: regesc(a[1]) "(" fs ")" regesc(a[2]) "(" fs ")" ... regesc(a[N]) "(" fs ")"Catchfly
@DavePritlove At the end of the day, using split and building a regex that matches the first N fields seems to be the only possible solution; and requires a lot of code: https://mcmap.net/q/1476019/-implementing-a-maxsplit-function-in-posix-awkCatchfly
Thanks for the feedback, interesting. I had a play after last commenting and was able to reconstruct with the parts removed by FS using the 4th optional parameter of split(), which stores the removed separators into an array indexed consecutively like the fields. I think that option is a gawk extension but is available in at least some versions of awk. Usage split(target, fieldarray, separator, sepsarray). I found some oddities where the string began with a separator that might need workaround, but it allowed reconstruction if used carefully.Ferocious
C
1

A while match approach will fail with, for example, maxsplit("ab",_,2,/^[ab]/) because the loop will match a in the first iteration and b in the second one, which isn't what a user expects from a regex like ^[ab]. IMHO there's no simple way to fix this problem.

Instead, here's a different approach, which is to build an ERE that matches the first N fields (after a call to awk's split):

edit1: Duplicated the backslash in the regex of ere_escape, for compatibility with GNU awk (thanks @EdMorton)
edit2: corrected a typo bug found by @EdMorton

UPDATE: I'm using ere_parenthesize from here for enclosing fs in parentheses safely.

function ere_parenthesize(...) {
    # see https://mcmap.net/q/1482547/-awk-function-for-enclosing-an-arbitrary-regex-in-parentheses
}
function ere_escape(s) {
   gsub(/[.[\\(*^$+?{|]/, "\\\\&", s)
   return s
}
function maxsplit(str,arr,nf,fs    ,n,i,r) {
    if (fs == 0 && fs == "")
        fs = FS
    n = split(str, arr, fs)
    if (n > nf) {
        r = ""
        if (length(fs) == 1) {
            if (fs == " ") {
                fs = "[[:space:]]+"
                r = "^[[:space:]]*"
            } else
                fs = ere_escape(fs)
        }
        fs = ere_parenthesize(fs)
        for (i = 1; i <= nf; i++) {
            r = r ere_escape(arr[i]) fs
        }
        sub(r,"",str)
        arr[nf+1] = str
        for (i = nf+2; i <= n; i++) {
            delete arr[i]
        }
        n = nf+1
    }
    return n
}

Here are the results:

s='' fs=' ' n=0   a=( )
s='' fs=' ' n=1   a=( )
s='' fs=' ' n=100 a=( )
s='' fs=',' n=0   a=( )
s='' fs=',' n=1   a=( )
s='' fs=',' n=100 a=( )
s='' fs='[ ]' n=0   a=( )
s='' fs='[ ]' n=1   a=( )
s='' fs='[ ]' n=100 a=( )
s=',' fs=',' n=0   a=( ',' )
s=',' fs=',' n=1   a=( '' '' )
s=',' fs=',' n=2   a=( '' '' )
s=',' fs=',' n=100 a=( '' '' )
s='  foo bar   baz    ' fs=' ' n=0   a=( 'foo bar   baz    ' )
s='  foo bar   baz    ' fs=' ' n=1   a=( 'foo' 'bar   baz    ' )
s='  foo bar   baz    ' fs=' ' n=2   a=( 'foo' 'bar' 'baz    ' )
s='  foo bar   baz    ' fs=' ' n=3   a=( 'foo' 'bar' 'baz' )
s='  foo bar   baz    ' fs=' ' n=100 a=( 'foo' 'bar' 'baz' )
s='foo=bar=baz' fs='=' n=0   a=( 'foo=bar=baz' )
s='foo=bar=baz' fs='=' n=1   a=( 'foo' 'bar=baz' )
s='foo=bar=baz' fs='=' n=2   a=( 'foo' 'bar' 'baz' )
s='foo=bar=baz' fs='=' n=3   a=( 'foo' 'bar' 'baz' )
s='foo=bar=baz' fs='=' n=100 a=( 'foo' 'bar' 'baz' )
s='foo|bar|baz|' fs='|' n=0   a=( 'foo|bar|baz|' )
s='foo|bar|baz|' fs='|' n=1   a=( 'foo' 'bar|baz|' )
s='foo|bar|baz|' fs='|' n=2   a=( 'foo' 'bar' 'baz|' )
s='foo|bar|baz|' fs='|' n=3   a=( 'foo' 'bar' 'baz' '' )
s='foo|bar|baz|' fs='|' n=4   a=( 'foo' 'bar' 'baz' '' )
s='foo|bar|baz|' fs='|' n=100 a=( 'foo' 'bar' 'baz' '' )
s=' foo  bar ' fs='[ ]' n=0   a=( ' foo  bar ' )
s=' foo  bar ' fs='[ ]' n=1   a=( '' 'foo  bar ' )
s=' foo  bar ' fs='[ ]' n=2   a=( '' 'foo' ' bar ' )
s=' foo  bar ' fs='[ ]' n=3   a=( '' 'foo' '' 'bar ' )
s=' foo  bar ' fs='[ ]' n=4   a=( '' 'foo' '' 'bar' '' )
s=' foo  bar ' fs='[ ]' n=5   a=( '' 'foo' '' 'bar' '' )
s=' foo  bar ' fs='[ ]' n=6   a=( '' 'foo' '' 'bar' '' )
s=' foo  bar ' fs='[ ]' n=100 a=( '' 'foo' '' 'bar' '' )
s='_.{2}__.?_[0-9]*' fs='_+' n=0   a=( '_.{2}__.?_[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=1   a=( '' '.{2}__.?_[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=2   a=( '' '.{2}' '.?_[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=3   a=( '' '.{2}' '.?' '[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=4   a=( '' '.{2}' '.?' '[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=100 a=( '' '.{2}' '.?' '[0-9]*' )
s='foo' fs='^[fo]' n=0   a=( 'foo' )
s='foo' fs='^[fo]' n=1   a=( '' 'oo' )
s='foo' fs='^[fo]' n=2   a=( '' 'oo' )
s='foo' fs='^[fo]' n=100 a=( '' 'oo' )
s='foo_bar_' fs='_?' n=0   a=( 'foo_bar_' )
s='foo_bar_' fs='_?' n=1   a=( 'foo' 'bar_' )
s='foo_bar_' fs='_?' n=2   a=( 'foo' 'bar' '' )
s='foo_bar_' fs='_?' n=3   a=( 'foo' 'bar' '' )
s='foo_bar_' fs='_?' n=100 a=( 'foo' 'bar' '' )
Catchfly answered 28/5 at 20:2 Comment(16)
I believe it's fine to double the \ in /[.[\(*^$+?{|]/ for ensuring compatibility, as with strict POSIX it will be treated as two literal \, which doesn't hurtCatchfly
If you want to write an answer with GNU awk I'll upvote it as it'll still be useful to have a maxsplit for GNU awk.Catchfly
When I put your code in a file named frav1.awk, put a simple wrapper script of { nf=maxsplit($0,flds,max); print nf; for (i=1; i<=nf; i++) print i, flds[i] } in a file named wrap.awk and then call it as echo 'a b c d' | gawk -v max=1 -f frav1.awk -f wrap.awk for a max of 1 or 2 I get the full input string back as the final array element.Motorcade
@EdMorton your GNU maxsplit works fine with the set of tests ;-)Catchfly
but now I ran into another problem - r = r ere_escape(arr[i]) "(" fs ")" is obviously putting parens around the fs but what if the fs contains a closing paren? Then the ) within fs closes the ( in "(" fs ")" and so r can't match in str. Try echo 'bar)bar)bar)' | awk -F')b' -v max=1 -f frav1.awk -f wrap.awk and it'll put the whole input line into arr[2] because r gets set to bar()b) and there is no b) in str. Maybe you just need to not add the parens and use r = r ere_escape(arr[i]) fs instead? That works for the couple of tests I ran.Motorcade
@EdMorton Aaarggg, a closing ) in fs won't be easy to fix...Catchfly
Unfortunately no, a simple fs="aa|bb" will break the regex if not enclosed in parenthesisCatchfly
Yeah, I'm not coming up with a good way to handle that. I thought maybe looping char by char through fs and escaping an unmatched ) but then you'd need to count any \ sequence preceding the ) and only escape the ) if that count was an even number and similarly test for ) possibly being inside a bracket expression and count escapes before the [ if so, etc. A real mess.Motorcade
Just brainstorming - what if we did something like (not complete) n=1; for (i=1; i<=length(str); i++) { if (split(substr(str,1,i),a,fs) == n) { n++; flds[n]=a[n]; seps[n]=substr(str,length(a[n]+1) } } i.e. split() an ever growing substring of str counting the number of split() results to give us the number of fields and then using length() to give us the strings that matched the fs in seps[]?Motorcade
No, that'd fail with a fs like [^x]+ as it'd match too few chars given input like xabx.Motorcade
Maybe special-case an fs that starts with ^ to only try to split at the start of the string but for any other fs loop matching it in the remainder of str after a leading previous match()ing string is removed? That should address your earlier comment and sounds promising running scenarios through my head...Motorcade
Ah, but then we'd need to understand that in an fs like (((^x))) the ^ would still be an anchor we need to special-case despite not being the first char in the fs and we'd need to consider (^x)|y, etc.Motorcade
If we could implement gawks split() in a POSIX awk then we could implement your maxsplit() in a POSIX awk. It might be worth one of us posting a question about how to do THAT since it's a more general problem with 1 less complication than this one (the max splits) and enumerate all of the considerations, things we've tried and in what ways they fail given various input and FS values. I could ask the GNU awk developers to take a look at that as they've considered it in the past (hence their split()!). We'd end up with at least a best-possible solution with stated caveats/considerations.Motorcade
I'm out of time to do any more with this just now but if you posted the question I'd take a look later and edit it if necessary to add any information you may have missed (we've covered a LOT of ground). Otherwise I might post it myself later - just in a bit of a time crunch with the job that actually pays the bills right now.Motorcade
If you do post the question, I'd recommend just providing sample input/output that covers the problematic use cases - the current input/output in this question is too lengthy for me at least to be able to understand what it's telling me without a LOT of time spent considering it all (so I've mostly ignored it to be honest) and it doesn't test all of the problems we've encountered.Motorcade
@EdMorton I posted a question about enclosing an ERE in parenthesis: https://mcmap.net/q/1482547/-awk-function-for-enclosing-an-arbitrary-regex-in-parentheses/3387716 ; that's not exactly what you were thinking about.Catchfly
D
4

what other "border-cases" exist that should be handled?

I would point to n being equal 0, in python this result in list with one-element which is whole text, e.g.

text = "Able Baker Charlie"
split = text.split(maxsplit=0)
print(split)  # ['Able Baker Charlie']
Depersonalize answered 27/5 at 15:29 Comment(1)
There was indeed a problem with n=0 (fixed the regex in the sub). +1Catchfly
H
4

A bit long for a comment ...

Assuming there are no limitations on what can be stored in fs ...

Sample data:

$ cat f1
a1,b2,c3,d4,e5
zee9,zee8,zee7,zee6,zee5

Sample awk script:

$ cat maxsplit.awk

function maxsplit(s,a,n,fs,    i,j) {
   ### see question
}

{ n = maxsplit($0,a,n_in,fs_in)
  for (i=1;i<=n;i++)
      print i,a[i]
}

Test cases:


GOOD - no split since there's no match on fs = '/,':

$ awk -v n_in=3 -v fs_in='/,' -f maxsplit.awk f1
1 a1,b2,c3,d4,e5
1 zee9,zee8,zee7,zee6,zee5

BAD - fs = '<anychar>?' - function reports 4 elements in array while entire line is stored in last array element:

$ awk -v n_in=3 -v fs_in='/?' -f maxsplit.awk f1
$ awk -v n_in=3 -v fs_in=',?' -f maxsplit.awk f1
$ awk -v n_in=3 -v fs_in='x?' -f maxsplit.awk f1

#### all generate the same result:

1
2
3
4 a1,b2,c3,d4,e5
1
2
3
4 zee9,zee8,zee7,zee6,zee5

NOTES:

  • split() actually splits on , when using ,? as the delimiter but in other cases ...
  • split() treats <anychar>? as 2 literal characters so the line is not split

GOOD (?) - single period treated as literal period (ie, no match so no split):

$ awk -v n_in=3 -v fs_in='.' -f maxsplit.awk f1
1 a1,b2,c3,d4,e5
1 zee9,zee8,zee7,zee6,zee5

BAD (?) - but double period treated as wildcards:

$ awk -v n_in=3 -v fs_in='..' -f maxsplit.awk f1
1
2
3
4 c3,d4,e5
1
2
3
4 ee8,zee7,zee6,zee5

NOTE: split() behaves similarly re: single period vs double period


GOOD (surprisingly?) - splits correctly on fs set to a character class:

$ awk -v n_in=3 -v fs_in='[^[:alnum:]]' -f maxsplit.awk f1
1 a1
2 b2
3 c3
4 d4,e5
1 zee9
2 zee8
3 zee7
4 zee6,zee5
Housemaster answered 27/5 at 17:52 Comment(0)
M
4

These are the types of Field Separator (FS) to consider, off the top of my head:

  1. A single blank character (the default) - ignore all leading/training white space and then split at sequences of contiguous white space.
  2. Any other single character (treated as a literal character) when the RS is NOT null or we are using split() - split at that literal character.
  3. Any other single character (treated as a literal character) when the RS IS null and we are doing built-in field splitting of $0 into $1, $2, etc., i.e. not calling split() - split at that literal character and also at newlines.
  4. Any multi-character string (treated like a dynamic regexp) - split at strings that match that regexp.
  5. A null string - undefined behavior by POSIX so some awks will split into characters, others will treat the input as a single field, and (unlikely) yet others could do anything else, including producing warning messages and/or fail with a fatal error.
  6. A backslash before a character that is not a regexp metacharacter in the field separator - undefined behavior by POSIX so some awks will ignore it (so a\qc is treated the same as aqc) while others will treat it literally (so a\qc is treated the same as a\\qc) and yet others could do anything else, including producing warning messages (which gawk would do) and/or (unlikely) fail with a fatal error.

Your code should probably do whatever the split() in the awk version calling it would do in the "undefined behavior" cases above which probably means calling some exiting function to see how it handles each (e.g. calling split() to entirely handle the fs="" case, and calling `match("aqc",/a\qc/) or similar to test the backslash case (but that'd produce a warning message from gawk at least).

A couple of thoughts on your code:

  1. You might want to change this:

if (fs == "")
    fs = FS

to this:


if (fs == 0 && fs == "")
    fs = FS

to differentiate between no arg passed in vs the null string passed in as an arg,

  1. This case:
if (length(fs) == 1)

could be split into 2 - one to handle when RS is null (and so you need to add |\n+ to the FS and do regexp matching) and the other when it's not null (and so you do literal string matching with index()) IF you wanted to split the way $0 is split into $1, $2, etc, but if you want to do it the way the split() function does then it's fine as-is.

FWIW it'd need more work but in case it gives you any ideas, here's how I'd start writing a function like this:

function maxsplit(str, flds, max, fs, seps,     i, nf) {
    delete flds
    delete seps

    if ( (fs == 0) && (fs == "") ) {
        fs = FS
    }

    if ( fs == " " ) {
        fs = "[[:space:]]+"
        if ( match(str,"^"fs) ) {
            seps[0] = substr(str,RSTART,RLENGTH)
            str = substr(str,RSTART+RLENGTH)
        }
    }
    else if ( length(fs) == 1 ) {
        if ( (fs == "^") || (fs == "\\") ) {
            fs = "\\" fs
        }
        else {
            fs = "[" fs "]"
        }
    }

    if ( fs == "" ) {
        nf = split(str,flds,fs)
        for ( i=0; i<=nf; i++ ) {
            seps[i] = ""
        }
    }
    else {
        while ( (nf <= max) && match(str,fs) ) {
            ++nf
            flds[nf] = substr(str,1,RSTART-1)
            seps[nf] = substr(str,RSTART,RLENGTH)
            str = substr(str,RSTART+RLENGTH)
        }
    }
    return nf+0
}

I'm calling split() when the fs is null because that's undefined behavior per POSIX so by doing it this way my function will just do whatever the version of awk it's being called from would do.

I'm also populating a seps[] array like gawk does as I think that'd be useful.

Input of foo_ with fs="_?" is an interesting case:

$ awk 'BEGIN{str="foo_"; n=split(str,f,"_?"); for (i in f) print n, i, "<"f[i]">"}'
2 1 <foo>
2 2 <>

showing that the _? only matches at the end of str when calling split() while

$ awk 'BEGIN{str="foo_"; match(str,/_?/); print RSTART, RLENGTH, "<"substr(str,RSTART,RLENGTH)">"}'
1 0 <>

shows that the _? matches the null string before the first character of str when calling match(). So a simple while(match()) loop as I have in my code above so far won't work. I expect a test on RLENGTH == 0 inside the loop can handle it but I'll have to think about that if/when I have a bit more time.

As requested in a comment, here's a gawk script that I think does what you want but I've only tested it with the one input string and FS used below:

$ cat tst.awk
function maxsplit(str, flds, max, fs, seps,    tail, nf)  {
    nf = split(str,flds,fs,seps)
    if ( nf > max ) {
        for ( ; nf > max; nf-- ) {
            tail = flds[nf] seps[nf] tail
            delete flds[nf]
            delete seps[nf]
        }
        flds[++nf] = tail
    }
    return nf
}

{
    nf = maxsplit($0, flds, max, FS, seps)

    printf "\t\tseps[0]=\"%s\"\n", seps[0]
    for ( i=1; i<=nf; i++ ) {
        printf "flds[%d]=\"%s\",\tseps[%d]=\"%s\"\n", i, flds[i], i, seps[i]
    }
}

$ echo '  a b   c  d   e  ' | awk -v max=3 -f tst.awk
                seps[0]="  "
flds[1]="a",    seps[1]=" "
flds[2]="b",    seps[2]="   "
flds[3]="c",    seps[3]="  "
flds[4]="d   e  ",      seps[4]=""

$ echo '  a b   c  d   e  ' | awk -v max=4 -f tst.awk
                seps[0]="  "
flds[1]="a",    seps[1]=" "
flds[2]="b",    seps[2]="   "
flds[3]="c",    seps[3]="  "
flds[4]="d",    seps[4]="   "
flds[5]="e  ",  seps[5]=""

$ echo '  a b   c  d   e  ' | awk -v max=5 -f tst.awk
                seps[0]="  "
flds[1]="a",    seps[1]=" "
flds[2]="b",    seps[2]="   "
flds[3]="c",    seps[3]="  "
flds[4]="d",    seps[4]="   "
flds[5]="e",    seps[5]="  "

$ echo '  a b   c  d   e  ' | awk -v max=6 -f tst.awk
                seps[0]="  "
flds[1]="a",    seps[1]=" "
flds[2]="b",    seps[2]="   "
flds[3]="c",    seps[3]="  "
flds[4]="d",    seps[4]="   "
flds[5]="e",    seps[5]="  "
Motorcade answered 28/5 at 1:15 Comment(0)
C
1

A while match approach will fail with, for example, maxsplit("ab",_,2,/^[ab]/) because the loop will match a in the first iteration and b in the second one, which isn't what a user expects from a regex like ^[ab]. IMHO there's no simple way to fix this problem.

Instead, here's a different approach, which is to build an ERE that matches the first N fields (after a call to awk's split):

edit1: Duplicated the backslash in the regex of ere_escape, for compatibility with GNU awk (thanks @EdMorton)
edit2: corrected a typo bug found by @EdMorton

UPDATE: I'm using ere_parenthesize from here for enclosing fs in parentheses safely.

function ere_parenthesize(...) {
    # see https://mcmap.net/q/1482547/-awk-function-for-enclosing-an-arbitrary-regex-in-parentheses
}
function ere_escape(s) {
   gsub(/[.[\\(*^$+?{|]/, "\\\\&", s)
   return s
}
function maxsplit(str,arr,nf,fs    ,n,i,r) {
    if (fs == 0 && fs == "")
        fs = FS
    n = split(str, arr, fs)
    if (n > nf) {
        r = ""
        if (length(fs) == 1) {
            if (fs == " ") {
                fs = "[[:space:]]+"
                r = "^[[:space:]]*"
            } else
                fs = ere_escape(fs)
        }
        fs = ere_parenthesize(fs)
        for (i = 1; i <= nf; i++) {
            r = r ere_escape(arr[i]) fs
        }
        sub(r,"",str)
        arr[nf+1] = str
        for (i = nf+2; i <= n; i++) {
            delete arr[i]
        }
        n = nf+1
    }
    return n
}

Here are the results:

s='' fs=' ' n=0   a=( )
s='' fs=' ' n=1   a=( )
s='' fs=' ' n=100 a=( )
s='' fs=',' n=0   a=( )
s='' fs=',' n=1   a=( )
s='' fs=',' n=100 a=( )
s='' fs='[ ]' n=0   a=( )
s='' fs='[ ]' n=1   a=( )
s='' fs='[ ]' n=100 a=( )
s=',' fs=',' n=0   a=( ',' )
s=',' fs=',' n=1   a=( '' '' )
s=',' fs=',' n=2   a=( '' '' )
s=',' fs=',' n=100 a=( '' '' )
s='  foo bar   baz    ' fs=' ' n=0   a=( 'foo bar   baz    ' )
s='  foo bar   baz    ' fs=' ' n=1   a=( 'foo' 'bar   baz    ' )
s='  foo bar   baz    ' fs=' ' n=2   a=( 'foo' 'bar' 'baz    ' )
s='  foo bar   baz    ' fs=' ' n=3   a=( 'foo' 'bar' 'baz' )
s='  foo bar   baz    ' fs=' ' n=100 a=( 'foo' 'bar' 'baz' )
s='foo=bar=baz' fs='=' n=0   a=( 'foo=bar=baz' )
s='foo=bar=baz' fs='=' n=1   a=( 'foo' 'bar=baz' )
s='foo=bar=baz' fs='=' n=2   a=( 'foo' 'bar' 'baz' )
s='foo=bar=baz' fs='=' n=3   a=( 'foo' 'bar' 'baz' )
s='foo=bar=baz' fs='=' n=100 a=( 'foo' 'bar' 'baz' )
s='foo|bar|baz|' fs='|' n=0   a=( 'foo|bar|baz|' )
s='foo|bar|baz|' fs='|' n=1   a=( 'foo' 'bar|baz|' )
s='foo|bar|baz|' fs='|' n=2   a=( 'foo' 'bar' 'baz|' )
s='foo|bar|baz|' fs='|' n=3   a=( 'foo' 'bar' 'baz' '' )
s='foo|bar|baz|' fs='|' n=4   a=( 'foo' 'bar' 'baz' '' )
s='foo|bar|baz|' fs='|' n=100 a=( 'foo' 'bar' 'baz' '' )
s=' foo  bar ' fs='[ ]' n=0   a=( ' foo  bar ' )
s=' foo  bar ' fs='[ ]' n=1   a=( '' 'foo  bar ' )
s=' foo  bar ' fs='[ ]' n=2   a=( '' 'foo' ' bar ' )
s=' foo  bar ' fs='[ ]' n=3   a=( '' 'foo' '' 'bar ' )
s=' foo  bar ' fs='[ ]' n=4   a=( '' 'foo' '' 'bar' '' )
s=' foo  bar ' fs='[ ]' n=5   a=( '' 'foo' '' 'bar' '' )
s=' foo  bar ' fs='[ ]' n=6   a=( '' 'foo' '' 'bar' '' )
s=' foo  bar ' fs='[ ]' n=100 a=( '' 'foo' '' 'bar' '' )
s='_.{2}__.?_[0-9]*' fs='_+' n=0   a=( '_.{2}__.?_[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=1   a=( '' '.{2}__.?_[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=2   a=( '' '.{2}' '.?_[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=3   a=( '' '.{2}' '.?' '[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=4   a=( '' '.{2}' '.?' '[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=100 a=( '' '.{2}' '.?' '[0-9]*' )
s='foo' fs='^[fo]' n=0   a=( 'foo' )
s='foo' fs='^[fo]' n=1   a=( '' 'oo' )
s='foo' fs='^[fo]' n=2   a=( '' 'oo' )
s='foo' fs='^[fo]' n=100 a=( '' 'oo' )
s='foo_bar_' fs='_?' n=0   a=( 'foo_bar_' )
s='foo_bar_' fs='_?' n=1   a=( 'foo' 'bar_' )
s='foo_bar_' fs='_?' n=2   a=( 'foo' 'bar' '' )
s='foo_bar_' fs='_?' n=3   a=( 'foo' 'bar' '' )
s='foo_bar_' fs='_?' n=100 a=( 'foo' 'bar' '' )
Catchfly answered 28/5 at 20:2 Comment(16)
I believe it's fine to double the \ in /[.[\(*^$+?{|]/ for ensuring compatibility, as with strict POSIX it will be treated as two literal \, which doesn't hurtCatchfly
If you want to write an answer with GNU awk I'll upvote it as it'll still be useful to have a maxsplit for GNU awk.Catchfly
When I put your code in a file named frav1.awk, put a simple wrapper script of { nf=maxsplit($0,flds,max); print nf; for (i=1; i<=nf; i++) print i, flds[i] } in a file named wrap.awk and then call it as echo 'a b c d' | gawk -v max=1 -f frav1.awk -f wrap.awk for a max of 1 or 2 I get the full input string back as the final array element.Motorcade
@EdMorton your GNU maxsplit works fine with the set of tests ;-)Catchfly
but now I ran into another problem - r = r ere_escape(arr[i]) "(" fs ")" is obviously putting parens around the fs but what if the fs contains a closing paren? Then the ) within fs closes the ( in "(" fs ")" and so r can't match in str. Try echo 'bar)bar)bar)' | awk -F')b' -v max=1 -f frav1.awk -f wrap.awk and it'll put the whole input line into arr[2] because r gets set to bar()b) and there is no b) in str. Maybe you just need to not add the parens and use r = r ere_escape(arr[i]) fs instead? That works for the couple of tests I ran.Motorcade
@EdMorton Aaarggg, a closing ) in fs won't be easy to fix...Catchfly
Unfortunately no, a simple fs="aa|bb" will break the regex if not enclosed in parenthesisCatchfly
Yeah, I'm not coming up with a good way to handle that. I thought maybe looping char by char through fs and escaping an unmatched ) but then you'd need to count any \ sequence preceding the ) and only escape the ) if that count was an even number and similarly test for ) possibly being inside a bracket expression and count escapes before the [ if so, etc. A real mess.Motorcade
Just brainstorming - what if we did something like (not complete) n=1; for (i=1; i<=length(str); i++) { if (split(substr(str,1,i),a,fs) == n) { n++; flds[n]=a[n]; seps[n]=substr(str,length(a[n]+1) } } i.e. split() an ever growing substring of str counting the number of split() results to give us the number of fields and then using length() to give us the strings that matched the fs in seps[]?Motorcade
No, that'd fail with a fs like [^x]+ as it'd match too few chars given input like xabx.Motorcade
Maybe special-case an fs that starts with ^ to only try to split at the start of the string but for any other fs loop matching it in the remainder of str after a leading previous match()ing string is removed? That should address your earlier comment and sounds promising running scenarios through my head...Motorcade
Ah, but then we'd need to understand that in an fs like (((^x))) the ^ would still be an anchor we need to special-case despite not being the first char in the fs and we'd need to consider (^x)|y, etc.Motorcade
If we could implement gawks split() in a POSIX awk then we could implement your maxsplit() in a POSIX awk. It might be worth one of us posting a question about how to do THAT since it's a more general problem with 1 less complication than this one (the max splits) and enumerate all of the considerations, things we've tried and in what ways they fail given various input and FS values. I could ask the GNU awk developers to take a look at that as they've considered it in the past (hence their split()!). We'd end up with at least a best-possible solution with stated caveats/considerations.Motorcade
I'm out of time to do any more with this just now but if you posted the question I'd take a look later and edit it if necessary to add any information you may have missed (we've covered a LOT of ground). Otherwise I might post it myself later - just in a bit of a time crunch with the job that actually pays the bills right now.Motorcade
If you do post the question, I'd recommend just providing sample input/output that covers the problematic use cases - the current input/output in this question is too lengthy for me at least to be able to understand what it's telling me without a LOT of time spent considering it all (so I've mostly ignored it to be honest) and it doesn't test all of the problems we've encountered.Motorcade
@EdMorton I posted a question about enclosing an ERE in parenthesis: https://mcmap.net/q/1482547/-awk-function-for-enclosing-an-arbitrary-regex-in-parentheses/3387716 ; that's not exactly what you were thinking about.Catchfly

© 2022 - 2024 — McMap. All rights reserved.