I'm trying to implement an awk function for splitting a string into an array; the fundamental difference with the built-in split
is that one is able to limit the number of "splits" that are performed, just like in Python's str.split
The prototype would be something like this:
maxsplit(s, a, n[, fs ])
Split the strings
into array elements, performing at mostn
splits (thus, the array will have at mostn+1
elementsa[1]
,a[2]
, ...,a[n+1]
), and return the actual number of elements in the array. All elements of the array shall be deleted before the split is performed. The separation shall be done with the EREfs
or with the field separatorFS
whenfs
is not provided.
At the moment, the effects of a null string asfs
value, or a negative number asn
value, are unspecified.
Here's the code I came-up with:
edit1: fixed the border cases pointed-out by @markp-fuso and @Daweo
edit2: added handling of non-provided fs
argument (thanks @EdMorton)
function maxsplit(s,a,n,fs, i,j) {
delete a
if (fs == 0 && fs == "")
fs = FS
if (fs == " ")
{
for (i = 1; (i <= n) && match(s,/[^[:space:]]+/); i++) {
a[i] = substr(s,RSTART,RLENGTH)
s = substr(s,RSTART+RLENGTH)
}
sub(/^[[:space:]]+/, "", s)
if (s != "")
a[i] = s
else
--i
}
else if (length(fs) == 1)
{
for (i = 1; (i <= n) && (j = index(s,fs)); i++) {
a[i] = substr(s,1,j-1)
s = substr(s,j+1)
}
a[i] = s
}
else
{
i = 1
while ( (i <= n) && (s != "") && match(s,fs) ) {
if (RLENGTH) {
a[i] = a[i] substr(s,1,RSTART-1)
s = substr(s,RSTART+RLENGTH)
++i
} else {
a[i] = a[i] substr(s,1,1)
s = substr(s,2)
}
}
a[i] = s
}
return i
}
As you can see, I wrote multiple code paths for handling the different types of fs
. My question is: what other edge cases need to be handled?
Here's a set of test cases:
edit1: reordered the input/output fields and added a few more tests.
edit2: converted the input to a colon-delimited format (instead of xargs
format), and added a few more tests.
note: I'm still looking for test-cases that could be tricky to handle.
#inputString:fieldSep:maxSplit
: :0
: :1
: :100
:,:0
:,:1
:,:100
:[ ]:0
:[ ]:1
:[ ]:100
,:,:0
,:,:1
,:,:2
,:,:100
foo bar baz : :0
foo bar baz : :1
foo bar baz : :2
foo bar baz : :3
foo bar baz : :100
foo=bar=baz:=:0
foo=bar=baz:=:1
foo=bar=baz:=:2
foo=bar=baz:=:3
foo=bar=baz:=:100
foo|bar|baz|:|:0
foo|bar|baz|:|:1
foo|bar|baz|:|:2
foo|bar|baz|:|:3
foo|bar|baz|:|:4
foo|bar|baz|:|:100
foo bar :[ ]:0
foo bar :[ ]:1
foo bar :[ ]:2
foo bar :[ ]:3
foo bar :[ ]:4
foo bar :[ ]:5
foo bar :[ ]:6
foo bar :[ ]:100
_.{2}__.?_[0-9]*:_+:0
_.{2}__.?_[0-9]*:_+:1
_.{2}__.?_[0-9]*:_+:2
_.{2}__.?_[0-9]*:_+:3
_.{2}__.?_[0-9]*:_+:4
_.{2}__.?_[0-9]*:_+:100
foo:^[fo]:0
foo:^[fo]:1
foo:^[fo]:2
foo:^[fo]:100
foo_bar_:_?:0
foo_bar_:_?:1
foo_bar_:_?:2
foo_bar_:_?:3
foo_bar_:_?:100
And the code wrapper for processing the test-cases.txt:
awk -F: '
function maxsplit(s,a,n,fs, i,j) {
# ...
}
function shell_quote(s) {
gsub(/\047/,"\047\\\047\047")
return "\047" s "\047"
}
function array_quote(a ,s,i) {
s=""
for (i=1;i in a;i++)
s = s " " shell_quote(a[i])
return "(" s " )"
}
NR > 1 {
s = $1
fs = $2
n = $3
maxsplit(s,a,n,fs)
print "s=" shell_quote(s), \
"fs=" shell_quote(fs), \
"n=" sprintf("%-3d",n), \
"a=" array_quote(a)
}
' test-cases.txt
My expected output is:
s='' fs=' ' n=0 a=( )
s='' fs=' ' n=1 a=( )
s='' fs=' ' n=100 a=( )
s='' fs=',' n=0 a=( )
s='' fs=',' n=1 a=( )
s='' fs=',' n=100 a=( )
s='' fs='[ ]' n=0 a=( )
s='' fs='[ ]' n=1 a=( )
s='' fs='[ ]' n=100 a=( )
s=',' fs=',' n=0 a=( ',' )
s=',' fs=',' n=1 a=( '' '' )
s=',' fs=',' n=2 a=( '' '' )
s=',' fs=',' n=100 a=( '' '' )
s=' foo bar baz ' fs=' ' n=0 a=( 'foo bar baz ' )
s=' foo bar baz ' fs=' ' n=1 a=( 'foo' 'bar baz ' )
s=' foo bar baz ' fs=' ' n=2 a=( 'foo' 'bar' 'baz ' )
s=' foo bar baz ' fs=' ' n=3 a=( 'foo' 'bar' 'baz' )
s=' foo bar baz ' fs=' ' n=100 a=( 'foo' 'bar' 'baz' )
s='foo=bar=baz' fs='=' n=0 a=( 'foo=bar=baz' )
s='foo=bar=baz' fs='=' n=1 a=( 'foo' 'bar=baz' )
s='foo=bar=baz' fs='=' n=2 a=( 'foo' 'bar' 'baz' )
s='foo=bar=baz' fs='=' n=3 a=( 'foo' 'bar' 'baz' )
s='foo=bar=baz' fs='=' n=100 a=( 'foo' 'bar' 'baz' )
s='foo|bar|baz|' fs='|' n=0 a=( 'foo|bar|baz|' )
s='foo|bar|baz|' fs='|' n=1 a=( 'foo' 'bar|baz|' )
s='foo|bar|baz|' fs='|' n=2 a=( 'foo' 'bar' 'baz|' )
s='foo|bar|baz|' fs='|' n=3 a=( 'foo' 'bar' 'baz' '' )
s='foo|bar|baz|' fs='|' n=4 a=( 'foo' 'bar' 'baz' '' )
s='foo|bar|baz|' fs='|' n=100 a=( 'foo' 'bar' 'baz' '' )
s=' foo bar ' fs='[ ]' n=0 a=( ' foo bar ' )
s=' foo bar ' fs='[ ]' n=1 a=( '' 'foo bar ' )
s=' foo bar ' fs='[ ]' n=2 a=( '' 'foo' ' bar ' )
s=' foo bar ' fs='[ ]' n=3 a=( '' 'foo' '' 'bar ' )
s=' foo bar ' fs='[ ]' n=4 a=( '' 'foo' '' 'bar' '' )
s=' foo bar ' fs='[ ]' n=5 a=( '' 'foo' '' 'bar' '' )
s=' foo bar ' fs='[ ]' n=6 a=( '' 'foo' '' 'bar' '' )
s=' foo bar ' fs='[ ]' n=100 a=( '' 'foo' '' 'bar' '' )
s='_.{2}__.?_[0-9]*' fs='_+' n=0 a=( '_.{2}__.?_[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=1 a=( '' '.{2}__.?_[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=2 a=( '' '.{2}' '.?_[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=3 a=( '' '.{2}' '.?' '[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=4 a=( '' '.{2}' '.?' '[0-9]*' )
s='_.{2}__.?_[0-9]*' fs='_+' n=100 a=( '' '.{2}' '.?' '[0-9]*' )
s='foo' fs='^[fo]' n=0 a=( 'foo' )
s='foo' fs='^[fo]' n=1 a=( '' 'oo' )
s='foo' fs='^[fo]' n=2 a=( '' 'oo' )
s='foo' fs='^[fo]' n=100 a=( '' 'oo' )
s='foo_bar_' fs='_?' n=0 a=( 'foo_bar_' )
s='foo_bar_' fs='_?' n=1 a=( 'foo' 'bar_' )
s='foo_bar_' fs='_?' n=2 a=( 'foo' 'bar' '' )
s='foo_bar_' fs='_?' n=3 a=( 'foo' 'bar' '' )
s='foo_bar_' fs='_?' n=100 a=( 'foo' 'bar' '' )
split
and Python'sstr.split
. I added a few test-cases but I'm sure I've missed some tricky usages – Catchflysplit($0,tmp); $0 = ""; for ( i=1; i<=(NF-2); i++ ) { $i = tmp[i] }
shouldn't run anyfor( )
loop cycles at all since u just blanked out$0 = ""
, thusNF
auto set to 0, causing 2nd arg of thefor( )
loop to check fori <= (0 - 2)
. The alternative is much worse since anyi
that actually passes that filtering criteria would be assigning into a negative field number. Maybe just pre-delete the 2 highest indices intmp[ ]
then use the hands-freefor ( i in tmp ) { }
iterator? – Lefthanderfs == FS && s == $0 && NF <= n
. If all 3 are true, thenawk
's built-in field splitting has already done all the work for you, so you could either directly clone field contents into the array, or straight upsplit( s, a , fs )
. A second optimization would be checking for eithern < 1 || s !~ fs
. If either is true, then just make a 1 cell array and throw the entire string there ::a[ 1 ] = s
– Lefthandertheo_max_n = gsub( fs, "&", s)
, and clampn
accordingly. Oh one more optimization I could think of - iffs
happens to be a string that isn't regex at all (e.g.fs == "<html>"
), then useindex( )
instead ofmatch( )
each round for major speed gain – Lefthanderposix
:delete a
is an extension,for(i in a) delete a[i]
is how to do it POSIXly. – Dubietysplit("",a)
is how to efficiently delete an array POSIXly, – Motorcadesplit()
, gawk'ssplit()
has an extra argument to populate an array of the separators between fields which comes in extremely useful at times. It'd be very little extra effort for you to add such an argument to your function - just something to consider. – Motorcadefs
is a regex – Catchflysplit()
doesn't save the strings that match the field separators so you can't reconstruct the remainder of the original string plus you can't use negative fs values when fs can't easily be negated, e.g. how do you write code to negate an fs of something like)+(%.*[a-z]|[][]{3,9})*
(note that the meaning ofa-z
is locale-specific!). – Motorcadesplit
, which might prove helpful. For now I'm trying to "validate" the function on edge-cases that I might have overlooked. – Catchflydelete arr
vssplit("",arr)
btw - although the former is technically not specific by POSIX yet, it will be in an upcoming release of the spec and every modern awk, POSIX or not, supportsdelete arr
so IMO it's not worth avoidingdelete arr
. – Motorcade-
would be removed to create fs but the negation would indicate the count was to be made from the end of the string. As per "At the moment, the effects of a null string as fs value, or a negative number as n value, are unspecified" in the Q. (acually, I misread that to mean -ve possible fs) – Ferociousfs
, but you'll have to escape the contents before using them as part of a regex – Catchflyn
would be like splitting the string from the right? That's an interesting idea; I can't see how to implement it easily though – Catchflys='' n=0 fs=' ' a=( )
but the 4th iss='' n=0 fs='_' a=( '' )
- why should splitting a null string (s=''
) produce different output for one fs char (a blank) vs another (an underscore)? – Motorcadestr.split
:"".split()
=>[]
and"".split(sep="_")
=>['']
– Catchflysplit()
for both cases would just produce an empty array so IMO that should be the expected output for both. Unless, of course, you really are trying to recreate pythons split() functionality (you did say "just like Python's str.split") instead of awks in which case it might become quite a different question depending on what else pythons split() does differently from awks. – Motorcadefor
andwhile
loops:if (i == 1 && s == "")
– Catchflysplit()
for that awk) so now there's 2 - 1) null fs, and 2) an fs with\
before an ordinary character. – Motorcade\x...
and\u...
) and shorthand for character classes such as\s
that POSIX doesn't specify but the awk implementation your function is being called from (e.g. gawk) might support. – Motorcadesplit()
in the awk it's being called from is to callsplit()
as there's too many ways that different POSIX compliant awks could behave differently from each other due to how they choose to implement various behaviors that aren't specified by POSIX. – Motorcadegsub(/[.[\(*^$+?{|]/, "\\\\&", str)
– Catchfly\
and^
differently from every other characters - they get a backslash in front while every other character gets enclosed in[...]
. You can't escape every char as some become have a different meaning while most are undefined behavior when escaped while^
and\
have a different meaning inside a bracket expression. See how I did it in theif ( length(fs) == 1 ) {
block in the script in my answer below. – Motorcadea\tb
- you need the\t
in that to remain\t
so it matches a literal tab in the input, it can't become\\t
or\\\t
or\\[t]
or[\][t]
or anything else. – Motorcade"\t"
become a literal TAB when used in for eg.split(s,a,"\t+")
. I may be wrong but I don't think we need to care about those escape sequences; plus, we wouldn't be escaping fs but the stored literal strings – Catchflyprintf 'xa\tby\n' | awk 'function x(fs) {printf "%s\n", fs} BEGIN{fs=ARGV[1]; x(fs); ARGV[ARGC--]=""} {n=split($0,f,ARGV[1]); print n, f[1], f[2]}' 'a\tb'
, for example, then you'll see in the output that the value of thefs
parameter inside your function (x()
here) will literally be the 4 charactersa\tb
andsplit()
will still be able to split the input using that as a tab andmatch($0,fs)
would also still handle\t
as a tab in the fs. So AFAIK what we can't do is changea\tb
intoa\\tb
or similar. – Motorcade\t
applies if the literal string you're asking about escaping is just a single character of course. – Motorcade\
escaping also applies to the literal strings taken fromARGV
or$0
when used insplit
. It seems to be the case only forfs
though, and the ERE inmatch
also have this behavior, so I think it's still possible to automatically build an ERE that matches the first N fields:regesc(a[1]) "(" fs ")" regesc(a[2]) "(" fs ")" ... regesc(a[N]) "(" fs ")"
– Catchflysplit
and building a regex that matches the first N fields seems to be the only possible solution; and requires a lot of code: https://mcmap.net/q/1476019/-implementing-a-maxsplit-function-in-posix-awk – Catchflysplit(target, fieldarray, separator, sepsarray)
. I found some oddities where the string began with a separator that might need workaround, but it allowed reconstruction if used carefully. – Ferocious