AWK equivalent to `read -r _ _ remainder`
Asked Answered
M

6

7

Let's say that you have a file which contains N whitespace-delimited columns and an additional column which has spaces in it that you want to keep.

Example with N = 2:

1.1 1.2 data for row1
  2.1   2.2    data   for    row2
?  ?   data   for   row3
 \ * data for   row4

I would like to output:

data for row1
data   for    row2
data   for   row3
data for   row4

In the shell you can do it easily with:

while read -r _ _ data
do
    printf "%s\n" "$data"
done < data.txt

But with awk it's kind of difficult. Is there a method in awk for splitting only the first N columns?

Managua answered 26/5 at 8:0 Comment(5)
If you didn't care about retaining exact amount of whitespace following the last split, you could set $1 = ""; $2 = "" and then leverage the reconstituted $0. (Not building this into an answer because that's obviously not what you want, but it may work for someone else).Professorate
That loop doesn't keep ALL of the spaces that might be in "an additional column which has spaces in it that you want to keep". Try while read -r _ _ data; do printf '<%s>\n' "$data"; done <<< 'a b c ', i.e. with spaces after the c, and note that it outputs just <c>, not <c > with the spaces after c that you want to keep.Bluestone
@EdMorton that’s right, it strips the trailing space chars, which might or might not be desirable. That’s why I would prefer AWK over it, but there's no builtin function that does itManagua
I see you're getting some answers that assume you want to print from where the string data appears in the input on rather than wanting to print from the 3rd field on. Those are 2 very different problems that each can benefit from quite different solutions. If you change one of the 2nd field values, e.g. 1.2 to data and one of the 3rd field values from data to foodatabar that should make what you're asking about even clearer (though I thought it was already clear that you wanted "a method in awk for splitting only the first N columns").Bluestone
@EdMorton I would say that it's typical of SO; when there's a sample input and expected output for illustrating the problem, the answers tend to get overly simplified to just handle of the sample data (sometimes right on the bullet). Here I can only blame myself for not having chosen a better sample.Managua
B
5

The premise of the awk language is that there should only be constructs to do things that aren't easy to do with other constructs to keep the language concise and so avoid the language bloat that some other tools/languages suffer from. e.g. some people like that perl has many unique language constructs to do anything you could possible want to do while others express their opposing view of the language in cartoons like https://www.zoitz.com/comics/perl_small.png.

This is just one of the many things that it'd be nice to have a function to do, but it's so easy to code whatever you actually need to do to skip a couple of fields for any specific input it'd just be cluttering up the language if a function existed to do it and if we had a function for THIS there are 100s of other functions that should also be created to do all of the other things it'd just be nice to have a function to do.

Using GNU awk for \s/\S shorthand

$ awk 'sub(/^\s*(\S+\s+){2}/,"")' file
data for row1
data   for    row2
data   for   row3
data for   row4

and the same with any POSIX awk:

$ awk 'sub(/^[[:space:]]*([^[:space:]]+[[:space:]]+){2}/,"")' file
data for row1
data   for    row2
data   for   row3
data for   row4

Note that the awk output from above would retain any trailing white space, unlike a shell read loop.

Both of those rely on the FS being the default blank character but are easily modified for any other FS that can be negated in a bracket expression (or opposite character class).

Note that the entire approach relies on being able to negate the FS in a bracket expression so it wouldn't work if the FS was some arbitrary regexp or even a multi-char string but then neither would the shell read loop you're asking to duplicate the function of.

If you do happen to have a FS you can't just negate in a bracket expression, e.g. if your fields are separated by 3 digits or 2 punctuation characters so you have something like:

$ echo 'abc345def;%ghi+klm;%nop345qrs' |
    awk -v FS='[[:digit:]]{3}|[[:punct:]]{2}' '{for (i=1; i<=NF; i++) print i, $i}'
1 abc
2 def
3 ghi+klm
4 nop
5 qrs

then here's a more general approach using GNU awk for the 4th arg to split():

$ echo 'abc345def;%ghi+klm;%nop345qrs' |
    awk -v FS='[[:digit:]]{3}|[[:punct:]]{2}' '{
        split($0,f,FS,s)
        print substr( $0, length(s[0] f[1] s[1] f[2] s[2]) + 1 )
    }'
ghi+klm;%nop345qrs
Bluestone answered 26/5 at 17:1 Comment(6)
Worth noting that the POSIX one won't work on mawk (default awk on Debian) due to a bug in its regex engine. It'll match only one field regardless of the number between curly braces.Herson
@oguzismail : see my answer below to deal with the issue you mentionedDerm
@oguzismail it probably will work in mawk 2, just not mawk 1. It wouldn't work without --re-interval or --posix in versions of gawk before 4.0 and probably other older awk variants/versions either as it requires support of RE-intervals and character classes as defined by POSIX but which aren't supporter in some older awk variants/versions. Hence the "with any POSIX awk" statement.Bluestone
@EdMorton No, it doesn't work with mawk 2 either. No other awk I have on my computer has this problem though, so I'm not suggesting that you change your answer but just letting you and future readers know.Herson
@oguzismail Huh, I guess mawk2 isn't POSIX compliant then. Weird as it shares so much code with GNU awk. Do you know it it's the character classes (e.g. [:space:]) or interval expressions ({2}) that's the problem? Oh well.... thanks for letting us know.Bluestone
It's the interval expressions (few other people spotted the bug in mawk 1.x before, see mawk mishandles {2,} in regular expressions). [[:space:]]*[^[:space:]]+[[:space:]]+[^[:space:]]+[[:space:]]+ works on both versions.Herson
A
4

If the data is separated by 1 or more spaces, you can remove the first 1 or 2 columns with sub, where the column is a single word consisting of non whitespace characters.

As your example shell script will also remove the word if there is just a single word, you can use an optional part for the second word.

awk '{
    sub(/^[[:space:]]*[^[:space:]]+([[:space:]]+[^[:space:]]+)?[[:space:]]*/, "");
}1' file

The pattern matches:

  • ^ Start of string
  • [[:space:]]*[^[:space:]]+ Match optional spaces and 1+ non whitespace chars
  • ([[:space:]]+[^[:space:]]+)? Optionally match 1+ non whitspace chars followed by 1+ spaces
  • [[:space:]]* Match trailing spaces

Input

1.1 1.2 data for row1
  2.1   2.2    data   for    row2
test

?  ?   data   for   row3
 \ * data for   row4

Output

data for row1
data   for    row2


data   for   row3
data for   row4
Avitzur answered 26/5 at 8:20 Comment(0)
G
4

With your shown samples only, please try following awk code.

awk '
match($0,/[[:space:]]+data[[:space:]]+.*$/){
  value=substr($0,RSTART,RLENGTH)
  sub(/^[[:space:]]+/,"",value)
  print value
}
'  Input_file
Gemagemara answered 27/5 at 6:44 Comment(1)
As I started all the additional fields with data it is a valid approach indeed +1Managua
D
3

Disclaimer: this solutions assumes you are using default GNU AWK understanding of fields, i.e. field separator is one-or-more white-space characters, if this does not hold ignore this answer entirely.

Is there a method in awk for splitting only the first N columns?

If you know N a priori, you might prepare regular expression and use it in sub String Function, in that particular, let file.txt content be

1.1 1.2 data for row1
  2.1   2.2    data   for    row2
?  ?   data   for   row3
 \ * data for   row4

then

awk '{sub(/[[:space:]]*[^[:space:]]+[[:space:]]+[^[:space:]]+[[:space:]]+/,"");print}' file.txt

gives output

data for row1
data   for    row2
data   for   row3
data for   row4

Explanation: regular expression consist of alternating white-space characters [[:space:]] and non-white-space characters [^[:space:]], leading white-space characters are optional thus there is zero-or-more (*), all other or 1-or-more (+) in number.

If you do need to have easy way to adjust N use for loop to remove leftmost column, one-by-one, for example if you would N=3 and processing file.txt as shown above, you could do

awk 'BEGIN{n=3}{for(i=0;i<n;i+=1){sub(/[[:space:]]*[^[:space:]]+[[:space:]]+/,"")};print}' file.txt

which gives output

for row1
for    row2
for   row3
for   row4

Explanation: this does remove leftmost column and adjacent field separator in each turn of for loop.

(tested in GNU Awk 5.1.0)

Dialysis answered 26/5 at 11:6 Comment(0)
D
3

you mean like this ?

awk '($!_=_)^_' FS=data OFS=data

data for row1
data   for    row2
data   for   row3
data for   row4
Derm answered 27/5 at 12:36 Comment(5)
Well, yes and no; check https://mcmap.net/q/1476019/-implementing-a-maxsplit-function-in-posix-awk/3387716Managua
@Managua : how is this different from RavinderSingh13's solution ? Cuz both solutions are dependent on that fixed string "data" being present to act as pivoting point. I set FS directly to this fixed string, so all fancy combos of spaces and tabs are fully preserved. And if there aren't any spaces or tabs to left of the first data, then $1 == "" to begin with, and me setting it to an empty string wouldn't end up deleting the parts u want to keep.Derm
@Managua : The best part of not explicitly pre-defining what _ should be is that in that tiny piece of code the same variable had 3 different roles - $!_ is logically negating the empty string, making it a 1, $!_=_ is assigning the empty string into $1, while )^_ is taking the 0th power, taking advantage of the duality property that unassigned vars are both empty string and numeric zero, bypassing the extra type casting.Derm
This tiny bit of code is cunning indeedManagua
@Managua : i suppose you can write it as awk '! ($1 = _)' or awk 'NF += $1 = _' or awk '($1 = _) < NF' (they're not totally identical to each other though, namely, the middle variant skips blank lines)Derm
D
1

You don't need gawk and the 4-argument variant of split( ) just to get the seps.

Using the same example as above :

 echo 'abc345def;%ghi+klm;%nop345qrs' | 

 mawk 'BEGIN { __ = (_ = "[0-9]") (_)_ "|" (_ = "[[:punct:]]")_
                _ = "\3&\5"
              ___ = "[\3\5]+"  } gsub(__, _) + gsub(___, "\f")_'

abc
   345
      def
         ;%
           ghi+klm
                  ;%
                    nop
                       345
                          qrs

The odd-numbered "steps" are the fields[] you can get from any split(), while the even numbered ones in between are the seps[]

Change it to a single gsub(__, "\11 \14| &\12")_' and it should be even more obvious

abc  
         | 345
def  
         | ;%
ghi+klm  
         | ;%
nop  
         | 345
qrs

This would, by and large, emulate gawk's FPAT and patsplit() functionalities.

Using this filtering criteria instead : __ = "[A-Za-z]+([[:punct:]][A-Za-z]+)*", the roles of fields[] and seps[] have just been swapped.

         | abc
345  
         | def
;%   
         | ghi+klm
;%   
         | nop
345  
         | qrs

This approach should work on all awk variants.

Derm answered 27/5 at 13:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.