How does string.split("\\S") work [duplicate]
Asked Answered
N

3

15

I was doing a question out of the book oracle_certified_professional_java_se_7_programmer_exams_1z0-804_and_1z0-805 by Ganesh and Sharma.

One question is:

  1. Consider the following program and predict the output:

      class Test {
    
        public static void main(String args[]) {
          String test = "I am preparing for OCPJP";
          String[] tokens = test.split("\\S");
          System.out.println(tokens.length);
        }
      }
    

    a) 0

    b) 5

    c) 12

    d) 16

Now I understand that \S is a regex means treat non-space chars as the delimiters. But I was puzzled as to how the regex expression does its matching and what are the actual tokens produced by split.

I added code to print out the tokens as follows

for (String str: tokens){
  System.out.println("<" + str + ">");
}

and I got the following output

16

<>

< >

<>

< >

<>

<>

<>

<>

<>

<>

<>

<>

< >

<>

<>

< >

So a lot of empty string tokens. I just do not understand this.

I would have thought along the lines that if delimiters are non space chars that in the above text then all alphabetic chars serve as delimiters so maybe there should be 21 tokens if we are matching tokens that result in empty strings too. I just don't understand how Java's regex engine is working this out. Are there any regex gurus out there who can shed light on this code for me?

Naturalize answered 9/10, 2014 at 14:21 Comment(4)
I tried your example and it makes much more sense if you replace \\S with \\s, could this be a typo ?Donnie
@Donnie This is for a certification exam, why would it seem strange that they throw in a tricky case like this? The fact that they included the correct answer (16) as one of the choices makes it very unlikely that this was unintentional.Customable
P.S. If 21 had been one of the choices, I probably would have gotten this wrong.Customable
Hi no it was meant to be \\S the opposite of \\s. Tricky one this.Naturalize
L
7

First things start with \s (lower case), which is a regular expression character class for white space, that is space ' ' tabs '\t', new line chars '\n' and '\r', vertical tab '\v' and a bunch of other characters.

\S (upper case) is the opposite of this, so that would mean any non white space character.

So when you split this String "I am preparing for OCPJP" using \S you are effectively splitting the string at every letter. The reason your token array has a length of 16.

Now as for why these are empty.

Consider the following String: Hello,World, if we were to split that using ,, we would end up with a String array of length 2, with the following contents: Hello and World. Notice that the , is not in either of the Strings, it has be erased.

The same thing has happened with the I am preparing for OCPJP String, it has been split, and the points matched by your regex are not in any of the returned values. And because most of the letters in that String are followed by another letter, you end up with a load of Strings of length zero, only the white space characters are preserved.

Loire answered 9/10, 2014 at 14:33 Comment(2)
The point of the questions is: why 16 and not 21? Why is "OCPJP" not treated as a bunch of separators? There are 21 letters, but last ones are ignored...Angulate
Fair point, missed that part of the question! Thanks for pointing that out and highlighting the documentation in your answer.Loire
A
12

Copied from the API documentation: (bold are mine)

public String[] split(String regex)

Splits this string around matches of the given regular expression. This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.

The string "boo:and:foo", for example, yields the following results with these expressions:

 Regex  Result
   :    { "boo", "and", "foo" }
   o    { "b", "", ":and:f" }

Check the second example, where last 2 "o" are just removed: the answer for your question is "OCPJP" substring is treated as a collection of separators which is not followed for non-empty strings, so that part is trimmed.

Angulate answered 9/10, 2014 at 14:43 Comment(4)
Thanks Pablo that makes sense if you ignore the empty strings after the last space. That would explain the number. 16 instead of 21 ish.Naturalize
This is on a slightly different point but say you had a comma seperated file with the values at the end empty say they were not filled in say its from an excel spreadsheet where the user did not enter a value. Would this mean that String.split would throw them away. Might lead to nasty bugs if you were expecting to processing the data. Just thinking aloud :-).Naturalize
Yes, that's the reason you have to check the length of the array when splitting a CSV line. If you mix that and the fact that CSV format lacks any standard...Angulate
@FrankBrosnan In that case you may want to consider split(",", -1).Serpens
C
8

The reason the result is 16 and not 21 is this, from the javadoc for Split:

Trailing empty strings are therefore not included in the resulting array.

This means, for example, that if you say

"/abc//def/ghi///".split("/")

the result will have five elements. The first will be "", since it's not a trailing empty string; the others will be "abc", "", "def", and "ghi". But the remaining empty strings are removed from the array.

In the posted case:

"I am preparing for OCPJP".split("\\S")

it's the same thing. Since non-space characters are delimiters, each letter is a delimiter, but the OCPJP letters essentially don't count, because those delimiters result in trailing empty strings that are then discarded. So, since there are 15 letters in "I am preparing for", they are treated as delimiting 16 substrings (the first is "" and the last is " ").

Customable answered 9/10, 2014 at 14:46 Comment(0)
L
7

First things start with \s (lower case), which is a regular expression character class for white space, that is space ' ' tabs '\t', new line chars '\n' and '\r', vertical tab '\v' and a bunch of other characters.

\S (upper case) is the opposite of this, so that would mean any non white space character.

So when you split this String "I am preparing for OCPJP" using \S you are effectively splitting the string at every letter. The reason your token array has a length of 16.

Now as for why these are empty.

Consider the following String: Hello,World, if we were to split that using ,, we would end up with a String array of length 2, with the following contents: Hello and World. Notice that the , is not in either of the Strings, it has be erased.

The same thing has happened with the I am preparing for OCPJP String, it has been split, and the points matched by your regex are not in any of the returned values. And because most of the letters in that String are followed by another letter, you end up with a load of Strings of length zero, only the white space characters are preserved.

Loire answered 9/10, 2014 at 14:33 Comment(2)
The point of the questions is: why 16 and not 21? Why is "OCPJP" not treated as a bunch of separators? There are 21 letters, but last ones are ignored...Angulate
Fair point, missed that part of the question! Thanks for pointing that out and highlighting the documentation in your answer.Loire

© 2022 - 2024 — McMap. All rights reserved.