How exactly does String.split() method in Java work when regex is provided?
Asked Answered
A

1

7

I'm preparing for OCPJP exam and I ran into the following example:

class Test {
   public static void main(String args[]) {
      String test = "I am preparing for OCPJP";
      String[] tokens = test.split("\\S");
      System.out.println(tokens.length);
   }
}

This code prints 16. I was expecting something like no_of_characters + 1. Can someone explain me, what does the split() method actually do in this case? I just don't get it...

Alethaalethea answered 7/3, 2014 at 20:8 Comment(5)
The source code is freely available. So is the javadoc for Pattern.Guarneri
Look at the source code of String . split.Tessie
a string with \\S becomes \S in the regex engine, which is a metachar for "non-whitespace char"Superposition
use System.out.println(Arrays.toString(tokens)); and you'll see what it splitedAnemology
This is an interesting question and I don't know why someone doesn't take the responsibility to explain why its 16? And why its only 4 for the input I am preparing? I wonder why someone even did a down vote as well!!!?Floris
R
14

It splits on every "\\S" which in regex engine represents \S non-whitespace character.

So lets try to split "x x" on non-whitespace (\S). Since this regex can be matched by one character lets iterate over them to mark places of split (we will use pipe | for that).

  • is 'x' non-whitespace? YES, so lets mark it | x
  • is ' ' non-whitespace? NO, so we leave it as is
  • is last 'x' non-whitespace? YES, so lets mark it | |

So as result we need to split our string at start and at end which initially gives us result array

["", " ", ""]
   ^    ^ - here we split

But since trailing empty strings are removed, result would be

[""," "]     <- result
        ,""] <- removed trailing empty string

so split returns array ["", " "] which contains only two elements.

BTW. To turn off removing last empty strings you need to use split(regex,limit) with negative value of limit like split("\\S",-1).


Now lets get back to your example. In case of your data you are splitting on each of

I am preparing for OCPJP
| || ||||||||| ||| |||||

which means

 ""|" "|""|" "|""|""|""|""|""|""|""|""|" "|""|""|" "|""|""|""|""|""

So this represents this array

[""," ",""," ","","","","","","","",""," ","",""," ","","","","",""]  

but since trailing empty strings "" are removed (if their existence was caused by split - more info at: Confusing output from String.split)

[""," ",""," ","","","","","","","",""," ","",""," ","","","","",""]  
                                                     ^^ ^^ ^^ ^^ ^^

you are getting as result array which contains only this part:

[""," ",""," ","","","","","","","",""," ","",""," "]  

which are exactly 16 elements.

Roubaix answered 7/3, 2014 at 20:17 Comment(18)
do you know why its 16? and why its only 4 for I am preparing?Floris
Thanks! Now I got it. Its removing the last nonwhite section it that is adjacent to the end of line!Floris
That's it... I didn't realize, that the trailing empty strings are removed. That explains the result. Thank you very much!Alethaalethea
@SabujHassan Exactly. If you want to turn off this default mechanism so trailing empty elements would not be removed just add negative limit as split argument like split(regex,-1);.Roubaix
@Alethaalethea No problem. But in the future start by reading javadocs of methods you are using. split method documentation mentions that ... trailing empty strings will be discarded.Roubaix
@Roubaix My mistake, I totally missed that part :SAlethaalethea
If \\S represents non-whitespace character then why it is considering last whitespace?Mealworm
As split will removes trailing spaces, if we replace split("\\S") with split("\s") then we will get strings split by space, It will not remove trailing spaces and string.Mealworm
@LokeshS I am not sure what you mean but let me rephrase my answer a little. When you split on \S you are splitting on each non-whitespace. Which means for string like "a_b" (where _ represents space) at first your result array will look like ["", "_", ""], but because split removes trailing spaces returned array will be ["", "_"]. Now which step is confusing you?Roubaix
The confusing part to me is, in the example "a_b", as it splits on each non-whitespace character, why "_" is included in the result one. Similarly, you gave the example above "x x" which results [""," "] . But, you made the statement, "Important part is that by default split removes trailing empty strings", if it removes trailing empty strings and the part after it, why that empty string is included in the result?Mealworm
If by default split removes trailing whitespaces, let's say I have string like "I am a good boy", I am splitting with "\\s", which splits on every whitespace. I should get the result as ["I","am","a","good"," "] as split removes trailing whitespaces like you said. But, I will get the result like this ["I","am","a","good","boy"]. What makes the difference exactly?Mealworm
@LokeshS "why "_" is included in the result one" _ in my example represents space, and \S represents non-whitespace. So lets step through splitting process of "a_b". Lets iterate over each characters (since \s may be matched only by single character). Is a non-whitespace? Yes. So we split on it |_b (| represent place we split on). Is _` non-whitespace? No it is whitespace so we don't split here, Is b non-whitespace? Yes, so we split on it |_| which in result gives us array ["", "_", ""].Roubaix
@LokeshS You may also misunderstand what is considered as empty string. In Java (and many other languages) empty string is string which doesn't contain any character. This means that " " is not empty (because it contains one character representing space). Only "" is considered as empty string (its length is 0).Roubaix
@LokeshS Now trailing empty strings are series of empty string placed at the end of array. So array like [a, "", ""] has two trailing empty strings, but array like [a, "", "", b] doesn't have any trailing empty strings (because it ends with non-empty string).Roubaix
@Roubaix How come "ab".split(" ").length returns 1?Except
@Roubaix Also "".split("[^A-Za-z]+"); returns 1. Using the pipe strategy you mentioned, I couldn't figure out how these two happens. Please explain/clarify how to deal with such cases.Except
@AbhinavVutukuri In case "ab".split(" ") split didn't happened, so result array contains original string ["ab"] so its length is 1. Second case is more interesting. Here split also couldn't happen because "" doesn't contain any single character outside of ranges A-Za-z so we are getting as result array with original string [""]. Confusing part which I didn't mention in my answer is that removing trailing empty strings makes sense only if they ware created by splitting process. But here there was no split, so there is no need to remove anything.Roubaix
@AbhinavVutukuri I tried to explain it in my answer in different question: #25057107Roubaix

© 2022 - 2024 — McMap. All rights reserved.