Java StringTokenizer.nextToken() skips over empty fields
Asked Answered
B

6

15

I am using a tab (/t) as delimiter and I know there are some empty fields in my data e.g.:

one->two->->three

Where -> equals the tab. As you can see an empty field is still correctly surrounded by tabs. Data is collected using a loop :

 while ((strLine = br.readLine()) != null) {
    StringTokenizer st = new StringTokenizer(strLine, "\t");
    String test = st.nextToken();
    ...
    }

Yet Java ignores this "empty string" and skips the field.

Is there a way to circumvent this behaviour and force java to read in empty fields anyway?

Bobbiebobbin answered 10/7, 2012 at 8:22 Comment(3)
Use string.split("\t") instead.Semasiology
from the java docs of String tokenizer "StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead."Chuckchuckfull
Just a heads up that it looks like using string.split("\t") won't return any trailing empty tokens at the end. If that matters, use string.split("\t", -1).Holst
B
9

Thank you at all. Due to the first comment I was able to find a solution: Yes you are right, thank you for your reference:

 Scanner s = new Scanner(new File("data.txt"));
 while (s.hasNextLine()) {
      String line = s.nextLine();
      String[] items= line.split("\t", -1);
      System.out.println(items[5]);
      //System.out.println(Arrays.toString(cols));
 }
Bobbiebobbin answered 10/7, 2012 at 11:15 Comment(0)
B
16

There is a RFE in the Sun's bug database about this StringTokenizer issue with a status Will not fix.

The evaluation of this RFE states, I quote:

With the addition of the java.util.regex package in 1.4.0, we have basically obsoleted the need for StringTokenizer. We won't remove the class for compatibility reasons. But regex gives you simply what you need.

And then suggests using String#split(String) method.

Bucksaw answered 10/7, 2012 at 8:27 Comment(0)
B
9

Thank you at all. Due to the first comment I was able to find a solution: Yes you are right, thank you for your reference:

 Scanner s = new Scanner(new File("data.txt"));
 while (s.hasNextLine()) {
      String line = s.nextLine();
      String[] items= line.split("\t", -1);
      System.out.println(items[5]);
      //System.out.println(Arrays.toString(cols));
 }
Bobbiebobbin answered 10/7, 2012 at 11:15 Comment(0)
C
5

You can use Apache Commons StringUtils.splitPreserveAllTokens(). It does exactly what you need.

Caseation answered 10/7, 2012 at 8:26 Comment(0)
D
1

I would use Guava's Splitter, which doesn't need all the big regex machinery, and is more well-behaved than String's split() method:

Iterable<String> parts = Splitter.on('\t').split(string);
Deration answered 10/7, 2012 at 8:30 Comment(3)
call me paranoid but I really don't think introducing a new dependency for something so simple (not to mention included in the standard library) is a bit of an overkill. I still appreciate the info regarding Guava splitter not needing regex tho :)Hendecagon
I agree, generally, but Guava is so useful and provides so many additional useful classes that it's part of my "default" dependencies for nearly all my projects (unless it's a very small self-contained library).Deration
Guava is awesome, for sure. I still havent fully explored the awesomeness that is Guava, thus it's always nice to learn new stuff about it.Hendecagon
G
0

As you can see in the Java Doc http://docs.oracle.com/javase/6/docs/api/java/util/StringTokenizer.html you can use the Constructor public StringTokenizer(String str, String delim, boolean returnDelims) with returnDelims true

So it returns each Delimiter as a seperate string!

Edit:

DON'T use this way, as @npe already typed out, StringTokenizer shouldn't be used any more! See JavaDoc:

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

Guadalupeguadeloupe answered 10/7, 2012 at 8:26 Comment(2)
I am still faced with the problem that I have multiple tabs behind eachother (indicating blanc fields) that the blank value is NOT put into the array..how can I fix this?Bobbiebobbin
returnDelims returns the delimiter. This does not reply the question.Stilly
G
0
public class TestStringTokenStrict {

/**
 * Strict implementation of StringTokenizer
 * 
 * @param str
 * @param delim
 * @param strict
 *            true = include NULL Token
 * @return
 */
static StringTokenizer getStringTokenizerStrict(String str, String delim, boolean strict) {
    StringTokenizer st = new StringTokenizer(str, delim, strict);
    StringBuffer sb = new StringBuffer();

    while (st.hasMoreTokens()) {
        String s = st.nextToken();
        if (s.equals(delim)) {
            sb.append(" ").append(delim);
        } else {
            sb.append(s).append(delim);
            if (st.hasMoreTokens())
                st.nextToken();
        }
    }
    return (new StringTokenizer(sb.toString(), delim));
}

static void altStringTokenizer(StringTokenizer st) {
    while (st.hasMoreTokens()) {
        String type = st.nextToken();
        String one = st.nextToken();
        String two = st.nextToken();
        String three = st.nextToken();
        String four = st.nextToken();
        String five = st.nextToken();

        System.out.println(
                "[" + type + "] [" + one + "] [" + two + "] [" + three + "] [" + four + "] [" + five + "]");
    }
}

public static void main(String[] args) {
    String input = "Record|One||Three||Five";
    altStringTokenizer(getStringTokenizerStrict(input, "|", true));
}}
Gilboa answered 7/5, 2020 at 8:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.