Tokenize a string with a space in java
Asked Answered
M

11

9

I want to tokenize a string like this

String line = "a=b c='123 456' d=777 e='uij yyy'";

I cannot split based like this

String [] words = line.split(" ");

Any idea how can I split so that I get tokens like

a=b
c='123 456'
d=777
e='uij yyy';  
Managerial answered 1/10, 2009 at 0:21 Comment(3)
Couldn't you just use a regex to split by spaces unless you're inside a quote (not that I know regex, but I'm pretty sure you can do that).Cl
Your code perfectly works here using jdk 1.6.0_13Medan
@LePad above code will output [a=b, c='123, 456', d=777, e='uij, yyy']Sesquioxide
R
9

The simplest way to do this is by hand implementing a simple finite state machine. In other words, process the string a character at a time:

  • When you hit a space, break off a token;
  • When you hit a quote keep getting characters until you hit another quote.
Rickard answered 1/10, 2009 at 0:26 Comment(2)
Well finite state machine equates to regular expression, so you could just stick with that, right?Infinitesimal
Beware that you may need to handle escaped quotes such as \"Sperm
R
3

Depending on the formatting of your original string, you should be able to use a regular expression as a parameter to the java "split" method: Click here for an example.

The example doesn't use the regular expression that you would need for this task though.

You can also use this SO thread as a guideline (although it's in PHP) which does something very close to what you need. Manipulating that slightly might do the trick (although having quotes be part of the output or not may cause some issues). Keep in mind that regex is very similar in most languages.

Edit: going too much further into this type of task may be ahead of the capabilities of regex, so you may need to create a simple parser.

Resource answered 1/10, 2009 at 0:29 Comment(0)
J
3
line.split(" (?=[a-z+]=)")

correctly gives:

a=b
c='123 456'
d=777
e='uij yyy'

Make sure you adapt the [a-z+] part in case your keys structure changes.

Edit: this solution can fail miserably if there is a "=" character in the value part of the pair.

Jato answered 21/6, 2010 at 7:15 Comment(0)
M
1

StreamTokenizer can help, although it is easiest to set up to break on '=', as it will always break at the start of a quoted string:

String s = "Ta=b c='123 456' d=777 e='uij yyy'";
StreamTokenizer st = new StreamTokenizer(new StringReader(s));
st.ordinaryChars('0', '9');
st.wordChars('0', '9');
while (st.nextToken() != StreamTokenizer.TT_EOF) {
    switch (st.ttype) {
    case StreamTokenizer.TT_NUMBER:
        System.out.println(st.nval);
        break;
    case StreamTokenizer.TT_WORD:
        System.out.println(st.sval);
        break;
    case '=':
        System.out.println("=");
        break;
    default:
        System.out.println(st.sval);
    }
}

outputs

Ta
=
b
c
=
123 456
d
=
777
e
=
uij yyy

If you leave out the two lines that convert numeric characters to alpha, then you get d=777.0, which might be useful to you.

Medford answered 1/10, 2009 at 1:14 Comment(0)
C
1

Assumptions:

  • Your variable name ('a' in the assignment 'a=b') can be of length 1 or more
  • Your variable name ('a' in the assignment 'a=b') can not contain the space character, anything else is fine.
  • Validation of your input is not required (input assumed to be in valid a=b format)

This works fine for me.

Input:

a=b abc='123 456' &=777 #='uij yyy' ABC='slk slk'              123sdkljhSDFjflsakd@*#&=456sldSLKD)#(

Output:

a=b
abc='123 456'
&=777
#='uij yyy'
ABC='slk slk'             
123sdkljhSDFjflsakd@*#&=456sldSLKD)#(

Code:

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexTest {

    // SPACE CHARACTER                                          followed by
    // sequence of non-space characters of 1 or more            followed by
    // first occuring EQUALS CHARACTER       
    final static String regex = " [^ ]+?=";


    // static pattern defined outside so that you don't have to compile it 
    // for each method call
    static final Pattern p = Pattern.compile(regex);

    public static List<String> tokenize(String input, Pattern p){
        input = input.trim(); // this is important for "last token case"
                                // see end of method
        Matcher m = p.matcher(input);
        ArrayList<String> tokens = new ArrayList<String>();
        int beginIndex=0;
        while(m.find()){
            int endIndex = m.start();
            tokens.add(input.substring(beginIndex, endIndex));
            beginIndex = endIndex+1;
        }

        // LAST TOKEN CASE
        //add last token
        tokens.add(input.substring(beginIndex));

        return tokens;
    }

    private static void println(List<String> tokens) {
        for(String token:tokens){
            System.out.println(token);
        }
    }


    public static void main(String args[]){
        String test = "a=b " +
                "abc='123 456' " +
                "&=777 " +
                "#='uij yyy' " +
                "ABC='slk slk'              " +
                "123sdkljhSDFjflsakd@*#&=456sldSLKD)#(";
        List<String> tokens = RegexTest.tokenize(test, p);
        println(tokens);
    }
}
Colorless answered 1/10, 2009 at 1:27 Comment(0)
R
1

Or, with a regex for tokenizing, and a little state machine that just adds the key/val to a map:

String line = "a = b c='123 456' d=777 e =  'uij yyy'";
Map<String,String> keyval = new HashMap<String,String>();
String state = "key";
Matcher m = Pattern.compile("(=|'[^']*?'|[^\\s=]+)").matcher(line);
String key = null;
while (m.find()) {
    String found = m.group();
    if (state.equals("key")) {
        if (found.equals("=") || found.startsWith("'"))
            { System.err.println ("ERROR"); }
        else { key = found; state = "equals"; }
    } else if (state.equals("equals")) {
        if (! found.equals("=")) { System.err.println ("ERROR"); }
        else { state = "value"; }
    } else if (state.equals("value")) {
        if (key == null) { System.err.println ("ERROR"); }
        else {
            if (found.startsWith("'"))
                found = found.substring(1,found.length()-1);
            keyval.put (key, found);
            key = null;
            state = "key";
        }
    }
}
if (! state.equals("key"))  { System.err.println ("ERROR"); }
System.out.println ("map: " + keyval);

prints out

map: {d=777, e=uij yyy, c=123 456, a=b}

It does some basic error checking, and takes the quotes off the values.

Rader answered 24/1, 2013 at 22:17 Comment(0)
P
0

This solution is both general and compact (it is effectively the regex version of cletus' answer):

String line = "a=b c='123 456' d=777 e='uij yyy'";
Matcher m = Pattern.compile("('[^']*?'|\\S)+").matcher(line);
while (m.find()) {
  System.out.println(m.group()); // or whatever you want to do
}

In other words, find all runs of characters that are combinations of quoted strings or non-space characters; nested quotes are not supported (there is no escape character).

Piperidine answered 1/10, 2009 at 4:13 Comment(0)
C
0
public static void main(String[] args) {
String token;
String value="";
HashMap<String, String> attributes = new HashMap<String, String>();
String line = "a=b c='123  456' d=777 e='uij yyy'";
StringTokenizer tokenizer = new StringTokenizer(line," ");
while(tokenizer.hasMoreTokens()){
        token = tokenizer.nextToken();
    value = token.contains("'") ? value + " " + token : token ;
    if(!value.contains("'") || value.endsWith("'")) {
           //Split the strings and get variables into hashmap 
           attributes.put(value.split("=")[0].trim(),value.split("=")[1]);
           value ="";
    }
}
    System.out.println(attributes);
}

output: {d=777, a=b, e='uij yyy', c='123 456'}

In this case continuous space will be truncated to single space in the value. here attributed hashmap contains the values

Chaworth answered 21/6, 2010 at 8:6 Comment(0)
F
0
 import java.io.*;
 import java.util.Scanner;

 public class ScanXan {
  public static void main(String[] args) throws IOException {

    Scanner s = null;

    try {
        s = new Scanner(new BufferedReader(new FileReader("<file name>")));

        while (s.hasNext()) {
            System.out.println(s.next());
           <write for output file>
        }
    } finally {
        if (s != null) {
            s.close();
        }
    }
 }
}
Fogarty answered 15/1, 2017 at 5:19 Comment(1)
Yes @YoungHobbit My working environment Linux(Ubuntu 15.01) coded on sublime3.Fogarty
W
-1
java.util.StringTokenizer tokenizer = new java.util.StringTokenizer(line, " ");
while (tokenizer.hasMoreTokens()) {
    String token = tokenizer.nextToken();
    int index = token.indexOf('=');
    String key = token.substring(0, index);
    String value = token.substring(index + 1);
}
Westbrooks answered 1/10, 2009 at 1:15 Comment(0)
H
-2

Have you tried splitting by '=' and creating a token out of each pair of the resulting array?

Hy answered 1/10, 2009 at 0:36 Comment(2)
This has the same problem as the .split() solution mentioned in the question.Maimonides
@Hy This solution doesn't work, but you could do something like split off a space, then go through each of the split strings: if it starts with ' (assuming it's well formatted), then you just append these strings together until you find one that ends with '. String Tokenziers or a state machine (or using a stack if you want to allow multiple levels of nesting quotes by alternating quote types ala python) may be more efficient, but this can work too!Isahella

© 2022 - 2024 — McMap. All rights reserved.