shlex alternative for Java

Asked 4/7, 2009 at 20:52 Answered 19/7, 2020 at 22:41

Is there a shlex alternative for Java? I'd like to be able to split quote delimited strings like the shell would process them. For example, if I'd send :

one two "three four"

and perform a split, I'd like to receive the tokens

one
two
three four

Lacustrine answered 4/7, 2009 at 20:52 Comment(1)

Notably -- "like the shell would process them" is a fairly hard task; shlex does it well, but many naive algorithms won't. For instance, in shell, "three four" and "three"' 'four are exactly equivalent, as is three\ four. – Kagu 5/2, 2013 at 17:9

I had a similar problem today, and it didn't look like any standard options such as StringTokenizer, StrTokenizer, Scanner were a good fit. However, it's not hard to implement the basics.

This example handles all the edge cases currently commented on other answers. Be warned, I haven't checked it for full POSIX compliance yet. Gist including unit tests available on GitHub - released in public domain via the unlicense.

public List<String> shellSplit(CharSequence string) {
    List<String> tokens = new ArrayList<String>();
    boolean escaping = false;
    char quoteChar = ' ';
    boolean quoting = false;
    int lastCloseQuoteIndex = Integer.MIN_VALUE;
    StringBuilder current = new StringBuilder();
    for (int i = 0; i<string.length(); i++) {
        char c = string.charAt(i);
        if (escaping) {
            current.append(c);
            escaping = false;
        } else if (c == '\\' && !(quoting && quoteChar == '\'')) {
            escaping = true;
        } else if (quoting && c == quoteChar) {
            quoting = false;
            lastCloseQuoteIndex = i;
        } else if (!quoting && (c == '\'' || c == '"')) {
            quoting = true;
            quoteChar = c;
        } else if (!quoting && Character.isWhitespace(c)) {
            if (current.length() > 0 || lastCloseQuoteIndex == (i - 1)) {
                tokens.add(current.toString());
                current = new StringBuilder();
            }
        } else {
            current.append(c);
        }
    }
    if (current.length() > 0 || lastCloseQuoteIndex == (string.length() - 1)) {
        tokens.add(current.toString());
    }

    return tokens;
}

Pectize answered 22/12, 2013 at 0:44 Comment(8)

Would you consider attaching a license to this (or explicitly donating it to the public domain)? – Kagu 5/3, 2014 at 22:27

Ah, there it is, last line of this page: user contributions licensed under cc by-sa 3.0 with attribution required – Surveillance 10/3, 2014 at 17:58

@RayMyers: We still need to know whether this is your own work, otherwise the license is unknown. Also, the CC-BY-SA license isn't completely compatible with Hadoop's Apache license (I would need to use it unmodified). If you'd dedicate this code under the Unlicense these problems go away, otherwise I'll have to write similar from scratch. ...I wish SO would change their default license. – Surveillance 10/3, 2014 at 19:5

bukzor and others: Thanks for pointing this out. Yes, it is my work. I've updated it to be explicitly public domain. – Pectize 12/3, 2014 at 20:2

@RayMyers: While this is good enough for me (thanks!), you should know that the expert advice I've often seen is that the 'public domain' is a legal concept on very shaky foundation (eg it doesn't even exist outside the US), and any work without a license (including those "released to the public domain") are best considered to have NoLicense. The license closest to what you are trying to do is the Unlicense. – Surveillance 12/3, 2014 at 21:32

While it surprises me, this appears to be the best code Java has to offer for this problem. Enjoy your bounty :) – Surveillance 12/3, 2014 at 21:32

Beware: this code improperly handles quoted empty strings. e.g. the input "''" will get parsed to an empty list rather than a list containing "". – Wes 29/8, 2019 at 17:59

@j3h: Good catch. Updated and added unit tests in the Gist. – Pectize 9/9, 2019 at 20:38

Look at Apache Commons Lang:

org.apache.commons.lang.text.StrTokenizer should be able to do what you want:

new StringTokenizer("one two \"three four\"", ' ', '"').getTokenArray();

Valtin answered 4/7, 2009 at 23:11 Comment(4)

Unfortunately, unlike shlex, commons.lang is not POSIX compatible. (-> (StrTokenizer. "\"foo\"'bar'baz") (.getTokenList)) returns a single entry containing "foo"'bar'baz, as opposed to the (correct) foobarbaz. – Kagu 5/2, 2013 at 17:2

@CharlesDuffy do you know the true answer? – Surveillance 5/3, 2014 at 21:41

@bukzor, that presumes that there is one. To my knowledge, such a tool has not been written at this time, short of using Python's shlex from Java via Jython (possible, but rather a large dependency chain to pull in). – Kagu 5/3, 2014 at 22:25

...though the answer from @RayMyers looks like a possible candidate. – Kagu 5/3, 2014 at 22:26

I had success using the following Scala code using fastparse. I can't vouch for it being complete:

val kvParser = {
  import fastparse._
  import NoWhitespace._
  def nonQuoteChar[_:P] = P(CharPred(_ != '"'))
  def quotedQuote[_:P] = P("\\\"")
  def quotedElement[_:P] = P(nonQuoteChar | quotedQuote)
  def quotedContent[_:P] = P(quotedElement.rep)
  def quotedString[_:P] = P("\"" ~/ quotedContent.! ~ "\"")
  def alpha[_:P] = P(CharIn("a-zA-Z"))
  def digit[_:P] = P(CharIn("0-9"))
  def hyphen[_:P] = P("-")
  def underscore[_:P] = P("_")
  def bareStringChar[_:P] = P(alpha | digit | hyphen | underscore)
  def bareString[_:P] = P(bareStringChar.rep.!)
  def string[_:P] = P(quotedString | bareString)
  def kvPair[_:P] = P(string ~ "=" ~ string)
  def commaAndSpace[_:P] = P(CharIn(" \t\n\r").rep ~ "," ~ CharIn(" \t\n\r").rep)
  def kvPairList[_:P] = P(kvPair.rep(sep = commaAndSpace))
  def fullLang[_:P] = P(kvPairList ~ End)

  def res(str: String) = {
    parse(str, fullLang(_))
  }

  res _
}

Dolomite answered 19/7, 2020 at 22:41 Comment(0)

Recommended topics

Hot tags