shlex alternative for Java
Asked Answered
L

3

15

Is there a shlex alternative for Java? I'd like to be able to split quote delimited strings like the shell would process them. For example, if I'd send :

one two "three four"
and perform a split, I'd like to receive the tokens
one
two
three four
Lacustrine answered 4/7, 2009 at 20:52 Comment(1)
Notably -- "like the shell would process them" is a fairly hard task; shlex does it well, but many naive algorithms won't. For instance, in shell, "three four" and "three"' 'four are exactly equivalent, as is three\ four.Kagu
P
10

I had a similar problem today, and it didn't look like any standard options such as StringTokenizer, StrTokenizer, Scanner were a good fit. However, it's not hard to implement the basics.

This example handles all the edge cases currently commented on other answers. Be warned, I haven't checked it for full POSIX compliance yet. Gist including unit tests available on GitHub - released in public domain via the unlicense.

public List<String> shellSplit(CharSequence string) {
    List<String> tokens = new ArrayList<String>();
    boolean escaping = false;
    char quoteChar = ' ';
    boolean quoting = false;
    int lastCloseQuoteIndex = Integer.MIN_VALUE;
    StringBuilder current = new StringBuilder();
    for (int i = 0; i<string.length(); i++) {
        char c = string.charAt(i);
        if (escaping) {
            current.append(c);
            escaping = false;
        } else if (c == '\\' && !(quoting && quoteChar == '\'')) {
            escaping = true;
        } else if (quoting && c == quoteChar) {
            quoting = false;
            lastCloseQuoteIndex = i;
        } else if (!quoting && (c == '\'' || c == '"')) {
            quoting = true;
            quoteChar = c;
        } else if (!quoting && Character.isWhitespace(c)) {
            if (current.length() > 0 || lastCloseQuoteIndex == (i - 1)) {
                tokens.add(current.toString());
                current = new StringBuilder();
            }
        } else {
            current.append(c);
        }
    }
    if (current.length() > 0 || lastCloseQuoteIndex == (string.length() - 1)) {
        tokens.add(current.toString());
    }

    return tokens;
}
Pectize answered 22/12, 2013 at 0:44 Comment(8)
Would you consider attaching a license to this (or explicitly donating it to the public domain)?Kagu
Ah, there it is, last line of this page: user contributions licensed under cc by-sa 3.0 with attribution requiredSurveillance
@RayMyers: We still need to know whether this is your own work, otherwise the license is unknown. Also, the CC-BY-SA license isn't completely compatible with Hadoop's Apache license (I would need to use it unmodified). If you'd dedicate this code under the Unlicense these problems go away, otherwise I'll have to write similar from scratch. ...I wish SO would change their default license.Surveillance
bukzor and others: Thanks for pointing this out. Yes, it is my work. I've updated it to be explicitly public domain.Pectize
@RayMyers: While this is good enough for me (thanks!), you should know that the expert advice I've often seen is that the 'public domain' is a legal concept on very shaky foundation (eg it doesn't even exist outside the US), and any work without a license (including those "released to the public domain") are best considered to have NoLicense. The license closest to what you are trying to do is the Unlicense.Surveillance
While it surprises me, this appears to be the best code Java has to offer for this problem. Enjoy your bounty :)Surveillance
Beware: this code improperly handles quoted empty strings. e.g. the input "''" will get parsed to an empty list rather than a list containing "".Wes
@j3h: Good catch. Updated and added unit tests in the Gist.Pectize
V
6

Look at Apache Commons Lang:

org.apache.commons.lang.text.StrTokenizer should be able to do what you want:

new StringTokenizer("one two \"three four\"", ' ', '"').getTokenArray();
Valtin answered 4/7, 2009 at 23:11 Comment(4)
Unfortunately, unlike shlex, commons.lang is not POSIX compatible. (-> (StrTokenizer. "\"foo\"'bar'baz") (.getTokenList)) returns a single entry containing "foo"'bar'baz, as opposed to the (correct) foobarbaz.Kagu
@CharlesDuffy do you know the true answer?Surveillance
@bukzor, that presumes that there is one. To my knowledge, such a tool has not been written at this time, short of using Python's shlex from Java via Jython (possible, but rather a large dependency chain to pull in).Kagu
...though the answer from @RayMyers looks like a possible candidate.Kagu
D
0

I had success using the following Scala code using fastparse. I can't vouch for it being complete:

val kvParser = {
  import fastparse._
  import NoWhitespace._
  def nonQuoteChar[_:P] = P(CharPred(_ != '"'))
  def quotedQuote[_:P] = P("\\\"")
  def quotedElement[_:P] = P(nonQuoteChar | quotedQuote)
  def quotedContent[_:P] = P(quotedElement.rep)
  def quotedString[_:P] = P("\"" ~/ quotedContent.! ~ "\"")
  def alpha[_:P] = P(CharIn("a-zA-Z"))
  def digit[_:P] = P(CharIn("0-9"))
  def hyphen[_:P] = P("-")
  def underscore[_:P] = P("_")
  def bareStringChar[_:P] = P(alpha | digit | hyphen | underscore)
  def bareString[_:P] = P(bareStringChar.rep.!)
  def string[_:P] = P(quotedString | bareString)
  def kvPair[_:P] = P(string ~ "=" ~ string)
  def commaAndSpace[_:P] = P(CharIn(" \t\n\r").rep ~ "," ~ CharIn(" \t\n\r").rep)
  def kvPairList[_:P] = P(kvPair.rep(sep = commaAndSpace))
  def fullLang[_:P] = P(kvPairList ~ End)

  def res(str: String) = {
    parse(str, fullLang(_))
  }

  res _
}
Dolomite answered 19/7, 2020 at 22:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.