How to parse strings to look like sys.argv
Asked Answered
S

2

50

I would like to parse a string like this:

-o 1  --long "Some long string"  

into this:

["-o", "1", "--long", 'Some long string']

or similar.

This is different than either getopt, or optparse, which start with sys.argv parsed input (like the output I have above). Is there a standard way to do this? Basically, this is "splitting" while keeping quoted strings together.

My best function so far:

import csv
def split_quote(string,quotechar='"'):
    '''

    >>> split_quote('--blah "Some argument" here')
    ['--blah', 'Some argument', 'here']

    >>> split_quote("--blah 'Some argument' here", quotechar="'")
    ['--blah', 'Some argument', 'here']
    '''
    s = csv.StringIO(string)
    C = csv.reader(s, delimiter=" ",quotechar=quotechar)
    return list(C)[0]
Spickandspan answered 22/5, 2009 at 18:24 Comment(2)
My own true forgetfulness revealed: stackoverflow.com/questions/92533, has me using shlex.split. Clearly I just forgot about it.Spickandspan
If what you actually need is "to process options" and not just "to parse strings on commandline", you could consider docs.python.org/2/library/argparse.htmlQuiroz
H
97

I believe you want the shlex module.

>>> import shlex
>>> shlex.split('-o 1 --long "Some long string"')
['-o', '1', '--long', 'Some long string']
Haematozoon answered 22/5, 2009 at 18:33 Comment(3)
Thank you! I knew there was something like this!Spickandspan
That's great, except that it doesn't seem to support Unicode strings. The doc says that Python 2.7.3 support Unicode strings, but I'm trying it and shlex.split(u'abc 123 →') gives me a UnicodeEncodeError.Gagliardi
I guess list(a.decode('utf-8') for a in shlex.split(u'abc 123 →'.encode('utf-8'))) will work.Gagliardi
G
3

Before I was aware of shlex.split, I made the following:

import sys

_WORD_DIVIDERS = set((' ', '\t', '\r', '\n'))

_QUOTE_CHARS_DICT = {
    '\\':   '\\',
    ' ':    ' ',
    '"':    '"',
    'r':    '\r',
    'n':    '\n',
    't':    '\t',
}

def _raise_type_error():
    raise TypeError("Bytes must be decoded to Unicode first")

def parse_to_argv_gen(instring):
    is_in_quotes = False
    instring_iter = iter(instring)
    join_string = instring[0:0]

    c_list = []
    c = ' '
    while True:
        # Skip whitespace
        try:
            while True:
                if not isinstance(c, str) and sys.version_info[0] >= 3:
                    _raise_type_error()
                if c not in _WORD_DIVIDERS:
                    break
                c = next(instring_iter)
        except StopIteration:
            break
        # Read word
        try:
            while True:
                if not isinstance(c, str) and sys.version_info[0] >= 3:
                    _raise_type_error()
                if not is_in_quotes and c in _WORD_DIVIDERS:
                    break
                if c == '"':
                    is_in_quotes = not is_in_quotes
                    c = None
                elif c == '\\':
                    c = next(instring_iter)
                    c = _QUOTE_CHARS_DICT.get(c)
                if c is not None:
                    c_list.append(c)
                c = next(instring_iter)
            yield join_string.join(c_list)
            c_list = []
        except StopIteration:
            yield join_string.join(c_list)
            break

def parse_to_argv(instring):
    return list(parse_to_argv_gen(instring))

This works with Python 2.x and 3.x. On Python 2.x, it works directly with byte strings and Unicode strings. On Python 3.x, it only accepts [Unicode] strings, not bytes objects.

This doesn't behave exactly the same as shell argv splitting—it also allows quoting of CR, LF and TAB characters as \r, \n and \t, converting them to real CR, LF, TAB (shlex.split doesn't do that). So writing my own function was useful for my needs. I guess shlex.split is better if you just want plain shell-style argv splitting. I'm sharing this code in case it's useful as a baseline for doing something slightly different.

Gagliardi answered 13/5, 2013 at 22:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.