Split a string based on each time a Deterministic Finite Automata reaches a final state?
Asked Answered
P

2

2

I have a problem which has an solution that can be solved by iteration, but I'm wondering if there's a more elegant solution using regular expressions and split()

I have a string (which excel is putting on the clipboard), which is, in essence, comma delimited. The caveat is that when the cell values contain a comma, the whole cell is surrounded with quotation marks (presumably to escape the commas within that string). An example string is as follows:

123,12,"12,345",834,54,"1,111","98,273","1,923,002",23,"1,243"

Now, I want to elegantly split this string into individual cells, but the catch is I cannot use a normal split expression with comma as a delimiter, because it will divide cells that contain a comma in their value. Another way of looking at this problem, is that I can ONLY split on a comma if there is an EVEN number of quotation marks preceding the comma.

This is easy to solve with a loop, but I'm wondering if there's a regular expression.split function capable of capturing this logic. In an attempt to solve this problem, I constructed the Deterministic Finite Automata (DFA) for the logic.

alt text

The question now is reduced to the following: is there a way to split this string such that a new array element (corresponding to /s) is produced each time the final state (state 4 here) is reached in a DFA?

Puff answered 16/12, 2010 at 15:5 Comment(0)
C
1

Using regex (unescaped): (?:(?:"[^"]*")|(?:[^,]*))

Use that and call Regex.Matches() which is .NET, or its analog in other platforms.

You could further expand the above to this: ^(?:(?:"(?<Value>[^"]*)")|(?<Value>[^,]*))(?:,(?:(?:"(?<Value>[^"]*)")|(?<Value>[^,]*)))*$

This will parse the whole string in 1 shot, but you need named groups and multi-capture per group for this to work (.NET supports it).

Cleancut answered 16/12, 2010 at 15:10 Comment(1)
I'm in VBA for this one I think so I'll have to resort to VBScript syntax. Lucky for me I believe they are extremely similar (although VBScript implementation does not support the look-behind capabilities of .Net). I can't wait to test it, thanks!Puff
B
1

Eligible commas are also followed by an even number of quotes, and VBScript does support lookaheads. Try splitting on this:

",(?=(?:[^""]*""[^""]*"")*[^""]*$)"
Banian answered 17/12, 2010 at 3:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.