Parse CSV, ignoring commas inside string literals in VBA?
Asked Answered
I

14

18

I have a VBA application that runs every day. It checks a folder where CSVs are downloaded automatically, and adds their contents to a database. When parsing them, I realized that certain values had commas as a part of their name. These values were contained in string literals.

So I'm trying to figure out how to parse this CSV and ignore commas that are contained in string literals. For example...

1,2,3,"This should,be one part",5,6,7 Should return 

1
2
3
"This should,be one part"
5
6
7

I have been using VBA's split() function, because I don't wanna reinvent the wheel, but if I have to I guess I'll do something else.

Any suggestions would be appreciated.

Indispose answered 21/7, 2011 at 18:20 Comment(6)
You will have to write parser in order to this, split won't handle this. One good example of such parser is Python csv module Reader class.Always
I was afraid of that. I just did a bit of Googling. I guess you can use regular expressions in VBA. I might do that.Indispose
Yes regular expressions is also an option it may work, but it depends on how complicated is format of your csv...Always
I've looked at the CSV, and I will only run into one string literal. This string literal may or may not have a comma. The format of the CSV is int,int,"String literal, will have at most one comma", and more values that don't really matter. would something like [^","] work Edit there should be asterisks on each side of the column for wildcards, butthe formatting is messing me upIndispose
Here is an excellent post describing how to regex in Excel VBAMahmud
Here is an excellent SO post on how to use regexp in Excel VBA.Mahmud
C
12

A simple regex for parsing a CSV line, assuming no quotes inside quoted fields, is:

"[^"]*"|[^,]*

Each match will return a field.

Cervical answered 21/7, 2011 at 18:46 Comment(2)
What if the field also has " in it?Mallina
Awesome - but is it possible to ignore " that are preceded by a `, like "This is a \"single\" result!' ...?Schweitzer
S
17

The first way to solve this problem is to look at the structure of the line from the csv file (int,int,"String literal, will have at most one comma", etc). A naive solution would be (Assuming that the line don't have any semicolons)

Function splitLine1(line As String) As String()

   Dim temp() As String
   'Splits the line in three. The string delimited by " will be at temp(1)
   temp = Split(line, Chr(34)) 'chr(34) = "

   'Replaces the commas in the numeric fields by semicolons
   temp(0) = Replace(temp(0), ",", ";")
   temp(2) = Replace(temp(2), ",", ";")

   'Joins the temp array with quotes and then splits the result using the semicolons
   splitLine1 = Split(Join(temp, Chr(34)), ";")

End Function

This function only solves this particular problem. Another way to do the job is using the regular expression object from VBScript.

Function splitLine2(line As String) As String()

    Dim regex As Object
    Set regex = CreateObject("vbscript.regexp")
    regex.IgnoreCase = True
    regex.Global = True

    'This pattern matches only commas outside quotes
    'Pattern = ",(?=([^"]*"[^"]*")*(?![^"]*"))"
    regex.Pattern = ",(?=([^" & Chr(34) & "]*" & Chr(34) & "[^" & Chr(34) & "]*" & Chr(34) & ")*(?![^" & Chr(34) & "]*" & Chr(34) & "))"

    'regex.replaces will replace the commas outside quotes with semicolons and then the
    'Split function will split the result based on the semicollons
    splitLine2 = Split(regex.Replace(line, ";"), ";")

End Function

This method seems much more cryptic, but does not deppends on the structure of the line

You can read more about regular expressions patterns in VBScript Here

Spanos answered 7/11, 2012 at 12:43 Comment(2)
I love this answer. You can replace the comma in the Pattern with any character and make this capable of parsing by any delimiter (I added a second argument to the function). You should also probably strip out any semicolons in the initial line, or you'll get problems that way. I also removed any doubled double quotes "", not sure if I needed to do that, but no looking back now that it works.Exhibitioner
I believe this is the only answer that handles the scenario where double quotes are in the data. MRAB's pattern is not sufficient. thx.Pretor
C
12

A simple regex for parsing a CSV line, assuming no quotes inside quoted fields, is:

"[^"]*"|[^,]*

Each match will return a field.

Cervical answered 21/7, 2011 at 18:46 Comment(2)
What if the field also has " in it?Mallina
Awesome - but is it possible to ignore " that are preceded by a `, like "This is a \"single\" result!' ...?Schweitzer
H
12

@Gimp said...

The current answers do not contain enough detail.

I'm running into the same problem. Looking for more detail in this answer.

To elaborate on @MRAB's answer:

Function ParseCSV(FileName)
    Dim Regex       'As VBScript_RegExp_55.RegExp
    Dim MatchColl   'As VBScript_RegExp_55.MatchCollection
    Dim Match       'As VBScript_RegExp_55.Match
    Dim FS          'As Scripting.FileSystemObject
    Dim Txt         'As Scripting.TextStream
    Dim CSVLine
    ReDim ToInsert(0)

    Set FS = CreateObject("Scripting.FileSystemObject")
    Set Txt = FS.OpenTextFile(FileName, 1, False, -2)
    Set Regex = CreateObject("VBScript.RegExp")

    Regex.Pattern = """[^""]*""|[^,]*"    '<- MRAB's answer
    Regex.Global = True

    Do While Not Txt.AtEndOfStream
        ReDim ToInsert(0)
        CSVLine = Txt.ReadLine
        For Each Match In Regex.Execute(CSVLine)
            If Match.Length > 0 Then
                ReDim Preserve ToInsert(UBound(ToInsert) + 1)
                ToInsert(UBound(ToInsert) - 1) = Match.Value
            End If
        Next
        InsertArrayIntoDatabase ToInsert
    Loop
    Txt.Close
End Function

You need to customize the InsertArrayIntoDatabase Sub for your own table. Mine has several text fields named f00, f01, etc...

Sub InsertArrayIntoDatabase(a())
    Dim rs As DAO.Recordset
    Dim i, n
    Set rs = CurrentDb().TableDefs("tbl").OpenRecordset()
    rs.AddNew
    For i = LBound(a) To UBound(a)
        n = "f" & Format(i, "00") 'fields in table are f00, f01, f02, etc..
        rs.Fields(n) = a(i)
    Next
    rs.Update
End Sub

Note that instead of using CurrentDb() in InsertArrayIntoDatabase(), you should really use a global variable that gets set to the value of CurrentDb() before ParseCSV() runs, because running CurrentDb() in a loop is very slow, especially on a very large file.

Houston answered 10/11, 2012 at 4:41 Comment(2)
Thanks for posting the answer, I awarded you the bounty.Rimma
To open a recordset for every line is slow. Better pass the DAO.Recordset handle down to InsertArrayIntoDatabase().Perambulate
A
3

If you are working with MS Access tables, there are advantages in simply importing text from disk. For example:

''If you have a reference to the Windows Script Host Object Model
Dim fs As New FileSystemObject
Dim ts As TextStream

''For late binding
''Dim fs As Object
''Dim ts As Object
''Set fs=CreateObject("Scripting.FileSystemObject")

Set ts = fs.CreateTextFile("z:\docs\import.csv", True)

sData = "1,2,3,""This should,be one part"",5,6,7"

ts.Write sData
ts.Close

''Just for testing, your table will already exist
''sSQL = "Create table Imports (f1 int, f2 int, f3 int, f4 text, " _
''     & "f5 int, f6 int, f7 int)"
''CurrentDb.Execute sSQL

''The fields will be called F1,F2 ... Fn in the text file
sSQL = "INSERT INTO Imports SELECT * FROM " _
     & "[text;fmt=delimited;hdr=no;database=z:\docs\].[import.csv]"
CurrentDb.Execute sSQL
Arbe answered 5/11, 2012 at 11:11 Comment(0)
M
2

I know this is an old post, but thought this may help others. This was plagiarized/revised from http://n3wt0n.com/blog/comma-separated-values-and-quoted-commas-in-vbscript/, but works really well and is set as a function that you can pass your input line to.

Function SplitCSVLineToArray(Line, RemoveQuotes) 'Pass it a line and whether or not to remove the quotes
    ReplacementString = "#!#!#"  'Random String that we should never see in our file
    LineLength = Len(Line)
    InQuotes = False
    NewLine = ""
    For x = 1 to LineLength 
        CurrentCharacter = Mid(Line,x,1)
        If CurrentCharacter = Chr(34) then  
            If InQuotes then
                InQuotes = False
            Else
                InQuotes = True
            End If
        End If
        If InQuotes Then 
            CurrentCharacter = Replace(CurrentCharacter, ",", ReplacementString)
        End If
        NewLine = NewLine & CurrentCharacter
    Next    
    LineArray = split(NewLine,",")
    For x = 0 to UBound(LineArray)
        LineArray(x) = Replace(LineArray(x), ReplacementString, ",")
        If RemoveQuotes = True then 
            LineArray(x) = Replace(LineArray(x), Chr(34), "")
        End If
    Next 
    SplitCSVLineToArray = LineArray
End Function
Minstrelsy answered 11/9, 2017 at 12:29 Comment(0)
M
1

I realize this is an old post, but I just bumped into it looking for a solution to the same problem the OP had, so the thread is still relevant.

To import data from a CSV, I add a query to a worksheet

wksTarget.Querytables.add(Connection:=strConn, Destination:=wksTarget.Range("A1"))

then set the appropriate Querytable parameters (e.g. Name, FieldNames, RefreshOnOpen, etc.)

Querytables can handle various delimiters via the TextFileCommaDelimiter, TextFileSemiColonDelimiter and others. And there are a number of other parameters (TextfilePlatform, TextFileTrailingMinusNumbers, TextFileColumnTypes, TextFileDecimalSeparator, TextFileStartRow, TextFileThousandsSeparator) that handle source file idiosyncrasies.

Relevant to the OP, QueryTables also has a parameter designed to handle commas that are within double quotes - TextFileQualifier = xlTextQualifierDoubleQuote.

I find QueryTables much simpler than writing code to import the file, split/parse strings or use REGEX expressions.

All together, a sample code snippet would look something like this:

    strConn = "TEXT;" & "C:\Desktop\SourceFile.CSV"
    varDataTypes = Array(5, 1, 1, 1, 1, 1, 5, 5)
    With wksTarget.QueryTables.Add(Connection:=strConn, _ 
         Destination:=wksTarget.Range("A1"))
        .Name = "ImportCSV"
        .FieldNames = True
        .RefreshOnFileOpen = False
        .SaveData = True
        .TextFilePlatform = xlMSDOS
        .TextFileStartRow = 1
        .TextFileParseType = xlDelimited
        .TextFileCommaDelimiter = True
        .TextFileTextQualifier = xlTextQualifierDoubleQuote
        .TextFileColumnDataTypes = varDataTypes
        .Refresh BackgroundQuery:=False
    End With

I prefer to delete the QueryTable once the data is imported (wksTarget.QueryTable("ImportCSV").Delete), but I suppose it could be created just once and then simply refreshed if the source and destinations for the data don't change.

Maximalist answered 13/1, 2017 at 19:53 Comment(0)
N
1

I made another variant of solution for parsing CSV files with "quoted" text strings with possible delimiters, like comma inside the double quotes. This method doesn't require regex expressions, or any other addons. Also, this code deals with multiple commas in between the quotes. Here is Subroutine for testing:

Sub SubstituteBetweenQuotesSub()
'In-string character replacement function by Maryan Hutsul      1/29/2019
Dim quote, quoteTwo As Integer
Dim oddEven As Integer
Dim i, counter As Integer
Dim byteArray() As Byte

'LineItems are lines of text read from CSV file, or any other text string
LineItems = ",,,2019NoApocalypse.ditamap,[email protected],Approver,""JC, ,Son"",Reviewer,[email protected],""God, All-Mighty,"",2019-01-29T08:47:29.290-05:00"

quote = 1
oddEven = 0

Do Until quote = 0
quote = InStr(quote, LineItems, Chr(34))
quoteTwo = InStr(quote + 1, LineItems, Chr(34))

oddEven = oddEven + 1
    If oddEven Mod 2 = 1 And quote <> 0 Then

        counter = 0
        For i = quote To quoteTwo
            byteArray = StrConv(LineItems, vbFromUnicode)
            If i <> 0 Then
                If byteArray(i - 1) = 44 Then   '44 represents comma, can also do Chr(44)
                counter = counter + 1
                End If
            End If
        Next i

        LineItems = Left(LineItems, quote - 1) & Replace(LineItems, ",", ";", quote, counter)
        quote = quote + 1
    ElseIf quote <> 0 Then
        quote = quote + 1
    End If
Loop

End Sub

Here is function to which you can pass lines from .csv, .txt or any other text files:

Function SubstituteBetweenQuotes(LineItems)
'In-string character replacement function by Maryan Hutsul                                          1/29/2019
'LineItems are lines of text read from CSV file, or any other text string
Dim quote, quoteTwo As Integer
Dim oddEven As Integer
Dim i, counter As Integer
Dim byteArray() As Byte


quote = 1
oddEven = 0

Do Until quote = 0
quote = InStr(quote, LineItems, Chr(34))
quoteTwo = InStr(quote + 1, LineItems, Chr(34))

oddEven = oddEven + 1
    If oddEven Mod 2 = 1 And quote <> 0 Then

        counter = 0
        For i = quote To quoteTwo
            byteArray = StrConv(LineItems, vbFromUnicode)
            If i <> 0 Then
                If byteArray(i - 1) = 44 Then   '44 represents "," comma, can also do Chr(44)
                counter = counter + 1
                End If
            End If
        Next i

        LineItems = Left(LineItems, quote - 1) & Replace(LineItems, ",", ";", quote, counter)
        quote = quote + 1
    ElseIf quote <> 0 Then
        quote = quote + 1
    End If
Loop

SubstituteBetweenQuotes = LineItems

End Function

And below is code for reading CSV file with function used:

Dim fullFilePath As String
Dim i As Integer

'fullFilePath - full link to your input CSV file
Open fullFilePath For Input As #1
row_number = 0
column_number = 0
'EOF - End Of File  (1) - file #1
Do Until EOF(1)
    Line Input #1, LineFromFile
            LineItems = Split(SubstituteBetweenQuotes(LineFromFile), ",")
    For i = LBound(LineItems) To UBound(LineItems)
    ActiveCell.Offset(row_number, i).Value = LineItems(i)
    Next i
    row_number = row_number + 1
Loop
Close #1

All delimiters and replacement character may be modified for your needs. I Hope this is useful as I had quite a journey to solve some problems with CSV imports

Nansen answered 29/1, 2019 at 22:15 Comment(0)
A
1

We had a similar CSV parsing challenge in excel recently, and implemented a solution adapted from Javascript code to parse CSV data:

Function SplitCSV(csvText As String, delimiter As String) As String()

    ' Create a regular expression to parse the CSV values
    Dim RegEx As New RegExp

    ' Create pattern which will match each column in the CSV, wih submatches for each of the groups in the regex
    ' Match Groups:  Delimiter            Quoted fields                  Standard fields
    RegEx.Pattern = "(" + delimiter + "|^)(?:\""([^\""]*(?:\""\""[^\""]*)*)\""|([^\""\""" + delimiter + """]*))"
    RegEx.Global = True
    RegEx.IgnoreCase = True

    ' Create an array to hold all pattern matches (i.e. columns)
    Dim Matches As MatchCollection
    Set Matches = RegEx.Execute(csvText)

    ' Create an array to hold output data
    Dim Output() As String

    ' Create int to track array location when iterating
    Dim i As Integer
    i = 0

    ' Manually add blank if first column is blank, since VBA regex misses this
    If csvText Like ",*" Then
        ReDim Preserve Output(i)
        Output(i) = ""
        i = i + 1
    End If

    ' Iterate over all pattern matches and get values into output array
    Dim Match As Match
    Dim MatchedValue As String
    For Each Match In Matches

        ' Check to see which kind of value we captured (quoted or unquoted)
        If (Len(Match.SubMatches(1)) > 0) Then
            ' We found a quoted value. When we capture this value, unescape any double quotes
            MatchedValue = Replace(Match.SubMatches(1), """""", """")
        Else
            ' We found a non-quoted value
            MatchedValue = Match.SubMatches(2)
        End If

        ' Now that we have our value string, let's add it to the data array
        ReDim Preserve Output(i)
        Output(i) = MatchedValue
        i = i + 1

    Next Match

    ' Return the parsed data
    SplitCSV = Output

End Function
Autorotation answered 14/5, 2019 at 9:45 Comment(1)
For this to work with a regular CSV with rows, make sure you replace the row delimiter (carriage return with linefeed) with commas: Replace(str, vbCrLf, ",")Brambly
E
0

Taking your comments into account you could take the easy way out here

  • split on " --> gives you 3 or more entries (could be more due to doublequotes inside the string literal)
  • split first part on ,
  • keep part 2 to n-1 together (is your string literal)
  • split the last part on ,
Escutcheon answered 21/7, 2011 at 18:47 Comment(1)
I sort of need the contents of the splitted array to have the same format, because I'm creating INSERT queries on the fly. So if I have 3 or more entreis that could mess things up for meIndispose
F
0

Try This! Make sure to have the "Microsoft VBScript Regular Expressions 5.5" ticked on References under Tools.

enter image description here

Function Splitter(line As String, n As Integer)
Dim s() As String
Dim regex As Object
    Set regex = CreateObject("vbscript.regexp")
    regex.IgnoreCase = True
    regex.Global = True
    regex.Pattern = ",(?=([^\""]*\""[^\""]*\"")*[^\""]*$)"
    s = split(regex.Replace(line, "|/||\|"), "|/||\|")
    Splitter = s(n - 1)
End Function
Fluidextract answered 27/9, 2019 at 4:1 Comment(0)
R
0

If the source CSV has every field in double quotes, then split(strLine, """, """) may work well

Rechaba answered 25/9, 2020 at 17:24 Comment(0)
A
0

I find that solutions based on split() and join() tend to be very fast compared to looping through characters. Getting a regular expression to work is also challenging if there can be multiple commas or multiple line breaks within a quoted string. I was just working with such a file, found here.

The function below uses the same basic mechanism as the top answer, but deals with the whole file rather than just a single line. Declarations are omitted for brevity.

Function CSVToArray(sourceText, rowDelim, columnDelim, Optional stringNotInSourceText = "|/", Optional removeErrorRows = False)
'Converts CSV text to a two-dimensional array.  It's fast by use of split() and join().
'To de-activate any combination of delimeter characters in quoted strings, they are first converted using the stringNotInSourceText argument
'The delimeter characters in the quoted strings are returned to their original values

    'Validate stringNotInSourceText
    If InStr(1, sourceText, stringNotInSourceText) > 0 Then
        Debug.Print "Error: The provided stringNotInSourceText appears in the sourceText"
    End If
    
    'Make replacement delimeters
    rowDelimReplacement = stringNotInSourceText & "R"
    columnDelimReplacement = stringNotInSourceText & "C"
    
    'Now, we need to separate quoted strings out so we can replace the delimeters inside them
    splitQuotes = Split(sourceText, """")
    
    'Amazing, if we loop through the array step 2, starting on 1, we get all the quoted strings
    For i = 1 To UBound(splitQuotes) Step 2
        splitQuotes(i) = Replace(splitQuotes(i), rowDelim, rowDelimReplacement)
        splitQuotes(i) = Replace(splitQuotes(i), columnDelim, columnDelimReplacement)
    Next
    
    'Rejoin to a now disambiguated text (a rowDelim and columnDelim character are now always actual delimeters)
    disambiguatedText = Join(splitQuotes, """")
    
    'Now we can split the disambiguated text to rows, without interference from characters in quotes
    rowArray = Split(disambiguatedText, rowDelim)
    
    'Use a sample row to count the number of columns
    rowSample = Split(rowArray(0), columnDelim)
    rowSampleUBound = UBound(rowSample)
    
    'Populate the two-dimensional array, restoring the original characters inside quote
    Set goodRowList = CreateObject("System.Collections.ArrayList")
    errorTemplate = "Error: Row #R has #U of #SU expected columns. "
    errorTemplate = errorTemplate & IIf(removeErrorRows, "Row removed.", "Row kept with up to #SU columns.")
    ReDim returnArray(0 To UBound(rowArray), 0 To rowSampleUBound)
    On Error Resume Next 'If a row has insufficient columns, debug.print the error template but keep going
        For r = 0 To UBound(returnArray, 1)
            SplitRow = Split(rowArray(r), columnDelim)
            rowUbound = UBound(SplitRow)
            If rowUbound <> rowSampleUBound Then
                Debug.Print Replace(Replace(Replace(errorTemplate, "#R", r), "#U", rowUbound), "#SU", rowSampleUBound)
            ElseIf removeErrorRows Then 'Storing good rows to remove the rest at the end
                goodRowList.Add r
            End If
            For c = 0 To rowSampleUBound
                restoredValue = SplitRow(c)
                restoredValue = Replace(restoredValue, rowDelimReplacement, rowDelim)
                restoredValue = Replace(restoredValue, columnDelimReplacement, columnDelim)
                returnArray(r, c) = restoredValue
            Next
        Next
    On Error GoTo 0
    
    'If removeErrorRows is set to true, this will remove the rows that were designated as having the wrong number of columns
    If removeErrorRows Then
        originalCount = 0
        ReDim cleanArray(0 To goodRowList.Count - 1, 0 To rowSampleUBound)
        For r = 0 To goodRowList.Count - 1
            For c = 0 To rowSampleUBound
                cleanArray(r, c) = returnArray(originalCount, c)
            Next
            originalCount = originalCount + 1
        Next
        returnArray = cleanArray
    End If

    CSVToArray = returnArray
    
End Function
Acetum answered 6/9, 2021 at 20:59 Comment(0)
M
0

The easiest solution might be to download a CSV parser written in VBA from GitHub. There are at least three available, and I'm the author of this one:

https://github.com/PGS62/VBA-CSV

Then the answer to the OP's question is to call the function CSVRead, passing in the example string given in the question:

CSVRead("1,2,3,""This should,be one part"",5,6,7")

which returns a 1x7 array.

Manna answered 9/9, 2021 at 13:26 Comment(0)
A
0

Regex is slow and this is a problem of limited variability.

You have context-dependent behaviour for commas and quotes only.

So brute-force logic is easy to write, fast to run and easy to understand. This code is much faster than Regex; without timing it, maybe 5-10x faster. Important for batch jobs.

' A fast, hard-coded method for splitting a CSV string which contains quoted sections
' e.g. 1,2,"comma,Separated,Values",Comma,Separated,Values will be split to 1, 2, "Comma,Separated,Values", Comma, Separated, Values
Public Function TokenizeCsvFast(sourceLine As String)
    
    Dim tokens() As String
    ReDim tokens(1 To 1)
    
    Dim processedTokenNumber As Long
    Dim newToken As String
    Dim newTokenNumber As Long
    newTokenNumber = 0
    
    Dim inQuotes As Boolean
    
    Dim stringPosition As Long
    For stringPosition = 1 To Len(sourceLine)
        
        Dim newCharacter As String
        newCharacter = Mid$(sourceLine, stringPosition, 1)
        
        Dim newTokenComplete As Boolean
        newTokenComplete = False
        
        If newCharacter = """" Then   ' Handle quotes as an explicit case
            inQuotes = Not inQuotes
        ElseIf newCharacter = "," Then

            If inQuotes Then
                ' if in quotes, just build up the new token
                newToken = newToken & newCharacter
            Else
                ' Outside of quotes, a comma separates values
                newTokenComplete = True
            End If

        ElseIf stringPosition = Len(sourceLine) Then
            ' The terminal token may not have a terminal comma
            newToken = newToken & newCharacter
            newTokenComplete = True
        Else
            ' Build up the new token one character at a time
            newToken = newToken & newCharacter
        End If
        
        If newTokenComplete Then
            processedTokenNumber = processedTokenNumber + 1
            
            ' Add the completed new token to the return array
            newTokenNumber = newTokenNumber + 1
            If newTokenNumber > UBound(tokens) Then
                ReDim Preserve tokens(1 To newTokenNumber)
            End If
            tokens(newTokenNumber) = newToken
            ' Debug.Print newToken
            
            ' Start new token afresh
            newToken = ""
            
        End If
        
    Next
    
    TokenizeCsvFast = tokens
    
End Function
Arthropod answered 16/10, 2021 at 15:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.