How do I correctly handle CR when reading text files with OleDB
Asked Answered
P

2

0

I have text files that are Tab delimited. I created a Schema.ini like so:

[MY_FILE.TAB]
Format=TabDelimited
ColNameHeader=False
Col1=id Short
Col2=data Text

This is the code I use to read it (C#):

using (var connection = new OleDbConnection(@"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=D:\FolderToData\;Extended Properties='text;FMT=delimited'"))
{
  using (var command = new OleDbCommand("SELECT * FROM MY_FILE.TAB", connection))
  {
    var table = new DataTable();
    using (var adapter = new OleDbDataAdapter(command)
    {
      adapter.Fill(table);
    }
  }
}

Everything works fine, except for one thing. The data in the text file contains Carriage Returns [CR]. Records itself are separated by Carriage Return Line Feeds [CR][LF]. Unfortunately, OleDB / MicrosoftJet (or whatever parses these files) treats both ([CR], [CRLF]) the same.

Example of MY_FILE.TAB (there should be a Tab between numbers and text):

1   One[CR][LF]
2   Two[CR][LF]
3   Th[CR]
ree[CR][LF]
4   Four[CR][LF]

Gives me 5 (malformed) Rows in the DataTable instead of 4.

What I need is:

1   "One"
2   "Two"
3   "Th\nree"
4   "Four2

But I get:

1    "One"
2    "Two"
3    "Th"
null null
4    "Four"

"ree" can't be converted to Int32 so first colum in fourth row is null.

How can I configure OleDB to treat [CR] different than [CR][LF]? Or any other Ideas?

Phocis answered 27/11, 2009 at 15:6 Comment(2)
Can you please specify what constraints do you have? Meaning, the file/schema is of third party or you can update it according to your needs?Malloch
I got these files "as-is" and need to load it. No way I can influence it.Phocis
K
2

I don't believe you can reconfigure OLEDB to do this directly.

An alternative approach would be to use a TextReader and TextWriter to process the file into a temporary file, scanning for and replacing CR alone into some special escape sequence. Then use OLEDB to read this replacement temporary file; finally, replace the special escape sequence back to a CR.

Kermes answered 27/11, 2009 at 15:11 Comment(5)
Yes this could be a quick'n'dirty hack. 1) Replace [CR][LF] with X 2) Replace [CR] with Y 3) Replace X with [CR][LF] 4) Read using OleDB 5) Replace Y with '\n' X and Y should be something unique...Phocis
@Phocis - I never pretended it would be pretty, but neither is your file format!Kermes
+1 for a nice workaround Jeremy; Just that since a temporary file shall be used therefore I do not understand "finally, replace the special escape sequence" - I would say just remove the temporary file after use.Malloch
@Jeremy: Don't get me wrong. I actually like your solution. And I don't like the file format either. But I guess that's what legacy means...Phocis
@KMan: He doesn't mean the temp file but the special excape sequence (former CR) needs to be converted back into a newline.Phocis
P
0

Wouldn't it be easy to read the file contents in a string, split it by Environment.NewLine or \r\n, which will get you an array for each line, which you can further split by tab?

Provocation answered 27/11, 2009 at 15:11 Comment(1)
Maybe, but that could lead to a lot of other problems, i.e. tab inside escaped sequence.Phocis

© 2022 - 2024 — McMap. All rights reserved.