MySQL LOAD DATA INFILE: works, but unpredictable line terminator
Asked Answered
A

7

18

MySQL has a nice CSV import function LOAD DATA INFILE.

I have a large dataset that needs to be imported from CSV on a regular basis, so this feature is exactly what I need. I've got a working script that imports my data perfectly.

.....except.... I don't know in advance what the end-of-line terminator will be.

My SQL code currently looks something like this:

LOAD DATA INFILE '{fileName}'
 INTO TABLE {importTable}
 FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
 LINES TERMINATED BY '\n'
 IGNORE 1 LINES
( {fieldList} );

This works great for some import files.

However, the import data is coming from multiple sources. Some of them have the \n terminator; others have \r\n. I can't predict which one I'll have.

Is there a way using LOAD DATA INFILE to specify that my lines may be terminated with either \n or \r\n? How do I deal with this?

Archaeopteryx answered 7/6, 2012 at 15:44 Comment(3)
Do you only eed to deal with \n (most *nix systems, including OS X) and \r\n (Windows)--in which case @Devart's answer looks perfect--or might you encounter other line termination sequences such as \n\r (e.g. from RISC OS), \r (e.g. Mac OS 9) and others?Graph
@Graph - that's a very good question. I had only considered the *nix/Windows scenarios, but as I said the creation of the import files is out of my control, so I guess I need to cater for anything. I've gone with a basic pre-processing solution: Devart's solution looked great, but had too many weak points to make it robust enough.Archaeopteryx
Sadly, in that case, I don't think there's any simple solution. One could parse the file counting occurrences of each candidate line termination sequence in an attempt to guess which it uses, but beware of files containing multiline text fields where the line termination differs from the actual record termination (you might have to count/verify that the expected number of fields appear between each record terminator). I don't know what code/framework you are using for your application, but there may well be libraries that can assist you here. Else, prompt the user to tell you which it is.Graph
L
10

I'd just pre-process it. A global search/replace to change \r\n to \n done from a command line tool as part of the import process should be simple and performant.

Lisabeth answered 7/6, 2012 at 16:18 Comment(3)
This is currently looking like my best bet. I was hoping for a pure SQL solution, but if it doesn't exist then a scripted pre-processing step will have to do. Will wait to see if any better answers come along but if not then this will probably be the one.Archaeopteryx
Yep. Most ETL systems pulling data from a variety of sources end up with some pre-processing. If all you have to do is normalize line terminators you probably ought to count your blessings ;)Lisabeth
Although @Devart's answer looked good on the surface, I've gone with a simple pre-processing utility. It seems like it'll be more robust, and as you say it is reasonably perfomant. Would have loved to have seen a pure SQL solution, but it looks like there isn't one. (seems bizarre that LOAD DATA INFILE couldn't have been given the option to accept any line terminator, but it doesn't, so that's the end of that)Archaeopteryx
G
11

You can specify line separator as '\n' and remove trailing '\r' separators if necessary from the last field during loading.

For example -

Suppose we have the 'entries.txt' file. The line separator is '\r\n', and only after line ITEM2 | CLASS3 | DATE2 the separator is '\n':

COL1  | COL2   | COL3
ITEM1 | CLASS1 | DATE1
ITEM2 | CLASS3 | DATE2
ITEM3 | CLASS1 | DATE3
ITEM4 | CLASS2 | DATE4

CREATE TABLE statement:

CREATE TABLE entries(
  column1 VARCHAR(255) DEFAULT NULL,
  column2 VARCHAR(255) DEFAULT NULL,
  column3 VARCHAR(255) DEFAULT NULL
)

Our LOAD DATA INFILE query:

LOAD DATA INFILE 'entries.txt' INTO TABLE entries
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(column1, column2, @var)
SET column3 = TRIM(TRAILING '\r' FROM @var);

Show results:

SELECT * FROM entries;
+---------+----------+---------+
| column1 | column2  | column3 |
+---------+----------+---------+
| ITEM1   |  CLASS1  |  DATE1  |
| ITEM2   |  CLASS3  |  DATE2  |
| ITEM3   |  CLASS1  |  DATE3  |
| ITEM4   |  CLASS2  |  DATE4  |
+---------+----------+---------+
Gan answered 12/6, 2012 at 11:12 Comment(6)
ah, now this looks promising.... :-) I'll investigate this further and let you know how it goes. Thank you for the suggestion.Archaeopteryx
What if COL3 in the CSV is quoted? Won't this result in a parser error when MySQL encounters an unexpected character between the end of the quoted field and the line termination? What if column3 is unquoted text that is supposed to end with a \r character (might be better to do something slightly more ugly involving testing the final character and, if it's \r, then removing only that one character)?Graph
I also thought about quoted COL3. I have tried some variants, and can say that quoted COL3 will cause a problems with line termination. So, the last field should be unquoted. About the ...\r\r\r - Yes, this method can be rewrited to remove only the last \r.Gan
in the absence of an accept I don't know if this worked, and I'm conscious of the points raised in the comments, but I'm giving you the bounty because this is the only answer that attempted to solve the problem within the bounds of the sql query. pre-processing the file was always the easy way out, but good to see a creative solution using just sql. Thank you.Thole
I've gone with a simple pre-processing solution. I really liked your idea here, but as with the comments above I couldn't rely on it working for all input. The import files are from multiple third parties, so I don't have enough control over the content to know that I won't be tripped up by quotes, etc. (I guess if I did have that kind of control, this question wouldn't have been asked anyway). But thanks for the suggestion; it was interesting and creative.Archaeopteryx
Yes, this solution does not cover all situations. It is just an approach to use MySQL in this question. Anyway, thanks for the bounty!Gan
L
10

I'd just pre-process it. A global search/replace to change \r\n to \n done from a command line tool as part of the import process should be simple and performant.

Lisabeth answered 7/6, 2012 at 16:18 Comment(3)
This is currently looking like my best bet. I was hoping for a pure SQL solution, but if it doesn't exist then a scripted pre-processing step will have to do. Will wait to see if any better answers come along but if not then this will probably be the one.Archaeopteryx
Yep. Most ETL systems pulling data from a variety of sources end up with some pre-processing. If all you have to do is normalize line terminators you probably ought to count your blessings ;)Lisabeth
Although @Devart's answer looked good on the surface, I've gone with a simple pre-processing utility. It seems like it'll be more robust, and as you say it is reasonably perfomant. Would have loved to have seen a pure SQL solution, but it looks like there isn't one. (seems bizarre that LOAD DATA INFILE couldn't have been given the option to accept any line terminator, but it doesn't, so that's the end of that)Archaeopteryx
C
3

I assuming the you need information only through mysql no by any programming language. Before use load data covert the format to windows format \r\n ( CR LF ) if u have notepad++. And then process the Load data query. Make sure the LINES TERMINATED BY '\r\n'

enter image description here

Edit:

Since the editors are often unsuitable for converting larger files. For larger files the following command is often used both windows and linux

1) To convert into windows format in windows

TYPE [unix_file] | FIND "" /V > dos_file

2) To convert into windows format in linux

unix2dos  [file]

The other commands also available

A windows format file can be converted to Unix format by simply removing all ASCII CR \r characters by tr -d '\r' < inputfile > outputfile

grep -PL $'\r\n' myfile.txt # show UNIX format  style file (LF terminated)
grep -Pl $'\r\n' myfile.txt # show WINDOS format style file (CRLF terminated)

In linux/unix the file command detects the type of End-Of-Line (EOL) used. So the file type can be checked using this command

Coterie answered 12/6, 2012 at 6:47 Comment(2)
it's not the answer I was hoping for, but this kind of conversion is looking like my best bet at the moment. Not in a program like Notepad++ though, because the file is several hundred meg; better to use a commandline tool.Archaeopteryx
@Archaeopteryx See the Edit part for command line conversion both windows and unixCoterie
A
1

You could also look into one of the data integration packages out there. Talend Open Studio has very flexible data input routines. For example you could process the file with one set of delimiters and catch the rejects and process them another way.

Academy answered 7/6, 2012 at 17:44 Comment(0)
B
1

If the first load has 0 rows, do the same statement with the other line terminator. This should be do-able with some basic counting logic.

At least it stays all in SQL, and if it works the first time you win. And could cause less headache that re-scanning all the rows and removing a particular character.

Bendigo answered 12/6, 2012 at 21:5 Comment(0)
S
1

Why not first just take a peek at how the lines end?

$handle = fopen('inputFile.csv', 'r');

$i = 0;
if ($handle) {
    while (($buffer = fgets($handle)) !== false) {

        $s =  substr($buffer,-50);

        echo $s; 
        echo preg_match('/\r/', $s) ? 'cr ' : '-- ';
        echo preg_match('/\n/', $s) ? 'nl<br>' : '--<br>';          

        if( $i++ > 5)
            break;

    }

    fclose($handle);
}
Sheryl answered 5/11, 2013 at 3:50 Comment(1)
This helped me out with this issue - once you have determined the line ending you can wrap the SQL command into a conditional statement.Solemnize
G
0

You can use LINES STARTING to separate usual line endings in text and a new row:

LOAD DATA LOCAL INFILE '/home/laptop/Downloads/field3-utf8.csv' 
IGNORE INTO TABLE Field FIELDS 
TERMINATED BY ';' 
OPTIONALLY ENCLOSED BY '^' 
LINES STARTING BY '^' 
TERMINATED BY '\r\n' 
(Id, Form_id, Name, Value)

For usual CSV files with " enclosing chars, it will be:

...
LINES STARTING BY '"' 
...
Gestalt answered 20/7, 2018 at 7:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.