Sparklyr ignoring line delimiter
Asked Answered
G

1

6

I'm trying to read a .csv of 2GB~ (5mi lines) in sparklyr with:

bigcsvspark <- spark_read_csv(sc, "bigtxt", "path", 
                              delimiter = "!",
                              infer_schema = FALSE,
                              memory = TRUE,
                              overwrite = TRUE,
                              columns = list(
                                  SUPRESSED COLUMNS AS = 'character'))

And getting the following error:

Job aborted due to stage failure: Task 9 in stage 15.0 failed 4 times, most recent failure: Lost task 9.3 in stage 15.0 (TID 3963,
10.1.4.16):  com.univocity.parsers.common.TextParsingException: Length of parsed input (1000001) exceeds the maximum number of characters defined in your parser settings (1000000). Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\n'. Parsed content: ---lines of my csv---[\n]
---begin of a splited line --- Parser Configuration: CsvParserSettings:     ... default settings ...

and:

CsvFormat:
    Comment character=\0
    Field delimiter=!
    Line separator (normalized)=\n
    Line separator sequence=\n
    Quote character="
    Quote escape character=\
    Quote escape escape character=null Internal state when error was thrown:
        line=10599, 
        column=6, 
        record=8221, 
        charIndex=4430464, 
        headers=[---SUPRESSED HEADER---], 
        content parsed=---more lines without the delimiter.---

As shown above at some point the line separator start to be ignored. In pure R can be read without problem, just read.csv passing the path and delimiter.

Geriatric answered 13/10, 2017 at 19:1 Comment(2)
As suggested by the author, try Dplyrs filter to remove/identify the unwanted row. github.com/rstudio/sparklyr/issues/83Fructuous
I will try it, at first I had suspecting that the buffer can't deal whit the data, but as the data is a huge mess it's possible to be a data problem, I'm also trying to write a Scala script to convert to Parquet.Geriatric
A
1

it looks like the file is not really a CSV, I wonder if spark_read_text() would work better in this situation. You should be able to bring all the lines into Spark, and split the lines into fields in memory, That last part, will be the trickiest.

Amalgamate answered 20/10, 2017 at 1:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.