dataFrame.coalesce(1).write().save("path")
sometimes writes only _SUCCESS and ._SUCCESS.crc files without an expected *.csv.gz even on non-empty input DataFrame
file save code:
private static void writeCsvToDirectory(Dataset<Row> dataFrame, Path directory) {
dataFrame.coalesce(1)
.write()
.format("csv")
.option("header", "true")
.option("delimiter", "\t")
.option("codec", "org.apache.hadoop.io.compress.GzipCodec")
.mode(SaveMode.Overwrite)
.save("file:///" + directory);
}
file get code:
static Path getTemporaryCsvFile(Path directory) throws IOException {
String glob = "*.csv.gz";
try (DirectoryStream<Path> stream = Files.newDirectoryStream(directory, glob)) {
return stream.iterator().next();
} catch (NoSuchElementException e) {
throw new RuntimeException(getNoSuchElementExceptionMessage(directory, glob), e);
}
}
file get error example:
java.lang.RuntimeException: directory /tmp/temp5889805853850415940 does not contain a file with glob *.csv.gz. Directory listing:
/tmp/temp5889805853850415940/_SUCCESS,
/tmp/temp5889805853850415940/._SUCCESS.crc
I rely on this expectation, can someone explain me why it work this way?
.format("csv").save
. – RedressdataFrame.coalesce(1).limit(1).collect
to eliminate dataframe emptiness as a cause? – Daleydeploy-mode
are you running your Spark app? – Acrolein