I know that files starting with "_" and "." are hidden files. And the hiddenFileFilter
will be always applied. It is added inside method org.apache.hadoop.mapred.FileInputFormat.listStatus
From research, I understood that we can use FileInputFormat.setInputPathFilter
to set our custom PathFilter
and that the hiddenFileFilter
is always active.
For this purpose I created a MyPathFilter
class as follows:
class MyPathFilter implements PathFilter{
public boolean accept(Path path) {
// TODO Auto-generated method stub
return path.getName();
}
}
and I know it should be used something like this before we read the input files:
FileInputFormat.setInputPathFilter(job,MyPathFilter.class);
But the problem with my Spark/Scala based Data processing App/pipeline is that we are reading the files as text, as follows:
val spark = context.sparkSession
import spark.implicits._
val rawDF = spark.read
.text(list: _*)
.map { r =>
//do something
}.toDF()
There is no way I can change how we are reading files because it is associated with capturing the metadata from the folders nested structure of the file location. So with spark.read
staying intact, how can I make sure that I can also read the files with filename starting with "_" (underscore)? How to use FileInputFormat.setInputPathFilter
in this situation?
We run our jobs on AWS-EMR so can we parameterize the FileInputFormat.SetInputPathFilter
while creating the EMR cluster? Or can we make use of spark-submit
options to re-configure and turn the "read hidden files feature" ON?
Please help me with your valuable suggestions. Thanks.
spark.read.text
. Use thehadoopRdd
functions. For example in streaming https://mcmap.net/q/1919046/-spark-streaming-textfilestream-watch-output-of-rdd-saveastextfile – Infamous