How to re-configure spark /hadoop to read files starting with an "_" (underscore)?

I know that files starting with "_" and "." are hidden files. And the hiddenFileFilter will be always applied. It is added inside method org.apache.hadoop.mapred.FileInputFormat.listStatus

From research, I understood that we can use FileInputFormat.setInputPathFilter to set our custom PathFilter and that the hiddenFileFilter is always active.

For this purpose I created a MyPathFilter class as follows:

class MyPathFilter implements PathFilter{
  public boolean accept(Path path) {
  // TODO Auto-generated method stub
  return path.getName();
  }
}

and I know it should be used something like this before we read the input files:

FileInputFormat.setInputPathFilter(job,MyPathFilter.class);

But the problem with my Spark/Scala based Data processing App/pipeline is that we are reading the files as text, as follows:

val spark = context.sparkSession
import spark.implicits._
val rawDF = spark.read
            .text(list: _*)
            .map { r =>
                    //do something
              }.toDF()

There is no way I can change how we are reading files because it is associated with capturing the metadata from the folders nested structure of the file location. So with spark.read staying intact, how can I make sure that I can also read the files with filename starting with "_" (underscore)? How to use FileInputFormat.setInputPathFilter in this situation?

We run our jobs on AWS-EMR so can we parameterize the FileInputFormat.SetInputPathFilter while creating the EMR cluster? Or can we make use of spark-submit options to re-configure and turn the "read hidden files feature" ON?

Please help me with your valuable suggestions. Thanks.

Recommended topics

Hot tags