How to re-configure spark /hadoop to read files starting with an "_" (underscore)?
Asked Answered
D

0

6

I know that files starting with "_" and "." are hidden files. And the hiddenFileFilter will be always applied. It is added inside method org.apache.hadoop.mapred.FileInputFormat.listStatus

From research, I understood that we can use FileInputFormat.setInputPathFilter to set our custom PathFilter and that the hiddenFileFilter is always active.

For this purpose I created a MyPathFilter class as follows:

class MyPathFilter implements PathFilter{
  public boolean accept(Path path) {
  // TODO Auto-generated method stub
  return path.getName();
  }
}

and I know it should be used something like this before we read the input files:

FileInputFormat.setInputPathFilter(job,MyPathFilter.class);

But the problem with my Spark/Scala based Data processing App/pipeline is that we are reading the files as text, as follows:

val spark = context.sparkSession
import spark.implicits._
val rawDF = spark.read
            .text(list: _*)
            .map { r =>
                    //do something
              }.toDF()

There is no way I can change how we are reading files because it is associated with capturing the metadata from the folders nested structure of the file location. So with spark.read staying intact, how can I make sure that I can also read the files with filename starting with "_" (underscore)? How to use FileInputFormat.setInputPathFilter in this situation?

We run our jobs on AWS-EMR so can we parameterize the FileInputFormat.SetInputPathFilter while creating the EMR cluster? Or can we make use of spark-submit options to re-configure and turn the "read hidden files feature" ON?

Please help me with your valuable suggestions. Thanks.

Dipterocarpaceous answered 6/9, 2018 at 19:27 Comment(4)
Don't use spark.read.text. Use the hadoopRdd functions. For example in streaming https://mcmap.net/q/1919046/-spark-streaming-textfilestream-watch-output-of-rdd-saveastextfileInfamous
Hi cricket_007, can you please help me on how to use hadoop rdd in my scenario? the streaming example is not that helpful? thank you in adavnce!Dipterocarpaceous
I've not written Spark code in years. I just remember using something similar to that other post to read those files.Infamous
Is there a SOLUTION for this question? Please link.Biographer

© 2022 - 2024 — McMap. All rights reserved.