Avoid creation of _$folder$ keys in S3 with hadoop (EMR)

Asked 18/3, 2017 at 15:27 Answered 10/9, 2023 at 10:50

Solved amazon-web-services hadoop amazon-s3 amazon-emr

I am using an EMR Activity in AWS data pipeline. This EMR Activity is running a hive script in EMR Cluster. It takes dynamo DB as input and stores data in S3.

This is the EMR step used in EMR Activity

s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://my-s3-bucket/hive/my_hive_script.q,-d,DYNAMODB_INPUT_TABLE1=MyTable,-d,S3_OUTPUT_BUCKET=#{output.directoryPath}

where

out.direcoryPath is :

s3://my-s3-bucket/output/#{format(@scheduledStartTime,"YYYY-MM-dd")}

So this creates one folder and one file in S3. (technically speaking it creates two keys 2017-03-18/<some_random_number> and 2017-03-18_$folder$)

2017-03-18
2017-03-18_$folder$

How to avoid creation of these extra empty _$folder$ files.

EDIT: I found a solution listed at https://issues.apache.org/jira/browse/HADOOP-10400 but I don't know how to implement it in AWS data pipeline.

Albinus answered 18/3, 2017 at 15:27 Comment(0)

EMR doesn't seem to provide a way to avoid this.

Because S3 uses a key-value pair storage system, the Hadoop file system implements directory support in S3 by creating empty files with the "_$folder$" suffix.

You can safely delete any empty files with the <directoryname>_$folder$ suffix that appear in your S3 buckets. These empty files are created by the Hadoop framework at runtime, but Hadoop is designed to process data even if these empty files are removed.

https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/

It's in the Hadoop source code, so it could be fixed, but apparently it's not fixed in EMR.

If you are feeling clever, you could create an S3 event notification that matches the _$folder$ suffix, and have it fire off a Lambda function to delete the objects after they're created.

Intercessor answered 18/3, 2017 at 19:11 Comment(2)

I found a solution listed at issues.apache.org/jira/browse/HADOOP-10400 but I don't know how to implement it in AWS data pipeline. – Albinus 19/3, 2017 at 7:40

@saurabhagarwal I believe you can't, with EMR -- it's a managed service. – Intercessor 19/3, 2017 at 12:50

use s3a while writing to s3 bucket, it will remove $folder$. i have tested this glue. not sure if it will apply in EMR clusters.

Credit:- answered by someone on reddit

from pyspark.sql import SparkSession
spark=SparkSession.builder.getOrCreate()
df=spark.read.format("parquet").load("s3://testingbucket/")
df.write.format("parquet").save("s3a://testingbucket/parttest/")
spark.stop()

Endocarp answered 9/10, 2020 at 19:55 Comment(5)

Tried on glue ETL and it worked as expected. Many thanks! – Fogarty 24/12, 2020 at 18:50

Wow...THIS IS THE ANSWER! – Insult 6/1, 2021 at 15:31

Thanks a lot ! This Answer should be the accepted one.. – Baran 12/5, 2021 at 13:41

Thanks, worked for me in EMR Hadoop, while writing to S3. – Latreshia 18/8, 2021 at 11:17

Worked great for me on a Glue Development Endpoint. Thank you. – Phosphene 9/5, 2022 at 22:3

EMR doesn't seem to provide a way to avoid this.

Because S3 uses a key-value pair storage system, the Hadoop file system implements directory support in S3 by creating empty files with the "_$folder$" suffix.

You can safely delete any empty files with the <directoryname>_$folder$ suffix that appear in your S3 buckets. These empty files are created by the Hadoop framework at runtime, but Hadoop is designed to process data even if these empty files are removed.

https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/

It's in the Hadoop source code, so it could be fixed, but apparently it's not fixed in EMR.

If you are feeling clever, you could create an S3 event notification that matches the _$folder$ suffix, and have it fire off a Lambda function to delete the objects after they're created.

Intercessor answered 18/3, 2017 at 19:11 Comment(2)

I found a solution listed at issues.apache.org/jira/browse/HADOOP-10400 but I don't know how to implement it in AWS data pipeline. – Albinus 19/3, 2017 at 7:40

@saurabhagarwal I believe you can't, with EMR -- it's a managed service. – Intercessor 19/3, 2017 at 12:50

There's no way in S3 to actually create an empty folder. S3 is an object store so everything is an object in there.

When Hadoop uses it as a filesystem, it requires to organize those objects so that it appears as a file system tree, so it creates some special objects to mark an object as a directory.

You just store data files, but you can choose to organize those data files into paths, which creates a concept similar to folders for traversing.

Some tools including AWS Management Console mimic folders by interpreting /s in object names. The Amazon S3 console supports the folder concept as a means of grouping objects. So does the Bucket Explorer.

If you just don't create a folder, but place files in the path you want - that should work for you.

You don't have to create a folder before writing files to it in S3 because /all/path/including/filename - is a whole key in the S3 storage.

Painting answered 18/3, 2017 at 17:11 Comment(5)

"There's no way in S3 to actually create an empty folder." That isn't true. While it's true that folders do not really exist, any object whose key ends with a trailing slash is interpreted by the console as a folder. Unfortunately, Hadoop uses this goofy _$folder$ construct, entirely unnecessarily, since it could just use / -- which is what happens when you "create a folder" in the console. – Intercessor 18/3, 2017 at 19:6

@Michael-sqlbot It's true about S3, it has only buckets and keys. But some tools can mimic folders by interpreting /s in object names. The Amazon S3 console supports the folder concept as a means of grouping objects. So does the Bucket Explorer. See here: bucketexplorer.com/documentation/… – Painting 18/3, 2017 at 20:29

Hadoop s3n client uses the $folder$ marker for historical reasons; I think originally you couldn't use "/". The newer S3a Client uses "/"; it ignores $folder$ files in listings. Amazon EMR's S3 connector is their own code, it appears to still use $folder$. Their decision. – Epsom 21/3, 2017 at 12:43

@SteveLoughran Are there any links detailing the switch from "_$folder$" to "/" ? – Granulose 7/3, 2018 at 1:3

Not AFAIK, you could look through the Hadoop NativeS3FileSystem code history – Epsom 7/3, 2018 at 13:22

Insted of using s3:// use s3a:// will solve your issue

This happens because of the S3 path you use during writing.

s3:// vs s3a://

s3:// will make the folder s3a:// will not

The prefixes s3:// and s3a:// are both used to specify the protocol for accessing data stored in Amazon S3 within Apache Spark.

s3://: This prefix is used to specify the S3 protocol for accessing data in Spark. It is the default protocol used by Spark and provides basic functionality for reading and writing data from and to S3. When using s3://, Spark uses the Hadoop S3A connector to interact with S3.
s3a://: This prefix is also used to specify the S3 protocol for accessing data in Spark. It is an alternative protocol that provides additional features and optimizations compared to s3://. When using s3a://, Spark uses the Hadoop S3A connector, which is an improved version of the S3 connector.

In general, it is recommended to use s3a:// instead of s3:// when working with Spark and S3, as s3a:// offers better performance and reliability. However, the specific choice between s3:// and s3a:// may depend on your specific use case and requirements.

For example, to specify the input or output path for reading or writing data to S3 using s3://, you can use the following syntax:

inputPath = "s3://your-bucket/your-input-path"
outputPath = "s3://your-bucket/your-output-path"

Similarly, to use s3a://, you can replace s3:// with s3a:// in the path:

inputPath = "s3a://your-bucket/your-input-path"
outputPath = "s3a://your-bucket/your-output-path"

Lunar answered 10/9, 2023 at 10:50 Comment(0)

Use below script in EMR bootstrap action to solve this issue. Patch provided by AWS

#!/bin/bash

# NOTE: This script replaces the s3-dist-cp RPM on EMR versions 4.6.0+ with s3-dist-cp-2.2.0.
# This is intended to remove the _$folder$ markers when creating the destination prefixes in S3.

set -ex

RPM=bootstrap-actions/s3-dist-cp-2.2.0/s3-dist-cp-2.2.0-1.amzn1.noarch.rpm

LOCAL_DIR=/var/aws/emr/packages/bigtop/s3-dist-cp/noarch

# Get the region from metadata
REGION=$(curl http://169.254.169.254/latest/meta-data/placement/availability-zone/ 2>/dev/null | head -c -1)

# Choose correct bucket for region
if [ $REGION = "us-east-1" ]
then
    BUCKET=awssupportdatasvcs.com
else
    BUCKET=$REGION.awssupportdatasvcs.com
fi

# Download new RPM
sudo rm $LOCAL_DIR/s3-dist-cp*.rpm
aws s3 cp s3://$BUCKET/$RPM /tmp/
sudo cp /tmp/s3-dist-cp-2.2.0-1.amzn1.noarch.rpm $LOCAL_DIR/

echo Rebuilding Repo
sudo yum install -y createrepo
sudo createrepo --update -o /var/aws/emr/packages/bigtop /var/aws/emr/packages/bigtop
sudo yum clean all

Arrowy answered 20/3, 2017 at 19:35 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags