spark read partitioned data in S3 partly in glacier

Asked 21/8, 2017 at 13:12 Answered 24/6 at 20:24

apache-spark amazon-s3 partitioning amazon-glacier

I have a dataset in parquet in S3 partitioned by date (dt) with oldest date stored in AWS Glacier to save some money. For instance, we have...

s3://my-bucket/my-dataset/dt=2017-07-01/    [in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-09/    [in glacier]
s3://my-bucket/my-dataset/dt=2017-07-10/    [not in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-24/    [not in glacier]

I want to read this dataset, but only the a subset of date that are not yet in glacier, eg:

val from = "2017-07-15"
val to = "2017-08-24"
val path = "s3://my-bucket/my-dataset/"
val X = spark.read.parquet(path).where(col("dt").between(from, to))

Unfortunately, I have the exception

java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The operation is not valid for the object's storage class (Service: Amazon S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: C444D508B6042138)

I seems that spark does not like partitioned dataset when some partitions are in Glacier. I could always read specifically each date, add the column with current date and reduce(_ union _) at the end, but it is ugly like hell and it should not be necessary.

Is there any tip to read available data in the datastore even with old data in glacier?

Reneta answered 21/8, 2017 at 13:12 Comment(6)

I don't believe that it's possible. AWS Glacier doesn't seem to support pushdown predicates... – Denote 21/8, 2017 at 13:17

Don't use glacier if you need to retrieve some recent data. It doesn't worth the trouble, and you might walking into a pricing trap. Just store you data and standard-IA. medium.com/@karppinen/… – Roby 21/8, 2017 at 14:20

Hi mootmoot. I do not want to unfreeze the data in Glacier (that would be prohibitively expensive). My point was that I want to use partitions of my dataset that are NOT in glacier yet, but other partitions (that I don't want to read) are! I think the only way is to modify the AWS SDK, so that it does not try to scan frozen partitions at all. – Reneta 22/8, 2017 at 19:23

@Boris: Were you able to find an answer to this? Given that there are even more glacier options available now(deep glacier), this process has become painful. Any suggestions? PS: I did see the discussion here on jira.apache.org/jira/browse/SPARK-21797 – Footboard 2/4, 2019 at 2:17

no, it's still not fixed. If you implemented issues.apache.org/jira/browse/HADOOP-14837 then spark could see which was offline, and maybe skip it, but really, that's a level of concern which Spark shouldn't have to do. Split offline and hot data or expect stack traces – Unconstitutional 3/4, 2019 at 9:41

@ManishRanjan did you got your answer? – Magisterial 10/4, 2019 at 18:16

Error you are getting not related to Apache spark , you are getting exception because of Glacier service in short S3 objects in the Glacier storage class are not accessible in the same way as normal objects, they need to be retrieved from Glacier before they can be read.

Apache Spark cannot handle directly glacier storage TABLE/PARTITION mapped to an S3 .

java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The operation is not valid for the object's storage class (Service: Amazon S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: C444D508B6042138)

When S3 moves any objects from S3 storage classes

STANDARD,
STANDARD_IA,
REDUCED_REDUNDANCY

to GLACIER storage class, you have object S3 has stored in Glacier which is not visible to you and S3 will bill only Glacier storage rates.

It is still an S3 object, but has the GLACIER storage class.

When you need to access one of these objects, you initiate a restore, which temporary copy into S3 .

Move data into S3 bucket read into Apache Spark will resolve your issue.

https://aws.amazon.com/s3/storage-classes/

Note : Apache Spark , AWS athena etc cannot read object directly from glacier if you try will get 403 error.

If you archive objects using the Glacier storage option, you must inspect the storage class of an object before you attempt to retrieve it. The customary GET request will work as expected if the object is stored in S3 Standard or Reduced Redundancy (RRS) storage. It will fail (with a 403 error) if the object is archived in Glacier. In this case, you must use the RESTORE operation (described below) to make your data available in S3.

https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/

Inwrought answered 3/4, 2019 at 5:26 Comment(0)

403 error is due to the fact you can not read object that is archieve in Glacier, source

Reading Files from Glacier

If you want to read files from Glacier, you need to restore them to s3 before using them in Apache Spark, a copy will be available on s3 for the time mentioned during restore command, for details see here, you can use S3 console, cli or any language to do that too

Discarding some Glacier files that you do not want to restore

Let's say you do not want to restore all the files from Glacier and discard them during processing, from Spark 2.1.1, 2.2.0 you can ignore those files (with IO/Runtime Exception), by setting spark.sql.files.ignoreCorruptFiles to true source

Magisterial answered 3/4, 2019 at 22:36 Comment(0)

If you define your table through Hive, and use the Hive metastore catalog to query it, it won't try to go onto the non selected partitions. Take a look at the spark.sql.hive.metastorePartitionPruning setting

Impasse answered 6/2, 2018 at 12:38 Comment(0)

try this setting: ss.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER") or add the spark-defaults.conf config:

spark.sql.hive.caseSensitiveInferenceMode NEVER_INFER

Culm answered 29/3, 2021 at 9:16 Comment(0)

The S3 connectors from Amazon (s3://) and the ASF (s3a://) don't work with Glacier. Certainly nobody tests s3a against glacier. and if there were problems, you'd be left to fix them yourself. Just copy the data into s3 or onto local HDFS and then work with it there

Unconstitutional answered 22/8, 2017 at 9:28 Comment(0)

I finally got around this error with the following pyspark code:

df = spark.read.option("excludeStorageClasses", ["GLACIER", "DEEP_ARCHIVE"]).parquet(location)

Tankard answered 24/6 at 20:24 Comment(0)

Reading Files from Glacier

Discarding some Glacier files that you do not want to restore

Recommended topics

Hot tags