pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

Asked 2/11, 2018 at 16:54 Answered 5/9, 2024 at 16:28

This has a different answer to those given in the post above

I am getting an error that reads

pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

when I try to read in a parquet file like such using Spark 2.1.0

data = spark.read.parquet('/myhdfs/location/')

I have checked and the file/table is not empty by looking at the impala table through the Hue WebPortal. Also, other files that I have stored in similar directories read absolutely fine. For the record, the file names contain hyphens but no underscores or full-stops/periods.

Hence, none of the answers in the following post apply Unable to infer schema when loading Parquet file

Any ideas?

Proust answered 2/11, 2018 at 16:54 Comment(5)

Have you checked the answers on this post first: #44955392 – Maureenmaureene 2/11, 2018 at 18:1

Possible duplicate of Unable to infer schema when loading Parquet file – Electrical 2/11, 2018 at 18:47

Yeap. I’ve read that and none of the answers apply. – Proust 3/11, 2018 at 1:0

Try reading an individual Parquet file by providing its full path and report the outcome. – Flowering 3/11, 2018 at 23:52

Ah hah! It turns out there was another level in the directory structure! – Proust 6/11, 2018 at 11:19

It turns out I was getting this error because there was another level to the directory structure. The following was what I needed;

data = spark.read.parquet('/myhdfs/location/anotherlevel/')

Proust answered 6/11, 2018 at 11:21 Comment(0)

I got the same problem but none of the answers I found online worked for me. It turns out that I was writing the code in this way:

data = spark.read.parquet("/myhdfs/location/anotherlevel/")

so, using double " . When I switched to using single ' , my problem was solved.

data = spark.read.parquet('/myhdfs/location/anotherlevel/')

Sharing in case it helps anybody

Clarkclarke answered 25/3, 2022 at 16:2 Comment(1)

This does not really answer the question. If you have a different question, you can ask it by clicking Ask Question. To get notified when this question gets new answers, you can follow this question. Once you have enough reputation, you can also add a bounty to draw more attention to this question. - From Review – Loser 29/3, 2022 at 5:56

For me, it worked when I specified the properties manually like below.

data = spark.read.parquet("/myhdfs/location/anotherlevel/").select( "Property1", "Property2", "Property3" )

Spermic answered 5/9, 2024 at 16:28 Comment(1)

Please use code formatting to improve clarity. – Ailey 19/9, 2024 at 6:36

Recommended topics

Hot tags