pyspark: ValueError: Some of types cannot be determined after inferring
Asked Answered
B

7

54

I have a pandas data frame my_df, and my_df.dtypes gives us:

ts              int64
fieldA         object
fieldB         object
fieldC         object
fieldD         object
fieldE         object
dtype: object

Then I am trying to convert the pandas data frame my_df to a spark data frame by doing below:

spark_my_df = sc.createDataFrame(my_df)

However, I got the following errors:

ValueErrorTraceback (most recent call last)
<ipython-input-29-d4c9bb41bb1e> in <module>()
----> 1 spark_my_df = sc.createDataFrame(my_df)
      2 spark_my_df.take(20)

/usr/local/spark-latest/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio)
    520             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    521         else:
--> 522             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    523         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    524         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/usr/local/spark-latest/python/pyspark/sql/session.py in _createFromLocal(self, data, schema)
    384 
    385         if schema is None or isinstance(schema, (list, tuple)):
--> 386             struct = self._inferSchemaFromList(data)
    387             if isinstance(schema, (list, tuple)):
    388                 for i, name in enumerate(schema):

/usr/local/spark-latest/python/pyspark/sql/session.py in _inferSchemaFromList(self, data)
    318         schema = reduce(_merge_type, map(_infer_schema, data))
    319         if _has_nulltype(schema):
--> 320             raise ValueError("Some of types cannot be determined after inferring")
    321         return schema
    322 

ValueError: Some of types cannot be determined after inferring

Does anyone know what the above error mean? Thanks!

Bevon answered 9/11, 2016 at 23:11 Comment(0)
A
64

In order to infer the field type, PySpark looks at the non-none records in each field. If a field only has None records, PySpark can not infer the type and will raise that error.

Manually defining a schema will resolve the issue

>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField("foo", StringType(), True)])
>>> df = spark.createDataFrame([[None]], schema=schema)
>>> df.show()
+----+
|foo |
+----+
|null|
+----+
Asta answered 15/11, 2016 at 18:28 Comment(2)
Can I just give the schema for the entire None column and skip the rest of the columns?Saddle
@AviralSrivastava No, you have to specify them all but if the schema can be inferred by auto loading, you can print the thing with df.schema and use that to start your schema, then fix it by editing it into Python and fix the schema problem, then load or cast the dataset with your new schema. It is laborious, but works.Catboat
C
20

And to fix this problem, you could provide your own defined schema.

For example:

To reproduce the error:

>>> df = spark.createDataFrame([[None, None]], ["name", "score"])

To fix the error:

>>> from pyspark.sql.types import StructType, StructField, StringType, DoubleType
>>> schema = StructType([StructField("name", StringType(), True), StructField("score", DoubleType(), True)])
>>> df = spark.createDataFrame([[None, None]], schema=schema)
>>> df.show()
+----+-----+
|name|score|
+----+-----+
|null| null|
+----+-----+
Catchings answered 15/1, 2018 at 18:10 Comment(2)
If we have more than 2 columns, and only 1 column is fully null, is there a better elegant way to pass the schema without explicitly define schema for all the columns?Tierratiersten
Why can't we simple convert to a Spark DF with all nulls? For me, it worked fine the other way around when converting from Spark - toPandas(). Am converting spark df toPandas() to use pandas functionality, but can't convert back nowOptometry
C
10

If you are using the RDD[Row].toDF() monkey-patched method you can increase the sample ratio to check more than 100 records when inferring types:

# Set sampleRatio smaller as the data size increases
my_df = my_rdd.toDF(sampleRatio=0.01)
my_df.show()

Assuming there are non-null rows in all fields in your RDD, it will be more likely to find them when you increase the sampleRatio towards 1.0.

Catboat answered 24/4, 2019 at 17:54 Comment(2)
If your rdd is very large, make your sample ratio more like 0.01 or spark will take a long time at the very end of the jobCorrespondent
@Correspondent I'll amend the answer, this is a better default, thanks.Catboat
Z
8

I've run into this same issue, if you do not need the columns that are null you can simply drop them from the pandas dataframe before importing to spark:

my_df = my_df.dropna(axis='columns', how='all') # Drops columns with all NA values
spark_my_df = sc.createDataFrame(my_df)
Zoroaster answered 13/9, 2019 at 13:49 Comment(2)
How do you do this if you are not importing from pandas ?Affirmation
It would depend on what you are using to import, the original question is about importing from Pandas.Zoroaster
F
2

This is probably because of the columns that have all null values. You should drop those columns before converting them to a spark dataframe

Frostbitten answered 29/8, 2019 at 15:9 Comment(0)
B
1

The reason for this error is that Spark is not able to determine the data types of your pandas dataframe so, one way to solve this you can pass the schema separately to the sparks createDataFrame function.

For example your pandas dataframe looks like this

d = {
  'col1': [1, 2],
  'col2': ['A', 'B]
}
df = pd.DataFrame(data = d)
print(df)

   col1 col2
0    1   A
1    2   B

When you want to convert it into Spark dataframe start by defining schema and adding it to your createDataFrame as follows

from pyspark.sql.types import StructType, StructField, LongType, StringType

schema = StructType([
  StructField("col1", LongType()),
  StructField("col2", StringType()),
])


spark_df = spark.createDataFrame(df, schema = schema)
Bakemeier answered 30/10, 2022 at 21:36 Comment(0)
H
0

It happens when atleast one of your columns is totally empty and has no value at all. If you will remove that column which is completely empty then rerun this, it will not through error.

Homicide answered 19/5 at 11:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.