pyspark: ValueError: Some of types cannot be determined after inferring

Asked 9/11, 2016 at 23:11 Answered 19/5 at 11:39

python python-2.7 pandas pyspark apache-spark-sql

I have a pandas data frame my_df, and my_df.dtypes gives us:

ts              int64
fieldA         object
fieldB         object
fieldC         object
fieldD         object
fieldE         object
dtype: object

Then I am trying to convert the pandas data frame my_df to a spark data frame by doing below:

spark_my_df = sc.createDataFrame(my_df)

However, I got the following errors:

ValueErrorTraceback (most recent call last)
<ipython-input-29-d4c9bb41bb1e> in <module>()
----> 1 spark_my_df = sc.createDataFrame(my_df)
      2 spark_my_df.take(20)

/usr/local/spark-latest/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio)
    520             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    521         else:
--> 522             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    523         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    524         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/usr/local/spark-latest/python/pyspark/sql/session.py in _createFromLocal(self, data, schema)
    384 
    385         if schema is None or isinstance(schema, (list, tuple)):
--> 386             struct = self._inferSchemaFromList(data)
    387             if isinstance(schema, (list, tuple)):
    388                 for i, name in enumerate(schema):

/usr/local/spark-latest/python/pyspark/sql/session.py in _inferSchemaFromList(self, data)
    318         schema = reduce(_merge_type, map(_infer_schema, data))
    319         if _has_nulltype(schema):
--> 320             raise ValueError("Some of types cannot be determined after inferring")
    321         return schema
    322 

ValueError: Some of types cannot be determined after inferring

Does anyone know what the above error mean? Thanks!

Bevon answered 9/11, 2016 at 23:11 Comment(0)

In order to infer the field type, PySpark looks at the non-none records in each field. If a field only has None records, PySpark can not infer the type and will raise that error.

Manually defining a schema will resolve the issue

>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField("foo", StringType(), True)])
>>> df = spark.createDataFrame([[None]], schema=schema)
>>> df.show()
+----+
|foo |
+----+
|null|
+----+

Asta answered 15/11, 2016 at 18:28 Comment(2)

Can I just give the schema for the entire None column and skip the rest of the columns? – Saddle 18/4, 2019 at 23:20

@AviralSrivastava No, you have to specify them all but if the schema can be inferred by auto loading, you can print the thing with df.schema and use that to start your schema, then fix it by editing it into Python and fix the schema problem, then load or cast the dataset with your new schema. It is laborious, but works. – Catboat 13/12, 2023 at 23:7

And to fix this problem, you could provide your own defined schema.

For example:

To reproduce the error:

>>> df = spark.createDataFrame([[None, None]], ["name", "score"])

To fix the error:

>>> from pyspark.sql.types import StructType, StructField, StringType, DoubleType
>>> schema = StructType([StructField("name", StringType(), True), StructField("score", DoubleType(), True)])
>>> df = spark.createDataFrame([[None, None]], schema=schema)
>>> df.show()
+----+-----+
|name|score|
+----+-----+
|null| null|
+----+-----+

Catchings answered 15/1, 2018 at 18:10 Comment(2)

If we have more than 2 columns, and only 1 column is fully null, is there a better elegant way to pass the schema without explicitly define schema for all the columns? – Tierratiersten 15/2, 2021 at 0:9

Why can't we simple convert to a Spark DF with all nulls? For me, it worked fine the other way around when converting from Spark - toPandas(). Am converting spark df toPandas() to use pandas functionality, but can't convert back now – Optometry 15/11, 2022 at 9:55

If you are using the RDD[Row].toDF() monkey-patched method you can increase the sample ratio to check more than 100 records when inferring types:

# Set sampleRatio smaller as the data size increases
my_df = my_rdd.toDF(sampleRatio=0.01)
my_df.show()

Assuming there are non-null rows in all fields in your RDD, it will be more likely to find them when you increase the sampleRatio towards 1.0.

Catboat answered 24/4, 2019 at 17:54 Comment(2)

If your rdd is very large, make your sample ratio more like 0.01 or spark will take a long time at the very end of the job – Correspondent 12/11, 2019 at 4:24

@Correspondent I'll amend the answer, this is a better default, thanks. – Catboat 12/11, 2019 at 16:59

I've run into this same issue, if you do not need the columns that are null you can simply drop them from the pandas dataframe before importing to spark:

my_df = my_df.dropna(axis='columns', how='all') # Drops columns with all NA values
spark_my_df = sc.createDataFrame(my_df)

Zoroaster answered 13/9, 2019 at 13:49 Comment(2)

How do you do this if you are not importing from pandas ? – Affirmation 11/2, 2020 at 10:26

It would depend on what you are using to import, the original question is about importing from Pandas. – Zoroaster 24/8, 2020 at 15:32

This is probably because of the columns that have all null values. You should drop those columns before converting them to a spark dataframe

Frostbitten answered 29/8, 2019 at 15:9 Comment(0)

The reason for this error is that Spark is not able to determine the data types of your pandas dataframe so, one way to solve this you can pass the schema separately to the sparks createDataFrame function.

For example your pandas dataframe looks like this

d = {
  'col1': [1, 2],
  'col2': ['A', 'B]
}
df = pd.DataFrame(data = d)
print(df)

   col1 col2
0    1   A
1    2   B

When you want to convert it into Spark dataframe start by defining schema and adding it to your createDataFrame as follows

from pyspark.sql.types import StructType, StructField, LongType, StringType

schema = StructType([
  StructField("col1", LongType()),
  StructField("col2", StringType()),
])


spark_df = spark.createDataFrame(df, schema = schema)

Bakemeier answered 30/10, 2022 at 21:36 Comment(0)

It happens when atleast one of your columns is totally empty and has no value at all. If you will remove that column which is completely empty then rerun this, it will not through error.

Homicide answered 19/5 at 11:39 Comment(0)

Recommended topics

Hot tags