AnalysisException: u"cannot resolve 'name' given input columns: [ list] in sqlContext in spark
Asked Answered
T

7

21

I tried a simple example like:

data = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/databricks-datasets/samples/population-vs-price/data_geo.csv")

data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values
data = data.select("2014 Population estimate", "2015 median sales price").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()

It works well, But when i try something very similar like:

data = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load('/mnt/%s/OnlineNewsTrainingAndValidation.csv' % MOUNT_NAME)

data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values
data = data.select("timedelta", "shares").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()
display(data)

It raise error: AnalysisException: u"cannot resolve 'timedelta' given input columns: [ data_channel_is_tech,...

off-course I imported LabeledPoint and LinearRegression

What could be wrong?

Even the simpler case

df_cleaned = df_cleaned.select("shares")

raises same AnalysisException (error).

*please note: df_cleaned.printSchema() works well.

Technology answered 18/8, 2016 at 10:57 Comment(0)
T
14

I found the issue: some of the column names contain white spaces before the name itself. So

data = data.select(" timedelta", " shares").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()

worked. I could catch the white spaces using

assert " " not in ''.join(df.columns)  

Now I am thinking of a way to remove the white spaces. Any idea is much appreciated!

Technology answered 18/8, 2016 at 17:48 Comment(0)
A
4

Because header contains spaces or tabs,remove spaces or tabs and try

1) My example script

from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

df=spark.read.csv(r'test.csv',header=True,sep='^')
print("#################################################################")
print df.printSchema()
df.createOrReplaceTempView("test")
re=spark.sql("select max_seq from test")
print(re.show())
print("################################################################")

2) Input file,here 'max_seq ' contains space so we are getting bellow exception

Trx_ID^max_seq ^Trx_Type^Trx_Record_Type^Trx_Date

Traceback (most recent call last):
  File "D:/spark-2.1.0-bin-hadoop2.7/bin/test.py", line 14, in <module>
    re=spark.sql("select max_seq from test")
  File "D:\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\session.py", line 541, in sql
  File "D:\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
  File "D:\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u"cannot resolve '`max_seq`' given input columns: [Venue_City_Name, Trx_Type, Trx_Booking_Status_Committed, Payment_Reference1, Trx_Date, max_seq , Event_ItemVariable_Name, Amount_CurrentPrice, cinema_screen_count, Payment_IsMyPayment, r

2) Remove space after 'max_seq' column then it will work fine

Trx_ID^max_seq^Trx_Type^Trx_Record_Type^Trx_Date


17/03/20 12:16:25 INFO DAGScheduler: Job 3 finished: showString at <unknown>:0, took 0.047602 s
17/03/20 12:16:25 INFO CodeGenerator: Code generated in 8.494073 ms
  max_seq
    10
    23
    22
    22
only showing top 20 rows

None
##############################################################
Ashram answered 20/3, 2017 at 6:54 Comment(2)
I'm experiencing the same error with a parquet file. I cannot change the header separator. Your solution doesn't fix completely the issue (title does not specify CSV file only). Any suggestion?Huss
Hi Jeremy, skip the header later add header to data frameAshram
P
1
As there were tabs in my input file, removing the tabs or spaces in the header helped display the answer.

My example:

saledf = spark.read.csv("SalesLTProduct.txt", header=True, inferSchema= True, sep='\t')


saledf.printSchema()

root
|-- ProductID: string (nullable = true)
|-- Name: string (nullable = true)
|-- ProductNumber: string (nullable = true)

saledf.describe('ProductNumber').show()

 +-------+-------------+
 |summary|ProductNumber|
 +-------+-------------+
 |  count|          295|
 |   mean|         null|
 | stddev|         null|
 |    min|      BB-7421|
 |    max|      WB-H098|
 +-------+-------------+
Procopius answered 26/2, 2020 at 7:9 Comment(0)
C
0

If you don't have whitespaces in headers, this error also raised when you not specify headers for csv at all like this:

df = sqlContext.read.csv('data.csv')

So you need to change it to this:

df = sqlContext.read.csv('data.csv', header=True)
Calotte answered 12/3, 2021 at 10:15 Comment(0)
C
0

Recently, I came across this issue while working on Azure synapse analytics; my error was the same.

analysisexception: cannot resolve '`xxxxxx`' given input columns: [];; 'filter ('passenger_count > 0) +- relation[] csv traceback (most recent call last):

 file "/opt/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1364, in filter jdf = self._jdf.filter(condition._jc) file "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ answer, self.gateway_client, self.target_id, self.name)
 file "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 75, in deco raise analysisexception(s.split(': ', 1)[1], stacktrace)""

This error came due to improper wording in our code or in CSV file use this code to read csv file:

-df = spark.read.load("examples/src/main/resources/people.csv",
                     format="csv", sep=";", inferSchema="true", header="true")

If you are again stuck somewhere in synapse or pyspark visit this site FOR Error info: https://docs.actian.com/avalanche/index.html#page/User/Common_Data_Loading_Error_Messages.htm

and for more info visit documentation: https://spark.apache.org/docs/latest/api/python/

Chromaticity answered 18/7, 2021 at 10:6 Comment(0)
E
0

I had a similar issue while trying to do max aggregation on a dataframe. It occurred I didn't import pyspark.sql.functions.max, but used built-in Python's max() function and that was causing the error.

Ence answered 31/5, 2023 at 11:32 Comment(1)
I had the same issue and this answer helped me, I was using pyspark interactive session.Retraction
B
-1

I have solved the problem by changing count(field_name) to count(1).

Burgher answered 4/10, 2023 at 14:48 Comment(2)
Can you explain how this is answering the question?Luff
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Fem

© 2022 - 2024 — McMap. All rights reserved.