Spark join throws 'function' object has no attribute '_get_object_id' error. How could I fix it?
Asked Answered
C

2

12

I am making a query in Spark in Databricks, and I have a problema when I am trying to make a join between two dataframes. The two dataframes that I have are the next ones:

  • "names_df" which has 2 columns: "ID", "title" that refer to the id and the title of films.

    +-------+-----------------------------+
    |ID     |title                        |
    +-------+-----------------------------+
    |1      |Toy Story                    |
    |2      |Jumanji                      |
    |3      |Grumpier Old Men             |
    +-------+-----------------------------+
    
  • "info" which has 3 columns: "movieId", "count", "average" that refer to the id of the film, the number of ranks that it has, and the average of those ratings.

    +-------+-----+------------------+
    |movieId|count|average           |
    +-------+-----+------------------+
    |1831   |7463 |2.5785207021305103|
    |431    |8946 |3.695059244355019 |
    |631    |2193 |2.7273141814865483|
    +-------+-----+------------------+
    

This "info" dataframe was created this way:

info =  ratings_df.groupBy('movieId').agg(F.count(ratings_df.rating).alias("count"), F.avg(ratings_df.rating).alias("average"))

Where "ratings_df" is another dataframe that contains 3 columns: "userId", "movieId" and "rating", that refer to the id of the user that voted, the id of the film that the user voted to, and the rating for that film:

+-------+-------+-------------+
|userId |movieId|rating       |
+-------+-------+-------------+
|1      |2      |3.5          |
|1      |29     |3.5          |
|1      |32     |3.5          |
+-------+-------+-------------+

I am trying to make a join between these two dataframes to get another one with those columns: "movieId", "title", "count", "average":

+-------+-----------------------------+-----+-------+
|average|title                        |count|movieId|
+-------+-----------------------------+-----+-------+
|5.0    |Ella Lola, a la Trilby (1898)|1    |94431  |
|5.0    |Serving Life (2011)          |1    |129034 |
|5.0    |Diplomatic Immunity (2009? ) |1    |107434 |
+-------+-----------------------------+-----+-------+

So the operation I did was the next one:

movie_names_df = info.join(movies_df, info.movieId == movies_df.ID, "inner").select(movies_df.title, info.average, info.movieId, info.count).show()

The problem is that I get the next error message:

AttributeError: 'function' object has no attribute '_get_object_id'

And I know that this error occurs because it consider that info.count is a function, and not an attribute, as I defined previously.

So, how could I make that join correctly to get what I want?

Thank you so much!

Copyist answered 7/9, 2016 at 7:57 Comment(2)
Have you tried to use the alternative column access using brackets? E.g. info["count"]?Kiger
Thank you! That's it! I was trying to make it using parenthesis. :)Copyist
K
21

Adding comment as answer since it solved the problem. count is somewhat of a protected keyword in DataFrame API, so naming columns count is dangerous. In your case you could circumvent the error by not using the dot notation, but bracket based column access, e.g.

info["count"]
Kiger answered 7/9, 2016 at 12:19 Comment(4)
I ran into a similar problem with a column named alias. Thanks for this answer!Clements
Same here with sample!Quimper
Is there a full list of reserved keywords somewhere?Coplin
I had the same problem with nameCoplin
T
0

Try to get info.count as a function call info.count().

movie_names_df = info.join(movies_df, info.movieId == movies_df.ID, "inner").select(movies_df.title, info.average, info.movieId, info.count()).show()
Transgression answered 15/10, 2022 at 9:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.