I am making a query in Spark in Databricks, and I have a problema when I am trying to make a join between two dataframes. The two dataframes that I have are the next ones:
"names_df" which has 2 columns: "ID", "title" that refer to the id and the title of films.
+-------+-----------------------------+ |ID |title | +-------+-----------------------------+ |1 |Toy Story | |2 |Jumanji | |3 |Grumpier Old Men | +-------+-----------------------------+
"info" which has 3 columns: "movieId", "count", "average" that refer to the id of the film, the number of ranks that it has, and the average of those ratings.
+-------+-----+------------------+ |movieId|count|average | +-------+-----+------------------+ |1831 |7463 |2.5785207021305103| |431 |8946 |3.695059244355019 | |631 |2193 |2.7273141814865483| +-------+-----+------------------+
This "info" dataframe was created this way:
info = ratings_df.groupBy('movieId').agg(F.count(ratings_df.rating).alias("count"), F.avg(ratings_df.rating).alias("average"))
Where "ratings_df" is another dataframe that contains 3 columns: "userId", "movieId" and "rating", that refer to the id of the user that voted, the id of the film that the user voted to, and the rating for that film:
+-------+-------+-------------+
|userId |movieId|rating |
+-------+-------+-------------+
|1 |2 |3.5 |
|1 |29 |3.5 |
|1 |32 |3.5 |
+-------+-------+-------------+
I am trying to make a join between these two dataframes to get another one with those columns: "movieId", "title", "count", "average":
+-------+-----------------------------+-----+-------+
|average|title |count|movieId|
+-------+-----------------------------+-----+-------+
|5.0 |Ella Lola, a la Trilby (1898)|1 |94431 |
|5.0 |Serving Life (2011) |1 |129034 |
|5.0 |Diplomatic Immunity (2009? ) |1 |107434 |
+-------+-----------------------------+-----+-------+
So the operation I did was the next one:
movie_names_df = info.join(movies_df, info.movieId == movies_df.ID, "inner").select(movies_df.title, info.average, info.movieId, info.count).show()
The problem is that I get the next error message:
AttributeError: 'function' object has no attribute '_get_object_id'
And I know that this error occurs because it consider that info.count is a function, and not an attribute, as I defined previously.
So, how could I make that join correctly to get what I want?
Thank you so much!
info["count"]
? – Kiger