According to the accepted answer in pyspark collect_set or collect_list with groupby, when you do a collect_list
on a certain column, the null
values in this column are removed. I have checked and this is true.
But in my case, I need to keep the null columns -- How can I achieve this?
I did not find any info on this kind of a variant of collect_list
function.
Background context to explain why I want nulls:
I have a dataframe df
as below:
cId | eId | amount | city
1 | 2 | 20.0 | Paris
1 | 2 | 30.0 | Seoul
1 | 3 | 10.0 | Phoenix
1 | 3 | 5.0 | null
I want to write this to an Elasticsearch index with the following mapping:
"mappings": {
"doc": {
"properties": {
"eId": { "type": "keyword" },
"cId": { "type": "keyword" },
"transactions": {
"type": "nested",
"properties": {
"amount": { "type": "keyword" },
"city": { "type": "keyword" }
}
}
}
}
}
In order to conform to the nested mapping above, I transformed my df so that for each combination of eId and cId, I have an array of transactions like this:
df_nested = df.groupBy('eId','cId').agg(collect_list(struct('amount','city')).alias("transactions"))
df_nested.printSchema()
root
|-- cId: integer (nullable = true)
|-- eId: integer (nullable = true)
|-- transactions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- amount: float (nullable = true)
| | |-- city: string (nullable = true)
Saving df_nested
as a json file, there are the json records that I get:
{"cId":1,"eId":2,"transactions":[{"amount":20.0,"city":"Paris"},{"amount":30.0,"city":"Seoul"}]}
{"cId":1,"eId":3,"transactions":[{"amount":10.0,"city":"Phoenix"},{"amount":30.0}]}
As you can see - when cId=1
and eId=3
, one of my array elements where amount=30.0
does not have the city
attribute because this was a null
in my original data (df
). The nulls are being removed when I use the collect_list
function.
However, when I try writing df_nested to elasticsearch with the above index, it errors because there is a schema mismatch. This is basically the reason as to why I want to retain my nulls after applying the collect_list
function.
null
values with something else, perhaps the string'null'
? – Bakeman