PySpark - Add Row Number After Using Explode Function [duplicate]
Asked Answered
R

1

8

I'm doing an nlp project and have reviews that contain multiple sentences. I am using the spark-nlp package that outputs one column containing a list of the sentences in each review. I am using explode to create a row for each sentence but I want to add numbering so I know which sentence was 1st, 2nd, etc. I don't know how to use row_number() because I don't really have anything to orderBy.

Here's what my data looks like:

REVIEW_ID REVIEW_COMMENTS     SENTENCES_LIST           
    1     Hi. Sent1. Sent2.   [Hi., Sent1., Sent2.]   
    2     Yeah. Ok.           [Yeah., Ok.]  

Here's what I want it to look like:

REVIEW_ID REVIEW_COMMENTS     SENTENCES_LIST           SENTENCE  SENT_NUMBER
    1     Hi. Sent1. Sent2.   [Hi., Sent1., Sent2.]    Hi.       1
    1     Hi. Sent1. Sent2.   [Hi., Sent1., Sent2.]    Sent1.    2
    1     Hi. Sent1. Sent2.   [Hi., Sent1., Sent2.]    Sent2.    3
    2     Yeah. Ok.           [Yeah., Ok.]             Yeah.     1
    2     Yeah. Ok.           [Yeah., Ok.]             Ok.       2

I'm using the code below and not sure how to use row_number() because I don't have a column to use as the "orderBy" except for it's placement in the SENTENCES_LIST.

df2 = df.withColumn('SENTENCE', F.explode('SENTENCES_LIST'))
df3 = df2.withColumn('SENT_NUMBER',row_number().over(Window.partitionBy('REVIEW_ID').orderBy('????'))) 
Reynalda answered 9/6, 2020 at 15:46 Comment(3)
Seems like you're looking for pyspark.sql.functions.posexplode? Try: df2 = df.withColumn('SENTENCE', F.posexplode('SENTENCES_LIST').alias("SENT_NUMBER", "SENT_NUMBER"))Mccarthy
@Mccarthy 'The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got SENTENCE ;'Reynalda
This works though df.select('*', F.posexplode('SENTENCES_LIST').alias('SENT_NUMBER', 'SENTENCE'))Reynalda
O
5

For Spark2.4+, you can use explode with higher order function transform.

from pyspark.sql import functions as F

df.withColumn("list", F.explode(F.expr("""transform(SENTENCES_LIST,(x,i)-> struct(x as SENTENCE,(i+1) as SENT_NUMBER))""")))\
  .select("*", "list.*").show()

#+---------+-----------------+--------------------+-----------+--------+-----------+
#|REVIEW_ID|  REVIEW_COMMENTS|      SENTENCES_LIST|       list|SENTENCE|SENT_NUMBER|
#+---------+-----------------+--------------------+-----------+--------+-----------+
#|        1|Hi. Sent1. Sent2.|[Hi., Sent1., Sen...|   [Hi., 1]|     Hi.|          1|
#|        1|Hi. Sent1. Sent2.|[Hi., Sent1., Sen...|[Sent1., 2]|  Sent1.|          2|
#|        1|Hi. Sent1. Sent2.|[Hi., Sent1., Sen...|[Sent2., 3]|  Sent2.|          3|
#|        2|        Yeah. Ok.|        [Yeah., Ok.]| [Yeah., 1]|   Yeah.|          1|
#|        2|        Yeah. Ok.|        [Yeah., Ok.]|   [Ok., 2]|     Ok.|          2|
#+---------+-----------------+--------------------+-----------+--------+-----------+
Otisotitis answered 9/6, 2020 at 16:11 Comment(3)
Thanks, your code worked but I ended up using: df.select('*', F.posexplode('SENTENCES_LIST').alias('SENT_NUMBER', 'SENTENCE')) which was more appropriateReynalda
@Reynalda true that is more appropriateOtisotitis
In retrospect this is overkill in comparison to posexplode. Thanks @MccarthyOtisotitis

© 2022 - 2024 — McMap. All rights reserved.