I'm doing an nlp project and have reviews that contain multiple sentences. I am using the spark-nlp package that outputs one column containing a list of the sentences in each review. I am using explode to create a row for each sentence but I want to add numbering so I know which sentence was 1st, 2nd, etc. I don't know how to use row_number() because I don't really have anything to orderBy.
Here's what my data looks like:
REVIEW_ID REVIEW_COMMENTS SENTENCES_LIST
1 Hi. Sent1. Sent2. [Hi., Sent1., Sent2.]
2 Yeah. Ok. [Yeah., Ok.]
Here's what I want it to look like:
REVIEW_ID REVIEW_COMMENTS SENTENCES_LIST SENTENCE SENT_NUMBER
1 Hi. Sent1. Sent2. [Hi., Sent1., Sent2.] Hi. 1
1 Hi. Sent1. Sent2. [Hi., Sent1., Sent2.] Sent1. 2
1 Hi. Sent1. Sent2. [Hi., Sent1., Sent2.] Sent2. 3
2 Yeah. Ok. [Yeah., Ok.] Yeah. 1
2 Yeah. Ok. [Yeah., Ok.] Ok. 2
I'm using the code below and not sure how to use row_number() because I don't have a column to use as the "orderBy" except for it's placement in the SENTENCES_LIST.
df2 = df.withColumn('SENTENCE', F.explode('SENTENCES_LIST'))
df3 = df2.withColumn('SENT_NUMBER',row_number().over(Window.partitionBy('REVIEW_ID').orderBy('????')))
pyspark.sql.functions.posexplode
? Try:df2 = df.withColumn('SENTENCE', F.posexplode('SENTENCES_LIST').alias("SENT_NUMBER", "SENT_NUMBER"))
– Mccarthydf.select('*', F.posexplode('SENTENCES_LIST').alias('SENT_NUMBER', 'SENTENCE'))
– Reynalda