pyspark parse fixed width text file
Asked Answered
F

4

22

Trying to parse a fixed width text file.

my text file looks like the following and I need a row id, date, a string, and an integer:

00101292017you1234
00201302017 me5678

I can read the text file to an RDD using sc.textFile(path). I can createDataFrame with a parsed RDD and a schema. It's the parsing in between those two steps.

Forbiddance answered 30/1, 2017 at 20:19 Comment(0)
P
24

Spark's substr function can handle fixed-width columns, for example:

df = spark.read.text("/tmp/sample.txt")
df.select(
    df.value.substr(1,3).alias('id'),
    df.value.substr(4,8).alias('date'),
    df.value.substr(12,3).alias('string'),
    df.value.substr(15,4).cast('integer').alias('integer')
).show()

will result in:

+---+--------+------+-------+
| id|    date|string|integer|
+---+--------+------+-------+
|001|01292017|   you|   1234|
|002|01302017|    me|   5678|
+---+--------+------+-------+

Having splitted columns you can reformat and use them as in normal spark dataframe.

Pantheas answered 30/1, 2017 at 20:42 Comment(3)
# df = sqlContext.read.text("blah.txt") I had to use sqlContext. Then it worked. I think I need to learn a little bit about context. But otherwise, you have answered.Forbiddance
spark stands for spark session in versions >= 2.0. If you use spark 1.6 or lower you need to use sqlContext, but in terms of accessing data is behaves pretty the samePantheas
Is there a way to get the values for column name and width from a dictionary? I would like to use a dictionary to programmatically map my columns.Malenamalet
V
5

Someone asked how to do it based on a schema. Based on above responses, here is a simple example:

x= '''    1 123121234 joe
    2 234234234jill
    3 345345345jane
    4abcde12345jack'''

schema = [
          ("id",1,5),
          ("ssn",6,10),
          ("name",16,4)
]
          
with open("personfixed.csv", "w") as f:
  f.write(x)

df = spark.read.text("personfixed.csv")
df.show()

df2 = df
for colinfo in schema:
  df2 = df2.withColumn(colinfo[0], df2.value.substr(colinfo[1],colinfo[2]))

df2.show()

Here is the output:

+-------------------+
|              value|
+-------------------+
|    1 123121234 joe|
|    2 234234234jill|
|    3 345345345jane|
|    4abcde12345jack|
+-------------------+

+-------------------+-----+----------+----+
|              value|   id|       ssn|name|
+-------------------+-----+----------+----+
|    1 123121234 joe|    1| 123121234| joe|
|    2 234234234jill|    2| 234234234|jill|
|    3 345345345jane|    3| 345345345|jane|
|    4abcde12345jack|    4|abcde12345|jack|
+-------------------+-----+----------+----+
Voiceless answered 17/6, 2021 at 12:28 Comment(0)
D
3

Here is a Oneliner for you :

df = spark.read.text("/folder/file.txt")

df.select(*map(lambda x: trim(df.value.substr(col_idx[x]['idx'], col_idx[x]['len'])).alias(x), col_idx))

where col_idx is something like this :

col_idx = {col1: {'idx': 1, 'len': 2}, col2: {'idx': 3, 'len': 1}}

It's practical when you have a lot of columns, and also more efficient to use select than multiple withcolumn (see The hidden cost of Spark withColumn)

Delozier answered 14/10, 2022 at 20:26 Comment(0)
R
0
df = spark.read.text("fixedwidth")

df.withColumn("id",df.value.substr(1,5)).withColumn("name",df.value.substr(6,11)).drop('value').show()

the result is

+-----+------+
|   id|  name|
+-----+------+
|23465|ramasg|
|54334|hjsgfd|
|87687|dgftre|
|45365|ghfduh|
+-----+------+
Ricky answered 12/5, 2021 at 20:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.