How to read whole file in one string
Asked Answered
C

6

15

I want to read json or xml file in pyspark.lf my file is split in multiple line in

rdd= sc.textFile(json or xml) 

Input

{
" employees":
[
 {
 "firstName":"John",
 "lastName":"Doe" 
},
 { 
"firstName":"Anna"
  ]
}

Input is spread across multiple lines.

Expected Output {"employees:[{"firstName:"John",......]}

How to get the complete file in a single line using pyspark?

Celanese answered 25/5, 2015 at 20:0 Comment(8)
the whitespace doesn't matter, really, it's only there for display purposes. json with line breaks/indentation is still json...Cocainism
How to append every think one single stringCelanese
How to append think in one line(string)by removing whitespaceCelanese
Do you want the entire RDD in one string or do you want you want everything of a single record together?Mohammedanism
I want you want everything of a single record togetherCelanese
Does your input file contains more than one record?Potpie
Yes I have multiple line file I want merge to single lineCelanese
Unclear why you'd want this if Spark has a builtin JSON parserPotentiate
W
10

There are 3 ways (I invented the 3rd one, the first two are standard built-in Spark functions), solutions here are in PySpark:

textFile, wholeTextFile, and a labeled textFile (key = file, value = 1 line from file. This is kind of a mix between the two given ways to parse files).

1.) textFile

input: rdd = sc.textFile('/home/folder_with_text_files/input_file')

output: array containing 1 line of file as each entry ie. [line1, line2, ...]

2.) wholeTextFiles

input: rdd = sc.wholeTextFiles('/home/folder_with_text_files/*')

output: array of tuples, first item is the "key" with the filepath, second item contains 1 file's entire contents ie.

[(u'file:/home/folder_with_text_files/', u'file1_contents'), (u'file:/home/folder_with_text_files/', file2_contents), ...]

3.) "Labeled" textFile

input:

import glob
from pyspark import SparkContext
SparkContext.stop(sc)
sc = SparkContext("local","example") # if running locally
sqlContext = SQLContext(sc)

for filename in glob.glob(Data_File + "/*"):
    Spark_Full += sc.textFile(filename).keyBy(lambda x: filename)

output: array with each entry containing a tuple using filename-as-key with value = each line of file. (Technically, using this method you can also use a different key besides the actual filepath name- perhaps a hashing representation to save on memory). ie.

[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'),
 ('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'),
 ('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'),
 ('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'),
  ...]

You can also recombine either as a list of lines:

Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()

[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']),
 ('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])]

Or recombine entire files back to single strings (in this example the result is the same as what you get from wholeTextFiles, but with the string "file:" stripped from the filepathing.):

Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()

Weigand answered 6/10, 2016 at 17:52 Comment(1)
Since you're joining lines, you should use '\n'.join(list(x[1]), but what is Spark_Full? A list?Potentiate
S
7

If your data is not formed on one line as textFile expects, then use wholeTextFiles.

This will give you the whole file so that you can parse it down into whatever format you would like.

Stoller answered 25/5, 2015 at 20:17 Comment(0)
P
7

This is how you would do in scala

rdd = sc.wholeTextFiles("hdfs://nameservice1/user/me/test.txt")
rdd.collect.foreach(t=>println(t._2))
Phocomelia answered 12/12, 2016 at 19:50 Comment(0)
S
5

"How to read whole [HDFS] file in one string [in Spark, to use as sql]":

e.g.

// Put file to hdfs from edge-node's shell...

hdfs dfs -put <filename>

// Within spark-shell...

// 1. Load file as one string
val f = sc.wholeTextFiles("hdfs:///user/<username>/<filename>")
val hql = f.take(1)(0)._2

// 2. Use string as sql/hql
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val results = hiveContext.sql(hql)
Spasmodic answered 11/4, 2017 at 16:17 Comment(0)
R
3

Python way

rdd = spark.sparkContext.wholeTextFiles("hdfs://nameservice1/user/me/test.txt")
json = rdd.collect()[0][1]
Ramer answered 8/10, 2020 at 8:46 Comment(0)
N
0

According to https://spark.apache.org/docs/latest/sql-data-sources-text.html, you can read with:

text_df = spark.read.text("your_path", wholetext=True)
text = text_df.first().value
Noisome answered 25/3 at 10:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.