How to write try except for loading data
Asked Answered
D

3

6

I'm pretty new to coding so I apologize for this being stupid question. I'm writing a spark function that takes in a file path and file type and creates a dataframe. If the input is invalid, I want to just print some sort of error message and return an empty dataframe. Would I use try except?

def rdf(name, type):
   try:
      df=spark.read.format(type).load(name)
      return df
   except ____ as error:
      print(error)
      return "" #I want to return an empty RDD here, but I can't figure out how to make one

How do I know what goes in the ____? I tried org.apache.spark.SparkException because that's the error I get when I pass in a .csv file as a parquet and it breaks but that isn't working

Denticulation answered 21/5, 2020 at 19:36 Comment(5)
Why is that org.apache.spark.SparkException not working? what error do you get in the traceback? You could try a generic except Exception as error and see what actual error you get.Ful
@RafaelBarros it's saying that, but some other bad inputs I put in are giving different exceptions. Is there anything wrong with just using except Exception?Denticulation
well it does hide the errors in your code and may be a problem during when debugging. Check his comment on my answerMugwump
to create an empty df we would need to know the schema you expect.Ful
@RafaelBarros like the number of rows and columns? I'm sorry, I'm not entirely sure what you mean, but I do appreciate the helpDenticulation
F
7

You can catch multiple exceptions in the try-except block; for instance:

def rdf(name, type):
   try:
      df=spark.read.format(type).load(name)
      return df
   except (SparkException, TypeError) as error:
      print(error)
      return ""

You could replace or add errors to that tuple.

Using a Exception will potentially silence errors that are unrelated to your code (like a networking issue if name is an S3 path). That is probably something you want your program to not handle.

Ful answered 21/5, 2020 at 19:49 Comment(4)
great point here to highlight the danger of just sticking with generic, non-specific exceptions!Archegonium
Where do you import SparkException from?Boorer
Since Spark 3.5 we can import it from pyspark.errors.PySparkException source. In previous versions you will have to use the general broad Exception.Maisiemaison
Within pytest we can also do with pytest.raises(Exception, match="SparkException"):Maisiemaison
M
2

Use Exception if you don't know what exception it might be:

def rdf(name, type):
   try:
      df=spark.read.format(type).load(name)
      return df
   except Exception as error:
      print(error)
      return ""

WARNING: This is not good practice as it could silence errors that would be useful during debugging and troubleshooting. (Thanks to @RafaelBarros)

Mugwump answered 21/5, 2020 at 19:41 Comment(2)
This is not good practice as it could silence errors that you want to catch. Silencing exceptions like this will make troubleshooting very complicated.Ful
yes but @Denticulation asked for what to put as the exception and this is the answer; but I will put your warning in the answer :)Mugwump
J
0

If anyone else is just trying to read a file without crashing the script if it doesn't exist, this approach does not use exceptions but worked for me.

def readDfParquetMaybe(path, schema, filesystemType):
    fs = fsspec.filesystem(filesystemType)
    if fs.exists(path):
        rawDf = spark.read \
            .option('header', True) \
            .parquet(path)
        return rawDf
    else:
        print("Could not find file:" + path)
        empty_df = spark.createDataFrame([], schema)
        return empty_df

See this link for more info on fsspec and to figure out what filesystemType needs to be for your application.

Here is an example of how to call this function (from M$ Synapse).

mySchema = StructType([
    StructField("RowKey", StringType()),
    StructField("Timestamp", DateType()),
    StructField("property1", StringType()),
    StructField("when", DateType())
])

myRawDf = readDfParquetMaybe('abfss://[email protected]/path/to/data', mySchema, 'abfss')

Please keep in mind the data load could still fail for other reasons.

Jeannettajeannette answered 10/9 at 2:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.