How to write try except for loading data

Asked 21/5, 2020 at 19:36 Answered 10/9 at 2:23

I'm pretty new to coding so I apologize for this being stupid question. I'm writing a spark function that takes in a file path and file type and creates a dataframe. If the input is invalid, I want to just print some sort of error message and return an empty dataframe. Would I use try except?

def rdf(name, type):
   try:
      df=spark.read.format(type).load(name)
      return df
   except ____ as error:
      print(error)
      return "" #I want to return an empty RDD here, but I can't figure out how to make one

How do I know what goes in the ____? I tried org.apache.spark.SparkException because that's the error I get when I pass in a .csv file as a parquet and it breaks but that isn't working

Denticulation answered 21/5, 2020 at 19:36 Comment(5)

Why is that org.apache.spark.SparkException not working? what error do you get in the traceback? You could try a generic except Exception as error and see what actual error you get. – Ful 21/5, 2020 at 19:40

@RafaelBarros it's saying that, but some other bad inputs I put in are giving different exceptions. Is there anything wrong with just using except Exception? – Denticulation 21/5, 2020 at 19:43

well it does hide the errors in your code and may be a problem during when debugging. Check his comment on my answer – Mugwump 21/5, 2020 at 19:46

to create an empty df we would need to know the schema you expect. – Ful 21/5, 2020 at 19:59

@RafaelBarros like the number of rows and columns? I'm sorry, I'm not entirely sure what you mean, but I do appreciate the help – Denticulation 21/5, 2020 at 20:2

You can catch multiple exceptions in the try-except block; for instance:

def rdf(name, type):
   try:
      df=spark.read.format(type).load(name)
      return df
   except (SparkException, TypeError) as error:
      print(error)
      return ""

You could replace or add errors to that tuple.

Using a Exception will potentially silence errors that are unrelated to your code (like a networking issue if name is an S3 path). That is probably something you want your program to not handle.

Ful answered 21/5, 2020 at 19:49 Comment(4)

great point here to highlight the danger of just sticking with generic, non-specific exceptions! – Archegonium 15/7, 2021 at 20:30

Where do you import SparkException from? – Boorer 23/11, 2022 at 1:4

Since Spark 3.5 we can import it from pyspark.errors.PySparkException source. In previous versions you will have to use the general broad Exception. – Maisiemaison 19/1 at 7:15

Within pytest we can also do with pytest.raises(Exception, match="SparkException"): – Maisiemaison 19/1 at 7:42

Use Exception if you don't know what exception it might be:

def rdf(name, type):
   try:
      df=spark.read.format(type).load(name)
      return df
   except Exception as error:
      print(error)
      return ""

WARNING: This is not good practice as it could silence errors that would be useful during debugging and troubleshooting. (Thanks to @RafaelBarros)

Mugwump answered 21/5, 2020 at 19:41 Comment(2)

This is not good practice as it could silence errors that you want to catch. Silencing exceptions like this will make troubleshooting very complicated. – Ful 21/5, 2020 at 19:42

yes but @Denticulation asked for what to put as the exception and this is the answer; but I will put your warning in the answer :) – Mugwump 21/5, 2020 at 19:43

If anyone else is just trying to read a file without crashing the script if it doesn't exist, this approach does not use exceptions but worked for me.

def readDfParquetMaybe(path, schema, filesystemType):
    fs = fsspec.filesystem(filesystemType)
    if fs.exists(path):
        rawDf = spark.read \
            .option('header', True) \
            .parquet(path)
        return rawDf
    else:
        print("Could not find file:" + path)
        empty_df = spark.createDataFrame([], schema)
        return empty_df

See this link for more info on fsspec and to figure out what filesystemType needs to be for your application.

Here is an example of how to call this function (from M$ Synapse).

mySchema = StructType([
    StructField("RowKey", StringType()),
    StructField("Timestamp", DateType()),
    StructField("property1", StringType()),
    StructField("when", DateType())
])

myRawDf = readDfParquetMaybe('abfss://[email protected]/path/to/data', mySchema, 'abfss')

Please keep in mind the data load could still fail for other reasons.

Jeannettajeannette answered 10/9 at 2:23 Comment(0)

Recommended topics

Hot tags