AWS GLUE import xls/xlsx file
Asked Answered
I

3

5

I'm setting up a AWS GLUE job for my customers. Their files are excel with xls/xlsx extension and have multiple sheets and they don't want to do any convert job before uploading. How do I extract data from xls/xlsx file directly or can GLUE convert xls/xlsx file to csv file?

Ieshaieso answered 12/8, 2019 at 6:35 Comment(0)
B
8

xls/xlsx is not supported in GLUE as of now. The easiest way is to write a python shell job in glue to convert excel to csv and then run crawler over it. Below is sample in python

 import xlrd
 import csv

 def csv_from_excel():

    wb = xlrd.open_workbook('your_workbook.xls')
    sh = wb.sheet_by_name('Sheet1')
    your_csv_file = open('your_csv_file.csv', 'wb')
    wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)

    for rownum in xrange(sh.nrows):
        wr.writerow(sh.row_values(rownum))

    your_csv_file.close()
Bawdyhouse answered 12/8, 2019 at 8:40 Comment(2)
how to import latest version of xlrd in glue?Apriorism
Why iterate row by row instead of just using pd.to_csv?Wetnurse
F
1

Excel read is possible in glue version 2 not in glue version 3.

Steps:

  1. Go to glue job and edit script with below code

code:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import pandas as pd

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)


excel_path= r"s3://input/employee.xlsx"
df_xl_op = pd.read_excel(excel_path,sheet_name = "Sheet1")
df=df_xl_op.applymap(str)
input_df = spark.createDataFrame(df)
input_df.printSchema()

job.commit()
  1. Save script

  2. Goto Action - Edit Job - Select Glue version2 and set key value under security configuration

    key : --additional-python-modules
    value : pandas==1.2.4,xlrd==1.2.0,numpy==1.20.1,fsspec==0.7.4

  3. Save and run the job

It will resolve your error and you will be able to read the excel file

Faun answered 18/5, 2022 at 7:46 Comment(0)
D
0

The xls/xlsx file type is not available at the moment. You need to write transformer script to convert the file from xls/xlsx to CSV.

Classifier API

Debit answered 12/8, 2019 at 7:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.