I'm setting up a AWS GLUE job for my customers. Their files are excel with xls/xlsx extension and have multiple sheets and they don't want to do any convert job before uploading. How do I extract data from xls/xlsx file directly or can GLUE convert xls/xlsx file to csv file?
AWS GLUE import xls/xlsx file
Asked Answered
xls/xlsx is not supported in GLUE as of now. The easiest way is to write a python shell job in glue to convert excel to csv and then run crawler over it. Below is sample in python
import xlrd
import csv
def csv_from_excel():
wb = xlrd.open_workbook('your_workbook.xls')
sh = wb.sheet_by_name('Sheet1')
your_csv_file = open('your_csv_file.csv', 'wb')
wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)
for rownum in xrange(sh.nrows):
wr.writerow(sh.row_values(rownum))
your_csv_file.close()
how to import latest version of xlrd in glue? –
Apriorism
Why iterate row by row instead of just using pd.to_csv? –
Wetnurse
Excel read is possible in glue version 2 not in glue version 3.
Steps:
- Go to glue job and edit script with below code
code:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import pandas as pd
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
excel_path= r"s3://input/employee.xlsx"
df_xl_op = pd.read_excel(excel_path,sheet_name = "Sheet1")
df=df_xl_op.applymap(str)
input_df = spark.createDataFrame(df)
input_df.printSchema()
job.commit()
Save script
Goto Action - Edit Job - Select Glue version2 and set key value under security configuration
key : --additional-python-modules
value : pandas==1.2.4,xlrd==1.2.0,numpy==1.20.1,fsspec==0.7.4Save and run the job
It will resolve your error and you will be able to read the excel file
The xls/xlsx
file type is not available at the moment. You need to write transformer script to convert the file from xls/xlsx
to CSV.
© 2022 - 2024 — McMap. All rights reserved.