AWS Glue job consuming data from external REST API
Asked Answered
B

4

14

I'm trying to create a workflow where AWS Glue ETL job will pull the JSON data from external REST API instead of S3 or any other AWS-internal sources. Is that even possible? Anyone does it? Please help!

Barner answered 13/1, 2020 at 9:55 Comment(0)
F
15

Yes, I do extract data from REST API's like Twitter, FullStory, Elasticsearch, etc. Usually, I do use the Python Shell jobs for the extraction because they are faster (relatively small cold start). When is finished it triggers a Spark type job that reads only the json items I need. I use the requests pyhton library.

In order to save the data into S3 you can do something like this

import boto3
import json

# Initializes S3 client
s3 = boto3.resource('s3')

tweets = []
//Code that extracts tweets from API
tweets_json = json.dumps(tweets)
obj = s3.Object("my-tweets", "tweets.json")
obj.put(Body=data)
Feeding answered 13/1, 2020 at 18:21 Comment(0)
D
4

The AWS Glue Python Shell executor has a limit of 1 DPU max. If that's an issue, like in my case, a solution could be running the script in ECS as a task.

You can run about 150 requests/second using libraries like asyncio and aiohttp in python. example 1, example 2.

Then you can distribute your request across multiple ECS tasks or Kubernetes pods using Ray. Here you can find a few examples of what Ray can do for you.

This also allows you to cater for APIs with rate limiting.

Once you've gathered all the data you need, run it through AWS Glue.

Demonstrative answered 31/7, 2020 at 9:59 Comment(0)
C
3

Yes, it is possible. You can use Amazon Glue to extract data from REST APIs. Although there is no direct connector available for Glue to connect to the internet world, you can set up a VPC, with a public and a private subnet. In the private subnet, you can create an ENI that will allow only outbound connections for GLue to fetch data from the API. In the public subnet, you can install a NAT Gateway.

Additionally, you might also need to set up a security group to limit inbound connections. Hope this answers your question.

Chaffee answered 27/12, 2020 at 22:52 Comment(0)
A
1

A new option since the original answer was accepted is to not use Glue at all but to build a custom connector for Amazon AppFlow.

I would argue that AppFlow is the AWS tool most suited to data transfer between API-based data sources, while Glue is more intended for ODP-based discovery of data already in AWS.

Antonietta answered 28/11, 2022 at 5:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.