Luigi Pipeline beginning in S3
Asked Answered
A

1

15

My initial files are in AWS S3. Could someone point me how I need to setup this in a Luigi Task?

I reviewed the documentation and found luigi.S3 but is not clear for me what to do with that, then I searched in the web and only get links from mortar-luigi and implementation in top of luigi.

UPDATE

After following the example provided for @matagus (I created the ~/.boto file as suggested too):

# coding: utf-8

import luigi

from luigi.s3 import S3Target, S3Client

class MyS3File(luigi.ExternalTask):
    def output(self):
        return S3Target('s3://my-bucket/19170205.txt')

class ProcessS3File(luigi.Task):

    def requieres(self):
        return MyS3File()

    def output(self):
        return luigi.LocalTarget('/tmp/resultado.txt')

    def run(self):
        result = None

        for input in self.input():
           print("Doing something ...")
           with input.open('r') as f:
               for line in f:
                   result = 'This is a line'

        if result:
            out_file = self.output().open('w')
            out_file.write(result)

When I execute it nothing happens

DEBUG: Checking if ProcessS3File() is complete
INFO: Informed scheduler that task   ProcessS3File()   has status   PENDING
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 21171] Worker Worker(salt=226574718, workers=1, host=heliodromus, username=nanounanue, pid=21171) running   ProcessS3File()
INFO: [pid 21171] Worker Worker(salt=226574718, workers=1, host=heliodromus, username=nanounanue, pid=21171) done      ProcessS3File()
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   ProcessS3File()   has status   DONE
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker Worker(salt=226574718, workers=1, host=heliodromus, username=nanounanue, pid=21171) was stopped. Shutting down Keep-Alive thread

As you can see, the message Doing something... never prints. What is wrong?

Anthropomorphous answered 25/10, 2015 at 16:28 Comment(3)
The error is in def requieres(self):. It must be requires.Dionysiac
Luigi checks for that method in order to get the input files, and since requires method does not exist, it returns an empty list.Dionysiac
You are absolutly right! I am such a durk! Thank you!Anthropomorphous
D
21

The key here is to define an External Task that has no inputs and which outputs are those files you already have in living in S3. Luigi docs mention this in Requiring another Task:

Note that requires() can not return a Target object. If you have a simple Target object that is created externally you can wrap it in a Task class

So, basically you end up with something like this:

import luigi

from luigi.s3 import S3Target

from somewhere import do_something_with


class MyS3File(luigi.ExternalTask):

    def output(self):
        return luigi.S3Target('s3://my-bucket/path/to/file')

class ProcessS3File(luigi.Task):

    def requires(self):
        return MyS3File()

    def output(self):
        return luigi.S3Target('s3://my-bucket/path/to/output-file')

    def run(self):
        result = None
        # this will return a file stream that reads the file from your aws s3 bucket
        with self.input().open('r') as f:
            result = do_something_with(f)

        # and the you 
        out_file = self.output().open('w')
        # it'd better to serialize this result before writing it to a file, but this is a pretty simple example
        out_file.write(result)

UPDATE:

Luigi uses boto to read files from and/or write them to AWS S3, so in order to make this code work, you'll need to provide your credentials in your boto config file ~/boto (look for other possible config file locations here):

[Credentials]
aws_access_key_id = <your_access_key_here>
aws_secret_access_key = <your_secret_key_here>
Dionysiac answered 26/10, 2015 at 17:8 Comment(9)
There are some problems with your code, please, Could you fix them? (e.g. the return in the first output method is wrong should be return S3Target(...Anthropomorphous
Another issue. In which part I should provide my aws credentials?Anthropomorphous
Done updating my answer. Hope it helps.Dionysiac
Should I use S3Client or something similar?Anthropomorphous
It's better if you use Luigi Targets. Tasks consume Targets that were created by some other task and they usually also output targets. And that way you can delegate to Luigi all retries and dependencies handling logic, which is why it was created.Dionysiac
I just open another question about an UnicodeDecodeError Do you have any idea?Anthropomorphous
Doesn't this solution read the entire content of the input file into memory before uploading it to S3? (or not?) If so, it isn't suitable for large datasets.Donoho
@Donoho Luigi uses multipart upload with 64Mb chunks by default but you may specify another size by passing part_size parameter to S3Target. So the answer is yes, this is suitable for large files.Dionysiac
@matagus, I am getting following error "FileNotFoundError: [Errno 2] No such file or directory: 's3://PATH_TO_S3BUCKET_WITH_FILENAME", but when reading it read properly from S3BUCKET, can you help me with this?Guanine

© 2022 - 2024 — McMap. All rights reserved.