AWS Athena Returning Zero Records from Tables Created from GLUE Crawler input csv from S3
Asked Answered
J

10

17

Part One :

I tried glue crawler to run on dummy csv loaded in s3 it created a table but when I try view table in athena and query it it shows Zero Records returned.

But the demo data of ELB in Athena works fine.

Part Two (Scenario:)

Suppose I Have a excel file and data dictionary of how and what format data is stored in that file , I want that data to be dumped in AWS Redshift What would be best way to achieve this ?

Juni answered 13/11, 2017 at 14:41 Comment(1)
Update the question with some sample data from the csv file which was read by the crawler and also mention the schema structure. Don't just plainly mention Zero Records returned.Konstanz
R
30

I experienced the same issue. You need to give the folder path instead of the real file name to the crawler and run it. I tried with feeding folder name to the crawler and it worked. Hope this helps. Let me know. Thanks,

Reprobation answered 26/12, 2017 at 1:33 Comment(2)
This worked for me. I experienced the same issue. However, it worked when I put each data under a folder of its own and pointed the Crawler to the Parent folder. Something I've noticed is that the table name of the tables created this way did not contain the name of the file format while my previous attempts did. This could be used as a visual identifier to recognize if the tables have been read right.Drench
This works given, only one file(eg. csv) is there in the bucket.Advection
E
11

I experienced the same issue. try creating separate folder for single table in s3 buckets than rerun the glue crawler.you will get a new table in glue data catalog which has the same name as s3 bucket folder name .

Eudocia answered 27/12, 2017 at 6:11 Comment(0)
Z
5

Delete Crawler ones again create Crawler(only one csv file should be not more available in s3 and run the crawler) important note one CSV file run it we can view the records in Athena.

Zomba answered 13/12, 2017 at 4:40 Comment(0)
J
4

I was indeed providing the S3 folder path instead of the filename and still couldn't get Athena to return any records ("Zero records returned", "Data scanned: 0KB").

Turns out the problem was that the input files (my rotated log files automatically uploaded to S3 from Elastic Beanstalk) start with underscore (_), e.g. _var_log_nginx_rotated_access.log1534237261.gz! Apparently that's not allowed.

Jampack answered 15/8, 2018 at 16:10 Comment(1)
Yep, this is a correct answer. Glue Crawler could find the data but Athena does not work when a file starts with _Thom
B
3

Here's what worked for me: I needed to move all of my CSVs into their own folders, just pointing Glue Crawler to the parent folder ('csv/' for me) was not enough.

csv/allergies.csv -> fails

csv/allergies/allergies.csv -> succeeds

Then, I just pointed AWS Glue Crawler to csv/ and everything was parsed out well.

Breeching answered 2/2, 2023 at 16:56 Comment(0)
C
2

The structure of the s3 bucket / folder is very important :

s3://<bucketname>/<data-folder>/
    /<type-1-[CSVs|Parquets etc]>/<files.[csv or parquet]>
    /<type-2-[CSVs|Parquets etc]>/<files.[csv or parquet]>
    ...
    /<type-N-[CSVs|Parquets etc]>/<files.[csv or parquet]>

and specify in the "include path" of the Glue Crawler:

s3://<bucketname e.g my-s3-bucket-ewhbfhvf>/<data-folder e.g data>
Circumfluent answered 7/9, 2020 at 8:33 Comment(0)
R
1

Solution: Select path of folder even if within folder you have many files. This will generate one table and data will be displayed.

Revolving answered 30/1, 2020 at 17:33 Comment(0)
C
1

So in many such cases using EXCLUDE PATTERN in Glue Crawler helps me.

This is sure that instead of directly pointing the crawler to the file, we should point it to the directory and even by doing so when we do not get any records, Exclude Pattern comes to rescue. You will have to devise some pattern by which only the file which u want gets crawled and rest are excluded. (suggesting to do this instead of creating different directories for each file and most of the times in production bucket, doing such changes is not feasible )

I was having data in S3 bucket ! There were multiple directories and inside each directory there were snappy parquet file and json file. The json file was causing the issue.

So i ran the crawler on the master directory that was containing many directories and in the EXCLUDE PATTERN i gave - * / *.json

And this time, it did no create any table for the json file and i was able to see the records of the table using Athena.

for reference - https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html

Cardiology answered 17/6, 2020 at 6:16 Comment(0)
S
1

Pointing glue crawler to the S3 folder and not the acutal file did the trick.

Sabadell answered 8/4, 2021 at 20:14 Comment(0)
C
0

I was having the same issue. My crawler was pointed to the root bucket path, where there were many directories with a much different structure. What worked for me was adding an additional data source location to a more specific path for my new data, and it worked.

Cobham answered 7/5, 2024 at 19:59 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.