How to make MSCK REPAIR TABLE execute automatically in AWS Athena
Asked Answered
J

3

48

I have a Spark batch job which is executed hourly. Each run generates and stores new data in S3 with the directory naming pattern DATA/YEAR=?/MONTH=?/DATE=?/datafile.

After uploading the data to S3, I want to investigate it using Athena. Also, I would like to visualize them in QuickSight by connecting to Athena as a data source.

The problem is that after each run of my Spark batch, the newly generated data stored in S3 will not be discovered by Athena, unless I manually run the query MSCK REPAIR TABLE.

Is there a way to make Athena update the data automatically, so that I can create a fully automatic data visualization pipeline?

Jessalin answered 29/11, 2017 at 6:49 Comment(1)
@samuel_liew the question is not broad, it just provides some extra information for context around the question. It probably doesn't matter technically that the OP wants a fully automated data viz pipeline, but then context is important to allow people to provide guidance to solve the underlying challenge. That specific challenge is managing partitions in Athena, because they are distinct metadata objects that need to be created. That they aren't automatically created or discovered is quite unexpected, as evident by the number of upvotes on this question.Cohin
U
31

There are a number of ways to schedule this task. How do you schedule your workflows? Do you use a system like Airflow, Luigi, Azkaban, cron, or using an AWS Data pipeline?

From any of these, you should be able to fire off the following CLI command.

$ aws athena start-query-execution --query-string "MSCK REPAIR TABLE some_database.some_table" --result-configuration "OutputLocation=s3://SOMEPLACE"

Another option would be AWS Lambda. You could have a function that calls MSCK REPAIR TABLE some_database.some_table in response to a new upload to S3.

An example Lambda Function could be written as such:

import boto3

def lambda_handler(event, context):
    bucket_name = 'some_bucket'

    client = boto3.client('athena')

    config = {
        'OutputLocation': 's3://' + bucket_name + '/',
        'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}

    }

    # Query Execution Parameters
    sql = 'MSCK REPAIR TABLE some_database.some_table'
    context = {'Database': 'some_database'}

    client.start_query_execution(QueryString = sql, 
                                 QueryExecutionContext = context,
                                 ResultConfiguration = config)

You would then configure a trigger to execute your Lambda function when new data are added under the DATA/ prefix in your bucket.

Ultimately, explicitly rebuilding the partitions after you run your Spark Job using a job scheduler has the advantage of being self documenting. On the other hand, AWS Lambda is convenient for jobs like this one.

Ulu answered 29/11, 2017 at 14:12 Comment(5)
I think another possibility is to use boto within your Glue script. You should be able to use this to execute the relevant MSCK REPAIR TABLE commandLassie
I used a scheduled AWS Crawler to crawl the database to update the table. What's your opinion on this solution?Jessalin
That can work. The nice thing about using a Lambda function is that Lambda can respond dynamically to events, such as adding a file to an S3 in the case of Athena. The bad thing about Lambda functions is that they can be somewhat confusing to manage from a continuous integration and version control perspective.Ulu
Another bad thing about a Lambda in this scenario is that its execution must finish within 5 minutes, which is likely short for a REPAIR TABLE (but enough for an ADD PARTITION)Ebullient
any idea on this question #63150282Canon
M
7

You should be running ADD PARTITION instead:

aws athena start-query-execution --query-string "ALTER TABLE ADD PARTITION..."

Which adds a the newly created partition from your S3 location Athena leverages Hive for partitioning data. To create a table with partitions, you must define it during the CREATE TABLE statement. Use PARTITIONED BY to define the keys by which to partition data.

Maxiemaxilla answered 14/12, 2018 at 20:56 Comment(2)
Not sure this is possible if you're using Firehose to put data into an Athena bucket. You still need to specify the partitions, even if using 'dynamic' partitions though :-(Jonathonjonati
@RalphBolton When logging with Firehose you can also use partition projection. See my answer.Landseer
L
7

There's multiple ways to solve the issue and get the table updated:

  1. Call MSCK REPAIR TABLE. This will scan ALL data. It's costly as every file is read in full (at least it's fully charged by AWS). Also it's painfully slow. In short: Don't do it!

  2. Create partitions by your own by calling ALTER TABLE ADD PARTITION abc .... This is good in a sense no data is scanned and costs are low. Also the query is fast, so no problems here. It's also a good choice if you have very cluttered file structure without any common pattern (which doesn't seem it's your case as it's a nicely organised S3 key pattern). There's also downsides to this approach: A) It's hard to maintain B) All partitions will to be stored in GLUE catalog. This can become an issue when you have a lot of partitions as they need to be read out and passed to Athena and EMRs Hadoop infrastructure.

  3. Use partition projection. There's two different styles you might want to evaluate. Here's the variant with does create the partitions for Hadoop at query time. This means there's no GLUE catalog entries send over the network and thus large amounts of partitions can be handled quicker. The downside is you might 'hit' some partitions that might not exist. These will of course be ignored, but internally all partitions that COULD match your query will be generated - no matter if they are on S3 or not (so always add partition filters to your query!). If done correctly, this option is a fire and forget approach as there's no updates needed.

CREATE EXTERNAL TABLE `mydb`.`mytable`
(
   ...
)
  PARTITIONED BY (
    `YEAR` int,
    `MONTH` int,
    `DATE` int)
  ...
  LOCATION
    's3://DATA/'
  TBLPROPERTIES(
      "projection.enabled" = "true",
      "projection.account.type" = "integer",
      "projection.account.range" = "1,50",
      "projection.YEAR.type" = "integer",
      "projection.YEAR.range" = "2020,2025",
      "projection.MONTH.type" = "integer",
      "projection.MONTH.range" = "1,12",
      "projection.DATE.type" = "integer",
      "projection.DATE.range" = "1,31",
      "storage.location.template" = "s3://DATA/YEAR=${YEAR}/MONTH=${MONTH}/DATE=${DATE}/"
  );

https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html

  1. Just to list all options: You can also use GLUE crawlers. But it doesn't seemed to be a favourable approach as it's not as flexible as advertised.

  2. You get more control on GLUE using Glue Data Catalog API directly, which might be an alternative to approach #2 if you have a lot of automated scripts that do the preparation work to setup your table.

In short:

  • If your application is SQL centric, you like the leanest approach with no scripts, use partition projection
  • If you have many partitions, use partition projection
  • If you have a few partitions or partitions do not have a generic pattern, use approach #2
  • If you're script heavy and scripts do most of the work anyway and are easier to handle for you, consider approach #5
  • If you're confused and have no clue where to start - try partition projection first! It should fit 95% of the use cases.
Landseer answered 21/4, 2021 at 7:12 Comment(9)
A word of caution about projection: read the Considerations and Limitations section on the AWS doc.Cory
RTFM is always a good idea. Can you elaborate what's your concern in particular (as I'm using that approach myself I would be highly interested in any pitfalls)? The official AWS docs for projections are already linked in my answer by the way. Thanks!Landseer
Sure! very nice answer btw. The part about "if too many partition are empty." I just reread it and I might read it wrongly... It means if there is nothing within the partition, not if the partition does not exists. I am using it with Firehose to S3Cory
The two projection mechanisms differ here.The example one creates all partitions possible. e.g. if you define only "projection.YEAR.range" = "2000,3000" and do NOT apply a filter to the query the mechanism will create 1000 partitions (with multiple partition keys it will create a cartesian product if unfiltered). The partitions will be passed to the execution (to the cluster). Presto skips empty partitions but you're falling into the same pitfall as with GLUE: The data transfer is killing you. I have made the experience (by accident) creating tens of thousands partitions will be very slow.Landseer
@Landseer I'm not sure about point 1, that Athena scans the whole data. I have a bucket where partition directories are like /dt=2021-05-31, /dt=2021-06-01, etc. When I run MSCK REPAIR PARTITION some_table, the output is: (Run time: 8.61 seconds, Data scanned: 0 KB) and only the new partitions are added.Sima
@Sima That's what I thought too at the beginning. Correct is only new parttitions are created. But it does read data and is charged (believe me - I'm VERY sure about that as it hit us unexpectedly). Also runtime increases. Didn't you wonder why a file list in 2021 needs almost 9 secs to read? There's infos on Presto that states the files need to be opened. There's a certain mode/driver/fs layer patch/whatsoever for Presto which can get around the issue, but not in Athena 1 and 2. Use projection and you'll never look back.Landseer
Thanks @supernova. Going to try projections very soon.Sima
@Landseer I tried using partition projections, but the queries seem to take 2-4x longer using this method. You mention there are "two ways" to do partition projection, but I only see the one you explained, what is the other way?Rudiger
You can also use glue console docs.aws.amazon.com/athena/latest/ug/… You might be creating too many [empty] partitions. So using year wirth range 2020 to 2022 is 3 partitions while year from 1900 to 2022 is 122 partitions - and all will be checked! This also applies to combinations, which can be problematic depending how your data looks like. Also male sure the partition keys are used in query! Year=2022 is needed as date>‘2022-…..‘ will not automatically ise the partition! (Check data size read).Landseer

© 2022 - 2024 — McMap. All rights reserved.