AWS Glue: Crawler does not recognize Timestamp columns in CSV format
Asked Answered
T

5

8

When running the AWS Glue crawler it does not recognize timestamp columns.

I have correctly formatted ISO8601 timestamps in my CSV file. First I expected Glue to automatically classify these as timestamps, which it does not.

I also tried a custom timestamp classifier from this link https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html

Here is what my classifier looks like

grok classifier

This also does not correctly classify my timestamps.

I have put into grok debugger (https://grokdebug.herokuapp.com/) my data, for example

id,iso_8601_now,iso_8601_yesterday
0,2019-05-16T22:47:33.409056,2019-05-15T22:47:33.409056
1,2019-05-16T22:47:33.409056,2019-05-15T22:47:33.409056

and it matches on both

%{TIMESTAMP_ISO8601:timestamp}

%{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?

import csv
from datetime import datetime, timedelta

with open("timestamp_test.csv", 'w', newline='') as f:
    w = csv.writer(f, delimiter=',')

    w.writerow(["id", "iso_8601_now", "iso_8601_yesterday"])

    for i in range(1000):
        w.writerow([i, datetime.utcnow().isoformat(), (datetime.utcnow() - timedelta(days=1)).isoformat()])

I expect AWS glue to automatically classify the iso_8601 columns as timestamps. Even when adding the custom grok classifier it still does not classify the either of the columns as timestamp.

Both columns are classified as strings.

The classifer is active on the crawler classifier active

Output of the timestamp_test table by the crawler

{
    "StorageDescriptor": {
        "cols": {
            "FieldSchema": [
                {
                    "name": "id",
                    "type": "bigint",
                    "comment": ""
                },
                {
                    "name": "iso_8601_now",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "iso_8601_yesterday",
                    "type": "string",
                    "comment": ""
                }
            ]
        },
        "location": "s3://REDACTED/_csv_timestamp_test/",
        "inputFormat": "org.apache.hadoop.mapred.TextInputFormat",
        "outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
        "compressed": "false",
        "numBuckets": "-1",
        "SerDeInfo": {
            "name": "",
            "serializationLib": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
            "parameters": {
                "field.delim": ","
            }
        },
        "bucketCols": [],
        "sortCols": [],
        "parameters": {
            "skip.header.line.count": "1",
            "sizeKey": "58926",
            "objectCount": "1",
            "UPDATED_BY_CRAWLER": "REDACTED",
            "CrawlerSchemaSerializerVersion": "1.0",
            "recordCount": "1227",
            "averageRecordSize": "48",
            "CrawlerSchemaDeserializerVersion": "1.0",
            "compressionType": "none",
            "classification": "csv",
            "columnsOrdered": "true",
            "areColumnsQuoted": "false",
            "delimiter": ",",
            "typeOfData": "file"
        },
        "SkewedInfo": {},
        "storedAsSubDirectories": "false"
    },
    "parameters": {
        "skip.header.line.count": "1",
        "sizeKey": "58926",
        "objectCount": "1",
        "UPDATED_BY_CRAWLER": "REDACTED",
        "CrawlerSchemaSerializerVersion": "1.0",
        "recordCount": "1227",
        "averageRecordSize": "48",
        "CrawlerSchemaDeserializerVersion": "1.0",
        "compressionType": "none",
        "classification": "csv",
        "columnsOrdered": "true",
        "areColumnsQuoted": "false",
        "delimiter": ",",
        "typeOfData": "file"
    }
}
Taipan answered 16/5, 2019 at 23:12 Comment(2)
Did you get this working?Tacitus
Sorry for the late reply but I can't remember if I ever got this working or not since I moved onto a new job. I think I handled it by parsing everything back to dates in a downstream service.Taipan
V
3

According to CREATE TABLE doc, the timestamp format is yyyy-mm-dd hh:mm:ss[.f...]

If you must use the ISO8601 format, add this Serde parameter 'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSSSS'

You can alter the table from Glue(1) or recreate it from Athena(2):

  1. Glue console > tables > edit table > add the above to Serde parameters. You will also need to click on "edit schema" and change data types from string to timestamp
  2. From Athena delete the table and run:
CREATE EXTERNAL TABLE `table1`(
  `id` bigint, 
  `iso_8601_now` timestamp, 
  `iso_8601_yesterday` timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ( 
  'field.delim' = ',',
  'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSSSS') 
LOCATION
  's3://REDACTED/_csv_timestamp_test/'
Vietnam answered 19/5, 2019 at 14:41 Comment(5)
Thanks for the answer. I just tried the above test with the timestamps in yyyy-mm-dd hh:mm:ss and the Glue Crawler still classified my data as a string. I need to make make sure glue crawler correctly classifys the timestamp columns rather than having to manually edit the table definition!Taipan
after crawler runs again, the table is overwritten to the original schemaGreatcoat
This answer works!! The only tweak needed is that if adding directly to the table properties in Glue data catalog, do not escape the quotes...what worked for me was adding to the Serde parameters section just timestamp.formats and yyyy-MM-dd'T'HH:mm:ss.SSSSSS, queries correctly as a timestamp now in Athena!Orcein
@cmcapellan, where exactly did you mention the SerDe parameters ? In the classifier ? or the Crawler ? I couldn't find the option to configure. Any help is appreciated.Beeves
Hi @SaiTejaChowderpally, I did this project a long time ago and unfortunately do not have access to it anymore. I think this was in the UI of the Glue catalog on the AWS Portal, not in the Athena query window... so I guess the "Edit Classifier" screen?Orcein
B
0

Not sure if you found your answer, but note that according to the docs:

"To reclassify data to correct an incorrect classifier, create a new crawler with the updated classifier."

Source

Barry answered 9/9, 2019 at 17:59 Comment(0)
W
0

Try following grok pattern, if you wish to have timestamp as datatype

%{TIME:timestamp}

Whitelaw answered 10/9, 2019 at 17:36 Comment(0)
T
0

It seems that JSON / CSV use their respected classifier, which only reads data in primitive data types.

For JSON they are string, number, array, etc. For CSV may be only number and string.

The Grok patterns only work on text files it seems, and with JSON classifier you only specify the path, not the data pattern, so there is not a way to do it in the Glue crawler.

Tacitus answered 20/1, 2021 at 8:30 Comment(0)
A
0

for what it's worth, i had the same issue with plain old dates. My csv had YYYY-MM-DD but Glue kept converting it to string.

I edited the table schema, changing my columns to dates, and then changed the Crawler setup to Add New Columns only.

Edit Crawler > Output > Configuration

(It had been set to "Update the table definition in the data catalog".)

Obviously this will only work if you're not expecting schema changes.

Armlet answered 7/12, 2021 at 9:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.