Regex to split columns of an Amazon S3 Bucket Log?
Asked Answered
D

7

5

I'm setting up an ETL process for my company's S3 buckets so we can track our usage, and I've run into some trouble breaking up the columns of the S3 log file because Amazon uses spaces, double quotes, and square brackets to delimit columns.

I found this Regex: [^\\s\"']+|\"([^\"]*)\"|'([^']*)' on this SO post: Regex for splitting a string using space when not surrounded by single or double quotes and that's gotten me pretty close. I just need help adjusting it to ignore single quotes and also ignore spaces between a "[" and a "]"

Here's an example line from one of our files:

dd8d30dd085515d73b318a83f4946b26d49294a95030e4a7919de0ba6654c362 ourbucket.name.config [31/Oct/2011:17:00:04 +0000] 184.191.213.218 - 013259AC1A20DF37 REST.GET.OBJECT ourbucket.name.config.txt "GET /ourbucket.name.config.txt HTTP/1.1" 200 - 325 325 16 16 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" -

And here's the format definition: http://s3browser.com/amazon-s3-bucket-logging-server-access-logs.php

Any help would be appreciated!

EDIT: in response to FaileDev, the output should be any string contained between two square brackets, e.g. [foo bar], two quotes, e.g. "foo bar" or spaces, e.g. foo bar (where both foo and bar would match individually. I've broken each match in the example line I provided into it's own line in the following block:

dd8d30dd085515d73b318a83f4946b26d49294a95030e4a7919de0ba6654c362 
ourbucket.name.config 
[31/Oct/2011:17:00:04 +0000] 
184.191.213.218 
- 
013259AC1A20DF37 
REST.GET.OBJECT 
ourbucket.name.config.txt 
"GET /ourbucket.name.config.txt HTTP/1.1" 
200 
- 
325 
325 
16 
16 
"-" 
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" 
-
Descendible answered 1/11, 2011 at 0:38 Comment(2)
What exactly should the output be?Golfer
I can't believe more people don't need this information! great question, thanks!Quenchless
P
2

You can't do it using string.split, you need to iterate through all captures of the 'column' group (if you're using C#)

This matches a non-quoted, non-bracketed field: [^\s\"\[\]]+
This matches a bracketed field: \[[^\]\[]+\] 
This matches a quoted field: \"[^\"]+\"

It's easiest to leave the quotes and brackets on during matching, then strip them off using Trim('[','\','"').

@"^((?<column>[^\s\"\[\]]+|\[[^\]\[]+\]|\"[^\"]+\")\s+)+$"
Psychotechnics answered 1/11, 2011 at 1:16 Comment(2)
Thanks, the ORing patterns worked fine. This string pattern works best for C# : @"([^\s\""[]]+)|([[^][]+])|(\""[^\""]+\"")"Descendible
Thanks. It seems stack overflow removed my slashes.... I forgot to embed it in a code block. Updating now.Psychotechnics
S
4

Here is a dumb regex I wrote to parse s3 log files in node:

/^(.*?)\s(.*?)\s(\[.*?\])\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(\".*?\")\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(\".*?\")\s(\".*?\")\s(.*?)$/

As I said, this is "dumb" - it relies heavily on them not changing the log format, and each field not containing any weird characters.

Shumate answered 20/11, 2013 at 16:26 Comment(0)
P
2

You can't do it using string.split, you need to iterate through all captures of the 'column' group (if you're using C#)

This matches a non-quoted, non-bracketed field: [^\s\"\[\]]+
This matches a bracketed field: \[[^\]\[]+\] 
This matches a quoted field: \"[^\"]+\"

It's easiest to leave the quotes and brackets on during matching, then strip them off using Trim('[','\','"').

@"^((?<column>[^\s\"\[\]]+|\[[^\]\[]+\]|\"[^\"]+\")\s+)+$"
Psychotechnics answered 1/11, 2011 at 1:16 Comment(2)
Thanks, the ORing patterns worked fine. This string pattern works best for C# : @"([^\s\""[]]+)|([[^][]+])|(\""[^\""]+\"")"Descendible
Thanks. It seems stack overflow removed my slashes.... I forgot to embed it in a code block. Updating now.Psychotechnics
R
1

This is a python solution that may help someone. It also removes the quotes and square brackets for you:

import re
log = '79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be mybucket [06/Feb/2014:00:00:38 +0000] 192.0.2.3 79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be A1206F460EXAMPLE REST.GET.BUCKETPOLICY - "GET /mybucket?policy HTTP/1.1" 404 NoSuchBucketPolicy 297 - 38 - "-" "S3Console/0.4" -'

regex = '(?:"([^"]+)")|(?:\[([^\]]+)\])|([^ ]+)'

# Result is a list of triples, with only one having a value
# (due to the three group types: '""' or '[]' or '')
result = re.compile(regex).findall(log)
for a, b, c in result:
    print(a or b or c)

Output:

79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be
mybucket
06/Feb/2014:00:00:38 +0000
192.0.2.3
79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be
A1206F460EXAMPLE
REST.GET.BUCKETPOLICY
-
GET /mybucket?policy HTTP/1.1
404
NoSuchBucketPolicy
297
-
38
-
-
S3Console/0.4
-
jon@jon-laptop:~/Downloads$ python regex.py
79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be
mybucket
06/Feb/2014:00:00:38 +0000
192.0.2.3
79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be
A1206F460EXAMPLE
REST.GET.BUCKETPOLICY
-
GET /mybucket?policy HTTP/1.1
404
NoSuchBucketPolicy
297
-
38
-
-
S3Console/0.4
-
Rackrent answered 4/8, 2015 at 13:13 Comment(0)
S
1

I agree with @andy! I can't believe more people aren't dealing with S3's access logs, considering how long they have been around.


This is the regexp I used

/(?:([a-z0-9]+)|-) (?:([a-z0-9\.-_]+)|-) (?:\[([^\]]+)\]|-) (?:([0-9\.]+)|-) (?:([a-z0-9]+)|-) (?:([a-z0-9.-_]+)|-) (?:([a-z\.]+)|-) (?:([a-z0-9\.-_\/]+)|-) (?:"-"|"([^"]+)"|-) (?:(\d+)|-) (?:([a-z]+)|-) (?:(\d+)|-) (?:(\d+)|-) (?:(\d+)|-) (?:(\d+)|-) (?:"-"|"([^"]+)"|-) (?:"-"|"([^"]+)"|-) (?:([a-z0-9]+)|-)/i

If you are using node.js you can utilize my module to make this much easier to deal with, or port it to C#, the basic ideas are all there.

https://github.com/icodeforlove/s3-access-log-parser

Sigfried answered 30/9, 2015 at 12:38 Comment(0)
C
0

I tried using this in C# but found there were some incorrect characters in the answer above and you had to have the regex for the non-quoted, non-bracketed field at the end otherwise it matches everything (using http://regexstorm.net/tester): enter image description here

The full regex with the bracketed field first, the quoted field second and the non-quoted, non-bracketed field last: enter image description here

A simple C# implementation:

    MatchCollection matches = Regex.Matches(contents, @"(\[[^\]\[]+\])|(""[^""]+"")|([^\s""\[\]]+)");
    for (int i = 0; i < matches.Count; i++)
    {
        Console.WriteLine(i + ": " + matches[i].ToString().Trim('[', ']', '"'));
    }
Caprine answered 30/12, 2016 at 18:47 Comment(0)
B
0

Here is the regex I copied from AWS Knowledge Center and modify it a bit to make it working in ASP.NET Core.

new Regex("([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*)");

It is working fine for us. And if anyone wants to use c# class to store the access log, below is the code to parse each line of the log file and create S3ServerAccessLog object for it.

private List<S3ServerAccessLog> ParseLogs(string accessLogs)
{
    // split log file per new line since each log will be on a single line.
    var splittedLogs = accessLogs.Split("\r\n", StringSplitOptions.RemoveEmptyEntries);
    var parsedLogs = new List<S3ServerAccessLog>();

    foreach (var logLine in splittedLogs)
    {
        var parsedLog = ACCESS_LOG_REGEX.Split(logLine).Where(s => s.Length > 0).ToList();
                
        // construct 
        var logModel = new S3ServerAccessLog
        {
            BucketOwner = parsedLog[0],
            BucketName = parsedLog[1],
            RequestDateTime = DateTimeOffset.ParseExact(parsedLog[2], "dd/MMM/yyyy:HH:mm:ss K", CultureInfo.InvariantCulture),
            RemoteIP = parsedLog[3],
            Requester = parsedLog[4],
            RequestId = parsedLog[5],
            Operation = parsedLog[6],
            Key = parsedLog[7],
            RequestUri = parsedLog[8].Replace("\"", ""),
            HttpStatus = int.Parse(parsedLog[9]),
            ErrorCode = parsedLog[10],
            BytesSent = parsedLog[11],
            ObjectSize = parsedLog[12],
            TotalTime = parsedLog[13],
            TurnAroundTime = parsedLog[14],
            Referrer = parsedLog[15].Replace("\"", ""),
            UserAgent = parsedLog[16].Replace("\"", ""),
            VersionId = parsedLog[17],
            HostId = parsedLog[18],
            Sigv = parsedLog[19],
            CipherSuite = parsedLog[20],
            AuthType = parsedLog[21],
            EndPoint = parsedLog[22],
            TlsVersion = parsedLog[23]
        };

        parsedLogs.Add(logModel);
    }

    return parsedLogs;
}
Bistort answered 17/4, 2021 at 10:37 Comment(0)
J
0

I wasn't able to get any of the posted solutions to parse a log file entry that has a request URI containing double quotes, so this is what I ended up with in Python:

import json
import re
from collections import namedtuple

FILENAME = '/tmp/2022-11/2022-11-01-20-21-34-AB64DC3459FF2F2B'

# define a named tuple to represent each log entry
LogEntry = namedtuple(
    'LogEntry',
    [
        'bucket_owner',
        'bucket',
        'timestamp',
        'remote_ip',
        'requester',
        'request_id',
        'operation',
        's3_key',
        'request_uri',
        'http_version',
        'status_code',
        'error_code',
        'bytes_sent',
        'object_size',
        'total_time',
        'turn_around_time',
        'referrer',
        'user_agent',
        'version_id',
        'host_id',
        'sigv',
        'cipher_suite',
        'auth_type',
        'endpoint',
        'tls_version',
        'access_point_arn'
    ]
)

# compile the regular expression for parsing log entries
LOG_ENTRY_PATTERN = re.compile(
    r'(\S+) (\S+) \[(.+)\] (\S+) (\S+) (\S+) (\S+) (\S+) "(.*) HTTP\/(\d\.\d)" (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) "(\S+)" "(.*)" (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+)'
)

# open the access log file
with open(FILENAME, 'r') as f:
    # iterate over each line in the file
    for line in f:
        # ignore certain types of operations
        if 'BATCH.DELETE.OBJECT' not in line \
                and 'S3.TRANSITION_SIA.OBJECT' not in line \
                and 'REST.COPY.OBJECT_GET' not in line:
            # parse the log entry using the regular expression
            match = LOG_ENTRY_PATTERN.match(line)

            if match:
                # create a LogEntry named tuple from the parsed log entry
                log_entry = LogEntry(*match.groups())
                log_entry = dict(log_entry._asdict())

                for key in log_entry:
                    if log_entry[key] == '-':
                        log_entry[key] = None

                print(json.dumps(log_entry, indent=4, default=str))

I personally find it cleaner working with a namedtuple which I then cast to a dict to easily insert into a MySQL database, than working with a list.

Jotunheim answered 6/12, 2022 at 16:40 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.