Filter log file entries based on date range
Asked Answered
C

5

50

My server is having unusually high CPU usage, and I can see Apache is using way too much memory. I have a feeling, I'm being DOS'd by a single IP - maybe you can help me find the attacker?

I've used the following line, to find the 10 most "active" IPs:

cat access.log | awk '{print $1}' |sort  |uniq -c |sort -n |tail

The top 5 IPs have about 200 times as many requests to the server, as the "average" user. However, I can't find out if these 5 are just very frequent visitors, or they are attacking the servers.

Is there are way, to specify the above search to a time interval, eg. the last two hours OR between 10-12 today?

Cheers!

UPDATED 23 OCT 2011 - The commands I needed:

Get entries within last X hours [Here two hours]

awk -vDate=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` ' { if ($4 > Date) print Date FS $4}' access.log

Get most active IPs within the last X hours [Here two hours]

awk -vDate=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` ' { if ($4 > Date) print $1}' access.log | sort  |uniq -c |sort -n | tail

Get entries within relative timespan

awk -vDate=`date -d'now-4 hours' +[%d/%b/%Y:%H:%M:%S` -vDate2=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` ' { if ($4 > Date && $4 < Date2) print Date FS Date2 FS $4}' access.log

Get entries within absolute timespan

awk -vDate=`date -d '13:20' +[%d/%b/%Y:%H:%M:%S` -vDate2=`date -d'13:30' +[%d/%b/%Y:%H:%M:%S` ' { if ($4 > Date && $4 < Date2) print $0}' access.log 

Get most active IPs within absolute timespan

awk -vDate=`date -d '13:20' +[%d/%b/%Y:%H:%M:%S` -vDate2=`date -d'13:30' +[%d/%b/%Y:%H:%M:%S` ' { if ($4 > Date && $4 < Date2) print $1}' access.log | sort  |uniq -c |sort -n | tail
Columella answered 9/10, 2011 at 19:47 Comment(4)
I'm lazy; I'd copy the log into Excel and create a pivot table...Serial
@Serial "Now you have two problems."Censurable
This isn't a hard problem to solve but all of the scripts listed in the question are wrong as they don't compare month names correctly, most of the answers, including the accepted answers, are wrong, and the rest of the answers are overly complicated. The problem, I think, is that the OP didn't provide any sample input/output at all, never mind some that cover all of the potential use cases and so everyone's guessing at what might work with no minimal reproducible example to test it.Cupboard
The only answer that looks like it might be correct and concise is actually hidden in @Patrick's comment rather than posted as an answer but with no input/output to test against, I can't say for sure.Cupboard
S
48

yes, there are multiple ways to do this. Here is how I would go about this. For starters, no need to pipe the output of cat, just open the log file with awk.

awk -vDate=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` '$4 > Date {print Date, $0}' access_log

assuming your log looks like mine (they're configurable) than the date is stored in field 4. and is bracketed. What I am doing above is finding everything within the last 2 hours. Note the -d'now-2 hours' or translated literally now minus 2 hours which for me looks something like this: [10/Oct/2011:08:55:23

So what I am doing is storing the formatted value of two hours ago and comparing against field four. The conditional expression should be straight forward.I am then printing the Date, followed by the Output Field Separator (OFS -- or space in this case) followed by the whole line $0. You could use your previous expression and just print $1 (the ip addresses)

awk -vDate=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` '$4 > Date {print $1}' | sort  |uniq -c |sort -n | tail

If you wanted to use a range specify two date variables and construct your expression appropriately.

so if you wanted do find something between 2-4hrs ago your expression might looks something like this

awk -vDate=`date -d'now-4 hours' +[%d/%b/%Y:%H:%M:%S` -vDate2=`date -d'now-2 hours' +[%d/%b/%Y:%H:%M:%S` '$4 > Date && $4 < Date2 {print Date, Date2, $4} access_log'

Here is a question I answered regarding dates in bash you might find helpful. Print date for the monday of the current week (in bash)

Senator answered 10/10, 2011 at 15:11 Comment(10)
Thanks man! Great examples with good explanations. I've elaborated your code for my specific needs, and added it to the original question for future reference for myself and others in need.Columella
i'm glad it could be of help.Senator
One last thing. How do I search through multiple log files? I am trying with find and xargs but still no luck: find -name 'access.log' | awk -vDate=date -d '13:20' +[%d/%b/%Y:%H:%M:%S -vDate2=date -d'13:40' +[%d/%b/%Y:%H:%M:%S ' { if ($4 > Date && $4 < Date2) print $1}' xargs | sort |uniq -c |sort -n | tailColumella
busy day, so I will give you a detailed answer after work. If you do not need to retain for the document name you could use a glob like access_logs.2011-* which would find all logs from 2011 assuming your looks look like access_log.YYYY-MM, if you need to keep the names try using a for loop.Senator
Is awk somehow smart enough to guess that you're comparing dates ? Because I'd say it's just comparing strings, and since dates don't sort the same as strings (in the default nginx format you're using)... well I did some quick tests and I get less results for past month than past day, so it does seem kind of brokenHire
@Antoine, I originally answered this question in 2011. It was looking at apache log files not nginx. I am unfamiliar with nginx. Does it use the same default format as apache?Senator
@Senator Sorry to resurect this, but I'm pretty sure my point does not depend on the version of awk, and indeed nginx in 2018 seems to be using the same date format as apache in 2011. The question is how to deal with the fact that [01/Feb/20XX < [02/Feb/20XX < [31/Jan/20XX ?Hire
Yes, same here! I faced with this too [27/May/2018:03:12:01 > [01/Jun/2018:09:14:26Maurer
Yes, this is a plain string compare of the two date strings which may fail if the two dates are in different months. Try something like this: awk -F'[][]' -v dstart=`date -d"-2 hours" +%Y%m%dT%0H:%0M:%0S` '{ $2 = substr($2,8,4)sprintf("%02d",(match("JanFebMarAprMayJunJulAugSepOctNovDec",substr($2,4,3))+2)/3)substr($2,1,2)"T"substr($2,13,8); if ($2 > dstart) print }' access_logAnnisannissa
@Antoine, I've edited my answer for adding an intro regarding this bug and a new bash version wich is quicker than perl version.Holly
H
4

Introduction

As accepted answer from matchew is wrong, regarding Antoine's comment: Because awk will do alphanumeric comparisons. So if you logfile list events across the end and begin of two months:

  • [27/Feb/2023:00:00:00
  • [28/Feb/2023:00:00:00
  • [01/Mar/2023:00:00:00

awk will consider:

[01/Mar/2023:00:00:00 < [27/Feb/2023:00:00:00 < [28/Feb/2023:00:00:00

Wich is wrong! You have to compare date stings!!

For this, you could use libraries. Conforming to the language you use.

I will present here two different way, one using with Date::Parse library, and another (quicker), using with GNU/.

As this is a common task

And because this is not exactly same than extract last 10 minutes from logfile where it's about a bunch of time upto the end of logfile.

And because I've needed them, I (quickly) wrote this:

#!/usr/bin/perl -ws
# This script parse logfiles for a specific period of time

sub usage {
    printf "Usage: %s -s=<start time> [-e=<end time>] <logfile>\n";
    die $_[0] if $_[0];
    exit 0;
}

use Date::Parse;

usage "No start time submited" unless $s;
my $startim=str2time($s) or die;

my $endtim=str2time($e) if $e;
$endtim=time() unless $e;

usage "Logfile not submited" unless $ARGV[0];
open my $in, "<" . $ARGV[0] or usage "Can't open '$ARGV[0]' for reading";
$_=<$in>;
exit unless $_; # empty file
# Determining regular expression, depending on log format
my $logre=qr{^(\S{3}\s+\d{1,2}\s+(\d{2}:){2}\d+)};
$logre=qr{^[^\[]*\[(\d+/\S+/(\d+:){3}\d+\s\+\d+)\]} unless /$logre/;

while (<$in>) {
    /$logre/ && do {
        my $ltim=str2time($1);
        print if $endtim >= $ltim && $ltim >= $startim;
    };
};

This could be used like:

./timelapsinlog.pl -s=09:18 -e=09:24 /path/to/logfile

for printing logs between 09h18 and 09h24.

./timelapsinlog.pl -s='2017/01/23 09:18:12' /path/to/logfile

for printing from january 23th, 9h18'12" upto now.

In order to reduce perl code, I've used -s switch to permit auto-assignement of variables from commandline: -s=09:18 will populate a variable $s wich will contain 09:18. Care to not miss the equal sign = and no spaces!

Nota: This hold two diffent kind of regex for two different log standard. If you require different date/time format parsing, either post your own regex or post a sample of formatted date from your logfile

^(\S{3}\s+\d{1,2}\s+(\d{2}:){2}\d+)         # ^Jan  1 01:23:45
^[^\[]*\[(\d+/\S+/(\d+:){3}\d+\s\+\d+)\]    # ^... [01/Jan/2017:01:23:45 +0000]

Quicker** bash version:

Answering to Gilles Quénot's comment, I've tried to create a version.

As this version seem quicker than version, You may found a full version of grepByDates.sh with comments on my website (not on gith...), I post here a shorter version:

#!/bin/bash

prog=${0##*/}
usage() {
    cat <<EOUsage
        Usage: $prog <start date> <end date> <logfile>
            Each argument are required. End date could by `now`.
EOUsage
}

die() {
    echo >&2 "ERROR $prog: $*"
    exit 1
}

(($#==3))|| { usage; die 'Wrong number of arguments.';}

[[ -f $3 ]] || die "File not found."
# Conversion of argument to EPOCHSECONDS by asking `date` for the two conversions
{
    read -r start
    read -r end
} < <(
    date -f - +%s <<<"$1"$'\n'"$2"
)

# Determing wich kind of log format, between "apache logs" and "system logs":
read -r oline <"$3"   # read one log line
if [[ $oline =~ ^[^\ ]{3}\ +[0-9]{1,2}\ +([0-9]{2}:){2}[0-9]+ ]]; then
    # Look like syslog format
    sedcmd='s/^\([^ ]\{3\} \+[0-9]\{1,2\} \+\([0-9]\{2\}:\)\{2\}[0-9]\+\).*/\1/'
elif [[ $oline =~ ^[^\[]+\[[0-9]+/[^\ ]+/([0-9]+:){3}[0-9]+\ \+[0-9]+\] ]]; then
    # Look like apache logs
    sedcmd='s/^[0-9.]\+ \+[^ ]\+ \+[^ ]\+ \[\([^]]\+\)\].*$/\1/;s/:/ /;y|/|-|'
else
    die 'Log format not recognized'
fi
# Print lines begining by `1<tabulation>`
sed -ne s/^1\\o11//p <(
    # paste `bc` tests with log file
    paste <(
        # bc will do comparison against EPOCHSECONDS returned by date and $start - $end
        bc < <(
            # Create a bc function for testing against $start - $end.
            cat <<EOInitBc
                define void f(x) {
                    if ((x>$start) && (x<$end)) { 1;return ;};
                    0;}
EOInitBc
            # Run sed to extract date strings from logfile, then
                # run date to convert string to EPOCHSECONDS
            sed "$sedcmd" <"$3" |
                date -f - +'f(%s)'
        )
    ) "$3" 
)

Explanation

  • Script run sed to extract date strings from logfile
  • Pass date strings to date -f - +%s to convert in one run all strings to EPOCH (Unix Timestamp).
  • Run bc for the tests: print 1 if min > date > max or else print 0.
  • Run paste to merge bc output with logfile.
  • Finally run sed to find lines that match 1<tab> then replace match with nothing, then print.

So this script will fork 5 subprocess to do dedicated things by specialised tools, but won't do shell loop against each lines of logfile!

** Note:

Of course, this is quicker on my host because I run on a multicore processor, each task run parallelized!!

Conclusion:

This is not a program! This is an aggregation script!

If you consider bash not as a programming language, but as a super language or a tools aggregator, you could take the full power of all your tools!!

Holly answered 24/1, 2017 at 15:25 Comment(6)
Very nice reply, I added this to a loop, and I can easily investigate what happened on a server.Retractile
Date::Parse is not considered reliable. Check https://mcmap.net/q/22327/-filter-log-file-entries-based-on-date-rangeGilgilba
@GillesQuénot Did you consider my $logre? Alternatively, you could do this under bash, by using GNU date, bc, sed and paste: { read start;read end ;}< <(date -f - +%s <<<$'Feb 17 2023 17:15:42 +0200\nFeb 17 2023 17:48:42 +0200');sed -ne s/^1\\o11//p <(paste <(bc < <(echo "define void f(x) { if (x>$start) { if ( x <$end ) { 1 ;return;};} ; 0 ;}"; sed 's/^[0-9.]\+ \+[^ ]\+ \+[^ ]\+ \[\([^]]\+\)\].*/\1/;s/:/ /;y|/|-|' <logs | date -f - +'f(%s)' )) logs) (Tried with your 3 lines logs samle.)Holly
@GillesQuénot I've edited my answer regarding my answer to your comment!Holly
@GillesQuénot Answer edited in a try to make more readable!Holly
Before trying to compare readability of your version and mine, Please make your version able to work on syslog or apache logs indifferently.Holly
L
3

If someone encounters with the awk: invalid -v option, here's a script to get the most active IPs in a predefined time range:

cat <FILE_NAME> | awk '$4 >= "[04/Jul/2017:07:00:00" && $4 < "[04/Jul/2017:08:00:00"' | awk '{print $1}' | sort -n | uniq -c | sort -nr | head -20
Labor answered 4/7, 2017 at 9:30 Comment(2)
The cat is (still) useless.Censurable
Again this may fail if the dates are in different months (eg, "May">"Jun"). See my comment above for a way to convert from the string to the number. Briefly, monthnum=match("JanFebMarAprMayJunJulAugSepOctNovDec",monthstr)+2)/3Annisannissa
O
1

Very quick and readable way to do it in Python. This seems to be faster than the bash version. (Computed time is displayed using an internal module which has been striped from this code)

./ext_lines.py -v -s 'Feb 12 00:23:00' -e 'Feb 15 00:23:00' -i /var/log/syslog.1

Total time                : 445 ms 187 musec
Time per line             : 7 musec 58 ns
Number of lines           : 63,072
Number of extracted lines : 29,265

I can't compare this code with the daemon.log file used by others... But, here is my config

Operating System: Kubuntu 22.10 KDE Plasma Version: 5.25.5 KDE Frameworks Version: 5.98.0
Qt Version: 5.15.6
Kernel Version: 6.2.0-060200rc8-generic (64-bit)
Graphics Platform: X11 Processors: 16 × AMD Ryzen 7 5700U with Radeon Graphics
Memory: 14.9 GiB of RAM

The essential code could fit in just one line (dts = ...), but to make it more readable it's being "splited" in three. It's not only rather fast, it's also very compact :-)

from argparse import ArgumentParser, FileType
from datetime import datetime
from os.path import basename
from sys import argv, float_info
from time import mktime, localtime, strptime

__version__ = '1.0.0'                     # Workaround (internal use)

now = datetime.now

progname = basename(argv[0])

parser = ArgumentParser(description = 'Is Python strptime faster than sed and Perl ?',
                        prog = progname)

parser.add_argument('--version',
                    dest = 'version',
                    action = 'version',
                    version = '{} : {}'.format(progname,
                                               str(__version__)))
parser.add_argument('-i',
                    '--input',
                    dest = 'infile',
                    default = '/var/log/syslog.1',
                    type = FileType('r',
                                    encoding = 'UTF-8'),
                    help = 'Input file (stdin not yet supported)')
parser.add_argument('-f',
                    '--format',
                    dest = 'fmt',
                    default = '%b %d %H:%M:%S',
                    help = 'Date input format')
parser.add_argument('-s',
                    '--start',
                    dest = 'start',
                    default = None,
                    help = 'Starting date : >=')
parser.add_argument('-e',
                    '--end',
                    dest = 'end',
                    default = None,
                    help = 'Ending date : <=')
parser.add_argument('-v',
                    dest = 'verbose',
                    action = 'store_true',
                    default = False,
                    help = 'Verbose mode')

args = parser.parse_args()
verbose = args.verbose
start = args.start
end = args.end
infile = args.infile
fmt = args.fmt

############### Start code ################

lines = tuple(infile)

# Use defaut values if start or end are undefined
if not start :
    start = lines[0][:14]

if not end :
    end = lines[-1][:14]

# Convert start and end to timestamp
start = mktime(strptime(start,
                        fmt))
end = mktime(strptime(end,
                      fmt))

# Extract matching lines
t1 = now()
dts = [(x, line) for x, line in [(mktime(strptime(line[:14 ],
                                                  fmt)),
                                  line) for line in lines] if start <= x <= end]
t2 = now()

# Print stats
if verbose :
    total_time = 'Total time'
    time_p_line = 'Time per line'
    n_lines = 'Number of lines'
    n_ext_lines = 'Number of extracted lines'

    print(f'{total_time:<25} : {((t2 - t1) * 1000)} ms')
    print(f'{time_p_line:<25} : {((t2 -t1) / len(lines) * 1000)} ms')
    print(f'{n_lines:<25} : {len(lines):,}')
    print(f'{n_ext_lines:<25} : {len(dts):,}')

# Print extracted lines
print(''.join([x[1] for x in dts]))
Overrun answered 19/2, 2023 at 20:29 Comment(1)
By my tests: your script's perfs are comparable to my bash script''s perfs, maybe a little better. Interesting, thanks, I was waiting for a python version!Holly
O
0

To parse the access.log precisely in a specified range, in this case only the last 10 minutes (based from EPOCH aka number of seconds since 1970/01/01):

Input file:

172.16.0.3 - - [17/Feb/2023:17:48:41 +0200] "GET / HTTP/1.1" 200 123 "" "Mozilla/5.0 (compatible; Konqueror/2.2.2-2; Linux)"
172.16.0.4 - - [17/Feb/2023:17:25:41 +0200] "GET / HTTP/1.1" 200 123 "" "Mozilla/5.0 (compatible; Konqueror/2.2.2-2; Linux)"
172.16.0.5 - - [17/Feb/2023:17:15:41 +0200] "GET / HTTP/1.1" 200 123 "" "Mozilla/5.0 (compatible; Konqueror/2.2.2-2; Linux)"

Perl's oneliner:

With the reliable Time::Piece time parser, using strptime() to parse date, and strftime() to format new one. This module is installed in core (by default) thats is not the case with not reliable Date::Parse

$ perl -MTime::Piece -sne '
    BEGIN{
        my $t = localtime;
        our $now = $t->epoch;
        our $monthsRe = join "|", $t->mon_list;
    }
    m!\[(\d{2}/(?:$monthsRe)/\d{4}:\d{2}:\d{2}:\d{2})\s!;
    my $d = Time::Piece->strptime("$1", "%d/%b/%Y:%H:%M:%S");
    my $old = $d->strftime("%s");
    my $diff = (($now - $old) + $gap);
    if ($diff > $min and $diff < $max) {print}
' -- -gap=$({ echo -n "0"; date "+%:::z*3600"; } | bc) \
     -min=0 \
     -max=600 access.log

Explanations of arguments: -gap, -min, -max switches

  • -gap the $((7*3600)) aka 25200 seconds, is the gap with UTC : +7 hours in seconds in my current case 🇹🇭 (Thai TZ) ¹ rewrote as { echo -n "0"; date "+%:::z*3600"; } | bc if you have GNU date. If not, use another way to set the gap
  • -min the min seconds since we print log matching line(s)
  • -max the max seconds until we print log matching line(s)
  • to know the gap from UTC, take a look to:

¹

$ LANG=C date
Fri Feb 17 15:50:13 +07 2023

The +07 is the gap.

This way, you can filter exactly at the exact seconds range with this snippet.

Sample output

172.16.0.3 - - [17/Feb/2023:17:48:41 +0200] "GET / HTTP/1.1" 200 123 "" "Mozilla/5.0 (compatible; Konqueror/2.2.2-2; Linux)"
Odericus answered 17/2, 2023 at 10:59 Comment(8)
You miss a BEGIN{} part!! In this, you reassing $t, $now and $monthRE for each line of log!! Adding a BEGIN{} section divite execution time by 2 (but stey longer than my bash version ;-) perl -MTime::Piece -sne 'BEGIN{ my $t = localtime; my $now = $t->epoch;print $now; $monthsRe = join "|", $t->mon_list; }; m!((?:$monthsRe) \d{2} \d{2}:\d{2}:\d{2})\s!;... ...And drop my for $monthRe!!Holly
My version is more readable ;)Gilgilba
By addon a BEGIN { my $t = localtime; $now = $t->epoch; $monthsRe = join "|", $t->mon_list; } section, your script will stay readable!! And should become a lot quicker!!Holly
Yes, I'm working on. Thanks for reporting ;)Gilgilba
Re-declaring $monthRe before reading each line of logfile is a bug, I think.Holly
Fixed, post edited accordingly and testedGilgilba
Have the courtesy to at least upvote my comment if you find them useful!Holly
I already said thanks in comment ^^ Merci beaucoup confrère Francophone =)Gilgilba

© 2022 - 2024 — McMap. All rights reserved.