Split access.log file by dates using command line tools
Asked Answered
F

7

19

I have a Apache access.log file, which is around 35GB in size. Grepping through it is not an option any more, without waiting a great deal.

I wanted to split it in many small files, by using date as splitting criteria.

Date is in format [15/Oct/2011:12:02:02 +0000]. Any idea how could I do it using only bash scripting, standard text manipulation programs (grep, awk, sed, and likes), piping and redirection?

Input file name is access.log. I'd like output files to have format such as access.apache.15_Oct_2011.log (that would do the trick, although not nice when sorting.)

Fatalism answered 27/7, 2012 at 11:41 Comment(0)
T
24

One way using awk:

awk 'BEGIN {
    split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ", months, " ")
    for (a = 1; a <= 12; a++)
        m[months[a]] = sprintf("%02d", a)
}
{
    split($4,array,"[:/]")
    year = array[3]
    month = m[array[2]]

    print > FILENAME"-"year"_"month".txt"
}' incendiary.ws-2009

This will output files like:

incendiary.ws-2010-2010_04.txt
incendiary.ws-2010-2010_05.txt
incendiary.ws-2010-2010_06.txt
incendiary.ws-2010-2010_07.txt

Against a 150 MB log file, the answer by chepner took 70 seconds on an 3.4 GHz 8 Core Xeon E31270, while this method took 5 seconds.

Original inspiration: "How to split existing apache logfile by month?"

Tager answered 30/7, 2012 at 1:14 Comment(6)
You are right, Sir. I've just tested perl solution as well, and awk solution was faster by 3x. I suspect it has to do with the fact that awk example doesn't use regular expressions but simple string splitting, which might be more efficient. Marking as an Accepted answer.Fatalism
Oh, and I'm definitely using this on production against 20 GB files with no problems now. Takes about 2 GB/minute on my system.Tager
Similar performances here as well: ~1 minute / ~2.5gb. Thanks!Fatalism
The only thing is, I need date extracted as well - my daily log sizes are well over 400MB these days.. Could you modify the script to include dates as well?Fatalism
Shouldn't 'split($4,array,"[:/]");' instruction should come before 'year = array[3]'Mound
@TheodoreR.Smith Your output file name are wrong because you encoded the month variable with two digits sprintf("%02d", a). Can you please fix your output file names so as to avoid confusion ?Tireless
U
10

Pure bash, making one pass through the access log:

while read; do
    [[ $REPLY =~ \[(..)/(...)/(....): ]]

    d=${BASH_REMATCH[1]}
    m=${BASH_REMATCH[2]}
    y=${BASH_REMATCH[3]}

    #printf -v fname "access.apache.%s_%s_%s.log" ${BASH_REMATCH[@]:1:3}
    printf -v fname "access.apache.%s_%s_%s.log" $y $m $d

    echo "$REPLY" >> $fname
done < access.log
Unclose answered 27/7, 2012 at 13:44 Comment(3)
The method in my answer is dramatically faster: Against a 150 MB log file, this answer took 70 seconds on an 3.4 GHz 8 Core Xeon E31270, while the method in mine took 5 seconds.Tager
However, this answer creates logs files on a day basis not on a monthly basis. This does less, no wonder it is faster.Comeon
@Comeon The reason this is slower is that iterating through the input is much faster in awk than in bash; the number of output files is not really relevant.Unclose
B
5

Here is an awk version that outputs lexically sortable log files.

Some efficiency enhancements: all done in one pass, only generate fname when it is not the same as before, close fname when switching to a new file (otherwise you might run out of file descriptors).

awk -F"[]/:[]" '
BEGIN {
  m2n["Jan"] =  1;  m2n["Feb"] =  2; m2n["Mar"] =  3; m2n["Apr"] =  4;
  m2n["May"] =  5;  m2n["Jun"] =  6; m2n["Jul"] =  7; m2n["Aug"] =  8;
  m2n["Sep"] =  9;  m2n["Oct"] = 10; m2n["Nov"] = 11; m2n["Dec"] = 12;
}
{
  if($4 != pyear || $3 != pmonth || $2 != pday) {
    pyear  = $4
    pmonth = $3
    pday   = $2

    if(fname != "")
      close(fname)

    fname  = sprintf("access_%04d_%02d_%02d.log", $4, m2n[$3], $2)
  }
  print > fname
}' access-log
Bortz answered 27/7, 2012 at 14:26 Comment(0)
F
4

Perl came to the rescue:

cat access.log | perl -n -e'm@\[(\d{1,2})/(\w{3})/(\d{4}):@; open(LOG, ">>access.apache.$3_$2_$1.log"); print LOG $_;'

Well, it's not exactly "standard" manipulation program, but it's made for text manipulation nevertheless.

I've also changed order of arguments in file name, so that files are named like access.apache.yyyy_mon_dd.log for easier sorting.

Fatalism answered 27/7, 2012 at 12:38 Comment(0)
P
4

I combined Theodore's and Thor's solutions to use Thor's efficiency improvement and daily files, but retain the original support for IPv6 addresses in combined format file.

awk '
BEGIN {
  m2n["Jan"] =  1;  m2n["Feb"] =  2; m2n["Mar"] =  3; m2n["Apr"] =  4;
  m2n["May"] =  5;  m2n["Jun"] =  6; m2n["Jul"] =  7; m2n["Aug"] =  8;
  m2n["Sep"] =  9;  m2n["Oct"] = 10; m2n["Nov"] = 11; m2n["Dec"] = 12;
}
{
  split($4, a, "[]/:[]")
  if(a[4] != pyear || a[3] != pmonth || a[2] != pday) {
    pyear  = a[4]
    pmonth = a[3]
    pday   = a[2]

    if(fname != "")
      close(fname)

    fname  = sprintf("access_%04d-%02d-%02d.log", a[4], m2n[a[3]], a[2])
  }
  print >> fname
}'
Preconception answered 20/1, 2015 at 21:42 Comment(1)
This is really impressive! Thank youTager
M
1

Kind of ugly, that's bash for you:

    for year in 2010 2011 2012; do
       for month in jan feb mar apr may jun jul aug sep oct nov dec; do
           for day in 1 2 3 4 5 6 7 8 9 10 ... 31 ; do
               cat access.log | grep -i $day/$month/$year > $day-$month-$year.log
            done
        done
     done
Matutinal answered 27/7, 2012 at 12:45 Comment(4)
very clever, thanks ;) this would work great for small file (filesize less than amount of ram), as it loops through entire file about 1,116 times :)Fatalism
very true, its not an efficient script. it would be good for occasional use. Thanks!Matutinal
it would be faster to unroll the outer loop and process the file in two passes - on the first pass split the file into entries by year. The second pass would then process each year file and split the entries by date. It may even be faster to unroll the second loop and process the file in three passes.Matutinal
grepping for the date will accidentally delete stacktraces etc, i.e. any lines that don't contain a date will be deleted. Usually it is these lines that are the most interesting.Blackbird
B
1

I made a slight improvement to Theodore's answer so I could see progress when processing a very large log file.

#!/usr/bin/awk -f

BEGIN {
    split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ", months, " ")
    for (a = 1; a <= 12; a++)
        m[months[a]] = a
}
{
    split($4, array, "[:/]")
    year = array[3]
    month = sprintf("%02d", m[array[2]])

    current = year "-" month
    if (last != current)
        print current
    last = current

    print >> FILENAME "-" year "-" month ".txt"
}

Also I found that I needed to use gawk (brew install gawk if you don't have it) for this to work on Mac OS X.

Brownedoff answered 1/5, 2014 at 5:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.