Logfile analysis in R?

Asked 14/4, 2011 at 14:44 Answered 9/4, 2016 at 3:55

I know there are other tools around like awstats or splunk, but I wonder whether there is some serious (web)server logfile analysis going on in R. I might not be the first thought to do it in R, but still R has nice visualization capabilities and also nice spatial packages. Do you know of any? Or is there a R package / code that handles the most common log file formats that one could build on? Or is it simply a very bad idea?

Ollie answered 14/4, 2011 at 14:44 Comment(2)

to paraphraze @Dirk : I'm looking forward to that package... – Fredel 14/4, 2011 at 15:9

I wrote a pretty nice (and long) readLogFile function for a project last year. It turned out to be quite useful. It's not too hard to write your own. Just text processing and indexing. – Ambie 25/4, 2014 at 15:3

In connection with a project to build an analytics toolbox for our Network Ops guys, i built one of these about two months ago. My employer has no problem if i open source it, so if anyone is interested i can put it up on my github repo. I assume it's most useful to this group if i build an R Package. I won't be able to do that straight away though because i need to research the docs on package building with non-R code (it might be as simple as tossing the python bytecode files in /exec along with a suitable python runtime, but i have no idea).

I was actually suprised that i needed to undertake a project of this sort. There are at least several excellent open source and free log file parsers/viewers (including the excellent Webalyzer and AWStats) but neither parse server error logs (parsing server access logs is the primary use case for both).

If you are not familiar with error logs or with the difference between them and access logs, in sum, Apache servers (likewsie, nginx and IIS) record two distinct logs and store them to disk by default next to each other in the same directory. On Mac OS X, that directory in /var, just below root:

$> pwd
   /var/log/apache2

$> ls
   access_log   error_log

For network diagnostics, error logs are often far more useful than the access logs. They also happen to be significantly more difficult to process because of the unstructured nature of the data in many of the fields and more significantly, because the data file you are left with after parsing is an irregular time series--you might have multiple entries keyed to a single timestamp, then the next entry is three seconds later, and so forth.

i wanted an app that i could toss in raw error logs (of any size, but usually several hundred MB at a time) have something useful come out the other end--which in this case, had to be some pre-packaged analytics and also a data cube available inside R for command-line analytics. Given this, i coded the raw-log parser in python, while the processor (e.g., gridding the parser output to create a regular time series) and all analytics and data visualization, i coded in R.

I have been building analytics tools for a long time, but only in the past four years have i been using R. So my first impression--immediately upon parsing a raw log file and loading the data frame in R is what a pleasure R is to work with and how it is so well suited for tasks of this sort. A few welcome suprises:

Serialization. To persist working data in R is a single command (save). I knew this, but i didn't know how efficient is this binary format. Thee actual data: for every 50 MB of raw logfiles parsed, the .RData representation was about 500 KB--100 : 1 compression. (Note: i pushed this down further to about 300 : 1 by using the data.table library and manually setting compression level argument to the save function);
IO. My Data Warehouse relies heavily on a lightweight datastructure server that resides entirely in RAM and writes to disk asynchronously, called redis. The proect itself is only about two years old, yet there's already a redis client for R in CRAN (by B.W. Lewis, version 1.6.1 as of this post);
Primary Data Analysis. The purpose of this Project was to build a Library for our Network Ops guys to use. My goal was a "one command = one data view" type interface. So for instance, i used the excellent googleVis Package to create a professional-looking scrollable/paginated HTML tables with sortable columns, in which i loaded a data frame of aggregated data (>5,000 lines). Just those few interactive elments--e.g., sorting a column--delivered useful descriptive analytics. Another example, i wrote a lot of thin wrappers over some basic data juggling and table-like functions; each of these functions i would for instance, bind to a clickable button on a tabbed web page. Again, this was a pleasure to do in R, in part becasue quite often the function required no wrapper, the single command with the arguments supplied was enough to generate a useful view of the data.

A couple of examples of the last bullet:

# what are the most common issues that cause an error to be logged?

err_order = function(df){
    t0 = xtabs(~Issue_Descr, df)
    m = cbind( names(t0), t0)
    rownames(m) = NULL
    colnames(m) = c("Cause", "Count")
    x = m[,2]
    x = as.numeric(x)
    ndx = order(x, decreasing=T)
    m = m[ndx,]
    m1 = data.frame(Cause=m[,1], Count=as.numeric(m[,2]),
                    CountAsProp=100*as.numeric(m[,2])/dim(df)[1])
    subset(m1, CountAsProp >= 1.)
}

# calling this function, passing in a data frame, returns something like:


                        Cause       Count    CountAsProp
1  'connect to unix://var/ failed'    200        40.0
2  'object buffered to temp file'     185        37.0
3  'connection refused'                94        18.8

The Primary Data Cube Displayed for Interactive Analysis Using googleVis:

A contingency table (from an xtab function call) displayed using googleVis)

enter image description here

Erythroblast answered 13/7, 2011 at 1:50 Comment(1)

Doug, that sounds lovely. I can try to help with the R packaging -- e.g. Python scripts are a non-issue as other package come with their own Perl (cf gdata which uses a Perl package to read xls files) or Java jars (several packages). – Analysis 13/7, 2011 at 22:16

It is in fact an excellent idea. R also has very good date/time capabilities, can do cluster analysis or use any variety of machine learning alogorithms, has three different regexp engines to parse etc pp.

And it may not be a novel idea. A few years ago I was in brief email contact with someone using R for proactive (rather than reactive) logfile analysis: Read the logs, (in their case) build time-series models, predict hot spots. That is so obviously a good idea. It was one of the Department of Energy labs but I no longer have a URL. Even outside of temporal patterns there is a lot one could do here.

Analysis answered 14/4, 2011 at 14:48 Comment(3)

+1, an experienced opinion does definitely help an alienated user here. In turn if the idea wasn't that bad I wonder whether the most basic stuff like parsing the line based logs and choosing distinct ips wasn't already done... – Ollie 14/4, 2011 at 15:7

"So much to do, so little time." Not everything that ought to get done also gets done. That's why Open Source is fun: Your itch, your project. – Analysis 14/4, 2011 at 15:19

ok, I start to like it. So I guess need to start reading to see whether the itching stops or continues... any suggestions from tools written in other languages / concepts? – Ollie 14/4, 2011 at 15:22

I have used R to load and parse IIS Log files with some success here is my code.

Load IIS Log files
require(data.table)

setwd("Log File Directory")

# get a list of all the log files
log_files <- Sys.glob("*.log")

# This line
# 1) reads each log file
# 2) concatenates them
IIS <- do.call( "rbind", lapply( log_files,  read.csv, sep = " ", header = FALSE, comment.char = "#", na.strings = "-" ) )

# Add field names - Copy the "Fields" line from one of the log files :header line 
colnames(IIS) <- c("date", "time", "s_ip", "cs_method", "cs_uri_stem", "cs_uri_query", "s_port", "cs_username", "c_ip", "cs_User_Agent", "sc_status", "sc_substatus", "sc_win32_status", "sc_bytes", "cs_bytes", "time-taken")

#Change it to a data.table
IIS <- data.table( IIS )

#Query at will
IIS[, .N, by = list(sc_status,cs_username, cs_uri_stem,sc_win32_status) ]

Nautch answered 25/4, 2014 at 15:0 Comment(0)

I did a logfile-analysis recently using R. It was no real komplex thing, mostly descriptive tables. R's build-in functions were sufficient for this job.
The problem was the data storage as my logfiles were about 10 GB. Revolutions R does offer new methods to handle such big data, but I at last decided to use a MySQL-database as a backend (which in fact reduced the size to 2 GB though normalization).
That could also solve your problem in reading logfiles in R.

Buoyant answered 15/4, 2011 at 7:6 Comment(2)

in fact that sounds interesting as I have already worked a bit with RMySQL. But what I don't see is how I should read the logfiles into MySQL. I think what we were discussing above is a package to analyze the most common logfile formats (locally). However, to you have any ressource, kickstart, link or whatsoever? – Ollie 15/4, 2011 at 8:45

I found a perl-script here. I had a custom logfile-format, so I changed the script a little bit. With usual logfiles that should work. – Buoyant 19/4, 2011 at 7:32

#!python

import argparse
import csv
import cStringIO as StringIO

class OurDialect:
    escapechar = ','
    delimiter = ' '
    quoting = csv.QUOTE_NONE


parser = argparse.ArgumentParser()
parser.add_argument('-f', '--source', type=str, dest='line', default=[['''54.67.81.141 - - [01/Apr/2015:13:39:22 +0000] "GET / HTTP/1.1" 502 173 "-" "curl/7.41.0" "-"'''], ['''54.67.81.141 - - [01/Apr/2015:13:39:22 +0000] "GET / HTTP/1.1" 502 173 "-" "curl/7.41.0" "-"''']])
arguments = parser.parse_args()

try:
    with open(arguments.line, 'wb') as fin:
        line = fin.readlines()
except: 
    pass
finally:
    line = arguments.line

header = ['IP', 'Ident', 'User', 'Timestamp', 'Offset', 'HTTP Verb', 'HTTP Endpoint', 'HTTP Version', 'HTTP Return code', 'Size in bytes', 'User-Agent']

lines = [[l[:-1].replace('[', '"').replace(']', '"').replace('"', '') for l in l1] for l1 in line]

out = StringIO.StringIO()

writer = csv.writer(out)
writer.writerow(header)

writer = csv.writer(out,dialect=OurDialect)
writer.writerows([[l1 for l1 in l] for l in lines])

print(out.getvalue())

Demo output:

IP,Ident,User,Timestamp,Offset,HTTP Verb,HTTP Endpoint,HTTP Version,HTTP Return code,Size in bytes,User-Agent
54.67.81.141, -, -, 01/Apr/2015:13:39:22, +0000, GET, /, HTTP/1.1, 502, 173, -, curl/7.41.0, -
54.67.81.141, -, -, 01/Apr/2015:13:39:22, +0000, GET, /, HTTP/1.1, 502, 173, -, curl/7.41.0, -

This format can easily be read into R using read.csv. And, it doesn't require any 3rd party libraries.

Demonology answered 9/4, 2016 at 3:55 Comment(0)

Recommended topics

Hot tags