Blog de François Maillet
  • Home
  • About
Follow

Efficient log processing



I’ve recently learned a couple of neat tricks to process large amounts of text files more efficiently from my new co-worker @nicolaskruchten. Our use-case is efficiently going through tens of gigabytes of logs to extract specific lines and do some operation on them. Here are a couple of things we’ve done to speed things up.

Keep everything gziped

Often, the bottleneck will be IO. This is especially true on modern servers that have a lot of cores and ram. By keeping the files we want to process gziped, we can use zcat to directly read the compressed files and pipe the output to whichever script we need. This reduces the amount of data that needs to be read from the disk. For example:

zcat log.gz | grep pattern

If you’re piping into a Python script, you can easily loop over lines coming from the standard input by using the fileinput module, like this:

import fileinput
for line in fileinput.input():
    process(line)

Use parallel to use all available cores

GNU parallel is the coolest utility I’ve discovered recently. It allows you to execute a script that needs to act on a list of files in parallel. For example, suppose we have a list of 4 log files (exim_reject_1.gz, exim_reject_2.gz, etc) and that we need to extract the lines that contain gmail.com. We could run a grep on each of those files sequentially but if our machine has 4 cores, why not run all the greps at once? It can be done like this using parallel:

parallel -j4 "zcat {} | grep gmail.com" ::: exim_reject*.gz

Breaking down the previous command, we tell parallel to run, using 4 cores, the command zcat {} | grep gmail.com, where {} will be substituted with each of the files matching the selector exim_reject*.gz. Each resulting command from the substitutions of {} will be run in parallel.

What’s great about it is that you can also collect all the results from the parallel executions and pipe them into another command. We could for example decide to keep the resulting lines in a new file like this:

parallel -j4 "zcat {} | grep gmail.com" ::: exim_reject*.gz | gzip > results.txt.gz

Use a ramdisk

If you’ll be doing a lot of reading and writing to the disk on the same files and have lots of ram, you should consider using a ramdisk. Doing so will undoubtedly save you lots of IO time. On Linux, it is very easy to do. The following command would create an 8GB ramdisk:

sudo mount -t tmpfs -o size=8G,nr_inodes=1k,mode=777 tmpfs /media/ramdisk

In the end…

By using all the tricks above, we were able to considerably improve the overall runtime or our scripts. Well worth the time it took to refactor our initial naive pipeline.

Linux    Programming, python, recoset, unix
SHARE THIS Twitter Facebook Delicious StumbleUpon E-mail
Similar posts
  • MapReduce avec parallel, cat et une r... — Je viens de publier un article, MapReduce avec parallel, ...
  • 64-bit Scientific Python on Windows — Getting a 64-bit installation of Python with scientific p...
  • Boston Music Hackday — I was thrilled to attend the Boston Music Hackday this we...
  • My time at Sun Labs and pyaura — My internship at Sun Microsystems Labs, which has been go...
  • Make OSX’s top behave like Linu... — OSX’s top program doesn’t quite behave like i...
← 64-bit Scientific Python on Windows
Hibernating in OSX Lion →

2 Comments

  1. Marc's Gravatar Marc
    February 23, 2012 at 10:02 am | Permalink

    There are also other “z” utilities similar to the zcat command you mentioned. Those commands make it very easy to work with compressed log files: zcat, zless, zmore, zgrep, zegrep, zdiff, zcmp.

    Source: http://www.thegeekstuff.com/2009/05/zcat-zless-zgrep-zdiff-zcmp-zmore-gzip-file-operations-on-the-compressed-files/

  2. Cédrik's Gravatar Cédrik
    May 31, 2012 at 6:42 am | Permalink

    When you have huge files to compress, I also recommend to replace gzip with pigz, which will use all available cores to speed up the compression.

  1. Made of String » Quicker ways of processing log files on January 29, 2012 at 4:29 pm

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Categories

  • Apple
  • cPanel
  • Linux
  • mir
  • Music
  • News
  • Politique
  • Programming
  • Technology
  • Voyage
  • Web design

Recent Posts

  • The Too Good To Be True Filter
  • MapReduce avec parallel, cat et une redirection
  • La loi 78 et le filtre antipourriel de Vidéotron
  • Loi 78: Courriels contenant un lien vers la pétition non livrés par Vidéotron? (Maintenant réglé)
  • Hibernating in OSX Lion

Blogroll

  • Consulting the Auracle
  • Léandre Maillet
  • Music Machinery
  • N-Code’s Blog
  • The Search Guy

Mes liens

  • Ma musique
  • Photos sur Flickr
  • Profil Last.fm

Archives

  • April 2013
  • November 2012
  • May 2012
  • September 2011
  • February 2011
  • September 2010
  • November 2009
  • October 2009
  • May 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • February 2008
  • January 2008

Tags

Apple avion bar boston burlington c++ canada cPanel dns server DRM encryption environment facebook GWT hosting internet jazz loi78 mac machine learning macports microsoft mir Music musique north reading open source osx playlist generation Politique Programming python research sdl space sparsebundle spectacle sports sun microsystems taking over the world unix Voyage Web design windows élection

Meta

  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org

EvoLve theme by Theme4Press  •  Powered by WordPress Blog de François Maillet