To Optimize or Not To Optimize

TL;DR

Oft times, coding is trying to find a balance between multiple parameters. Some good factors we measure code upon are –

  • Performance
  • Efficiency
  • Speed
  • Correctness

Occasionally, we do miss out on factors such as –

  • Readability
  • Ease of Troubleshooting

In this post, I will walk you through an interesting case I encountered while developing a custom log parser and how I had to make certain choices for my solution at work.

Apologies on the lack of specific examples here because I cannot divulge the actual data. However, I will try my best to provide examples wherever to support my decisions.

The Challenge

The Expectation

We needed an easy way to write log definitions for some of our log types which would match the log line and split the fields into their individual field names.

Our expectations with the parsers were –

  • It should identify the log type and return the key, value pair for each field.
  • It should not involve long background processing tasks which would require us to poll for a response frequently.
  • It should be fast in returning a response.
  • It should be easy to update and maintain in case we want to add / remove / edit log types.

Enter Logstash

We knew about Logstash which uses Grok from a previous project that relies heavily on it. In a nutshell, Grok is just a cleaner way of representing the objects/entities we desire to identify and isolate in a log line. Grok uses PCRE style regex to define about 120 basic and commonly used patterns.

We discovered that given an input to Logstash via HTTP, Logstash could not return it back to a UI apart from using Websockets. This presents it’s own challenges such as a Websocket response will be sent to all clients connected to the socket (not just the one that requested). Also in my opinion, adding Logstash would have complicated the solution too much.

Enter pygrok

My friend and teammate (Divya Meharwade) discovered pygrok (a Grok implementation in Python). This really made it easier for us to overcome the challenge of using Logstash while still retaining the simplicity of Grok.

As I started looking deeper into it, one thing was clear – pygrok couldn’t be used for our use-case. My reasons were –

  • It expected us to know the log type and pass it as a parameter before it returned us the fields.
  • It’s custom pattern definition options did not preload our patterns so I could reuse it (despite spending hours on it).
  • And it parsed one line at a time making the process slow. The primary reasons for these were –
    • The regex pre-compilation happened during object creation. An object creation relied upon passing the pattern. I did not know the pattern off hand and wanted the parser to not only identify the pattern but also split the fields.
    • Another implementation could have been to pass all the log lines through each pattern and then isolating the ones that were matched. This would have meant longer iterations.
  • The worst case time complexity I estimated for the code was O(m x n) where –
    • m is the total number of logs
    • n is the number of log types that needs to be matched before a log can be identified and it’s fields returned.

The Solution

Enter Custom Parser

Looking at the Pygrok code, I felt that I could easily reuse parts of their code and rewrite the parser for our use-case.

I was targeting these areas specifically –

  • Regex needs to be pre-compiled before it can be used (remember, Grok is just a lot of regexes).
  • All patterns can be read from a file.
  • Isolate and compile only the patterns that are really important to you. (I did this by assigning a specific prefix to each of our log types.)
  • The efficiency of the parser should be O(n) where –
    • n is the total number of logs

In essence, in a single pass the regex should identify the log line and spit out it’s corresponding log fields.

The quest for an optimal solution took me to the obscure paragraph on the Python Regex documentation. The idea was simple – build a lexical tokenizer that combines all the log types we need to match separated by the logical OR (“|”) and then iterate each log line over it.

re – No support for atomic grouping

This challenge is very specific to Python that I ran into. Python’s “re” module doesn’t support PCRE style atomic grouping which caused the parser code to raise a lot of errors. I tore my hair in exasperation until I ran into this Stackoverflow article and stumbled upon the indomitable regex library.

The same code I wrote worked flawlessly by changing one line – the import statement. 🙂

O(1) Results

Some furious hacking got me to a point where I finally got the results but there were more fields with None values from the multiple log types that were compiled with the logical OR operator. This happened because Python’s regex implementation works in a certain way that if a field doesn’t get it’s value it returns as None.

Here’s a simple example for you to understand.

# Notice how with the OR operator, the first value is always selected 
# and second value, although a perfect match for the same regex, returns None.
re.match(r"(?P\w+)|(?P\w+)", "Malcolm").groupdict()
{'first_name': 'Malcolm', 'last_name': None}

This violated our implicit constraint of correctness in output. Some of you might say, why couldn’t you remove the ones which returned None? That is an excellent point. However, what if one of the fields on the log line had no value? We’d be missing that as well.

O(m x n) Solution

My final solution was very similar to pygrok. The algorithm steps I used were –

  • Get all the log types and their corresponding regex from static files.
  • Compile the log types and their regexes and put them in a hashmap.
  • Take each log line and iterate over the compiled regex hashmap and store their output before returning it to the callee function.

Numbers Game

Running some performance tests on the code, I computed the time taken using time and memory_profiler modules.

Parser Type File Size/Data Size Number of Lines Time Taken Memory Consumed
pygrok based Parser 50 kb/2.5kb 302 1.43s 40.3MiB
Custom Parser [O(m x n)] 50 kb/2.5kb 302 0.14s 40.5MiB
Improvements ~ 920% [Decrease in time] ~ -0.49% [Increase in size]
Parser Type File Size/Data Size Number of Lines Time Taken – Median of 5 runs
pygrok based Parser 1MB/54.5kb 6193 27.15s
Custom Parser [O(m x n)] 1MB/54.5kb 6193 1.72s
Improvements ~ 1478% [Decrease in time]

Notes:

  • Tests were performed using file as input independently.
  • No tests were performed with Custom Parser [O(n)] solution since it produced incorrect results.
  • UPDATE [2017-10-13]: I noticed after publishing all this that the size of the file on disk was 50kb whereas the actual data size was about 2.5kb. I am yet to understand why the disparity.
  • UPDATE [2017-10-13]: The time taken in the first table also accounts for time taken by the data profiler since they were run together. The second one was without any data profiling since it was taking significant time for both.

Closing Thoughts

In summary, I’d like to quote from Zen of Python –

Special cases aren’t special enough to break the rules.
Although practicality beats purity.

I guess this statement sums up the fact that, optimization is a tradeoff and must be implemented wisely.

I wish to contribute to the pygrok code very soon so that people do not reinvent the wheel like I did. I’ll post an update here when that happens.

2 Comments Add yours

  1. Excellent thread. It hit the spot.

    Liked by 1 person

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s