Retries using td-agent

At my current workplace, we use td-agent for data collection purposes. As with any system, network endpoints can break when there are issues on the server. Let’s take the following data flow –

td-agent Retry Architecture Diagram

Considerations

  • td-agent fetches data from service A and then sends an HTTP API call to service B.
  • service A is highly reliable (say, it is an Amazon cloud solution) and service B exposes an HTTP API for data consumption.
  • service B relies on td-agent for all data.

Reliability Challenges

  • Input Data Reliability
    • td-agent input data should be reliable (achieved using plugin for Amazon cloud service)
  • td-agent Issue
    • td-agent processes are killed forcefully i.e. using kill or pkill commands
    • td-agent service is stopped manually
  • API Connectivity Issues
    • Ensure no data loss occurs in case connectivity drops (typically handled with retries and file buffers)

My testing showed that td-agent gracefully handled data pushes to API in both cases where processes were killed/manually stopped. td-agent buffers worked well giving me options to handle bulk data being fetched using parameters such as flush_intervalbuffer_chunk_limitbuffer_queue_limit (documented here). The one question on my mind was “How do I ensure retries if the API was down for x minutes?”

Open Source Plugins

I looked at the following plugins –

https://github.com/ento/fluent-plugin-out-http – This is a standard td-agent output plugin without any buffering and uses the bufferize plugin for buffering. The major challenge here was retries weren’t handled.

Tests showed the following:

  • No data loss observed when API down and data being written to buffer.
  • Data loss observed when API down and data was being pushed to API from the buffer (because buffer limits had reached).
  • Without buffers, data loss observed when API down.

https://github.com/ablagoev/fluent-plugin-out-http-buffered/ – This plugin was inspired by the previous plugin with two significant changes.

  • Buffered plugin from the start (not relying on another plugin for buffering)
  • Handled retries for specific HTTP codes

Tests showed the following:

  • No data loss observed when API down with application errors.
  • Data loss observed when there were issues such as connectivity issues on the API server.
  • Retries were instantaneous but not technically “intelligent” i.e. I couldn’t really customize it.

Also, here was the major catch in the code, use of the keyword – fail

Analysing fail

With assistance from the comment on line 81, I added the fail keyword to rescue section.

rescue IOError, EOFError, SystemCallError => e
    # server didn't respond
    $log.warn "Net::HTTP.#{request.method.capitalize} raises exception: #{e.class}, '#{e.message}'"
    fail "Server issues"

Restarting td-agent and trying to send data displayed the error –

2015-10-08 12:08:55 +0000 [warn]: Net::HTTP.Post raises exception: Errno::ECONNREFUSED, 'Connection refused - connect(2) for "localhost" port 5000'
2015-10-08 12:08:55 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2015-10-08 12:08:59 +0000 error_class="Errno::ECONNREFUSED" error="Connection refused - connect(2) for \"localhost\" port 5000" plugin_id="object:3f96df96b148"
2015-10-08 12:08:55 +0000 [warn]: suppressed same stacktrace
2015-10-08 12:08:59 +0000 [warn]: retry succeeded. plugin_id="object:3f96df96b148"

The retries succeeded but there were errors which were unexplained, clearly not a good implementation. Looking at many reported issues on the official forum, I stumbled upon an elasticsearch plugin use-case and pursued it to look at the plugin. On line 182, I stumbled upon retry!

Time to retry

Reading the code, retry looked simple and what was more – it was never clear to me (a Ruby noob) that the language supported this feature. Now I started playing around with retry trying to replace that with fail.

rescue IOError, EOFError, SystemCallError => e
    # server didn't respond
    $log.warn "Net::HTTP.#{request.method.capitalize} raises exception: #{e.class}, '#{e.message}'"
    retry

This worked magically throwing only the warning log I wanted –

2015-10-08 12:08:55 +0000 [warn]: Net::HTTP.Post raises exception: Errno::ECONNREFUSED, 'Connection refused - connect(2) for "localhost" port 5000'

When I tried to replace fail with retry I saw this error –

* Restarting td-agent  td-agent
/etc/td-agent/plugin/out_buffered_http.rb: 
/etc/td-agent/plugin/out_buffered_http.rb:79: Invalid retry (SyntaxError)

Adding custom retry times

The fascinating part was adding the robust retry times. The logic was inspired by Celery – 2 ^ n (2 power n) seconds of retry with a static cut-off where n is retry counts. I expanded the scope a bit by making it m ^ n (m power n).

Here is the code:

# define the configuration parameters

# set 2 second default retry interval
config_param :retry_interval, :integer, default: 2

# set 6 as retry_threshold
config_param :retry_threshold, :integer, default: 6

# include the retry code in the rescue function
rescue => e
    # server didn't respond
    $log.warn "Net::HTTP.#{request.method.capitalize} raises exception: #{e.class}, '#{e.message}'"
    # Set threshold so sleep times don't shoot up
    if retries == @retry_threshold then retries else retries+=1 end
    # Set m^n (m power of n) as sleep time
    sleep_time = @retry_interval ** retries
    $log.info "Sleeping for #{sleep_time} seconds"
    sleep sleep_time
    retry

With a retries = 0 at the beginning of the function, I could now restrict the retries to @retry_threshold and apply the m ^ n function to increase the retry timeouts slowly with a cut-off limit. Thus the service sleeps with the retry seconds – 2, 4, 8, 16, 32, 64, 64, 64… until the API is back up.

Conclusion

While td-agent does provide a template for writing custom plugins, here are some notes from my experience on retries for output plugins.

  • retry can be used only in the rescue block
  • fail can be used either in the rescue block or main block
  • Preferably use raise over fail. This will allow you to catch exceptions correctly and then retry without errors like: [warn]: temporarily failed to flush the buffer.

Happy coding!

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s