Avoiding GREEDYDATA for logstash'grok

来源:互联网 发布:java表白代码结果演示 编辑:程序博客网 时间:2024/06/11 02:03
In working on performance optimization, there seems to be a constant theme to avoid using GREEDYDATA in grok matches. That's all very well and good, but I don't see how that's possible in standards syslog messages. For the logs I'm working on, most of my fields are simple key-value pairs which I split out with the KV filter. Before I do that, however, I have to parse and remove the leading syslog headers (timestamp, priority, facility, etc.). The only way I know to only get the kv portions is to use a single GREEDYDATA match for everything after my header fields and to pass the resulting match down to the kv filter. This works great, but seems to really limit performance.

I see that even the included pattern sets for Cisco and other devices make extensive use of greedy matches. How do other folks work with syslog data without greedydata? For folks processing syslog style messages, are you achieving 10K+ eps rates (as others seem to talk about) without massive horizontal scaling? I can just eek out 2-2.5K EPS on an 8 core AWS instance. While horizontal scaling works well enough for me, that's a lot more compute resources than I expected to need. My single node commercial SIEM product can do much higher rates with a whole lot less and that's what I'm looking to largely replace with ELK!

I'm running 8 worker threads on my 8 core instance, with all cores running nearly pegged with the vast majority of the time going to the worker threads. Apart from a bit of mutate (casting and renaming), the only thing going on is the grok filter followed by the KV filter (template version of the filter config is at https://github.com/davidski/cookbooks/blob/master/sch-logstash/templates/default/config/filter_sidewinder.conf.erb). The consensus I hear from #logstash on Freenode and mentioned from elastichsearch.org most recently in the ELK for DevOps webinar is to avoid greedy data. It's possible there's other back pressure in the pipeline but my ES index rates scale linearly when I add mode indexer nodes and I don't know of any way to directly monitor internal logstash pipeline queues to see where any pressue is coming from.

I'm really mostly curious in what sorts of rates folks are seeing on syslog style traffic ingestion. Most of the benchmarks I've seen are either off of generated events or with well structured data. Are folks really getting 10K+ eps rates with typical syslog ingestion without big farms of indexers? So far, that seems to be a unicorn.
URL:http://comments.gmane.org/gmane.comp.sysutils.logstash.user/10463

少用 Grok

感谢群里 @wood 童鞋提供的信息,Grok 在高压力情况下确实比较容易率先成为瓶颈。所以在日志格式可控的情况下,最好可以想办法跳过使用 Grok 的环节。在早先的 cookbook 里,就有通过自定义 LogFormat 成 JSON 样式的做法。我前年博客上也写过 nginx 上如此做的示例:http://chenlinux.com/2012/09/21/json-event-for-logstash/index.html。

url:http://chenlinux.com/2014/06/11/nginx-access-log-to-elasticsearch/

Logstash Grok Speeds

21 DECEMBER 2014 on Logstash, Grok

Logstash and its Grok filter are excellent and I love them, but it was going so slow that the data was useless by the time I had finally ingested it to review it, here's what was wrong and how I fixed it.

Issue

The Grok filter in Logstash was only able to handle between 10 and 30 log lines per second based on my many REGEX lines. This is slower than the speed at which we generate log lines (~50/sec).

Background

At Pindrop we send all log messages to Syslog for aggregation. This means that all logs for all purposes are in the same file. I set up an ELK Stack on my laptop to:

  • Get deeper insight into what kinds of response times our customers should be expecting
  • Determine where we were spending the most time in generating our API response
  • Look for anomalies, similarities, and other interesting points in what customers are querying for.

The grok filter config was immediately very complicated, due to the widely varying nature of our log lines, mixture of appliations sending log lines, and the detail I needed to pluck from these lines. For example, some of our log lines have a Python array pretty printed, whereas others did a JSON serialization of the data. Extracting phone numbers and usernames from this data is a dizzying set of REGEX queries.

Thankfully to develop these grok lines, Grok Debugger exists, which will save your sanity.

Here's an example log line:

Dec 18 07:49:05 box-004 MainProcess[36381]: pindrop.profiler INFO [req bdcecd58a4ab7c2e@box-004] fetch_data - finished at 1418888945.046632, taking 0.006296 seconds (0.00 cpu)

This is one of the easier examples, the data I want from this line is:

  • Timestamp (Dec 18 07:49:05)
  • Who logged it? (pindrop.profiler)
  • Function (fetch_data)
  • How long did it take? (0.006296)

So I have a grok configuration that looks like this:

filter {      grok {        match => ["message", "%{pd_SYSLOG_NORMAL} %{pd_REQUEST} %{DATA:proc_detail} - %{DATA}, taking %{NUMBER:duration_seconds:float} seconds"]        patterns_dir => "patterns/"    }}

I also have all these reusable patterns that I wrote/modified from the Logstash base.

pd_UNIQUE_ID [a-f0-9]{12,16}pd_SOMECLASS (?:[a-zA-Z0-9-_]+[.]?)+[A-Za-z0-9$_]+pd_SYSLOG_BASE %{SYSLOGTIMESTAMP:timestamp} %{HOST} %{SYSLOGPROG}:  pd_SYSLOG_NORMAL %{pd_SYSLOG_BASE} %{pd_SOMECLASS} (INFO|ERROR|WARN)  pd_REQUEST \[req %{pd_UNIQUE_ID:unique_id}@%{HOST}\]  

This is a pretty specific pattern, I realize. But with our many different types of loglines, it's hard to be more generic, capture all the data I need, and be able to come in and edit them again later and maintain my sanity (which is already wearing thin).

Debugging the Problem

I knew that since I had about 15 different styles of patterns all in sequence from most-specific to least (with break_on_match => true) that testing each of these in sequence was going to be the biggest time-sink. I optimized and limited these patterns as much as I could, but with little success.

I dove into Logstash's Grok Filter and started timing the regex match, like so:

now = Time.now  grok, match = grok.match(input.to_s) # Original Line  total = Time.now - now  if total > @average_time      @logger.warn("Took #{total}s which is #{total - @average_time} longer than average of #{@average_time}", :event => input)    @average_time = (@average_time + total) / 2end  

This gave me some good insight

  1. What the average time spent handling these patterns was.
  2. If our patterns were taking progressively longer over time.
  3. Which input lines were taking longer than the average to process.

My longest offenders were taking 1.5s, where the quick ones were less than 0.01s.

Solution

In the end, I knew that simplifying the many patterns we would inevitably test these log lines against was the only thing that was going to make a difference. As I was scouring the Logstash Docs, I noticed that there was this section on overwriting a field with a match from a grok line.

This gave me a 'eureka' moment in which I realized I should match the Syslog base of the line once and only once, and then in a separate grok filter, match my speciality parts. Now my grok line is just:

match => ["message", "%{pd_REQUEST} %{DATA:proc_detail} - %{DATA}, taking %{NUMBER:duration_seconds:float} seconds"]  

pd_SYSLOG_NORMAL actually expands into 5 different patterns. The result of removing this is that I now only have 5 patterns left in the match line (pd_REQUEST is 2). A 50% decrease in patterns for just this one line, but more than that, in the worst case a line was running on those initial 5 patterns 15x. So this means I take 70 matching patterns out of the mix (assuming that 1 of those has to match).

The speed up was dramatic, I went from matching 10-30 lines/s to 500 lines/s. So, if you're trying to match on the same thing over and over again, pull it out into a separate grok filter, it will save you time and sanity.

URL:http://ghost.frodux.in/logstash-grok-speeds/

The problem is that if I disable the only filter I have enabled my StatsD counter says that the amount of processed events is 12k events/s. If I add the grok filter logstash is unable to catch-up position in the redis queue. Server specs are the following:

  • 8 x Xeon L5420 @ 2.5GHz
  • 12GB memory

Grok disabled
12k events/s
CPU usage 70-75%

h3. Grok enabled
2.9k events/s
CPU usage 50%

I use the StatsD counter as metric. Here the graph of the different scenarios:
http://imgur.com/KHPQYHr

So, is there any real performance issue while using grok filters or am I screwing something? I have not experience with java apps, so please let me know what information may be useful for you in order to see what the problem is.

URL:https://logstash.jira.com/browse/LOGSTASH-480
0 0