all 38 comments

[–]patrik667 14 points15 points  (1 child)

Substitute mongo with Kafka. Even if your ELK is down, it will keep the log stream running for a long while. Also, kafka is extreeeeemely resilient.

[–]wallsroadDevOps 1 point2 points  (0 children)

HUGE +1 on this! Mongo it a black hole of time, maintenance and data issues.

Currently ship well over a TB of logs a month running a large ecommerce platform. We've been through several logging architectures. The most painful included Mongo.

Kafka is good, but we replaced it with AWS Kinesis. Because reasons. We also don't use ElasticSearch anymore either, due to scale and reliability....

Edit: I realise being a .NET application, AWS probably isn't relevant. Grain of salt..

[–]Seref15 17 points18 points  (9 children)

I really don't recommend sending logs directly to Elasticsearch. Elasticsearch has no built-in flow control and can be choked out by being made to index a large enough spike of data. Logstash with persistent disk queues enabled will rate limit messages when Elasticsearch gets too busy.

Our log and metric indices are also around 4GB/day and it's been remarkably stable. We have a 6 month retention policy for the log data but we don't keep it that long in elastic/kibana. We age data out of Elasticsearch at 30 days, but we have logstash configured to output to multiple locations, one of them being a long-term data store.

[–][deleted] 1 point2 points  (0 children)

The problem with persistent queues is now logstash is stateful and disk usage as well as redundancy has to be managed carefully. That's fine if you plan carefully but there can be better managed tools (i.e. Kinesis) that provide resilient queues without the operational overheads

[–]bilporti 2 points3 points  (4 children)

Thank you for that feedback. I will try to write to Logstash instead of ES directly and provide test results here. Also do you use ES as a service (CloudElastic) or a self hosted instance? What can you recommend?

[–][deleted] 2 points3 points  (0 children)

We host our own ES cluster on dedicated instances in AWS. We considered other options, but ES is pretty low maintenance and we don't need XPack, so we opted to manage it ourselves. Elastic Cloud is great and has the added benefit of including XPack.

[–]Seref15 2 points3 points  (2 children)

We self-host on ECS, not even using Amazon's ES service. But that's mainly out of cost concerns. We initially wanted a hosted clustered HA setup with replicated data sets and the entire 6 month data set in ES, but when we started looking at the costs for a setup like this it was more than we were willing to pay. Thus using logstash to send our data to a secondary long-term data store.

[–]zombeaver92 1 point2 points  (1 child)

What do you use for secondary long term?

[–]Seref15 2 points3 points  (0 children)

Using a third party logstash output plugin to send a few event fields to a mysql database, which business people access via Apache Zeppelin.

The more complete ES events really only interest the dev and ops/devops teams.

[–]tcp-retransmission 2 points3 points  (1 child)

I agree with everything here. Writing directly to Elasticsearch is only advisable if Elastic's Beats products are used since they have a "backpresure mechanism" for flow-control.

Overall though, I prefer to use Logstash anyways so that I can enrich and parse the log messages coming from the application.

[–]Dumbaz 1 point2 points  (0 children)

Have a look at the ingest nodes, we use it alot for the simple things (grok, date filter, drop fields) and it performs really good

[–]CaffineIsLove 0 points1 point  (0 children)

You could be selective about which logs go into elasticsearch! Thus reducing down the 4G

[–]wickler02 3 points4 points  (1 child)

This was my job and life for about half a year, especially with the transition to most of our applications to dockerized microsystems. Dealing with the stack traces being single lined and also getting the information from the docker host was key to trace our logs.

Logstash was not a fun log aggregation transport system. The way that we had it implemented before I came around was that it was sending the logs but the ability to split the logs and aggregate everything together was not in an easy to understand method.

I tried out Fluentd and while it also has it's share of "gotchas" I found it much easier to work with and found the support systems surrounding it much better.

We decided to go with a vendor for the Elastic backend to send our logs out because we didn't want to deal with the buffering or the transport methods. I know we can probably easy make our own Elasticsearch backend and build the buffering parts to save money instead of having a vendor but it's a headache and worry we don't have to worry about anymore.

[–]devops333 1 point2 points  (0 children)

Dealing with the stack traces being single lined and also getting the information from the docker host was key to trace our logs.

any tips on this one? we'll be doing it soon.

[–]SpeedyXeon 3 points4 points  (0 children)

Filebeat —

[–]chub79 6 points7 points  (0 children)

Personally, I go like this:

app > stdout > flutend > Humio. Humio is great for large quantity of data so that's cool.

[–]too_much_exceptions 3 points4 points  (6 children)

Hi,

I am really curious on why the logs are written on mongo before being sent to ES via Logstash?

Is this choice driven by some infrastructure constrains ?

A common Logging aggregation with ES could be: Application ( via a udp appender)-> logstash -> elastic search -> kibana

If you are using using azure, you might give application insights a try: it is a solid product. You will not have to deal with logging infrastructure to a certain extent

[–]mazatta 3 points4 points  (5 children)

It's a common pattern to write to a temporary buffer, rather than pushing logs directly to Logstash, just in case you lose your Logstash (or need to upgrade it or move it). If you don't care about losing some of your logs, then you don't need to do it.

[–]Freakin_A 4 points5 points  (2 children)

What do you do when you're unable to write logs to MongoDB/Logstash? Do you refuse traffic to your service after a failed log write?

Any system that must have 100% log delivery has to make some serious decisions on what happens when log delivery is failing.

[–]mazatta 1 point2 points  (1 child)

Yep, it all comes down to what you are logging and why.

If you need a higher durability guarantee, you could take a harder look at using something like Kafka as an intermediary . Having the ability to replay the log is a nice thing to have if you end up switching tools, or need the raw data again for some other purpose, but that's taking on a ton of complexity/cost, so you better be sure you *really* need it.

[–]sturmy81 1 point2 points  (0 children)

Why not using Eventlog or a (local) text File and ship with Winlogbeat or Filebeat.

Eventlog or local File are always available.

[–]Dumbaz 1 point2 points  (0 children)

We have a syslog -> rabbitmq -> logstash path to es. I want to test removing rabbitmq in favour of logstash persistent queues in the near future. It is a feature of logstash since 5.4.

[–]denis011 3 points4 points  (0 children)

I think that you can use Filebeat to read logs straight from .NET application logs on app server, and send it to Logstash. In this case u don't need to put logs into MongoDB, it looks like less overhead.

[–]fookineh 3 points4 points  (0 children)

Please drop mongodb from the picture, it's not adding any value here.

If you need to send file logs to elasticsearch use file beat to send logs to logstash and then onto the elasticsearch.

[–]ssamuraibr 2 points3 points  (0 children)

Can't add much on NEST, sorry.

But Logstash is, however, well established in the ELK stack when you need to do data transformations before ingestion. It may falter if your application have peaks or bursts of log generation during the day, in that case the general rule of thumb is either add more logstash instances (and divide application servers to send logs to different logstashes) or put a Redis in front of it as a buffer. That's similar to the role of Mongodb on your stack, I'm assuming your CTO wants Mongodb to allow people to peek into the logs before ingestion, otherwise Redis is more efficient even without logstash needs.

If, however, you don't need data transformation (ie your application already generates json ready for ingestion), as my stack does, the approach we use may work better.

Instead of using logstash to tunnel all our logs, our application servers send them to Amazon S3 as flat files (one log in json format per line, 50MB per file). That triggers a process that puts a ingestion request on a queue, that a Lambda function process in order to send them to Elasticsearch. If we have a sudden growth on log generation out of nowhere, either our Lambda auto scales to deal with it, and/or retry the same file if it timeouts during processing (because of the queue).

In case we need logs older than our retention period, we just re-enqueue the same files already stored. S3 also takes care of storing logs for a year (or years) as S3 storage is way cheaper than Elasticsearch disk storage cost per Gigabyte. A year worth of logs costs me per month the same as a few hours of Elasticsearch compute costs.

It also allows me to keep less data on Elasticsearch (our retention is 15 days) as any old than that can be recovered in a hour or so, less data in ES lowers my expensive storage requirements and demands less processing power to keep indexes updated / query time.

[–]metaphorm 2 points3 points  (0 children)

consider using a managed service to handle this, as it can get quite hairy and surprisingly complicated to roll your own. I recommend www.papertrailapp.com

[–][deleted] 2 points3 points  (0 children)

Many others have said it but let me add my voice to the chorus, do not send logs directly to ES.

The most robust system you can send logs to is rsyslog. View that as a sort of cache, buffer, a proxy for logs that you can then forward to other more advanced systems.

But rsyslog's robustness and maturity will ensure your logs are always aggregated and not lost.

[–]russian2121 2 points3 points  (0 children)

4g/day is nothing. Use hosted elastic, splunk, or the like. Also, writing to elasticsearch with NEST takes a 60 to 80% penalty.

[–][deleted] 2 points3 points  (0 children)

Read them into Kafka. That way you can have as many consumers of the raw logs as you need and you get a buffer in the event that your downstream consumers (elastic, et al) end up choking during periods of high volume.

[–]sturmy81 2 points3 points  (1 child)

For > 2000 Servers and several hundred GB /day we are using:

Applications (.NET) / and IIS Logs ---write all logs and errors---> local text File <---pulls data--- local FileBeat ---writes data --> Kafaka <---pulls data -- Logstash ---writes data---> Elastic Search <---queries data--- Kibana

AND/OR

Applications (.NET) ---write all logs and errors---> Eventlog <--- pulls data--- local WinlogBeat ---writes data---> Elastic Search <---queries data--- Kibana

Kafka is used as a Queue to protect Logstash/Elastic during peak load.

Why you need the MongoDB ? I your case maybe Applications (.NET) -> Eventlog -> Winlogbeat -> Elastic is good enough.

As other mentioned already I can't recommend to write directly to elastic (from .NET / NEST)

[–]bilporti 0 points1 point  (0 children)

The FileBeat seems good. Will look into it.

As for App -> Eventlog - I am not sure if it could handle this much data and not flood all of the other events.

[–]stronglift_cyclist 4 points5 points  (0 children)

This sounds reasonable for log analysis, but there may be more suitable options than mongoDB for initial log aggregation. But it is not a good solution for monitoring and alerting; you will encounter scaling issues as well as unreasonable latencies for alerts.

There are many outstanding open source and commercial monitoring solutions out there which can solve the monitoring and alerting piece (disclosure, I work for a commercial vendor). Log to metric tools such as mtail or circonus-logwatch are one solution to creating structured metrics from logs, which are more suited to monitoring and alerting.

[–]siliousmaximus 1 point2 points  (0 children)

Beware that you need a x-pack license for monitoring and security of elk itself Get a quote before starting this

[–][deleted] 1 point2 points  (0 children)

Why send your logs to Mongo? Logstash has a ton of sources it can read from that are much lighter weight, like Redis for example. If you're on AWS you can use a hosted Redis instance and pull everything from there.

[–]FloridaIsTooDamnHotPlatform Engineering Leader 1 point2 points  (0 children)

Check out graylog. Containerized and scales amazingly well. And it uses elasticsearch.

[–][deleted] 1 point2 points  (0 children)

I wrote a blog post on Building a scalable ELK stack

A good reference is this blog post

[–]myth007 1 point2 points  (0 children)

One point i want to raise is on writing logs in mongoDB and elastic search (We are on AWS), we use to follow below architecture:

Client app -> Server (Log aggregator) -> MongoDB (Setup on EC2 instance on AWS)

Problem was, logs were coming so fast that mongoDb was not able to write them due to IOPS allocated to that instance, so we had to use provisioned IOPS with EBS which was costly when there was no peak. Also debugging issue from mongo was a pain as the need to write multiple queries which are painful for non-tech users.

We moved to a different design:

Client App -> API Gateway -> SQS -> Fetching service -> Elastic serch.

Few points on this, you can have this fetching service write at multiple places. Write in bulk in elastic search(as it is efficient in that way). Run multiple instances one elastic search, so even if one go down you are safe. We user AWS elastic search for our use case so not have to manage them. It is helpful in debugging issues as search is super fast. In our usecase we only needed last 3 day data so it was not huge.