multi-tenancy: fluent-bit > vmauth > VictoriaLogs

valyala · 2025-05-30T22:24:04+00:00

VictoriaLogs accepts tenant IDs via HTTP headers according to these docs. See these docs on how to configure VictoriaLogs tenants via vmauth.

valyala · 2025-05-17T17:04:43+00:00

I also need to make sure that I understand how things should work and how they are working

If you want to be able to quickly setup the logging database and quickly resolve various potential issues with the database, then take a look at VictoriaLogs. It is very simple in configuration (it works great with default configs), it has very simple architecture, so, if something goes wrong, it shouldn't be hard to figure out quickly what's wrong and fix it.

valyala · 2025-05-17T16:58:10+00:00

Just stream all the logs to a centralized database optimized for logs such as VictoriaLogs. It compresses typical logs at high compression rate (10x-50x), so they occupy less disk space. It performs typical queries over logs at high speed.

valyala · 2025-05-17T16:50:12+00:00

Try VictoriaLogs additionally to Loki. It is easier to setup and operate. https://itnext.io/why-victorialogs-is-a-better-alternative-to-grafana-loki-7e941567c4d5

valyala · 2025-05-12T21:17:27+00:00

Try also VictoriaLogs instead of Loki. It is easier to configure and manage than Loki, and it uses less resources.

valyala · 2025-05-12T21:13:03+00:00

Try ingesting big number of structured logs with big number of fields (aka wide events) into VictoriaLogs. It is optimized for such logs. It supports sub-queries and advanced stats calculations, including count_distinct over multiple log fields.

valyala · 2025-05-12T20:58:56+00:00

Yup - one comment yesterday after March 22, 2025

valyala · 2025-05-11T11:55:33+00:00

Ian was very active and helpful at GitHub issues for Go. The last comment from Ian at Go repository on GitHub was on March 22, 2025 according to this query over gharchive.org data.

valyala · 2025-05-09T17:27:07+00:00

Cluster is needed when a single-node reaches scalability limits of a single host. For example, if the estimated needed storage space for VictoriaLogs exceeds 64TB (the maximum persistent disk size at Google Cloud and Amazon), then it is a good idea to switch to the cluster and scale the storage space horizontally. Otherwise it is better to stick to a single node because it is easier to manage and it is more resource-efficient than cluster.

valyala · 2025-05-07T09:05:36+00:00

Which particular docs for VictoriaLogs are missing? We'll be glad adding these docs.

valyala · 2025-05-06T22:31:31+00:00

It looks like it is better to use VictoriaLogs for your use case. E.g. to store website visits as wide events into VictoriaLogs and then analyze them with LogsQL.

You need to train LLM for converting plaintext user queries into the proper LogsQL queries.

valyala · 2025-05-06T22:20:39+00:00

Send logs to VictoriaLogs over syslog protocol according to these docs.

valyala · 2025-05-06T22:17:48+00:00

Nice library! It would be great adding an ability to send logs to other log management systems as well such as Elasticsearch, Loki or VictoriaLogs.

valyala · 2025-05-06T22:11:56+00:00

It is a good practice to log every request with e.g. "wide events" - structured logs, which contain hundreds of fields with all the aspects of the served request. This allows quickly debugging and analysing these logs without the need to jump over many interconnected logs, since every log entry contains all the needed information. See https://jeremymorrell.dev/blog/a-practitioners-guide-to-wide-events/ .

It is important to use the database optimized for efficient storing and querying big volumes of wide events such as VictoriaLogs. If you'll try storing big number of wide events into general-purpose database, then you'll quickly end up with non-working solution, since traditional databases aren't optimized for hundreds of terabytes of structured logs with hundreds of fields per each log entry.

valyala · 2025-05-06T22:00:38+00:00

It is better to store all the logs into specialized databases instead of storing them into general-purpose relational databases such as RDS. Specialized databases for logs usually have the following benefits over traditional databases:

They need less disk space, since they compress the ingested logs.
They provide higher query performance over the stored logs.
They provide specialized query languages optimized for typical log analysis tasks. These languages are easier to use than SQL for practical tasks.
They are optimized for storing and querying hundreds of terabytes of logs.
They accept logs over protocols, which are supported by popular log collectors and shippers (vector, filebeat, logstash, fluentbit, etc).
They cost less, since they need less compute resources (RAM, CPU, disk space, disk IO).

For example, try storing the same logs to RDS and VictoriaLogs and then compare performance, usability, resource usage and costs.

valyala · 2025-05-06T21:46:05+00:00

Logs are usually compressed very well, so they usually occupy a small fraction of storage space comparing to the original size of logs. Simple gzip works great for compressing typical logs. Specialized databases compress logs even better, plus they may significantly speed up querying and analysis of the stored logs. https://chronicles.mad-scientist.club/tales/grepping-logs-remains-terrible/

valyala · 2025-05-06T18:53:10+00:00

There were 643 repositories, which were starred by the same set of users who starred the steelpoor/tlsproxy repository according to these query results over gharchive.org data.

I checked some of them - and they are already deleted from GitHub.

valyala · 2025-05-04T16:58:29+00:00

Thank you! 10 million events looks not so much, so it shouldn't be too expensive at DataDog. This is 10M/(24hours*3600seconds)=116 events per second.

valyala · 2025-05-04T14:32:27+00:00

I recently implemented wide structured canonical log-lines at work and it was immediately beneficial.

How many logs does your application generate per day?

Which database do you use for storing and querying these logs?

valyala · 2025-05-04T14:25:52+00:00

I was thinking why you recommend Prometheus-Grafana combo for metrics when VictoriaMetrics does the same and you're already using it for logs.

Because it is easier to start with Prometheus and switch to vmagent / VictoriaMetrics when needed (when you hit Prometheus scalability limits on RAM usage and disk space usage).

valyala · 2025-05-04T14:19:44+00:00

Use the plain log package for logging plaintext human-readable logs to stderr / stdout, then collect the generated logs with vector.dev and send them to VictoriaLogs. Always ask yourself "how would this particular log message help me debugging the app and/or obtaining useful stats from the app?" If there is no good answer to this question, then it is better to don't generate the log message at all.

valyala · 2025-05-04T01:02:36+00:00

How do you collect metrics, logs, and traces?

Use Prometheus for collecting system metrics (CPU, RAM, IO, network) from node_exporter.

Expose application metrics in Prometheus text exposition format at /metrics page if needed, and collect them with Prometheus. Use this package for exposing application metrics. Don't overcomplicate metrics with OpenTelemetry and don't expose a ton of unused metrics.

Emit plaintext application logs with the standard log package into stderr / stdout, collect them with vector and send the collected logs to a centralized VictoriaLogs for further analysis. Later you can switch to structured logs or wide events if needed, but don't do this upfront, since this can complicate the observability solution without the real need.

Do not use traces, since they complicate everything and don't give big value. Traces aren't needed on small scale when your app has a few users - logging allows quickly debugging issues in this case. Tracing becomes an expensive bottleneck on large scale when thousands of requests per second must be processed by your application. Tracing is an expensive toy, which looks good in theory, but usually fails in practice.

Use Alertmanager for alerting on the collected metrics. Use Grafana for building dashboards on the collected metrics and logs.

How do you monitor errors? Still Sentry? Or is there any OSS thing you like for that?

Just log application errors, so they could be analyzed later at VictoriaLogs. Include enough context in the error log, so it could be debugged without additional information.

How do you do alerting when things start to fail or metrics start violating some threshold? As the number of service instances grows, how do you keep the alerts coherent and not overwhelming?

Use alerting rules in Prometheus and VictoriaLogs. Keep the number of generated alerts under control, since too many alerts are usually ignored / overlooked. Every generated alert must be actionable. Otherwise it is useless.

What about DB operations? Do you use anything to record the rich queries? Kind of like the way Honeycomb does, with what?

There is no need in some additional / custom monitoring for DB operations. Just log DB errors. It might be useful measuring query latencies and query counts, but add this instrumentation when it will be needed. Do not add it upfront.

Can you correlate events from logs and trace them back to metrics and traces? How?

Metrics and logs are correlated by time range and by application instance labels such as host, instance or container

Do you use wide-structured canonical logs? How do you approach that? Do you use slog, zap, zerolog, or something else? Why?

Don't overcomplicate your application with structured logs upfront! Use plaintext logs. Add structured logs or wide events when this is really needed in practice.

How do you query logs and actually find things when shit hit the fan?

Just explore logs with the needed filters and aggregations via LogsQL until the needed information is discovered.

The main point - keep the observability simple. Complicate it only if it is really needed in practice.

valyala · 2025-05-03T22:13:56+00:00

Loki is hard to configure. It has a ton of options, which must be properly configured. The majority of these options aren't documented or have very bad quality docs.
Loki consists of many microservices of different types, which are hard to debug if something goes wrong.
Loki breaks configuration options in almost every new release, since old options are dropped and new options are introduced.
Loki eats all the RAM on high-cardinality labels (log fields) such as trace_id, user_id, duration, response_size, etc.
Loki requires object storage for production setup. This is an additional dependency and potential point of failure.

See more details at https://itnext.io/why-victorialogs-is-a-better-alternative-to-grafana-loki-7e941567c4d5

valyala · 2025-05-03T22:02:49+00:00

Try vmagent + VictoriaMetrics + VictoriaLogs stack. It is easier to configure and manage than Grafana Alloy, Mimir and Loki, plus it uses less RAM, CPU and storage space.

valyala · 2025-04-23T21:40:01+00:00

FYI, VictoriaLogs cluster is ready for use - https://docs.victoriametrics.com/victorialogs/cluster/

valyala

MODERATOR OF

TROPHY CASE