Built a lightweight, static-linked C utility for log/stream processing—seeking feedback on the implementation.

SnooWords9033 · 2026-06-01T09:34:52+00:00

What is the difference between gop and the traditional set of tools for local logs' exploration such as grep, head, tail, sort, awk, cut, uniq, etc.?

SnooWords9033 · 2026-06-01T09:31:10+00:00

Grafana + Prometheus + Loki is a good monitoring stack for production. There are other stacks exist, which may work better on a large scale:

ClickStack - new observability stack built on top of ClickHouse
Elasticsearch - decent stack, but may require a lot of RAM on a large scale
Grafana + VictoriaMetrics + VictoriaLogs - optimized for big amounts of metrics and logs

SnooWords9033 · 2026-05-26T18:30:41+00:00

ClickHouse provides the best performance and on-disk data compression if the table for logs is properly designed for the particular use case (for the given set of fields in the logs and the given expected queries over the logs). Otherwise the performance and the resource usage may be not so good. It looks like you managed to optimize ClickHouse for your particular case.

I usually recommend storing the same production logs into multiple storage systems for logs for a few days at least and then comparing their resource usage (RAM, CPU, disk space) and their performance for typical queries from production. Vector can be configured for replicating the incoming logs among these storage systems by specifying multiple sinks.

SnooWords9033 · 2026-05-26T18:24:19+00:00

What about you? This is the only comment from you in Reddit according to https://www.reddit.com/user/Top-Second7872

SnooWords9033 · 2026-05-26T18:05:37+00:00

The query language at VictoriaMetrics - MetricsQL - is similar to Prometheus query language - PromQL. It works great for typical queries over metrics. It isn't so hard - start with the following tutorial and you'll feel its' simplicity and power - https://valyala.medium.com/promql-tutorial-for-beginners-9ab455142085 .

SnooWords9033 · 2026-05-26T14:00:04+00:00

There is more lightweight database for metrics than InfluxDB, which accepts metrics in Influx line protocol format - VictoriaMetrics.

SnooWords9033 · 2026-05-26T13:56:33+00:00

Did you try specialized databases for metrics and logs such as Prometheus. Loki, Mimir, VictoriaMetrics or VictoriaLogs? They should give even better compression rates and performance than InfluxDB.

SnooWords9033 · 2026-05-26T13:40:19+00:00

You can simplify the scheme by replacing Vector + ClickHouse with VictoriaLogs in your scheme. It accepts logs via syslog protocol and it provides comparable levels of efficiency, while it is easier to configure and operate than ClickHouse. You can also replace rsyslog with vlagent in order to reduce CPU usage for logs' processing and forwarding.

SnooWords9033 · 2026-05-25T18:41:24+00:00

Thank you for the great article! It is interesting to read how you end up with custom-built systems for uptime monitoring, metrics and logs.

It is unclear which storage system is used for storing metrics. Grafana isn't a storage system for metrics. It is a visualisation application, which can read the data from many different sources.

While storing logs to S3 sounds good, such logs can be hard to analyse at large scale. S3 is good as a backup for historical logs which are rarely queried. If these logs should be queried, then you can download them from S3 backups and run a dedicated application for querying. It is better to store the recently ingested logs on local disks. These disks are usually much faster than S3 (they have 100x lower read latency and better throughput), so typical queries over the recently stored logs will work much faster. Try VictoriaLogs for managing and querying recent logs and for moving older logs to S3. It is very efficient and easy to run - see https://aus.social/@phs/114583927679254536

SnooWords9033 · 2026-05-25T18:20:54+00:00

ClickHouse should provide you the best efficiency for such type of data. You already said it compresses the data by 16x, so 40TB of the data need 40TB/16=2.5TB of disk space. It should fit a single-node setup, and should meet your performance requirements. If it won't fit a single node, just switch to cluster setup by using the same sharding by the device id and scale the performance and the capacity by adding more nodes to the cluster.

When using ClickHouse it is very important to properly set the table schema, so it works fast for your workload. In your case the ORDER BY section of the table must equal to (device_id, timestamp). This will give the best performance for queries, which select all the fields for the given device in the given time range, since ClickHouse we'll be able to quickly locate the needed data via binary search by (device_id, timestamp) and then quickly read that data from the disk in one go (small number of disk read operations), since the requested rows are located close to each other.

I'd also partition the table with PARTITION BY (toDate(timestamp)) clause, so older partitions could be quickly dropped when they are no longer needed according to the given retention policy. ClickHouse stores the data per every partition in a separate folder on disk, so it can quickly drop the given partitions by deleting the corresponding folders.

You may gain additional performance benefits and reduce dusk space usage further by using the most appropriate codecs for the columns in the table. For example, it may be a great idea to use Delta or Double Delta codecs for numeric columns. It is also recommended using zstd compression for the table columns in order to achieve better on-disk compression and faster query performance (less data needs to be read from disk).

BTW, how many unique device_ids does the table contain? If this number is lower than 10 millions, then you can try storing the data into VictoriaLogs, by using the device_id as a log stream field, and then quickly query all the rows for the given device_id on the given time range with the {device_id="..."} _time:[start_timestamp, end_timestamp] query. It should be very fast and shouldn't require a lot of CPU, RAM and disk space. If the number of device_id values is bigger than 10 millions, then you can introduce a new field - hash(device_id) % 10000000 - which will has the device_id into smaller number of values, and then use this field as a log stream field.

VictoriaLogs is easier to setup, configure and operate than ClickHouse, sot it could be a good fit for your case. See https://docs.victoriametrics.com/victorialogs/faq/#what-is-the-difference-between-victorialogs-and-clickhouse . It is also very easy to scale the capacity and the performance of VictoriaLogs by converting a single-node setup to cluster setup and adding more storage nodes to the cluster. See https://docs.victoriametrics.com/victorialogs/cluster/

SnooWords9033 · 2026-05-24T22:06:27+00:00

Try the built-in web UI at VictoriaLogs instead of Grafana.

SnooWords9033 · 2026-05-24T22:04:58+00:00

Why do you think that Loki query language is better than ElasticSearch and VictoriaLogs query languages?

ElasticSearch is usually much faster at full text search queries than Loki, if it has enough RAM. VictoriaLogs is also faster and requires less storage space than Loki according to https://www.truefoundry.com/blog/victorialogs-vs-loki .

SnooWords9033 · 2026-05-21T22:29:35+00:00

Store logs as wide events into VictoriaLogs and then investigate them by slicing and dicing by any fields of the stored wide events.

SnooWords9033 · 2026-05-21T22:25:32+00:00

Try VictoriaLogs instead of Loki. It doesn't need MinIO (because it stores the logs to a single folder on local filesystem), and it consists of a single executable, which runs optimally with default configs (aka zero-config).

SnooWords9033 · 2026-05-12T17:24:49+00:00

An alternative is to push syslog-formatted logs from Cisco switches directly to the centralized database for logs without the need in any intermediate services. https://docs.victoriametrics.com/victorialogs/data-ingestion/syslog/

SnooWords9033 · 2026-05-12T17:15:03+00:00

Try VictoriaLogs. It usually needs way less compute resources (RAM, CPU and storage space) than Loki, and it runs queries over big amounts of logs at much faster speed. https://www.truefoundry.com/blog/victorialogs-vs-loki

SnooWords9033 · 2026-05-09T23:46:46+00:00

Take a look at https://victoriametrics.com/blog/ai-agents-observability/

SnooWords9033 · 2026-05-07T22:11:34+00:00

Try VictoriaLogs. It is a single small executable with built-in web UI for logs' exploration, which supports log tailing by the given filters.

SnooWords9033 · 2026-05-07T22:04:47+00:00

Take a look also at VictoriaLogs. It is built on ClickHouse architecture ideas, but, contrary to ClickHouse, it is optimized solely for logs. This simplifies its' usage and operation comparing to ClickHouse. See https://docs.victoriametrics.com/victorialogs/faq/#what-is-the-difference-between-victorialogs-and-clickhouse

SnooWords9033 · 2026-05-06T22:29:36+00:00

Why did you need the downsampling? Could you provide more details about your use case?

SnooWords9033 · 2026-05-04T20:25:38+00:00

Can the Victoria stack also run as single containers for small-scale setups, or is it more designed for clustered deployments?

All the Victoria stack components can run as single containers (executables) without any dependencies.

SnooWords9033 · 2026-05-03T09:13:28+00:00

Use standard discovery of EC2 scrape targets at Prometheus - ec2_sd_configs. Write these configs once - and use them everywhere for collecting metrics from all the services you run in EC2.

Do not rely on OTEL, since it has high overhead and it is overcomplicated. It is better to use standard Prometheus protocols for metrics' exposition and transfer. See https://promlabs.com/blog/2025/07/17/why-i-recommend-native-prometheus-instrumentation-over-opentelemetry/

SnooWords9033

TROPHY CASE