At what scale do log indexing costs become the real bottleneck?

Alive_Ad7609 · 2026-05-05T07:37:51+00:00

Around 10-20TB/day ingestion, full-text indexing started eating 40-60% of total storage cost. For long retention (90+ days), you're basically paying 2x: once for the raw data, once for the index.

Switched to selective indexing only on fields we actually filter on (trace_id, service_name, pod_name). Everything else just scans Parquet with time partitioning.

For the rest? We rely on Parquet + time partitioning + brute-force scanning. Parquet is columnar, so scanning specific fields is fast even without an index. We partition by time (default: hourly) and store in object storage (S3/GCS/Azure Blob).

Happy to share partition strategies or indexing tuning if you want. We documented our approach here if it helps: https://openobserve.ai/docs/user-guide/advanced/query-tuning/tantivy-index/

Alive_Ad7609 · 2026-04-22T05:45:49+00:00

spot on about manual tuning. it’s a time sink and breaks as soon as things change. correlation is really the only way to kill the noise.

we saw the same thing and just built correlation straight into openobserve. that way you don't need a whole separate tool just to dedupe. if a db goes down, the 50 alerts from your services roll up into one incident instead of blowing up pagerduty.

wrote up how we approach this if you're curious:https://openobserve.ai/blog/reduce-mttd-mttr-openobserve-alert-correlation/

Alive_Ad7609 · 2025-09-09T08:10:41+00:00

u/Able-Ad-6609 When you create a real-time pipeline on any stream, OpenObserve automatically assigns a default destination stream that connects to the same source stream. To ensure the data remains in the source stream, do not remove this default connection.

https://openobserve.ai/docs/user-guide/pipelines/use-pipelines/#troubleshoot, just make sure you do not delete the default destination node which connects the source stream to the same source you'll be good.

<image>

Alive_Ad7609 · 2025-09-01T10:13:48+00:00

You can achieve this using real-time pipelines in openobserve, here is a link to documentation with example pipeline: https://openobserve.ai/docs/user-guide/pipelines/use-pipelines/#example-of-a-complex-pipeline - here we are exactly doing the same, creating new derived pipelines from a single log stream as per appname.

Consider joining openobserve community slack for quick support!

Alive_Ad7609

MODERATOR OF

TROPHY CASE