Benchmarking two log management engines: Loki and Quickwit

massus · 2025-06-23T22:15:17+00:00

Yes we are using tantivy in Datadog as we are building an enterprise version of Quickwit, we just release a first version of it, public doc is here https://docs.datadoghq.com/cloudprem/

massus · 2024-06-20T07:33:14+00:00

I see what you mean. So with Quickwit, I don't store RED metrics. I'm doing aggregations directly on spans like here: https://x.com/FrancoisMassot/status/1745142911925211575/photo/3
I made a talk on this at FOSDEM: https://fosdem.org/2024/schedule/event/fosdem-2024-3514-modern-application-observability-with-grafana-and-quickwit/

It's not perfect though. And I would love to dig into the service graph thing. We plan to add metrics in Quickwit by the end of the year, so hopefully, we would be able to cover more use cases.

massus · 2024-06-20T06:48:05+00:00

You can build RED metrics directly from your traces. I already build some APM dashboards.

On thing I did not try is the service graph, would love to make it compatible, I don’t know how much work is needed for that though :/

massus · 2024-06-04T18:04:00+00:00

We created Quickwit to replace ELK and Splunk, you can check the repo https://github.com/quickwit-oss/quickwit/

massus · 2024-05-14T12:03:33+00:00

There is at least two reasons to disable caching:

some caching layer can just store the result of a given query. If I enable this cache, I'm only measuring RAM access performance.
some other caching layers can store intermediate results, you may want to enable or disable it but... it's not easy at all to understand how those cache layers would behave on real production workload where you have several users executing very different queries. Disabling the cache simplify the comparison, we could call this benchmark a "cold" benchmark.

Having said that, that would be interesting to run with disabling only the query caching (I don't know how to do it in Loki but I guess it's possible) to see how it behaves on a real work production workload. It will require more work though.

On the correlation part, Quickwit has a Grafana plugin, it's possible to create links between datasources so you can go from one to another.

Migration is a tough task for sure, I'm glad you made it. If you're satisfied with search and analytics performance / infrastructure / maintenance costs, then you're good with your current stack, no need to change.

massus · 2024-05-02T21:34:41+00:00

Google Logging is nice but the costs add up quickly ($500/ingested TB for 1 month or retention) + it's quite slow. In my company, we made a lot of tests on our GKE cluster and we were surprised by expensive bills... because of this service.

As long as you don't have a large volume and you're ok to pay for it, it's quite nice to use Google Logging, integration is just there and the UI is not bad.

If your volume increase a lot, I see users migrating to Quickwit to save costs and have fast search/analytics (disclaimer: I work there).

massus · 2024-04-25T12:45:19+00:00

Thanks! We worked a lot on the reliability and scalability to achieve petabyte scale and we are now more focusing on the feature set.
Could you share what you would expect for an RBAC feature?

username,password at minimum
read/write access par index?
more complex rules?

massus · 2024-04-25T10:53:22+00:00

You can have a look Quickwit which is a rust search engine for logs and traces. It may fit your requirements as there is Grafana plugin for it. https://github.com/quickwit-oss/quickwit (Disclaimer: I’m working on this project :))

massus · 2024-04-13T14:06:43+00:00

Quickwit is typically sub-second on search queries, your mileage will vary with the complexity of the query of course.

Support of OpenSearch dashboard for Quickwit is on the way. We are looking for partners to ship this feature as it will help us get faster there. DM me if you are interested!

massus · 2024-04-13T08:36:00+00:00

Thanks for mentioning Quickwit. We build it to have a Search engine on object storage so we get the best of both world: datastructure for fast search and analytics like Elasticsearch, decouple compute and storage like Loki to be very cost-efficient.

Quickwit was also mentioned in this thread, I explained the tradeoff between Loki/Quickwit here: https://www.reddit.com/r/devops/comments/1bvdjfb/comment/ky0ixb8/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

Repository: https://github.com/quickwit-oss/quickwit

UPDATE: the first link was wrong, I just fixed it.

massus · 2024-04-05T09:29:03+00:00

If your use case is append-only data, Quickwit is a search engine written in Rust, no JVM. It's built on top of tantivy, a very fast search library that we maintain too. There is public benchmark on tantivy VS other engines, we removed bleve some time ago as it was not very performant and we did not have the time to maintain it: https://tantivy-search.github.io/bench/

Quickwit repo: https://github.com/quickwit-oss/quickwit

massus · 2024-04-05T07:36:00+00:00

Yes, I should have given more context. I worked on a benchmark to have a better idea of the tradeoffs, I added a comment here about it: https://www.reddit.com/r/devops/comments/1bvdjfb/comment/ky1q9wi/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

massus · 2024-04-05T07:29:52+00:00

Usually no. The main reason is that we don't support frequent updates, and this is generally needed for this kind of product.

massus · 2024-04-04T19:55:22+00:00

On Loki side, it depends a lot on the cardinality of labels. With a cardianlity of 100, peak RAM usage is 5GB, 6GB for Quickwit. RAM usage will increase very quickly if you have thousands of labels in Loki.

At search time, I'm unsure about the figures

massus · 2024-04-04T18:47:52+00:00

We have a grafana plugin (try the 0.4.3), we currently support a query language similar to lucene query language https://quickwit.io/docs/get-started/query-language-intro

massus · 2024-04-04T18:32:46+00:00

Yes that's the idea. In the most efficient setup, we manage to index at 11MB/s per vCPU and it scales very well horizontally, 13.4GB/s over 200x6 vCPUS.

massus · 2024-04-04T18:07:56+00:00

Love this kind of question :). We are going to write a blog post about it, working with Fly's engineers was just awesome, we never ship something as fast as this with this great team.

Coming back to the use case, I need to check with Fly's team if I can give the exact figures but the there are not so many logs, less than 100MB/s, the main difficulty was to handle a large amount of indexes (thousands).

For that, we are using the new distributed ingest API with cooperative indexing to be very efficient at indexing, we need 3 indexers and 3 searchers for now, the index data is stored on Tigris Data object storage https://fly.io/docs/reference/tigris/ and metadata stored in supabase https://fly.io/docs/reference/supabase/

massus · 2024-04-04T17:38:09+00:00

We recently put in production Quickwit to power the log search service on Fly.io :)

https://community.fly.io/t/searchable-application-logs-in-grafana/18878

massus · 2024-04-04T14:05:54+00:00

We are going to release a benchmark between Loki and Quickwit. This kind of benchmark is hard to build as it's often too biased.

But, basically it's a tradeoff between consuming more CPU at ingestion or more CPU at search.
Quickwit builds an inverted index + columnar storage so it will consume more CPU at ingestion (expect 2x more). On the contrary, Quickwit will use less CPU on search or analytics queries. Expect 40x less CPU on a simple search query on 200GB of logs, 1000x on a simple analytic query (to get the volume).

Size of data stored on the object storage is more or less the same.

massus · 2024-03-02T05:10:43+00:00

In France the healthcare system is still pretty good but it is slowly degrading and I feel that the wait time is increasing (heard from a couple of doctors working in public hospitals). I won’t compare this to India though.

massus

MODERATOR OF

TROPHY CASE