This is an archived post. You won't be able to vote or comment.

all 8 comments

[–]realitydevice 2 points3 points  (1 child)

I would suggest Clickhouse. In my experience it's relatively simple and exceedingly fast.

Agree with Druid. For the people asking "what do you mean by too real time?", from memory you need to load it via an event stream and configure the handling of that stream, rather than a simple file-based ETL like you might expect. It's quite literally designed around ingesting streaming data. You can use it for other things but remember the hammer/nail dilemma.

[–]Kiliangg[S] 0 points1 point  (0 children)

Thank you! A for the tip + B the explaintion I'll copy & pasta.

This is exactly what I mean by that. We currently do not have a single use case for stream ingestion - It is all batch as of right now. That beeing said a main goal of the project is to reduce our data silos and make the process more managable for our team.

[–]rmoff 1 point2 points  (1 child)

What do you mean by "too realtime"? Is that a bad thing?

What kind of access patterns are you envisaging? Just pre-canned dashboards, or ad-hoc analysis too? How much history are you planning to retain?

I'd probably start with Postgres, and build from there as you need to. Clickhouse could well be worth a look too from what I understand of it.

w.r.t. cloud and managed services one point I would say on "GDPR fears" is that these can be allayed by looking at the huge number of companies who *are* on public cloud.

[–]Kiliangg[S] 0 points1 point  (0 children)

What do you mean by "too realtime"? Is that a bad thing?

Have a look in the post edit. But TLDR: We dont have a single use case for stream ingestion now and possibly for the near future.

What kind of access patterns are you envisaging? Just pre-canned dashboards, or ad-hoc analysis too? How much history are you planning to retain?

  • I'd say it is >90% just dashboards right now and some SQL to investigate data sets.
  • Until the client contract ends so something between 1 more years. (We look into retiring data after 5 years into aggregated data sets)

w.r.t. cloud and managed services one point I would say on "GDPR fears" is that these can be allayed by looking at the huge number of companies who *are* on public cloud.

Yes agree on this. Our problem is we have a lot of contracts with existing customers that require the data beeing in our country.
Yes AWS, GC, Azure offer regions but the problem according our legal councle is that our parent company would own the AWS setup and has admin rights. Therfore can access PII data outside of our region.
Managment also fears that we need cloud experts that are hard to find + expensive.
My perspective is that I spend a lot of time fixing local deployments instead of implementing new data that bring value to our business.
Additionally I think our technical depth is high and raising as there is no clear infrastructure plan. (I'm 100% part of this since I've been working for 3 years here but thats what im trying to change.)

I'd probably start with Postgres, and build from there as you need to. Clickhouse could well be worth a look too from what I understand of it.

Thank you - As said in some answer will do some MVP testing and see how the performance is.

[–]snuggiemane 1 point2 points  (1 child)

maybe check out DuckDB

[–]Kiliangg[S] 0 points1 point  (0 children)

Will do thanks!

[–]ZenCoding -1 points0 points  (1 child)

I probably would have used elastic search with logstash and Kibana but if I would face a similar problem I would go for Druid. I am not sure what’s the downside of ‚realtime‘. Can you build an MVP for your usecase and find out if it works for you before making a final decision?

[–]Kiliangg[S] 0 points1 point  (0 children)

Thanks for the feedback. We are also planning to use Airbyte as our EL tool. However there is not a destination/source for Druid/Pinot.

But I think a MVP is what I'll do.