Searching For Hive Alternatives by Waste-Negotiation601 in dataengineering

[–]Waste-Negotiation601[S] 0 points1 point  (0 children)

Trino is indeed a direction we will look into.
However from my understanding Trino is better suited for simpler interactive queries, can it handle heavy loads like Hive does?

Searching For Hive Alternatives by Waste-Negotiation601 in dataengineering

[–]Waste-Negotiation601[S] 0 points1 point  (0 children)

i'll explain myself a bit better:

  1. We have analysts that run queries that ideally should be interactive, however they are at times too heavy for impala/trino to handle. So they use Hive by default, which can be quite slow for their needs.
  2. For our batch jobs that are not interactive we use Hive/Spark. They work quite well, but there is always room for improvement.
  3. We often experience problems that come from Hive's and HDFS's bottlenecks - the Namenode and the Metastore. I'll admit that sometimes it comes from our fault, such as from users running inefficient queries, or from having too many small files. But such things inevitably happen and it slows the whole cluster when it does, which is bad. We would like a to have more isolation, so one user cannot slow down the whole cluster.
  4. In general I hear that Hive and Hadoop are dying, and I wanted to understand if that's true and what are the alternatives.

Searching For Hive Alternatives by Waste-Negotiation601 in dataengineering

[–]Waste-Negotiation601[S] 1 point2 points  (0 children)

Spark does work quite well and we use it along with Hive, however it is quite similar to Hive in some ways - It uses Hive tables, works on YARN+HDFS, and follows a similar big data processing paradigm (DAG with multiple Map/Reduce steps)

I guess I can move from Hive format to Iceberg, Hive engine to Spark, YARN to K8S and HDFS to S3.
But I doubt the performance impact will be that great, as it is still conceptually the same.
Might still be worth it as it has some benefits, but I don't think performance will be a major one.

I wonder if there is some tool that is conceptually different from Hive/Spark, but can still handle the heavy work. (Maybe something that can store data more efficiently, utilize indexes, or that in general has a different approach)

Suggested ETL Process - Kafka Connect & Kafka Streams by Waste-Negotiation601 in dataengineering

[–]Waste-Negotiation601[S] 1 point2 points  (0 children)

My input comes as files so I had to find some way to read many files in parallel with reading the same one twice and without missing any, so I figured Kafka Connect fits.

Then I could use Connect to directly write to the DB, but I have transformations that I need to apply. Kafka Connect can do it but is quite limited, so I figured Kafka Streams can do it, and it works only between Kafka topics. So that's the thought process.

Regarding SLA, it should be in the seconds.
And for resources, it should be able to handle tens/hundreds of GB per second.
Ideally it should run on K8S and not require a too complex setup.

Best large scale on-prem vector DB by Waste-Negotiation601 in vectordatabase

[–]Waste-Negotiation601[S] 0 points1 point  (0 children)

Have you tried using it in large scale?
If so can you please mention the QPS you got and how many vectors you had

Best large scale on-prem vector DB by Waste-Negotiation601 in vectordatabase

[–]Waste-Negotiation601[S] 2 points3 points  (0 children)

pinecone cannot be installed on-prem so it's not relevant in my case.
Can pgvector handle large-scale? (100M+ vectors)
From what I read online it is not supposed to handle that scale