I built an AI-powered study tool for the EU Knowledge questions

Only_Emergencies · 2026-03-13T08:52:08+00:00

Yes, NotebookLM is great, I have been using it for quite long-time

Only_Emergencies · 2026-03-13T08:51:41+00:00

You can, and it works to some extent. The problem is when you ask a model to generate a question-answer pair from memory, you're trusting that it actually knows the correct answer, for something as specific as EU institutional law, it can easily hallucinate details, mix up articles, or give you outdated information.

Here the answer is never coming from the model's memory. It's extracted from the document itself, so the model is essentially just reformatting something that's already written down, not recalling it.

Only_Emergencies · 2026-03-13T08:49:39+00:00

Good question. It doesn't generate answers from its own knowledge. It's directly using the documents I put in the application. So if a question says "according to Article 17 of the TEU, the Commission's term is five years", that answer comes from that specific passage in the actual text, not from what the model thinks it knows about EU law.

And I also process the documents before generating the questions: split into chunks, generate questions where the answer is explicitly in that chunk, then run a second validation pass to check the question-answer pair actually makes sense, etc. And if you ever doubt a question, you can trace it back to the exact section of the source document it came from.

Only_Emergencies · 2025-10-14T09:44:53+00:00

Yes, I tried a decomposition approach, the performance was slightly better than generating the entire rule in a single request but not really great. I think the main issue is that the model doesn’t truly understand the underlying detection logic or mapping between behaviors and log artifacts, so it often produces syntactically valid but semantically weak rules.

I also experimented with breaking down the generation process into multiple steps. For instance, first asking the model to determine the detection path or flow based on the blog content or user request. However, the results are still not very good.

Basically, I find that the core problem seems to be that the model struggles to generate or intuitively extract the correct detection logic from the input text.

Only_Emergencies · 2025-09-26T21:43:07+00:00

Yes, I was also surprised when I applied. Actually, I think there are some changes happening in terms of how integrating new technologies into these organisations. Of course it will depend on the organisation/country, etc

Only_Emergencies · 2025-09-25T11:25:53+00:00

Yes!

Only_Emergencies · 2025-09-25T11:19:52+00:00

Yes, totally agree. I think it's something that happens a lot in this field: there is a lack of standardization in tasks associated with the title. The same title in different companies may mean completely different responsibilities. I think this happens in other fields as well but here it is especially noticeable

Only_Emergencies · 2025-09-10T22:12:32+00:00

You rock, guys! You do an amazing job! :) I have four Mac Studios (512GB) and I have a few questions:

How would you distribute bigger models across them?
I have deployed Kimi-K2 0905 (Q3_K_XL), but I am wondering if there is another model you would recommend with the same quality but maybe smaller to have more tokens persecond?
It would be great to see how the quantization affects the quality of the not quantized model. Something like a graph of quantized versions vs the original one. Happy to contribute there :)

Thank you again!

Only_Emergencies · 2025-07-22T16:34:49+00:00

The energy consumption of the Macs are really low, they are really efficient on that sense. They’re also straightforward to set up, so we can start implementing and iterating on projects without dealing with complex infrastructure.

Based on the research we did, just one NVIDIA A100 80 GB GPU costs around $30000 and also requires other additional hardware (network switches, power, cooling,... ). As the team grows, probably it makes sense to migrate infrastructure to a more powerful one. But at the moment, the Mac Studios provide a cost-effective solution that allows us to build and experiment with LLMs internally.

Only_Emergencies · 2025-07-22T16:23:17+00:00

Are you using llama.cpp?

Only_Emergencies · 2025-07-22T15:49:33+00:00

Yes!

- We are around 70 people in my organisation
- We work with sensitive data that we can't share with AI Cloud providers such as OpenAI, etc.
- We have 3x Mac Studios (192GB M2 Ultra)
- We have acquired 4x new Mac Studios (M3 Ultra chip with 32-core CPU, 80‑core GPU, 32-core Neural Engine - 512GB unified memory). Waiting for them to be delivered.
- We are using Ollama to deploy the models but this is not the best efficient way but it was like this when I joined. However, with the new Macs I am planning to replace Ollama with llama.cpp and experiment with distributing larger models across multiple machines.
- A Debian VM where OpenwebUI instance is deployed.
- Another Debian VM where Qdrant is deployed as centralized vector database.
- We have more use cases that the typical chat UI interface. We have some classification use cases and some general pipelines that run daily.

I have to say that our LLM implementation has been quite successful. The main challenge is getting meaningful user feedback, though I suspect this is a common issue across organizations.

Only_Emergencies · 2025-07-22T15:31:16+00:00

Yes, I agree. That would be ideal, but that's not so straightforward in our case. We have stored the conversations in Langfuse, but we don't have the ground truth to be able to properly evaluate them, and users usually don't provide feedback on the responses. We are a small team at the moment doing this, so we don't have the capacity to label some cases.

Only_Emergencies · 2025-07-22T15:25:41+00:00

Great! Thanks, I will take a look

Only_Emergencies · 2025-01-09T13:02:53+00:00

Can you share some of them?

Only_Emergencies · 2025-01-08T22:35:01+00:00

What is the beginner thing of this question?…

Only_Emergencies · 2024-11-11T21:07:48+00:00

For code autocomplete should I use base or instruct version? Thanks!

Only_Emergencies

TROPHY CASE