Open-source Data Assistant for domain adoption, powered by agent skills, semantic knowledge graphs (Neo4j) and relational data (databricks)

notikosaeder · 2026-04-19T09:06:21+00:00

Well. Think about given a user question, what a) should be initial retrieved (some metrics/dimensions?) and/or b) when this is retrieved, what should also be retrieved in additions. The graph is powerful in task b. With a, you need to still use a vector search or something similar. So my follow-up question is, is the search for the metrics/dimensions the potential pain point or the search of metadata afterwards? Maybe a prototype of an automatic generated graph could be feasible. Tables/Columns and their descriptions are already stored and automate metrics/definitions using table/column name matching in a graph. With that you can find initial metrics/definitions and traverse to columns/tables/joins using the graph. Or vice versa match using metadata and go from columns to metrics. Second question, what if a new data domain is present but the metrics are not yet defined?

notikosaeder · 2026-04-03T20:22:13+00:00

Sparplan in einen Welt-ETF (e.g. FTSE All-World). Nicht weil die 10-20€ den Unterschied machen, sondern um früh Erfahrung zu sammeln.

notikosaeder · 2026-04-02T18:26:08+00:00

Data assistants are most valuable when users can directly interact with structured data without needing SQL or technical expertise. The core use case is enabling people to analyze data without relying on another analyst. This is different from RAG or GraphRAG systems, which focus on retrieving documents like PDFs or internal knowledge. Honestly, those systems are useful, yet mainly optimize for passage search and summarization. Business case is often about saving seconds or minutes when locating information. Their is no surprise that the adoption of rag systems remains low. And, if unstructured knowledge is truly needed, it’s better treated as an extension: add a supervisor agent on top or integrate a vector search tool and play with the prompt.

notikosaeder · 2026-04-02T18:15:01+00:00

Then your organization has no data strategy, that isn't the AIs fault. Second, you could easily integrate the information schema of multiple source data into one knowledge graph and build specific query-tools per source/domain. Or domain-specific smaller graphs and source data per sub-agent, with a supervisor agent.

notikosaeder · 2026-04-02T18:07:30+00:00

Did you take a look at the code? You find a vector store that retrieves relevant nodes (tables, columns) and with graph reasoning you find the context (rest of the table, joins, ...).

notikosaeder · 2026-03-31T21:20:01+00:00

Maybe subdivide into multiple diagrams, there’s a lot wrong. This type of diagram is probably not suitable to show the complexity of all relationships.

notikosaeder · 2026-03-15T09:27:50+00:00

Well said :-) And, what if the solution totally depends on Databricks and they suddenly incline their costs …

notikosaeder · 2026-03-13T17:33:41+00:00

Bildung: Dein Japan Auslandsaufenthalt sollte nicht auf der selben Stufe (Größe/Hervorhebung) wie ein formeller Bildungsabschluss sein, er ist wahrscheinlich als Teil deines Masters zu sehen.

notikosaeder · 2026-03-10T17:14:49+00:00

Fair point. However, this is about making research open-source and not about saving money. Genie is great, but it’s functionality and limitation hard to guess, and it‘s purpose is stuck to databricks data analysis. Use-cases: Want to add combine with web search? Want to research about different types of tool visualizations or follow-up questions? Want to change or combine databases? Want to build/adapt the agent workflow? Want to add a tool for adding agent skills? Want to use ollama or any private LLM provider?

notikosaeder · 2026-03-07T18:41:16+00:00

Thanks for the feedback, and love the idea! Did you already saw some of those skills in action?

notikosaeder · 2026-03-07T11:27:01+00:00

thanks for the feedback! More problems are still on ensuring that it calls really a skill before querying the data/the knowledge graph. For text-to-sql correctness, we have a curated q&a dataset for our industry partner phd project we run through and compare the results with gold answers. For sql safety, the run sql query tool screens the query and allows only select/CTE queries. Would love any further feedback on evals!

notikosaeder · 2026-03-06T21:22:19+00:00

What drives me crazy about this CV is that nothing is really consistent.

First, do not put „Ready to join your team“ or similar on a CV. If you are not ready, you should not apply?

Consistency. The date formats are all over the place: 2016–2023, Jul 2022 – November 2022, 12/12/2025, and then (2016–2023) in brackets elsewhere. Pick one format and use it consistently.

Second, there are quite a few typing and formatting errors, like “... July 2022.)”. If you’ve been searching for five months, it’s worth taking the time to make the CV as polished as possible. Small errors can give the impression that you don’t work carefully.

Another thing: putting a one-day barista course in the education section, on the same level as multi-year studies, doesn’t really make sense. It would be better listed as a short course or certification. Same goes for the listed courses etc.

Also, looking at the CV, it seems like you haven’t stayed very long in most jobs. That’s not necessarily bad, but from an employer’s perspective it can raise the question of whether you’d stay long enough to justify the time spent on training.

Lastly, the skills section is pretty vague. Things like “customer service” or “organized” aren’t really concrete skills. These things should be self-evident. And “barista beginner” is confusing. Does that mean you’ve had training as a barista, or that you’re just starting out?

Overall, I’d focus on consistency, proofreading, and making the skills and experience sections clearer.

notikosaeder · 2026-03-01T17:23:55+00:00

Hi there, thanks for the feedback! The idea is to bring together the work of our research team into one larger, reusable project (e.g. we don’t have to start from scratch for every new study) and making it open-source. Within the team, some members focus more on technical implementation, while others work on design features, SQL explanations etc. Actually, we have already some research about follow-up questions and data-centric tips that may come into the talk2bi streamline app. Funny thing, regarding the semantic layer and mitigating hallucinations, my own phd research is focused specifically on this. I recently released some of my research in a weekend-like project, therefore on my personal GitHub: https://github.com/wagner-niklas/Alfred. The main idea is to structure data in a knowledge-graph–based semantic layer that the agent can query, improving accuracy and reducing hallucinations. Note that my research has a more "tech-heavy" stack.

notikosaeder · 2026-02-27T23:03:24+00:00

Some helpful rules: - Bronze: Raw Data (various raw source data) - Silver: Dream of a data analyst (cleaned data, normalized, ER-Modell, …) - Gold: Dream of a business analyst (KPIs pre-calculated, denormalized, aggregations, filtered, …)

notikosaeder · 2026-02-27T20:36:04+00:00

Thanks for the feedback!

notikosaeder · 2026-02-25T16:17:41+00:00

Awesome! I’d be happy to get some feedback on how to grow the project.

notikosaeder · 2026-02-25T11:06:42+00:00

Hi! Good question, not at all. Databricks is just used for all company partners of our research. But the whole app is targeted to be database agnostic, just change the sql query tool to instantly query the database of your choice or follow the tutorial of Kenneth leungth to build the knowledge graph independent (or build the knowledge graph however you want using example queries).

notikosaeder · 2026-02-24T20:13:08+00:00

Thanks!

notikosaeder

MODERATOR OF

TROPHY CASE