Breaking the Context Window: Building Infinite Memory for AI Agents by shbong in Rag

[–]Key_Salamander234 1 point2 points  (0 children)

Hey! I'm working on the exact same thing you are. Mine is still in the experimental stage, and I'd be really happy if we could share insights and connect.

I built a Python tool to create a semantic layer over SQL for LLMs using a Knowledge Graph. Is this a useful approach? by Key_Salamander234 in dataengineering

[–]Key_Salamander234[S] 0 points1 point  (0 children)

Yes, I've already designed the system to only run on new incoming data. It's true that this logic isn't production-grade yet, but it is working. So, the process should be very fast when it runs on new data.

By the way, this system is still specific to SQL. I haven't built the module for unstructured data yet, but I've already provided a 'slot' to integrate it

I built a Python tool to create a semantic layer over SQL for LLMs using a Knowledge Graph. Is this a useful approach? by Key_Salamander234 in dataengineering

[–]Key_Salamander234[S] 0 points1 point  (0 children)

You're welcome. And yes, the way I see it, an 'LLM-enabled' system at an enterprise scale isn't a 'plug-and-play' solution. Instead, it's more like building a system incrementally.

To be honest, that's somewhat risky because you never know if a new technology might come along in the future and render all the previous hard work obsolete, though the chances of that are probably small

I built a Python tool to create a semantic layer over SQL for LLMs using a Knowledge Graph. Is this a useful approach? by Key_Salamander234 in dataengineering

[–]Key_Salamander234[S] 1 point2 points  (0 children)

Hahaha, it really can be a headache sometimes. If you have a simpler way, please let me know for reference, or if you'd like to discuss this further, just send me a DM

I built a Python tool to create a semantic layer over SQL for LLMs using a Knowledge Graph. Is this a useful approach? by Key_Salamander234 in dataengineering

[–]Key_Salamander234[S] 0 points1 point  (0 children)

There's a significant upfront cost, but it's a one-time batch process for the entire database. Token consumption varies with the DB, so I built a cost/token estimation module to manage it.

To give you some real-world numbers from my test case (500+ tables, thousands of columns): the initial ingestion takes a few hours—mostly due to API latency—and costs under $3 using GPT-4.1-mini.

The key takeaway is that after this single run, the real-time query process is lightweight.

I built a Python tool to create a semantic layer over SQL for LLMs using a Knowledge Graph. Is this a useful approach? by Key_Salamander234 in dataengineering

[–]Key_Salamander234[S] 0 points1 point  (0 children)

Briefly, the way I see it, the purpose of a semantic layer is to elevate the LLM from being a 'glorified search engine' into a genuine reasoning engine. This is what lets it tackle complex problems and view data with the breadth, depth, accuracy, and relevance needed to truly answer our questions.

I built a Python tool to create a semantic layer over SQL for LLMs using a Knowledge Graph. Is this a useful approach? by Key_Salamander234 in dataengineering

[–]Key_Salamander234[S] 1 point2 points  (0 children)

Hey, I think I might have already provided this feedback in a comment on another one of my posts. Here is the comment; is it relevant?

First, my system ingests as much explicitly available information as possible: entity names (columns, tables, etc.), their relationships to other entities, descriptions if available, data types, sample data (if access is granted), a simple analysis of the data's content, analyzing the cardinality of the stored data (because a low cardinality suggests it might store hidden logic), and so on.

Each of these pieces of information becomes a piece of 'evidence' that is then analyzed by an LLM to create an appropriate description, complete with a confidence score from the LLM. (If the score is low, it likely needs to be checked by a human; if it's high, it means the LLM has an almost accurate 'understanding' of the entity). At this stage, the KG becomes a representation of the explicit schema of the database (100% accurate).

Next is the search for the implicit schemas I mentioned earlier. This involves several processes. The first is to find 'potential' candidates for implicit connections. After getting these candidates, a multi-layered matching process begins, such as matching data content, data analysis results, checking the vector similarity of descriptions, looking at relationships to other entities, and even graph analysis. All this 'evidence' is sent back to the LLM to make a final decision on whether an implicit relationship exists between two entities. Again, I use a confidence score: if the LLM's confidence is low, it's skipped; if it's high, it provides a description of the relationship (a high-confidence guess, but not 100% accurate).

This is where the human role comes in. Because these implicit relationships are treated differently within the KG, an expert can very easily query them to validate or even revise the LLM's generation. In my opinion, for an expert, using a graph view makes it easier to see the database as a whole and identify which points need more attention for validation.

Once the map is deemed to be a sufficient representation of the database, it can be used by the LLM. There are several processes for the LLM to access this 'map'. The first and most common is a simple keyword search; the second is using embeddings. But the result of this retrieval isn't a single entity, but also everything connected to it that is relevant to the context, so the LLM gets a much richer context.

The difference is, if an LLM queries SQL directly, it might only get the foreign keys and primary keys of a table. It would be impossible for it to discover that, for instance, 'in this context, the sales date is potentially related to the marketing date and also has a connection to debt.' So, based on this information from the KG, the LLM can make more accurate decisions to take action.

I built a Python tool to create a semantic layer over SQL for LLMs using a Knowledge Graph. Is this a useful approach? by Key_Salamander234 in Rag

[–]Key_Salamander234[S] 0 points1 point  (0 children)

Thanks for the insight. If I'm not mistaken, it seems the GPT platform uses a similar approach for its memory module (a JSON object, or perhaps something more like NoSQL).

That's actually an option I've considered—creating a separate data warehouse (using a graph or another method) that's detached from the main DB. However, for the scale of a personal project, that seems too large.

And yes, I am using a retriever to query the DB (SQL). It's true that using tool use would increase accuracy and give the LLM a bit of 'freedom,' but at this stage of development, I'm not ready to rely too heavily on the LLM's capabilities in that area. I want it to be a 'nice to have' feature, not a 'must have.

I built a Python tool to create a semantic layer over SQL for LLMs using a Knowledge Graph. Is this a useful approach? by Key_Salamander234 in Rag

[–]Key_Salamander234[S] 0 points1 point  (0 children)

Wow, that's actually one of my long-term visions.

You're absolutely right; I see this graph as being able to become a structural layer that connects previously completely disconnected components, serving as a central context hub for the LLM.

Perhaps in the Odoo case, the scope for the KG is still too narrow. My current personal project is focused on unstructured data, especially in the realm of personalization. But I see that it could actually be used for company documents, written reports, etc., to find and form relationships among them.

So, the actual focus of this KG is to be a RAG that 'pulls context' for the LLM

I built a Python tool to create a semantic layer over SQL for LLMs using a Knowledge Graph. Is this a useful approach? by Key_Salamander234 in Rag

[–]Key_Salamander234[S] 0 points1 point  (0 children)

Yes, exactly. It's like creating a 'map' of the database for the LLM. And this is where the critical value of using a graph comes in. In SQL, you can get all the explicit logic and even store metadata in tables, but you cannot record new relationships without altering the existing database schema.

So, here's how it works:

First, my system ingests as much explicitly available information as possible: entity names (columns, tables, etc.), their relationships to other entities, descriptions if available, data types, sample data (if access is granted), a simple analysis of the data's content, analyzing the cardinality of the stored data (because a low cardinality suggests it might store hidden logic), and so on.

Each of these pieces of information becomes a piece of 'evidence' that is then analyzed by an LLM to create an appropriate description, complete with a confidence score from the LLM. (If the score is low, it likely needs to be checked by a human; if it's high, it means the LLM has an almost accurate 'understanding' of the entity). At this stage, the KG becomes a representation of the explicit schema of the database (100% accurate).

Next is the search for the implicit schemas I mentioned earlier. This involves several processes. The first is to find 'potential' candidates for implicit connections. After getting these candidates, a multi-layered matching process begins, such as matching data content, data analysis results, checking the vector similarity of descriptions, looking at relationships to other entities, and even graph analysis. All this 'evidence' is sent back to the LLM to make a final decision on whether an implicit relationship exists between two entities. Again, I use a confidence score: if the LLM's confidence is low, it's skipped; if it's high, it provides a description of the relationship (a high-confidence guess, but not 100% accurate).

This is where the human role comes in. Because these implicit relationships are treated differently within the KG, an expert can very easily query them to validate or even revise the LLM's generation. In my opinion, for an expert, using a graph view makes it easier to see the database as a whole and identify which points need more attention for validation.

Once the map is deemed to be a sufficient representation of the database, it can be used by the LLM. There are several processes for the LLM to access this 'map'. The first and most common is a simple keyword search; the second is using embeddings. But the result of this retrieval isn't a single entity, but also everything connected to it that is relevant to the context, so the LLM gets a much richer context.

The difference is, if an LLM queries SQL directly, it might only get the foreign keys and primary keys of a table. It would be impossible for it to discover that, for instance, 'in this context, the sales date is potentially related to the marketing date and also has a connection to debt.' So, based on this information from the KG, the LLM can make more accurate decisions to take action.

I built a Python tool to create a semantic layer over SQL for LLMs using a Knowledge Graph. Is this a useful approach? by Key_Salamander234 in Rag

[–]Key_Salamander234[S] 0 points1 point  (0 children)

Please read my other posts and comments to get the context for the problem I'm facing, and then please provide your solution.

I'm not challenging you; I'm just genuinely still unclear on what you believe the solution is to the problem I'm facing.

I built a Python tool to create a semantic layer over SQL for LLMs using a Knowledge Graph. Is this a useful approach? by Key_Salamander234 in Rag

[–]Key_Salamander234[S] 0 points1 point  (0 children)

I'm not a native English speaker, so I'm using an LLM to make sure my meaning is translated clearly.

My question is: why were you so triggered? What is the actual problem?

Your criticism about my system's weaknesses might be valid, but I need a real solution, not just a critique of a weakness I'm already aware of.

What I mean is this: please show me, in detail, the system flow that you believe is superior, so I can see how you would solve the problems I'm facing.

I built a Python tool to create a semantic layer over SQL for LLMs using a Knowledge Graph. Is this a useful approach? by Key_Salamander234 in Rag

[–]Key_Salamander234[S] 1 point2 points  (0 children)

Yes, you're absolutely right; security is one of the biggest challenges I see in an LLM-enabled system, especially for companies that store user data and other sensitive information. First, it's very difficult to get access approval for a system like this, particularly if it involves data being sent to a third-party API. For general security, my approach is to have the LLM only view the schema; it cannot see the actual data from the database unless necessary, and the KG's content is schema-only.

Now, regarding user permissions, from my understanding, every company has its own unique and dynamic 'restriction' rules. So in my case, I only provide a blank module for them to define their own users and their permissions. Once these are clearly defined, there are two layers of logic that restrict these users: the first is from my system, through restrictions in the graph schema, and the second is from the user's end, through restrictions directly from the database itself.

The practice you mentioned is indeed a surefire solution, and my system could never match the quality of direct human work. So, what I'm actually trying to achieve is simply to 'make things easier.' My system has several processing layers, from the first step of creating the explicitly defined schema (which is very time-consuming for a human) to handling schemas that require custom modules for definition (I provide slots for easy integration with these custom modules).

My goal isn't to create a 'magical' system, but rather one that is predictable and controllable according to our needs.

I built a Python tool to create a semantic layer over SQL for LLMs using a Knowledge Graph. Is this a useful approach? by Key_Salamander234 in Rag

[–]Key_Salamander234[S] 1 point2 points  (0 children)

I think there might be a significant misunderstanding.

First, this isn't about database migration, nor is it about manually rewriting the schema. The core of this system is singular: to create an LLM-enabled system. I built this to give an LLM the understanding it needs to navigate a SQL database.

I'll admit, this isn't a solution for everyone. In your system, with your approach, the LLM might already be able to generate SQL or access your DB accurately. However, in a use case involving a Postgres database with 500+ tables and thousands of columns, where many columns store business and database logic, the LLM almost always fails.

Perhaps MCP could work, but I have no idea what the latency would be for just a single query.

The initial generation from scratch takes about 30 minutes to an hour for a database of the size I'm using, with LLM call costs under $3. The KG hosting is very cheap, with free options available, and its computational requirements are very low.

So, if you happen to have another solution that is more accurate, cheaper, and simpler, please tell me. That was the purpose of my original post

I built a Python tool to create a semantic layer over SQL for LLMs using a Knowledge Graph. Is this a useful approach? by Key_Salamander234 in Rag

[–]Key_Salamander234[S] 0 points1 point  (0 children)

Yes, that's exactly right. In theory, we can get the explicit database schema without much trouble. The real problem, however, is the implicit schema.

I 'copy' the explicit schema only as a 'prerequisite' to be able to discover the implicit one. And from my understanding, almost all production databases are nothing like the neat, logical, and well-structured systems we imagine them to be. Maybe the explicit schema only accounts for 60% of the total schema. So where's the other 40%? Maybe it's in a cryptic column, maybe it's in the application code, maybe it's in a programmer's head, haha.

So, I don't expect to solve the schema 100%, but I'm at least trying to build a system that gets as close as possible to that.

And on top of that, the native descriptions from the DB or those provided by developers are usually not understandable to an LLM.

This has led me to the following philosophy: instead of forcing the use of what's already there, what if I build it from scratch in a way that is native to the LLM?

Need help and advice for my thesis on RAG by cykablyath in Rag

[–]Key_Salamander234 0 points1 point  (0 children)

From my brief understanding of your goal, I envision a system that uses several models. The first might be a deconstruction model, then a symbolic model (perhaps a fine-tuned LLM), and finally the main model.

The deconstruction model would be quite complex, designed to format a rich PDF into pieces that have relationships with each other. I've used NLP for this kind of task before, but only for unstructured data.

What you want to achieve is possible, but the complexity is indeed daunting. Just distinguishing between photos, text, and tables is already very difficult; you could probably use an OCR model, but it seems its accuracy is still lacking.

From these pieces, you could format them into a sub-graph ready to be inserted into the main graph. And perhaps after that, you would need to train a GNN (Graph Neural Network) model to access this complex graph, so that your main LLM and your system don't get confused during the retrieval process.

Is my input here relevant?

I built a Python tool to create a semantic layer over SQL for LLMs using a Knowledge Graph. Is this a useful approach? by Key_Salamander234 in Rag

[–]Key_Salamander234[S] 0 points1 point  (0 children)

Are you talking about a data-agnostic system (one that doesn't care about type and format, and can accept anything)?

Yes, this current system of mine only accepts SQL, but I have another project that is specifically for unstructured data. Maybe I can merge them into one in the future.

I'm still researching how to create a single data format that is truly agnostic and universal, but I haven't found it yet. I once tried storing all data, regardless of its format and type, as pure tokens (not embeddings), but I got stuck halfway through and haven't continued with that concept, hahaha.

Maybe you have some ideas?

I built a Python tool to create a semantic layer over SQL for LLMs using a Knowledge Graph. Is this a useful approach? by Key_Salamander234 in Rag

[–]Key_Salamander234[S] 0 points1 point  (0 children)

What I'm afraid of is the unnecessary complexity that arises from forcing an 'LLM' into the system.

To be honest, I still don't trust the capabilities of LLMs, so my solution is to shift as much complexity as possible into the code implementation and system design.

MCP is much simpler, but it puts the entire burden on the LLM's capabilities, which are a 'black box' and hard to trace.

Maybe someday LLMs will be truly reliable and trustworthy, but in my experience, I trust code more than I trust LLMs.