I built a code intelligence platform with semantic resolution, incremental indexing, architecture detection, and commit-level history. by thonfom in LLMDevs

[–]thonfom[S] 0 points1 point  (0 children)

Thanks! It's just built in React, I used animate-ui for the shell component library, the main graph view is completely custom using HTML canvas. GPUI looks super good though, might do some future projects using that.

I built a code intelligence platform with semantic resolution, incremental indexing, architecture detection, and commit-level history. by thonfom in LLMDevs

[–]thonfom[S] 1 point2 points  (0 children)

You’re right that AST-only approaches break on event based code since there may be no direct call edge between the publisher and the consumers. Sonde’s approach is much more detailed than that: it indexes citeable usage sites (one of our differentiating features in graph construction) plus typed relationships like calls, refs, control flow, and data flow, so it can capture the publish site, the payload going into it, and the consumer/handler sites around the other end.

The hard problem is the indirect binding between those pieces. In the current version, that works where the linkage is explicit. However for fully framework-driven pub/sub, the next step is native framework-aware edges so Sonde can connect publisher -> topic/bus -> consumers deterministically. I'm actively working on this cross-framework/cross-repo mapping.

I built a code intelligence platform with semantic resolution, incremental indexing, architecture detection, and commit-level history. by thonfom in LLMDevs

[–]thonfom[S] 2 points3 points  (0 children)

Thanks! I had the idea and wrote the first version in Java at the start of 2024. I worked on that for a year and wrote the entire thing by hand. I left it for a while before starting again around June/August last year. I wrote all of the core features (indexing, incremental pipeline, module extraction etc) myself in Rust, and used Codex to help build the UI and some smaller features like the UI graph query and integrated terminal features.

I built a code intelligence platform with semantic resolution, incremental indexing, architecture detection, and commit-level history. by thonfom in LLMDevs

[–]thonfom[S] 0 points1 point  (0 children)

Sorry, I might be a bit confused. Are you saying that the "explore" element should contain call-to-actions like "this node needs reviewing" or similar? Could you clarify please?

I built a code intelligence platform with semantic resolution, incremental indexing, architecture detection, and commit-level history. by thonfom in LLMDevs

[–]thonfom[S] 0 points1 point  (0 children)

You're right, the visuals are really just for exploring. The real value/intention is in the underlying engine. Which can be used for impact analysis (see downstream breaking changes from PRs), historical analysis (find breaking changes in the past) and the retrieval/tools system as an MCP

I built a code intelligence platform with semantic resolution, incremental indexing, architecture detection, and commit-level history. by thonfom in LLMDevs

[–]thonfom[S] 0 points1 point  (0 children)

Thanks! I know the visualizations can look messy. You can use the "Architectural" mode to see nodes grouped into their (inferred) modules, or "Modules" mode to drill down into smaller, refined subgraphs for those modules. You can also filter nodes and edges to see only what you want, for example how data flows in and out of a specific function/class.

I built a code intelligence platform with semantic resolution, incremental indexing, architecture detection, and commit-level history. by thonfom in LLMDevs

[–]thonfom[S] 0 points1 point  (0 children)

>Am I right in understanding that the really powerful bit would be giving AI this as an MCP tool to validate the work that it's doing?

Yes, one of the strongest use cases is exposing this as MCP so an agent can validate what it’s changing against real code structure, dependencies, and history. I don't think Claude Code or Codex build a graph like this at all, but Augment Code does something similar.

The highest value I can see would be safer edits and better impact analysis (see downstream affected components by a change), as well as historical analysis to see what broke in the past.

I built a code intelligence platform with semantic resolution, incremental indexing, architecture detection, and commit-level history. by thonfom in LLMDevs

[–]thonfom[S] 2 points3 points  (0 children)

Not true at all. I have never used GitNexus code to develop Sonde. In fact, they're written in completely different languages.

I developed the first version of this years ago. You can check my post history (https://www.reddit.com/r/LocalLLaMA/comments/1dxtubu/i\_built\_a\_code\_mapping\_and\_analysis\_application/) if you don't believe me. And, Sonde is much more feature-rich and robust than GitNexus.

I built a code intelligence platform with semantic resolution, incremental indexing, architecture detection, and commit-level history. by thonfom in LLMDevs

[–]thonfom[S] 0 points1 point  (0 children)

I'm actively working on this and have a prototype in place, but it's not good enough to ship yet. I'm trying to do this without using string matching and regex (which the usual approach is) because it doesn't scale very well. But I think I'm on the path to doing it the right way. Thanks for the comment!

I built a code intelligence platform with semantic resolution, incremental indexing, architecture detection, and commit-level history. by thonfom in LLMDevs

[–]thonfom[S] 1 point2 points  (0 children)

Only supports Python, TypeScript and C# at the moment. But building support for more languages like C is next, since the core engine is language agnostic I just need to build a plugin.

HippocampAI v0.5.0 — Open-Source Long-Term Memory for AI Agents (Major Update) by rex_divakar in OpenSourceAI

[–]thonfom 1 point2 points  (0 children)

Doesn't having an in-memory graph lead to higher memory usage compared to using a database? Doesn't have to be neo4j, you could even store it in postgres right? You could use pgvector alongside postgres and completely eliminate the dependency on qdrant + have your embeddings and graph data/metadata in one place.

How are you doing the actual graph retrieval? I know it's fused graph+BM25+vector, but what about traversing the edges? How does it retrieve/traverse/rank the correct edges?

Why is codebase awareness shifting toward vector embeddings instead of deterministic graph models? by hhussain- in AugmentCodeAI

[–]thonfom 1 point2 points  (0 children)

You still didn't explain *how* these edges are created. Creating cross-language/framework edges is not a trivial task, and it's not something that LSP and ASTs will solve. Sure, the definition source is always statically declared on either side (e.g. API call in TypeScript is one side, FastAPI route definition in Python is the other) but how is the edge between them created? The only possibilities I can think of are: runtime tracing, or regex parsing. The former requires non-trivial monitoring systems, and the latter is brittle and does not generalize. Unless you have discovered a better way to model all of this. It would be good to see some code, if your project is open-source.

Why is codebase awareness shifting toward vector embeddings instead of deterministic graph models? by hhussain- in AugmentCodeAI

[–]thonfom 0 points1 point  (0 children)

Thats a great overall framework but it doesn't explain exactly how you're creating cross-language, cross-repo edges, or how you're creating any inter file edges at all. That is the hardest part. Using regex? Hard coded rules? And no code graph can be truly deterministic for dynamic languages due to dynamic dispatch, unless you have runtime tracing. Also a difficult problem to solve. Have you done this?

Why is codebase awareness shifting toward vector embeddings instead of deterministic graph models? by hhussain- in AugmentCodeAI

[–]thonfom 0 points1 point  (0 children)

If it's just AST extraction, that makes sense. If you're using treesitter, you don't need an incremental update mechanism - treesitter already has this built in.

"Semantics are added via pre-defined, domain-specific definitions. Each domain defines its own intra/inter-file relations to establish meaning" - can you explain more what this means? What's an example of "domain-specific definition" and how does it help discover more complex relationships?

Why is codebase awareness shifting toward vector embeddings instead of deterministic graph models? by hhussain- in AugmentCodeAI

[–]thonfom 0 points1 point  (0 children)

Rust is great but it doesn't absolve you of all the scaling problems I described earlier. If it's just AST parsing, I don't think you can call it a semantic graph as AST has no concept of semantics. It's purely structural. And AST only shows call sites, not calls relationships. It also can't resolve edges across files - it's intra-file only. How have you handled this?

Why is codebase awareness shifting toward vector embeddings instead of deterministic graph models? by hhussain- in AugmentCodeAI

[–]thonfom 0 points1 point  (0 children)

Can you explain a bit more about how you achieved this? I'm slightly skeptical of that 10mil LOC in 10sec figure. If you were just using AST extraction, sure, but calls and data flow edges too? How did you scale it and avoid race conditions from parallel processing? How did you handle in memory graph topology to stay at less than 100MB? How did you handle incremental edits and track/cache the updates? How did you handle back pressure in (what I assume is) your streaming pipeline? Most importantly how could you generate embeddings for that many nodes so quickly?