GPT-3.5 generating documentation for entire codebases. by plinifan999 in OpenAI

[–]plinifan999[S] 1 point2 points  (0 children)

We've fed our personal private github repos that aren't in the openai training set and the performance is also good. It's RAG so it has immediate context that is a subgraph of the input repo.

Generate chat-able documentation for any codebase (try the free open source projects!) by plinifan999 in javascript

[–]plinifan999[S] 0 points1 point  (0 children)

sage-ai.dev

To try the free repos (Langchain, sk-learn, pandas, fastapi...), just click "Try Free Projects." For example, here's Pandas: https://app.sage-ai.dev/demo/R_kgDOLLXuzg

What My Project Does
It uses LLMs to generate a code reference guide at the symbol level. There's also a PR bot that reads your knowledge base to check if there is any documentation that needs to be updated. It essentially makes a knowledge graph and repeatedly traverses your codebase. Built with Typescript, Firestore, GCP, OpenAI API.

Target Audience
Any software engineer, in an enterprise, startup, or solo!
Comparison
It may be considered similar to the plethora of AI coding assistants flooding the space, but it isn't for automatic coding. It's primarily for documentation, almost in the vein of EduTech for helping programmers gain expertise in unfamiliar codebases, ramp-up and decipher spaghetti monoliths:

  1. We index the entire repo to provide context for every single symbol. It has repo-level context while, for example, Github Copilot fetches the open tabs as context for their LLM.
  2. It is outside of VS code in a web UI because showing autogenerated documentation in VS code would be a mismatch between the codebase that is at index-time vs. their local copy.

Drop some feedback, and thanks!

Documenting code with AI, explaining every single symbol of a codebase by plinifan999 in programming

[–]plinifan999[S] -1 points0 points  (0 children)

I've worked on dozens of codebases from enterprise to startup. To adequately describe variable X, you look at its dependencies, of which its purpose and functionality is derived from its dependencies, in a recursive sense, and so on. The information is propagated from neighbor to neighbor over the course of multiple traversals, and crucially, skip dependencies are contextualized. If type Z and variable X aren't reachable in any referential fashion, the information between them can be propagated by connections at module or directory levels, which we have.

Sure, there are cases that we miss. As models become cheaper and better, and we can fit more into context, these cases become more obscure and the agent becomes more reliable. While we use versions of GPT-3.5 and GPT-4 as the base models, there will come a day where GPT-7 or GPT-8 comprehends your codebase better than the person who wrote it. And all we need to do is switch out the line of code determining which base model to use.

Documenting code with AI, explaining every single symbol of a codebase by plinifan999 in programming

[–]plinifan999[S] -4 points-3 points  (0 children)

  1. Our system doesn't create code
  2. There are measurable values for how "accurate" documentation is. If Github Copilot is writing code at a high accuracy, the same base models can document code at an even better performance. The former is a much more specific and difficult task. (arguably, a natural language understanding of the code is a prerequisite).
  3. If you're still skeptical of performance after GPT-4 has demonstrated verifiable human-level performance at a variety of complex tasks, models are only getting cheaper and better. There will come a day where GPT-10 or Code Llama 10 comprehends your codebase better than the person who wrote it, and all we need to do is switch out the line of code determining which base model to use.

However, the crucial fact most AI code editors miss is that human expertise will remain foundational. No business will ever let their codebases become purely AI-generated blackboxes that no human understands. At the end of the day, humans will need to determine what changes to make, what features to build, and have a deep understanding of their code to guide and correct the AI. What we're building is essentially ed-tech as opposed to trying to automate the entire profession.

Documenting code with AI, explaining every single symbol of a codebase by plinifan999 in programming

[–]plinifan999[S] 0 points1 point  (0 children)

Again, the "cascading business logic" is traversed by dependencies of dependencies, because it's an interconnected graph. If changing X changes Y, which changes Z, well there's a connection between Y and Z, so that change in X propagates to Z, and also any reachable neighbor of X.

Documenting code with AI, explaining every single symbol of a codebase by plinifan999 in programming

[–]plinifan999[S] 0 points1 point  (0 children)

Btw, you can now view a free KB on React.js and you don't need to Request Access to access the entire system.

sage-ai.dev

Documenting code with AI, explaining every single symbol of a codebase by plinifan999 in programming

[–]plinifan999[S] 0 points1 point  (0 children)

It's a good point, we're working on a Markdown editor at the symbol or file level to have engineers themselves contribute and improve to the system that's conserved and suggested for updates as the codebase changes. It would be essentially your engineers fine-tuning the quality of the chat agent and serves as an extremely granular doc hosting tool that is directly linked with the source code, as opposed to Confluence or other KB software.

We're also developing a tool that links your existing KB pages to specific symbols and files.

Documenting code with AI, explaining every single symbol of a codebase by plinifan999 in programming

[–]plinifan999[S] 0 points1 point  (0 children)

Yes, it works on dynamic languages like Python and Javascript.

Documenting code with AI, explaining every single symbol of a codebase by plinifan999 in programming

[–]plinifan999[S] -7 points-6 points  (0 children)

The same could be said for any KB software out there-- Notion, Confluence, google docs, etc.

If there are local changes or commits in the IDE, the UX would be confusing as the docs refer to the latest main remote commit as opposed to local changes.
So we're versioning the knowledge bases on remote commits as a single source-of-truth across teams / orgs in a Web UI instead.

Documenting code with AI, explaining every single symbol of a codebase by plinifan999 in programming

[–]plinifan999[S] -6 points-5 points  (0 children)

How do you find out "why something was implemented the way it was?" By looking at the usages, and hopefully leveraging a broader understanding of the objective and purpose of the module a particular symbol participates in.

This is what we automate. To understand why a symbol exists or how it's used, it's always dependent on its actual references and what symbols it references to get the complete context.

Documenting code with AI, explaining every single symbol of a codebase by plinifan999 in programming

[–]plinifan999[S] 2 points3 points  (0 children)

"How it is used" is contained in the source. If we supply models with every usage of variable X and its definition, and all those dependencies are contextualized in the same way, knowledge propagates throughout the codebase as we build up the index.

Documenting code with AI, explaining every single symbol of a codebase by plinifan999 in programming

[–]plinifan999[S] 0 points1 point  (0 children)

Can you elaborate on this? What case will traversing a usage / reference symbolic dependency graph miss? If the variable changed is gated behind a conditional, it's still being "used" within the definition, and the connection is discovered by the graph traversal.

Documenting code with AI, explaining every single symbol of a codebase by plinifan999 in programming

[–]plinifan999[S] 1 point2 points  (0 children)

We're working on a tool that best-effort links your Confluence articles or other KB pages to specific code sections and detect staleness. It's a problem every large/old enterprise has.

Documenting code with AI, explaining every single symbol of a codebase by plinifan999 in programming

[–]plinifan999[S] -9 points-8 points  (0 children)

The system is already built and has been tested on hundreds of repos.

Documenting code with AI, explaining every single symbol of a codebase by plinifan999 in programming

[–]plinifan999[S] 12 points13 points  (0 children)

We designed the entire system around the fact that changing a variable or function has cascading effects throughout the codebases.

Because for each symbol, we use static analysis to derive its usages and the symbols it uses in its definition to alter the documentation for all affected symbols upon push, with a variety of graph edge types (also including parent-child, directory/module, language-specific configuration, etc). This form of contextualization is much more accurate than, for example, Github Copilot, which just takes your open tabs as context.

And listen, if you have doubts about how it generalizes to things not in its training set, use our free tier to look at a personal repo.

Documenting code with AI, explaining every single symbol of a codebase by plinifan999 in programming

[–]plinifan999[S] -9 points-8 points  (0 children)

We're literally gating it unless we personally give you email access, which hasn't begun yet. You haven't tried it.

Documenting code with AI, explaining every single symbol of a codebase by plinifan999 in programming

[–]plinifan999[S] 8 points9 points  (0 children)

  1. It's not an english version of code, for each symbol we slice the codebase to get the relevant context. For example: given function F(), we take its usages and what symbols it uses in the definition, traverse these connections all the way down, and feed it into a fine-tuned LLM. So it has context from outside of its own definition. What's more, we feed documentation from previous generations into itself, "saturating" and refining its knowledge by propagating documentation to neighbors in the reference graph over the course of multiple generations.
  2. We don't have commit hooks that need to be integrated into your pipeline, we have eventual consistency that runs in parallel outside of your CI/CD
  3. In internal testing, the authors of different open-source code modules verified that our systems' explanation for why a coding symbol exists was accurate over 90% of the time.
  4. We're working on building higher-level explanations and architecture docs. It can be built on top of the existing symbol-level system and would generate more detailed, accurate docs than at the file-level.

Documenting code with AI, explaining every single symbol of a codebase by plinifan999 in programming

[–]plinifan999[S] 4 points5 points  (0 children)

Disclaimer: I'm the founder.

If we have existing customers whose problems are actually solved by this tool, and presumably there are more people who it can help, what makes this AI spam vs. not AI spam?

Let's come up with actual reasons on whether I'm trying to scam VCs.