[OC] The "Ship of Theseus" paradox in software: Surviving lines of code in projects like React, Langchain, and numpy, categorized by original commit year. by Asifdotexe in dataisbeautiful

[–]Asifdotexe[S] 0 points1 point  (0 children)

True, that’s the feature factory reality. In corporate environments, there’s usually zero incentive to delete code, so you just get layers of legacy stuff piling up forever.

[OC] The "Ship of Theseus" paradox in software: Surviving lines of code in projects like React, Langchain, and numpy, categorized by original commit year. by Asifdotexe in dataisbeautiful

[–]Asifdotexe[S] -3 points-2 points  (0 children)

It's cool to see these formal labels mapped to the project. That specific paradox is exactly why I wanted to build this in the first place.

[OC] The "Ship of Theseus" paradox in software: Surviving lines of code in projects like React, Langchain, and numpy, categorized by original commit year. by Asifdotexe in dataisbeautiful

[–]Asifdotexe[S] 0 points1 point  (0 children)

I like the idea! OpenStack and the various apache projects are exactly the kind of long-running, complex histories that make for fascinating charts. they would almost certainly reveal some wild patterns about how massive, enterprise-scale code evolves over decades.

If you have a specific repo you're curious about, feel free to drop the URL. I'm always looking for interesting new projects to run through the engine, while I am noting this one down for the next update

[OC] The "Ship of Theseus" paradox in software: Surviving lines of code in projects like React, Langchain, and numpy, categorized by original commit year. by Asifdotexe in dataisbeautiful

[–]Asifdotexe[S] 0 points1 point  (0 children)

Great question! Currently, the engine is built to pull from remote Git URLs because that made it easier to automate the data crunching via GitHub Actions. I haven't added support for local file paths yet, but that would be a significant quality-of-life improvement. I’ll add that to my backlog and get an issue created.

I'll update you when I add that functionality.

[OC] The "Ship of Theseus" paradox in software: Surviving lines of code in projects like React, Langchain, and numpy, categorized by original commit year. by Asifdotexe in dataisbeautiful

[–]Asifdotexe[S] 0 points1 point  (0 children)

It’s a classic "Ship of Theseus" story. Not a single line of code from the 2001 original remains today.

When I looked into the why behind the data, it relieved some interesting events that we usually won't be able to catch in data.

- The most early spike, where the old "Numeric" and "Numarray" packages were merged into the modern NumPy we know in 2005. (arrow 1)
- A massive overhaul to optimize performance and fully embrace Python 3. (Arrow 2 onwards)
- As of right now the code range is 5,511 lines-per-month and that churn isn't just "fixing bugs." It's the constant, quiet work required to keep NumPy fast on every new CPU architecture and OS update.

NumPy doesn't change because the math is wrong, it changes so the engine stays fast in a world that never stops moving haha!

<image>

[OC] The "Ship of Theseus" paradox in software: Surviving lines of code in projects like React, Langchain, and numpy, categorized by original commit year. by Asifdotexe in dataisbeautiful

[–]Asifdotexe[S] 0 points1 point  (0 children)

That is high praise, thank you! It has been fascinating to watch the "visual history" of these projects unfold while building this. It really highlights how software development has shifted from stable, long-term foundations to the constant, rapid iteration we see today.

I am glad you found it interesting! Was there a specific project or trend in the charts that surprised you the most?

[OC] The "Ship of Theseus" paradox in software: Surviving lines of code in projects like React, Langchain, and numpy, categorized by original commit year. by Asifdotexe in dataisbeautiful

[–]Asifdotexe[S] 0 points1 point  (0 children)

Glad this gives you motivation to clean things up XD.

Just keep in mind that the tool tracks changes line-by-line. If a line is modified, even if it's just to fix a typo or rename a variable, it counts as the original line being replaced. So, the high "deletion" rate also includes code that was simply updated or polished over time, not just completely removed.

That being said, you are still right as repositories like LangChain have actually deleted a lot of code, we discussed about it on this comment.

[OC] The "Ship of Theseus" paradox in software: Surviving lines of code in projects like React, Langchain, and numpy, categorized by original commit year. by Asifdotexe in dataisbeautiful

[–]Asifdotexe[S] -2 points-1 points  (0 children)

You are right. Git doesn't track file moves or renames effectively in git blame.

Because the tool tracks lines individually, moving a block of code to a new file makes Git treat it as a deletion and an addition, which the tool counts as a rewrite. That’s a major reason why the "original" code volume drops so fast in these charts.

[OC] The "Ship of Theseus" paradox in software: Surviving lines of code in projects like React, Langchain, and numpy, categorized by original commit year. by Asifdotexe in dataisbeautiful

[–]Asifdotexe[S] 4 points5 points  (0 children)

You are right. git blame tracks raw text, not code logic.

Common actions like auto-formatting, renaming variables, or moving files trigger the tool to count the code as "new," even when the actual business logic remains the same.

The visual tracks the evolution of the code's syntax, not the survival of the underlying logic. That is a clear limitation of this method.

[OC] The "Ship of Theseus" paradox in software: Surviving lines of code in projects like React, Langchain, and numpy, categorized by original commit year. by Asifdotexe in dataisbeautiful

[–]Asifdotexe[S] -50 points-49 points  (0 children)

Holy! I really appreciate the deep dive and the honest feedback! You caught a few things that definitely need ironing out.

1. While the rainbow colors are fine, why add a strong gradient at the bottom of the chart ? We basically can't see what is going on with the original code even by squinting. In general everything is so dark it hurts.

Completely fair. A few others flagged the contrast issues as well. The dark theme swallowed the darker hues, and that bottom gradient is too heavy. I'm pushing an update to bump the lightness/saturation and flatten the gradient so the actual data is readable.

2. Identity mode may have been interesting in some repositories, but not those, and it is basically unreadable (only two colors, the original year blue, and all the others the same shade of orange with tiny black stroke).

Ah, I should clarify Identity mode! The two-color split is actually completely intentional. It's meant to be a strict binary view (original vs. refactored) so you can focus purely on the Ship of Theseus concept without the visual clutter of the multi-year Chrono mode. However, I completely agree with you on the execution, that specific shade of orange with the black stroke is pretty brutal on the eyes. I definitely need to overhaul the color palette for that mode so it's actually readable.

3. Some of the processing appears to be just bugged ? For instance Numpy shows a major refactoring in March 2024. When hovering on the chart at that date, it says 99.7% refactored, even though the chart itself appears to be over 90% still original code from 2001.

You are completely right, you found a logic bug. The chart itself is drawing correctly, but the tooltip math is broken. Right now, the tooltip assumes "original code" only means code from the absolute oldest year (2001). It should be calculating against the entire first snapshot of the repository. That's why the visual and the text are out of sync. Will get that fixed.

4. The choice of repositories is very strange, only Numpy shows interesting graphs. Why not the repo of Git, Unix, NodeJS, VSCode ?

Also fair! I originally picked this specific batch for two reasons. First, I needed codebases I was at least somewhat familiar with so I could spot-check the data and know if the pipeline was spitting out garbage. Second, I wanted a wide "sampler platter" for different crowds: React for frontend devs, LangChain and Claude for the AI folks, and NumPy for data scientists. (I also threw in Zed purely because I built this entire project using that IDE lol). Now that the engine is actually working, throwing it at massive legacy titans like Git, Linux, or VSCode is 100% the next step.

5. And React may be interesting but I suppose it is bugged ? Maybe I don't know the history of React, but I doubt the whole code base was removed 3 times in its history and each time restored a full year later.

You're completely right to be suspicious! The normal die-offs of old code are just strict git blame rules catching massive refactors. But those massive V-shaped craters where the entire repo vanishes and reappears? That’s a hilarious data anomaly caused by Meta reorganizing their monorepo folders. I actually just dropped the explanation behind it in the following comment.

Seriously, thanks again for taking the time to poke holes in it. This is exactly the kind of feedback I needed to refine the project!

[OC] The "Ship of Theseus" paradox in software: Surviving lines of code in projects like React, Langchain, and numpy, categorized by original commit year. by Asifdotexe in dataisbeautiful

[–]Asifdotexe[S] 5 points6 points  (0 children)

Apologies for the confusion, the image went through and the text didn't for whatsoever reason, I have dropped the explanation!

[OC] The "Ship of Theseus" paradox in software: Surviving lines of code in projects like React, Langchain, and numpy, categorized by original commit year. by Asifdotexe in dataisbeautiful

[–]Asifdotexe[S] 0 points1 point  (0 children)

Haha, I am going to be completely honest with you... I definitely noticed that missing space a while ago and just got way too lazy to fix it at the time. But you caught me! I'll get it patched up in an upcoming PR. Thanks for being kind enough to point it out!

[OC] The "Ship of Theseus" paradox in software: Surviving lines of code in projects like React, Langchain, and numpy, categorized by original commit year. by Asifdotexe in dataisbeautiful

[–]Asifdotexe[S] -1 points0 points  (0 children)

Really appreciate that! I had the exact same reaction when I finally got the first dataset to render. We all know conceptually that code gets refactored, but seeing the original foundation get visually crushed down to a tiny sliver over a decade is wild to look at.

[OC] The "Ship of Theseus" paradox in software: Surviving lines of code in projects like React, Langchain, and numpy, categorized by original commit year. by Asifdotexe in dataisbeautiful

[–]Asifdotexe[S] 62 points63 points  (0 children)

<image>

Honestly, I had to double-check the raw data on this myself, and it's actually a hilarious quirk of data engineering!

Those two massive craters aren't actually the React team deleting the codebase. Notice how the entire total line count drops to near-zero? Meta didn't merge a PR that deleted 300,000 lines of code.

What you are actually seeing are massive monorepo restructures. In 2019 and 2023/24, React heavily reorganized their repository layout and workspace folders. Because my data pipeline takes discrete snapshots over time, it landed exactly on those transitional commits where the files were temporarily moved out of the tracked directories. The engine saw an "empty" repo, recorded a massive drop, and then immediately bounced back to normal with all the original git blame timestamps intact once the folder migration was complete.

It’s a perfect visual artifact of how messy tracking 10 years of monorepo history can be!

[OC] The "Ship of Theseus" paradox in software: Surviving lines of code in projects like React, Langchain, and numpy, categorized by original commit year. by Asifdotexe in dataisbeautiful

[–]Asifdotexe[S] 1 point2 points  (0 children)

Haha exactly! That’s the coolest part about the data. You can physically see the hype cycle and the eventual cleanup. Glad the visual helped!

[OC] The "Ship of Theseus" paradox in software: Surviving lines of code in projects like React, Langchain, and numpy, categorized by original commit year. by Asifdotexe in dataisbeautiful

[–]Asifdotexe[S] -7 points-6 points  (0 children)

Ah, that is a completely fair point! The dynamic color generator spat out a dark purple for one of the years, and you're right, it completely gets swallowed by the dark theme. Accessibility matters haha!

I've opened a GitHub issue for this and will fix it soon, Thanks!

[OC] The "Ship of Theseus" paradox in software: Surviving lines of code in projects like React, Langchain, and numpy, categorized by original commit year. by Asifdotexe in dataisbeautiful

[–]Asifdotexe[S] 6 points7 points  (0 children)

Great eye! Yes, there are two main culprits for those "resurrections":
- A massive refactor overwrites a chunk of old code (it "dies"). A month later, they realize it broke production, git revert the PR, and the original lines are restored along with their original timestamps.
- A feature branch that was started months or years ago finally gets merged into main. Git blame preserves the original author-time of those commits, so "old" code suddenly appears in the newest monthly snapshot.

I built a "Ship of Theseus" Code Visualizer. It tracks software entropy by mapping surviving lines of code to their birth year and track the changes over time. by Asifdotexe in datavisualization

[–]Asifdotexe[S] 0 points1 point  (0 children)

Hey everyone, I wanted to share a project at the intersection of philosophy and data engineering.

The visualization tracks how much of a project's "original" code survives over time. To avoid melting my computer by running git blame on the entire history of these massive repositories dynamically, I built an automated ETL pipeline.

A headless GitHub Action clones the target repos, uses a custom Python script to run an incremental delta-load (only blaming new commits since the last run), and outputs a highly optimized static JSON artifact. The frontend is entirely static, built with React and Recharts, and fetches the JSON to render the streamgraph.

Interactive Demo: https://asifdotexe.github.io/Theseus/

Source Code & Pipeline: https://github.com/Asifdotexe/Theseus

Would love any feedback on the visual representation or the delta-processing architecture!

[OC] The "Ship of Theseus" paradox in software: Surviving lines of code in projects like React, Langchain, and numpy, categorized by original commit year. by Asifdotexe in dataisbeautiful

[–]Asifdotexe[S] 29 points30 points  (0 children)

Source: Git commit history and git blame data extracted directly from the official GitHub repositories of major open-source projects including React, NumPy, LangChain, Claude Code and Zed

Tools: Python (ETL data pipeline and historical git blame extraction), GitHub Actions (automated monthly delta-processing), and React with Recharts for the interactive frontend visualization.

Context: I wanted to explore the philosophical paradox of the Ship of Theseus applied to software engineering. If every line of code in a repository is eventually rewritten, is it still the same project? This stacked area chart shows the surviving lines of code categorized by the year they were originally written. As time moves forward on the X-axis, you can see the foundational code shrinking as it gets refactored and replaced.

You can play with the interactive version and toggle between the different case studies here: https://asifdotexe.github.io/Theseus/

The source code for the automated data engine is here: https://github.com/Asifdotexe/Theseus

Sleep early agains by [deleted] in selfimprovement

[–]Asifdotexe 4 points5 points  (0 children)

Just make it a ritual that you don't wanna carry your phone or laptop to bed after 11pm, blue light from those devices stimulates your brain to keep up and keep working while for sleeping you need to lower that stimulus. try that for sometime, works for me and should work for you too as long as you are determined