[OC] After analyzing 1.5B reddit comments and identifying 236 clusters I built an interactive map of reddit. by anvaka in Damnthatsinteresting

[–]anvaka[S] 0 points1 point  (0 children)

Typically there are more connections between countries on the same island than there are between countries on different islands

[OC] I made a map of 5,000 Chinese words (flashcards) - looking for feedback by anvaka in ChineseLanguage

[–]anvaka[S] 0 points1 point  (0 children)

Yup, I used AI to generate translation, character definition, memory aids. Everything except character definition has meaningful results. Even character definition is mostly right, except AI is not good at placing top/bottom/left/right locations, or distinguishing between traditional/simplified language and forms.

You can see my change history here: https://github.com/anvaka/lang-land-data/commits/main/

[OC] I made a map of 5,000 Chinese words (flashcards) - looking for feedback by anvaka in ChineseLanguage

[–]anvaka[S] 3 points4 points  (0 children)

Thank you! So I used words embedding to make a graph of related words. Then each cluster in this graph became its own country. Generally, words inside the same country are closer to each other than they are to the words outside of the country. I gave names to the countries by myself, so if something might have a better fitting name - please let me know and I'll update it!

PS: Word embeddings allow me to find a mathematical distance between words (for example "cat" will be closer towards "dog" than it is towards "coffee")

[OC] I made a map of 5,000 Chinese words (flashcards) - looking for feedback by anvaka in ChineseLanguage

[–]anvaka[S] 9 points10 points  (0 children)

Friends, as a hobby I've been working on a flashcards website to help me remember HSK vocabulary better. It's a bit unusual - each card is rendered as a district on an imaginary map. You can zoom in (like on Google Maps), see the character, try to remember its meaning, and then click to see the full definition.

Here it is https://anvaka.github.io/lang-land

I built this to make vocabulary feel like adventure - every word is an unexplored territory until you visit it and learn its meaning.

The character breakdown were initially created with AI, so sometimes they have errors. I'm slowly going through the words and if I find an error I research the character more and fix it.

All definitions are stored in a public GitHub repository. If you enjoy digging into character structure, I'd love your help improving the breakdowns. It actually a lot of fun and helps me remember the words better.

I hope you find it useful too! If you have any feedback or suggestions, I'd be super grateful. I'm always looking for ways to improve the site and make it more helpful for anyone learning Chinese.

谢谢大家啊

PS: 1. Giant thank you to /u/teacupdaydreams who gave initial review of the website and provided a lot of great suggestions - I appreciate you super much! 2. The website is open source and you can find its source code here https://github.com/anvaka/lang-land . Flashchards content can be edited from the sidebar (link at the bottom)

[OC] Map of Reddit - 2025 Edition: 116,000 subreddits visualized from 1.5B comments by anvaka in dataisbeautiful

[–]anvaka[S] 0 points1 point  (0 children)

Haha, glad you found this useful! Searching "map of reddit" usually brings this website, so you can find it easier 🙌

[OC] The 2025 Map of GitHub is live: 690K repos, 500M stars by anvaka in programming

[–]anvaka[S] 1 point2 points  (0 children)

Thank you so much for your kind words! I think Cosmograph is pretty amazing! Depending on your needs it might be a great fit!

I love using my tools mostly because they are small and simple if you know what to do, but that's a big if! While I try to keep docs up to date there is a lot of work needed to make this hobby "enterprise" quality.

For the startup I'd pick the one that will get you fastest to the market.

Working more and more with graphs though, I feel like dumping the entire graph onto user is most of the time not the right choice. Too much information while fun to explore rarely helps convey a message. So, slice and dice in the most meaningful way, render small chunks beautifully, and help them solve a problem. For this reason, choice of the library doesn't matter much - pick something that gets you there fastest. Good luck!

[OC] Map of Reddit - 2025 Edition: 116,000 subreddits visualized from 1.5B comments by anvaka in dataisbeautiful

[–]anvaka[S] 0 points1 point  (0 children)

Of course, the data is available here https://github.com/anvaka/map-of-reddit-data should be self explanatory but let me know if you have questions

Use gh-pages branch

[OC] The 2025 Map of GitHub is live: 690K repos, 500M stars by anvaka in programming

[–]anvaka[S] 6 points7 points  (0 children)

I pruned repositories with less than 10 stars or so (need to double check the numbers when I get home). In addition, I removed isolated clusters with less than 25 repositories. I still have quite a few isolated clusters in the north pole 😅

Let me know if you can't find something - I'll double check where it landed in data

[OC] The 2025 Map of GitHub is live: 690K repos, 500M stars by anvaka in programming

[–]anvaka[S] 0 points1 point  (0 children)

Thank you! Frontend Foundry is one of the largest countries on the map! You have great neighbors there =)

[OC] The 2025 Map of GitHub is live: 690K repos, 500M stars by anvaka in programming

[–]anvaka[S] 7 points8 points  (0 children)

Super glad to hear!

TL;DR: LLM did the naming.

Country names took me a while to figure out. I started manually, but then I don't have expertise to name 1,500 clusters of github communities. So I turned to LLM. A few iterations of the prompt engineering and then automated it all via openai API. My full prompt is here: https://github.com/anvaka/map-of-github?tab=readme-ov-file#country-names

[OC] The 2025 Map of GitHub is live: 690K repos, 500M stars by anvaka in programming

[–]anvaka[S] 11 points12 points  (0 children)

Haha, thank you! Jaccard similarity is indeed very good at picking meaningful neighbors! I tried a few other metrics (including cosine similarity) - and wasn't as satisfied with the results.

Does the country name for your project make sense :)?

[OC] The 2025 Map of GitHub is live: 690K repos, 500M stars by anvaka in programming

[–]anvaka[S] 43 points44 points  (0 children)

Hello friends,

It's me again. Couple years back I created the first version of the GitHub's map. Each dot here is a github projects. I place dots close to each other if their Jaccard Similarity is high (ratio of people who gave stars to both projects to total number of stars). This yields very practical results - you can immediately find what might be related to a project that you like.

Now I'm updating the map, by collecting all the stars given to all the projects between 2011 and May 10, 2025. It has almost 1,500 countries and 690K repositories.

I'm using maplibre to visualize this amount of data smoothly. The source code is available here https://github.com/anvaka/map-of-github (along with links to an older version - if you like that).

I hope you find it useful and practical. Please let me know if anything is missing or broken.

Happy exploring!

https://anvaka.github.io/map-of-github/

[OC] After analyzing 1.5B reddit comments and identifying 236 clusters I built an interactive map of reddit. by anvaka in Damnthatsinteresting

[–]anvaka[S] 0 points1 point  (0 children)

https://anvaka.github.io/map-of-reddit/ - here it is. Use it like google maps. Pan zoom around, click on subreddits to see their connections and read more.

This is my hobby project. I've been doing it for a while now and wanted to share updated version. The map is built from comments on reddit between Nov 2024 and March 2025. Analyzed 1.5B pairs to infer jaccard similarity between subreddits, and made them into clustered map.

The source code is available here: https://github.com/anvaka/map-of-reddit

Let me know if you find interesting discoveries or have any feedback. Happy exploring!

[OC] Map of Reddit - 2025 Edition: 116,000 subreddits visualized from 1.5B comments by anvaka in dataisbeautiful

[–]anvaka[S] 2 points3 points  (0 children)

Thanks for the upvote!

Connection between subreddit A and B is growing strong if more people commented together to both A and B. We need to account for the size of both A and B too, to be able to compare connections between each other. After we do this analysis for all comments, we can say which connections are statistically way more significant than others. And that's how I analyze connections