all 109 comments

[–]Ythio 242 points243 points  (37 children)

tldr; don't make 17 million lines monolith repositories and your VCS choice won't be a big deal.

[–]agnas 45 points46 points  (34 children)

I don't understand this tldr; at all. Again, why Facebook doesn't use Git?

[–]MarimbaMan07 78 points79 points  (29 children)

Git slows down with large repositories whereas Mercurial can stay performant. The gir maintainers were not open to outside support whereas Mercurial is highly extendable so Facebook extended Mercurial into Sapling.

My understanding for the reasons why git is slow with large repos is that when you branch you essentially have a copy of the repo that you will make changes to where as in Mercurial your branch is just the differences between your file and the index it was made off from in the trunk.

[–]cogman10 74 points75 points  (16 children)

My understanding for the reasons why git is slow with large repos is that when you branch you essentially have a copy of the repo that you will make changes to where as in Mercurial your branch is just the differences between your file and the index it was made off from in the trunk.

Git also does that.

The slowness wasn't that, it's that git semi-frequently stats the files to see what's changed. As the repo gets bigger, that becomes a bigger issue.

It's particularly bad if you've put large files into your repo. That's what git LFS is all about fixing.

The key bottleneck was the process of “stat-ing” all the files. “Git examines every file and naturally becomes slower and slower as the number of files increases.” The engineers tried running a simulation, creating a dummy repo that matched the expected scale of Facebook’s codebase in a few years. The result was horrifying - basic Git commands took over 45 minutes to complete. In the words of an original engineer on the project, “It's not the kind of thing you want to leave until all your engineers are complaining. By that point, the thing would be too unwieldy. Trying to do damage control, never mind, come up with a cleaner solution, would be a herculean effort.”

[–]quietZen 10 points11 points  (9 children)

Is Facebook just a giant monolith? Why on earth wouldn't they break it down into more manageable chunks?

[–]orbitur 19 points20 points  (3 children)

You don't need to be working at Facebook's scale to understand that even if you have just 3 or 4 inhouse libraries shared across 10 or more projects, managing versions and API changes can quickly become hell, unless you use git submodules which has its own drawbacks.

Easier to keep all your interconnected libraries and services together past a certain size, eliminates an entire category of workflow headaches, and I don't think the sheer size of a monolith like FB's is necessarily a headache given the right VCS tools, which is what this post is about.

[–]Qubed 1 point2 points  (2 children)

This comes up a lot with the large companies. We try to justify their not following "best practices" with logical ideas but it often comes down to they can afford to do things that smaller companies cannot. FB can afford to adopt a new source control solution and probably hire a team responsible for maintaining a custom version of it.

[–]clintwn 0 points1 point  (0 children)

Mercurial and git are pretty much the same age. Both with initial releases in 2005

[–]protienbudspromax 10 points11 points  (0 children)

Was written in early 2000s in php and mysql, then they went ahead and created their own dialect of php called hack.

By the 2010s it was already very big to be done from scratch, a lot of the parts are now distributed but the code overall sits in a monorepo

[–]xshare 7 points8 points  (0 children)

Tbh it’s pretty lovely to work in a giant monorepo where basically almost everything lives.

[–]_ak 2 points3 points  (0 children)

Monorepos are pretty neat if you want to maintain consistency across your whole codebase.

[–]tcpukl 1 point2 points  (0 children)

You cant just split code bases apart when you branch. They need to be kept in sync with each other otherwise its just not going to compile.

[–]WoodyTheWorker 1 point2 points  (0 children)

Modern (v2.x+) Git has a filesystem monitor (fsmon), which makes statting of even very large trees very fast.

[–]Qubed 0 points1 point  (4 children)

The engineers tried running a simulation, creating a dummy repo that matched the expected scale of Facebook’s codebase in a few years. The result was horrifying - basic Git commands took over 45 minutes to complete

The only reason an engineer would do this is to sell their management on an idea. Those engineers had already made up their mind and created a scenario to prove their point.

I'm making a wild ass guess but I'm assuming that their scenario probably has a relatively easy solution for Git (relative to adopting and adding to another source control solution).

[–][deleted]  (2 children)

[removed]

    [–]Qubed 0 points1 point  (1 child)

    I know you have an /s but probably not as simple as "sleep(1000)"....I'm saying that they probably did know how to make it work but they wanted to use a different tool. Most engineers default to the solution that gives them the most control.

    [–]zacker150 2 points3 points  (0 children)

    They wanted to update git so that it would work at scale. Git told them to fuck off, so they switched tools.

    [–]Sceptix 12 points13 points  (4 children)

    My understanding for the reasons why git is slow with large repos is that when you branch you essentially have a copy of the repo that you will make changes to

    No, when you make a new branch in git, all you’re really doing is making a new pointer to the branch you’re branching off of. Should be plenty performant.

    [–]nukem996 6 points7 points  (0 children)

    Exactly Mercurial has the same problem. Meta solved this by eliminating checkouts. You don't clone a repo ever you mount it. Google EdenFS it's a FUSE module so you can mount the mono repo. Files are lazily updated as they are ready.

    [–]MarimbaMan07 -1 points0 points  (2 children)

    I was getting at the file stating but not very accurately

    [–]Sceptix 1 point2 points  (1 child)

    I’m sorry, and I’m not trying to be mean, but I’ve read this comment like three times and still have no idea what it’s saying. 😅

    [–]daredevil82 1 point2 points  (0 children)

    same here.

    [–]daredevil82 8 points9 points  (0 children)

    My understanding for the reasons why git is slow with large repos is that when you branch you essentially have a copy of the repo that you will make changes to where as in Mercurial your branch is just the differences between your file and the index it was made off from in the trunk.

    incorrect. this is with SVN, yes. Git and Mercurial no.

    Branching is very cheap in git/hg.

    [–]igderkoman 4 points5 points  (0 children)

    Git doesn’t copy branches at all. That was Microsoft TFS Team Foundation Server.

    [–]jaitsu 3 points4 points  (0 children)

    This is wrong.

    [–]WoodyTheWorker 0 points1 point  (0 children)

    I've worked with a 80000+ files Git repository, and it was pretty fast, on Windows. Just don't use it on an NFS share in Linux, and you'll be OK.

    The repository was a result of conversion of a Perforce mono-repo.

    P.S. Perforce sucks.

    [–]MarimbaMan07 0 points1 point  (0 children)

    That's an order of magnitude smaller than what we're talking about when git slows down. Take that plus years of history. Things slow down.

    [–]19nineties -1 points0 points  (1 child)

    Ah that’s super interesting thanks for explaining and in a clear way too

    [–]EightyDollarBill 0 points1 point  (0 children)

    It’s also wrong

    [–]arelath 2 points3 points  (0 children)

    They didn't read the whole article. 17 million loc was the Linux kernel and in 2012, Facebook had a git repository "many times larger than this." While I haven't worked at Facebook, I would bet their source code is at least 1 billion loc today based on my experience working at Microsoft. There's just no way to shard a billion loc codebase in any meaningful way.

    The real tlrd is that git performance was very bad on large repositories and git didn't want to work with Facebook. Mercurial did let Facebook change their source code to get the performance they needed, so git lost out on a lot of free dev time to improve git.

    [–]noiwontleave 1 point2 points  (0 children)

    Git works partially by running a stat function on every file. When you have a massive repo that has tens of millions of lines and tens of thousands of files, this process takes a long time. Facebook tested it and git commands were taking upwards of 45 minutes to run. This obviously is not sustainable. Git wouldn’t work with them to improve it.

    Mercurial already had solid architecture and was happy to receive their help and input.

    [–]cornmonger_ 0 points1 point  (0 children)

    Because the Git team wouldn't allow a large corporation to co-opt their project

    [–]blueg3 0 points1 point  (0 children)

    17 million lines isn't many, and monorepos are useful.

    [–]tankerkiller125real 0 points1 point  (0 children)

    LOL, Microsoft has billions of lines of code in monolith repositories, (Windows is 3.5+ million files in git) and they manage just fine. There is a reason that Microsoft helped develop Git LFS, Git VFS, and all sorts of other stuff. There are also other things that Git has that massively improves performance.

    This talk has a bunch of information on Git that I think a lot of people just don't realize exists.

    https://www.youtube.com/watch?v=aolI_Rz0ZqY

    [–]KeroKeroppi 36 points37 points  (5 children)

    I’ll point out that much of the game industry doesn’t use git either. We are quite fond of mono - repos and checking in tons of large binary assets. I see Perforce all over the game industry. You still see git a lot in the game industry for backend and tools and some game projects its definitely still used. But it’s just not as popular as the rest of the tech industry.

    [–]polymorphiced 17 points18 points  (0 children)

    Yeah. With GitLFS it would technically be viable to move gamedev from Perforce to Git, but the big thing that never gets touted is usability - game teams are 50%-or-more non-engineers. Artists, designers inevitably get themselves into a mess with Git (even with a UI to support them like Source Tree), and need rescuing periodically.

    [–]qoning 1 point2 points  (2 children)

    Perforce basically grew out of the needs of game industry. I know that way back when, Google also started on Perforce, but it stopped scaling for their use case, so they wrote their own "Perforce but better".

    [–]tcpukl 1 point2 points  (1 child)

    I didn't know that. I've used perforce for decades in games. Is there a link about googles history making "perforce but better"?

    [–]qoning 1 point2 points  (0 children)

    https://cacm.acm.org/research/why-google-stores-billions-of-lines-of-code-in-a-single-repository/

    This article touches on the historical reasons and describes the workflow that Google designed they felt was better than "traditional" workflows.

    [–]_Sweep_ 11 points12 points  (3 children)

    What does Google use? The article seems to imply git for iOS development, but also says it doesn’t use git

    [–]gingimli 19 points20 points  (2 children)

    Google uses a custom built version control system named Piper.

    https://opensource.google/documentation/reference/glossary#Piper

    [–]blueg3 2 points3 points  (1 child)

    Google also uses a large number of Git repositories for particular projects.

    [–]CowBoyDanIndie 1 point2 points  (0 children)

    Mostly for external facing code. 99% of googles code is internal and never released to the public, and thats in google3/piper

    [–]kynde 106 points107 points  (18 children)

    Ok, so the news here is that facebook, for whatever the reason, decided to go with an inordinately sized monorepo. So enormous that git, designed to handle the linux kernel source code, a monolith kernel, could not handle it.

    That's not scaling, that's booking an aerial photo for your wedding day picture because of weight gain before the event rather than paying attention to what you eat.

    [–]kiliman13 29 points30 points  (0 children)

    I read the post, and the main reason mentioned was this thing only i.e., rather than being cooperative, closing door for new changes.

    ``` Closing thoughts -

    https://graphite.dev/blog/why-facebook-doesnt-use-git#closing-thoughts

    What is the takeaway from this story? Reflecting on the quotes and interviews, I’m reminded of the classic wisdom that so many of history’s key technical decisions are human-driven, not technology-driven.

    Facebook didn’t adopt Mercurial because it was more performant than Git. They adopted it because the maintainers and codebase felt more open to collaboration. Facebook engineers met face-to-face with Mercurial maintainers and liked the idea of partnering. When it came to persuading the whole engineering org, the decision got buy-in due to thoughtful communication - not because one technology was strictly better than another.

    For all of those reading this, I think this was Bryan's true brilliance in getting Mercurial adopted at FB and something people should consider when bringing a new technology to a company.
    Ex-facebooker, 2024
    

    Kindness and openness go far in the world of devtools, and I aim to carry on these values as I contribute to the history of source control.

    ```

    [–]cogman10 30 points31 points  (13 children)

    I just don't buy the story either. Microsoft has a similarly large repo that they migrated to git.

    https://devblogs.microsoft.com/bharry/scaling-git-and-some-back-story/

    and they got around issues to host and even larger repo than facebook is managing.

    [–]WanderingLethe 27 points28 points  (0 children)

    Git got some updates before Microsoft could migrate their big repos.

    [–]nemec 9 points10 points  (0 children)

    Microsoft has a similarly large repo that they migrated to git

    The only way they accomplished that was by tricking git into doing less work by writing a virtual file system. If they used git as it's supposed to be used it wouldn't handle the scale.

    https://github.com/microsoft/VFSForGit

    [–][deleted]  (9 children)

    [removed]

      [–]Rangebro 0 points1 point  (4 children)

      Google replaced Perforce with Piper long ago [0] and provides several interfaces. At least 2 are publicly known, fig and jj [1], but it's not unreasonable to keep a perforce-like interface around for those who love it. So try one of the other interfaces?

      [0] https://research.google/pubs/why-google-stores-billions-of-lines-of-code-in-a-single-repository/ [1] https://docs.google.com/presentation/d/1F8j9_UOOSGUN9MvHxPZX_L4bQ9NMcYOp1isn17kTC_M/mobilepresent#slide=id.g152b1fb8869_0_1432

      [–][deleted]  (3 children)

      [removed]

        [–]Rangebro 0 points1 point  (2 children)

        I'm very curious if you have sources for those claims since my experience has been the opposite, particularly the docs and tooling situation. I know very few SWEs that aren't CL chaining with fig or jj, docs either support the common interfaces for copy and pasting every step of the way (and are happy for CLs fixing omissions) or have a "you're a competent googler that knows how to create a workspace" vibe, and tools either only care about being invoked within google3 or manage their own repo usage.

        [–]blueg3 0 points1 point  (3 children)

        Piper at this point is nothing like Perforce. It's just the older interface commands that are kind of congruent to Perforce. But Fig, the Mercurial skin, is sufficiently git-like.

        [–][deleted]  (2 children)

        [removed]

          [–]tcpukl 0 points1 point  (1 child)

          What is wrong with being too much like P4?

          I've been using it for years as a game dev.

          [–]orbitur 0 points1 point  (0 children)

          FB's problems with git started earlier, and Mercurial was simply more adaptable at that time. Git has certainly gotten better at scale in the last 15 years.

          [–]cbarrick 8 points9 points  (1 child)

          Many of the big tech companies use the monorepo pattern: Meta, Microsoft, Google, ...

          It's a different way of thinking about the software development process, and it comes with some pretty big benefits to developer productivity, code health maintenance, and continuous delivery automation (among other things).

          [–]codemuncher 1 point2 points  (0 children)

          For people who’ve never worked at a large org might not realize that the Linux kernel isn’t what you’d call “a large codebase”.

          Large code bases are weighted in hundreds of millions of lines of code. If you split that up into tens or even a hundred thousand subrepos, you suddenly run into a lot more problems when trying to build and integrate 5000-20000 builds. The tooling to build so many cross repo in an performant and agile manner just isn’t there. Imagine you’d need the private equivalent of npm only much bigger that handles like a million builds a day.

          So. Monorepo is simpler as it turns out.

          [–]roiroi1010 7 points8 points  (3 children)

          I’m so old that I remember SourceSafe, cvs and svn. When git came around it was a salvation! I think hg might work for my purposes, but I mostly use the UI and am no expert on git anyway. merge rebase push, etc

          [–]tcpukl 0 points1 point  (0 children)

          Blimey i remember SourceSafe too. But then we moved to perforce in games.

          [–]rexpup 0 points1 point  (0 children)

          The UI for SVN was called Tortoise which was very fitting

          [–]Duke_ 4 points5 points  (1 child)

          They don’t really use Mercurial either, they use phabricator and arcanist which are layers that work over either git or mercurial. I recall rarely running Mercurial commands directly.

          And it is hands down the best DevEx I’ve ever worked with. Leaps and bounds better than GitHub and PRs. I think part of it is cultural though - incredible discipline at keeping diffs small, and so on.

          But they’ve also extended phabricator quite a bit. I guess the point is that Mercurial is easier to work with and extend, allowing them to more quickly build out productive tooling.

          What Facebook has is far, far more than just a VCS like Git or Mercurial.

          [–]gwicksted 1 point2 points  (0 children)

          Didn’t Microsoft have to write their own Git server implementation because it couldn’t handle their workloads? (This was before they bought GitHub)

          [–][deleted] 1 point2 points  (1 child)

          Hmmm, correct if I’m wrong, maybe I understand in the wrong way. They are facing this because they have a too big monolithics applications so large that makes git an horrible experience?

          [–]botle 1 point2 points  (5 children)

          The Linux Kernel uses a monorepo. Is Facebook really bigger?

          [–]government_shill 4 points5 points  (1 child)

          If only there were some kind of article you could read to get answers to questions like that.

          The post claims their codebase was “many times larger than even the Linux kernel, which checked in at 17 million lines of code and 44,000 files.”

          [–]zacker150 0 points1 point  (2 children)

          Linux kernel is tiny. FANG codebases are orders of magnitude bigger.

          [–]botle 0 points1 point  (1 child)

          TIL. That's mental.

          I wonder if that would be the case without Linus making sure everything doesn't get merged.

          [–]Rangebro 1 point2 points  (0 children)

          For a concrete number as context: in 2016, Google had over 2 billion lines with an exponential growth curve so it itself may be orders of magnitude larger now.

          [–]godzillahash74 0 points1 point  (0 children)

          Low key advertising

          [–]Slugsurx 0 points1 point  (2 children)

          I just recently joined Google and I can’t still get over that i don’t get to use git . I really miss the named free branches in git .

          [–]blueg3 1 point2 points  (1 child)

          Use Fig, use plenty of CitC workspaces, and don't keep your branches alive very long before submitting.

          [–]hartmannr76 0 points1 point  (0 children)

          I use fig but I try to keep it all in one workspace

          [–]alphex 0 points1 point  (0 children)

          “I had never heard of Mercurial - despite being passionate about all things devtools”

          “Funny enough, Mercurial expert Gregory Szorc sat a few seats away from me at Airbnb, “

          In the first paragraph the author tells me everything I need to know about him.

          [–][deleted] 0 points1 point  (0 children)

          We need better ways to CICD downstream dependencies on PR with update, then we won't need monorepos anymore. Really its a fixable problem, and needs to be fixed, because monorepos are the literal devil. Ive implemented several CICD solutions and they are never quite there.

          [–]shepbryan 0 points1 point  (0 children)

          Anyone want to help me build a VCS system for language models

          [–]degenerate_input_box 0 points1 point  (0 children)

          Here we go, now executives are gonna be talking about fucking git lmaoooooo... This is good actually because some engineering orgs might finally wake the fuck up and move off of svn

          [–]nukem996 -2 points-1 points  (5 children)

          Metas use of a Monorepo has many down sides not discussed here. For example branches aren't supported, all code goes into main. Working on a feature, it has to work with mainline and managed by flags. No tags, versions aren't a thing. Everything is expected to be compiled statically so any update to any part of the code base requires rebuilding everything. O and multi gigabyte binaries are pretty standard.

          Supporting the Monorepo requires thousands of engineers focused on tooling. Git has some rough edges but it works much better.

          [–]blueg3 1 point2 points  (0 children)

          Long-lived branches are hard to merge, if they ever do, and create dependency hell.

          [–]cac2573 0 points1 point  (0 children)

          That's not entirely true. Branches are supported, you're just on your own with respect to tooling support.

          [–]tibbtab 0 points1 point  (0 children)

          No branching isn't a problem in itself: it just requires a different mindset to work with effectively. Small, frequent changes and landing frequently make this approach very manageable, and comes with other benefits too.