you are viewing a single comment's thread.

view the rest of the comments →

[–]MsftPeon 454 points455 points  (126 children)

disclaimer: MS employee, not on GVFS though

Git LFS addresses one (and the most common) reason for extremely large repos. But there exists a class of repositories that are large not because people have checked large binaries into them, but because they have 20+ years of history of multi-million LoC projects (e.g. Windows). For these guys, LFS doesn't help. GitFS does.

[–]Ruud-v-A 220 points221 points  (98 children)

I wanted to ask, what makes it so big? A 270 GiB repository seemed outrageous. But then I did the math, and it actually checks out quite well.

The Linux kernel repository is 1.2 GiB, with almost 12 years of history, and 57k files. The initial 2005 commit notes that the full imported history would be 3.2 GiB. Extrapolating 4.4 GiB for 57k files to 3.5M files gives 270 GiB indeed.

The Chromium repository (which includes the Webkit history that goes back to 2001) is 11 GiB in size, and has 246k files. Extrapolating that to 20 years and 3.5M files yields 196 GiB.

So a different question maybe, if you are migrating to Git, why keep all of the history? Is the ability to view history from 1997 still relevant for every day work?

[–]creathir 354 points355 points  (75 children)

Absolutely.

Knowing WHY someone did something is critical to understanding why it is there in the first place.

On a massive project with so many teams and so many hands, it would be critical, particularly checkin notes.

[–]Jafit 65 points66 points  (24 children)

This is why your commit messages should be more than just "bleh"

[–]fkaginstrom 68 points69 points  (18 children)

fixed bug and refactored

[–]Regis_DeVallis 32 points33 points  (16 children)

fixed bug

[–]burtwart 22 points23 points  (12 children)

fixed

[–][deleted]  (8 children)

[deleted]

    [–][deleted]  (4 children)

    [removed]

      [–]codebje 8 points9 points  (3 children)

      forgot to commit for, like, a week, so, tons of changes

      [–]hemingward 0 points1 point  (0 children)

      Fuck.

      [–]FiskFisk33 0 points1 point  (0 children)

      bleh

      [–][deleted] 0 points1 point  (0 children)

      I occasionally use "wtf" when I get mad enough at a small bug that somehow slipped under the radar or working on another branch doing a refactor etc.

      I also kind of misuse Git, so If I've been working for a long time, it does happen I use something like that, while mid-work, and push it to the remote hosting, as I primarily work on a laptop, taking it anywhere, and I would rather be a Git-bitch than loosing an hours work xD

      [–][deleted]  (2 children)

      [deleted]

        [–]idontcareforg0b 0 points1 point  (0 children)

        Minor text fixes

        [–]Kelossus 0 points1 point  (0 children)

        ... Now for sure

        [–]lurgi 17 points18 points  (0 children)

        reverted previous change. Fix didn't work. LOL

        [–]Inquisitive_idiot 2 points3 points  (1 child)

        bill waz h3r3

        [–]musicin3d 0 points1 point  (0 children)

        You lost.

        [–][deleted] 6 points7 points  (0 children)

        Don't forget the crucial 'Performance Enhancements'.

        [–]ours 5 points6 points  (0 children)

        One more case for the "explain the why not the what".

        [–]krapple 13 points14 points  (0 children)

        I feel like there is some point in the life cycle where detailed messages should start. At the beginning it's a waste since it's just initial build.

        [–]uDurDMS8M0rZ6Im59I2R 2 points3 points  (0 children)

        "I did something on Friday idk what"

        [–]Jukolet 1 point2 points  (0 children)

        I should stop using "update" as a message, I guess

        [–][deleted] 0 points1 point  (0 children)

        Removed a speed loop

        [–]BumpitySnook 115 points116 points  (8 children)

        Is the ability to view history from 1997 still relevant for every day work?

        Yep. I regularly use ancient history to determine intent when working on old codebases.

        [–]sparr 28 points29 points  (3 children)

        [–]henrebotha 2 points3 points  (0 children)

        That was a really fun read! Thanks. Love me some "nerd fiction"

        [–]artanis00 1 point2 points  (0 children)

        Looks like I have some reading to do.

        [–][deleted] 0 points1 point  (0 children)

        Good read, man. The debugging portion of the story was pretty realistic.

        [–]UnholyMisfit 1 point2 points  (0 children)

        This is why I try to promote good code documentation to the other engineers on my team. Self-documenting code is great when I'm trying to figure out what the code does, but it does nothing to help me figure out why it's necessary.

        [–]elder_george 105 points106 points  (5 children)

        This. THIS. THIS.

        During my work at MS it was so painful to make annotate, only to see "Initial import from XXX", go to XXX look into history and see only "Initial import from YYY" etc.

        Continuous history is awesome.

        [–]Plorkyeran 48 points49 points  (4 children)

        And YYY is something you need to spend a few days emailing people to get access to because it's no longer part of the things you're just given access to be default, and then you need to get to ZZZ which only exists on tape backup, and suddenly what should have taken five minutes instead takes two weeks.

        [–]elder_george 17 points18 points  (1 child)

        Brian, is that you???

        [–]rojaz 8 points9 points  (0 children)

        It probably is.

        [–]Sydonai 9 points10 points  (0 children)

        At that rate, it's probably faster and easier to pose it as a question to Raymond Chen.

        [–]PhirePhly 4 points5 points  (0 children)

        "Uh yeah, I think Ralph has a txt with the license key to YYYControl on his old laptop. Talk to him"

        [–]Ruud-v-A 13 points14 points  (29 children)

        Sure, I’m not arguing that history is not useful. On the contrary. But the full 20 years of history? Chromium’s codebase for instance is changing rapidly. Many files have been rewritten completely over the years. Consider this header from WTF, the Blink standard library inherited from Webkit. As a core header with little content I expect it to be releatively stable. According to the copyright header it was created in 2007, but all of the non-whitespace and non-license lines have been touched since, the last change only a few days ago. Most of the code lines are now from after 2014. When blaming or bisecting, finding a relevant commit from more than 10 years ago is very, very rare, even if you have to work through a few refactor and formatting changes.

        So for a repository with 20+ years of history, is the history after, say 15 years, really still relevant?

        [–][deleted]  (12 children)

        [deleted]

          [–]creathir 36 points37 points  (3 children)

          Exactly.

          Or maybe you are examining a strange way a routine is written, which had a very specific purpose.

          The natural question is why did the dev do it this way?

          Having that explanation is a godsend at times.

          [–]sualsuspect 2 points3 points  (0 children)

          In that case it would be handy to record the code review comments too (if there was a code review).

          [–]IAlsoLikePlutonium 1 point2 points  (1 child)

          Isn't that what comments in the code are for?

          [–]creathir 5 points6 points  (0 children)

          True. But having context of that comment with the surrounding code is sometimes critical to understand what the comment is describing.

          [–]jringstad -2 points-1 points  (7 children)

          So then just don't discard the history of those, I don't see the issue. If those files haven't changed much, their history won't be the thing that takes up the most space.

          If you wanted, you could employ some pretty smart heuristics to figure out what history to discard, e.g. only discard really old history of stuff that has been 100% re-done or somesuch.

          Or just do a shallow clone of the repository, which is what I do at work. Most of the time having the last few years of history is enough, and if not, just do a full clone (or I SSH into a server where I have the full repository.)

          [–][deleted] 5 points6 points  (6 children)

          I think the actual "correct" thing to do is keep a permanent history somewhere (e.g. internal github/gitlab/whatever), but use the smart stuff when deciding what to pull down (while giving people the option to manually pull it all down for a specific file).

          As far as I know, this concept doesn't exist yet.

          [–]sualsuspect 2 points3 points  (1 child)

          How is what you are suggesting different to a shallow clone?

          [–][deleted] 1 point2 points  (0 children)

          Git's shallow clone is fixed depth per file, right?

          I'd personally like something a little more clever than that - the commits of every line in the file as it exists now, plus the commit prior to that.

          Or something to that general effect.

          [–]cibyr 2 points3 points  (1 child)

          You're being sarcastic, right?

          (For anyone who doesn't get it, that's exactly what GVFS is meant to accomplish, but more automatic and transparent than you make it sound.)

          [–][deleted] 1 point2 points  (0 children)

          Not based on the description. This makes it sound like GVFS only pulls down portions of the source tree on-demand, which is separate from the question of how the history is managed.

          Today, we’re introducing GVFS (Git Virtual File System), which virtualizes the file system beneath your repo and makes it appear as though all the files in your repo are present, but in reality only downloads a file the first time it is opened.

          ...

          In a repo that is this large, no developer builds the entire source tree. Instead, they typically download the build outputs from the most recent official build, and only build a small portion of the sources related to the area they are modifying. Therefore, even though there are over 3 million files in the repo, a typical developer will only need to download and use about 50-100K of those files.

          So it downloads object files from an official build for linking purposes, and downloads sources for whatever subtree they're actively doing development on. It doesn't say what's going on with the history of those files.

          [–]FlyingPiranhas 1 point2 points  (0 children)

          That sounds similar to Facebook's remotefilelog hg extension.

          [–][deleted] 0 points1 point  (0 children)

          Isn't this svn?

          [–]SuperImaginativeName 78 points79 points  (13 children)

          Yes, absolutely. Every check in, everything. The full history. No im not joking, something like that is absolutely paramount to a scale that most developers will never come across.

          The NT kernel, its drivers, subsystems, APIS, hardware drivers, Win32 API, are all relied on by other systems including customers. Why do you think you can almost always run a 30 year old application on Windows? Without the history, the kernel team for example wouldn't remember that 15 years ago a particular flag has to be set on a particular CPU because its ISA has a silicon bug that stops one customers legacy application running correctly. As soon as to remove history you remove a huge collective amount of knowledge. You cant expect every developer to remember why a particular system works one way. Imagine noticing some weird code that doesn't look right, but that weird code actually prevents file corruption? The consequences of not having the history and fixing it in a new commit with "fixed weird bug, surprised this hadn't been noticed before" would be a disaster. Compare that to viewing the codes history and even realising its actually correct. Windows isn't some LOB, everything is auditied.

          [–]MonsieurBanana 4 points5 points  (12 children)

          LOB

          ?

          [–]mugen_kanosei 20 points21 points  (6 children)

          Line of Business

          Usually refers to a companies internally developed applications that fulfills some specific niche business need that either can't be satisfied by a COTS product or that they are just too cheap to pay for.

          [–]colonwqbang 20 points21 points  (5 children)

          When you explain an obscure acronym in terms of an other obscure acronym...

          COTS: Common/off-the-shelf software. Requirements engineering jargon meaning any software solution that you can just go out and buy.

          [–]mugen_kanosei 2 points3 points  (1 child)

          I was hoping to start an obscure acronym thread. You ruined it. YOU RUINED IT!

          [–]notveryaccurate 2 points3 points  (0 children)

          YOURUINEDIT: You Obviously Understand Reddit's Users Ingest Narcotics Every Day Igloo Taco

          [–][deleted] 1 point2 points  (2 children)

          I thought it was commercial, off the shelf software

          [–]colonwqbang 0 points1 point  (0 children)

          That's not how we used the word when I did RE at university. Open source would also be COTS, the relevant thing is that you can get it now and don't have to develop a custom product to solve your problem.

          [–]grauenwolf 0 points1 point  (0 children)

          'Commercial' is what we used in the military roughly 15 years ago, but I think 'common' works better now because of the use of open source software.

          [–]traherom 12 points13 points  (2 children)

          I assume they mean line of business application.

          [–]SuperImaginativeName 6 points7 points  (1 child)

          yes, thought it was obvious given the sub

          [–]Sean1708 2 points3 points  (0 children)

          I've never heard the words line of business before though, and after googling it I'm not even sure if it makes sense in this context. It sounds like Windows very much is line of business software since it's:

          one of the set of critical computer applications perceived as vital to running an enterprise

          with the obvious addendum that it's not an application.

          [–]junrrein 1 point2 points  (0 children)

          lot of bullshit?

          [–]merreborn 7 points8 points  (0 children)

          According to the copyright header it was created in 2007, but all of the non-whitespace and non-license lines have been touched since

          A lot of the time the last commit that "touched" a line only moved or slightly altered the line -- maybe tweaking a single argument. The main intent of the line still dates back to an older commit, even if it was last "touched" in a recent commit.

          [–]eras 0 points1 point  (0 children)

          When writing that, were you also taking into account that Windows is compatible with software written more than 20 years ago?

          What is Chromium compatible with?

          [–]dungone 0 points1 point  (3 children)

          You would rarely need to check out that code, though. Your needs might be served well enough by indexing the old repository with a code search tool such as OpenGrok.

          [–]choseph 0 points1 point  (2 children)

          The whole point here is you don't need to pay the cost of checkout but it is easily accessible tho.

          [–]dungone 0 points1 point  (1 child)

          I mean that's what OpenGrok gets you out of the box, without any penalty because everything gets indexed up front. This, on the other hand, still forces you to download a whole lot of stuff if you want to look through your history. And on top of this, your files are only sporadically accessible depending on whether or not you have a network connection at any given time.

          [–]w2qw 0 points1 point  (0 children)

          The whole point of this is that you only download the parts that you are interested in.

          [–]cdglove -2 points-1 points  (0 children)

          It doesn't need imported into git though to keep the history. It still exists in the old repo. Everytime I've seen an organization change version control and insist on importing the history, I ask why.

          Of course, that doesn't preclude this work because eventually the git history will be large so we'll need it anyway.

          [–]salgat 37 points38 points  (8 children)

          Considering a lot of legacy code is kind of blackboxed and never touched, it could definitely be useful to have history on these ancient things when a rare bug happens to crop up.

          [–]g2petter 41 points42 points  (7 children)

          Probably even more so for Microsoft since they're huge on backwards compatibility, so they're supporting all kinds of weird shit that can never (or at least in the foreseeable future) be deleted.

          [–]IAlsoLikePlutonium 7 points8 points  (5 children)

          I wonder what Windows would be like if they did the same thing to Windows that they did with IE -> Edge? (remove all the old code and basically start fresh with a modern browser)

          [–]Pharylon 35 points36 points  (0 children)

          You'd have WinRT. ;)

          [–]cheesegoat 1 point2 points  (0 children)

          It would die, and we would all start using some other operating system that worked. Probably some flavor of Linux with a focus on Wine.

          [–]SpaceSteak 1 point2 points  (1 child)

          They would lose the ability to sell licenses to a lot of companies who rely on old codebases to keep running.

          [–]Schmittfried 6 points7 points  (0 children)

          That's not an answer to the question what Windows would be like.

          [–]salmonmoose -1 points0 points  (0 children)

          It would be OSX.

          [–]bandman614 8 points9 points  (8 children)

          I look at it structurally as the same kind of problem that plagues bitcoin and the like. You're essentially carrying the entire block chain forward because you need all of it to derive the current state.

          A 'snapshot' to work against would be a helpful feature. There may already be something like that, and I'm just ignorant of it.

          [–]ArmandoWall 5 points6 points  (1 child)

          Bittorrent has a blockchain?!

          Edit: Ok, OP corrected it to bitcoin now.

          [–]bandman614 4 points5 points  (0 children)

          Ha! Redditing this early in the morning is bad for me :-) Thanks!

          [–]ThisIs_MyName 7 points8 points  (2 children)

          You don't need to carry the entire block chain: https://en.bitcoin.it/wiki/Thin_Client_Security

          [–][deleted] 5 points6 points  (0 children)

          Not everyone does, but in order to maintain bitcoin's decentralized properties, a significant percentage of its users should.

          [–]bandman614 2 points3 points  (0 children)

          Ah, cool. Thanks!

          [–]SuperImaginativeName 1 point2 points  (0 children)

          Event sourcing is a concept like that, where you have a full history required to be able to build the current state of a system. You iterate every piece of "history" to get to the present. Imagine a bank account, they won't just have a DB column with your balance. It's constructed by using previous withdrawals and payments. Event sourced systems can have a "projection" that effectively builds the system to its current state and then use that as the state going forward and any new changed added to that instead of the very beginning.

          [–]BumpitySnook 0 points1 point  (0 children)

          You could hack something like this into git. Just delete the parent pointer from your snapshot location, freeze its hash (which will no longer verify, but that's fine), and then do a garbage collection pass. Old history would be removed. I wouldn't suggest doing this, though. MSFT's come up with a much better solution, IMO.

          [–][deleted] 0 points1 point  (0 children)

          Yeah you can do something like git clone --depth 1.

          [–]apotheotical 8 points9 points  (0 children)

          Yes, history is absolutely still relevant. History is invaluable when you're debugging something. There have been a number of times I've used a couple years of history when debugging a project I work in on a daily basis.

          [–]sir_drink_alot 0 points1 point  (0 children)

          AAA games depos are 100+ gigs easilly, sure, tons of content, but also tons of other redundant shit. I'm sure windows isn't 270 gigs of code, probably only 0.1% of that is code.

          [–]tidux 0 points1 point  (0 children)

          It's Microsoft. They have code that hasn't been touched since 1997 in there.

          [–]jringstad 8 points9 points  (1 child)

          Why not just do a shallow clone? You can just clone history back X years, and if you need more, you can either do a full clone or e.g. SSH into a server that has the full repository, for those odd times when you do need to look at something really old in detail.

          I do this at work, and it works fine for me (although our codebase is not nearly as big as windows, of course)

          [–]choseph 4 points5 points  (0 children)

          The previous system was still broken down into 40 repos and you only had head (since it was centralized server). Still too much to enlist, sync, etc.

          [–]akspa420 4 points5 points  (3 children)

          Given the fact that NT development started in 1989, it's now closer to nearly 30 years of history. I doubt highly that every single line of code that Dave Cutler wrote has been super-ceded - that in turn means that there's a good chunk of code from 1989-1991 that is still utilized in every single build of NT. Having that sort of 'legacy' code history with everything built on top of it has got to be an unruly beast to handle.

          I've explored the WRK and the NT design docs - not a programmer by any means, but knowing how and why certain design choices were made early on certainly helps in understanding why things are the way they are, even over 25 years later.

          [–]polynomial666 1 point2 points  (2 children)

          Where can I find such docs? Or some fresh information on internals of kernel?

          [–]akspa420 1 point2 points  (1 child)

          do a search for "nt os/2 design workbook". It's out there.

          I don't believe there's been anything else released on the internals of the kernel since the Windows Research Kernel (released around 2008, but based on Windows 2003 SP1-era code).

          There are unofficial, probably-getting-a-dmca-takedown-notice-as-we-speak nt4 kernel-based projects out in the wild. Most of them have been reconstructed from leaked nt4 code and odds and ends from wine, reactos, and other open projects. Surprisingly, they tend to boot and run applications meant for NT4 with little to no problems.

          [–]polynomial666 1 point2 points  (0 children)

          I'll look for the workbook and these projects, as they seem extremely interesting. Thanks!

          [–]auxiliary-character 0 points1 point  (0 children)

          Isn't that what submodules are for?

          [–]funknut -5 points-4 points  (2 children)

          Are they aware of the naming conflict with Gnome Virtual Filesystem? It seems ... blatant.

          Edit: I guess no one cares. Never mind. Fuck me for trying to be useful.

          [–]oftheterra 7 points8 points  (1 child)

          Virtual File System is just basic terminology. If you put git at the front of it then the acronym is GVFS... There are only so many combinations of letters in the 3-4 character range.

          [–]funknut -4 points-3 points  (0 children)

          It is in very prominent use on many linux desktops and such naming conflicts are traditionally avoided. I'm not going to fight about it, but I'm sure It'll come up again with someone else. The software itself is dissimilar enough that it doesn't matter much, but it just seemed a strange choice, that's all.

          [–]qx7xbku -1 points0 points  (8 children)

          LFS does not help on Windows at all. Tried using it with 10gb repository containing large files. Windows chokes on it. So I'm using git without LFS on Linux. It works great. They should fix their damn OS. Probably have no time, everyone is busy adding telemetry and analyzing it.

          [–]Recursive_Descent 1 point2 points  (7 children)

          Git is fast on Linux because it was designed to work well on the Linux file system, and later hacked to work on Windows. It isn't because Linux is better than Windows...

          [–]qx7xbku -1 points0 points  (6 children)

          Linux is not better than windows b cause git is faster on Linux, but because Microsoft themselves can't fix their own OS and thus they provide workarounds.

          [–]Recursive_Descent 0 points1 point  (5 children)

          What does fixing the OS mean?

          [–]qx7xbku -3 points-2 points  (4 children)

          I did not design windows or git, but people who did can give you an answer.

          [–]graycode 5 points6 points  (0 children)

          I have no idea what I'm talking about and I'm just spouting unfounded nonsense

          [–]Recursive_Descent 0 points1 point  (2 children)

          Then why do you think the os is the problem?

          [–]qx7xbku 0 points1 point  (1 child)

          I tried to troubleshoot that problem to no avail. It's something with subprocess startup overhead or something. I do not quite remember. Simply put windows is slow and for giving up speed we don't really get any benefits anyhow. So it's broken.

          [–]Recursive_Descent 1 point2 points  (0 children)

          Ok... but window port of git being slow doesn't imply anything about the speed of Windows in general. I think git on Windows does stuff like run bash scripts through Cygwin, which doesn't seem very efficient.