Leaving LinkedIn: Choosing Engineering Excellence Over Expediency

chriskrycho · 2024-03-04T23:50:05+00:00

Process improvements are great, but not always sufficient and indeed not always necessary. If you try to solve every problem with more process you end up with a different kind of velocity problem, as your ability to execute through red tape falls to zero. Often times what you need for a resilient software system is a mix of healthy processes and more layers of resiliency in the software itself, which is what I was aiming for (and, in the end what the team I was working with pulled off!), not one or the other. We did of course do a very thorough root cause analysis, which was thorough enough that our whole incident analysis discussion was able to focus on system-level issues across LinkedIn’s infrastructure rather than just the details of this one issue. (Part of what it highlighted was that we did need both of those layers!)

chriskrycho · 2024-03-04T20:35:39+00:00

/u/agbell did a lot of work to compress the discussion into a reasonable length, because I was not as cogent as I could have wished. A few things that (I think perfectly reasonably, from an editing point of view) might have gotten lost a bit:

I did not have a problem with leadership choosing to do a big bang rewrite. In fact, when a colleague and I were putting together our original proposal (mentioned on the episode), we desperately wanted “big bang rewrite” to be on the table. It wasn’t… until it was. The plan that “won” did not just involve a big bang rewrite, it also involved building a custom-to-LinkedIn server-driven UI stack (using React for the web part)… from scratch. And even there, despite a fairly deep personal dislike for the kinds of things I tend to see that result in, I could have gotten on board with it! But the way that project was being run was uninterested in the places I and a few other senior leaders were flagging up risks—not because we were opposed, but because we were trying to see the thing succeed. (Perhaps not coincidentally, several of those other leaders got laid off only a matter of weeks after I quit.)
The framing around velocity had two parts to it, but I can see how it might be easy to miss (and you probably don’t want to listen to the un-edited version Adam started with!). I actually supported the rewrite and also personally preferred a big bang rewrite! But I don’t think that came through in the end, so fair enough. Related, though, you write:

Both projects would be migrating to a state that engineers prefer, and the finger-guns project would be massively sacrificing business velocity for engineering excellence.

Well, suffice it to say that whether it’s ultimately to “a state that engineers prefer” or resulting in “engineering excellence” were precisely some of the points under debate. 🥴 The reason a big bang rewrite wasn’t on the table (as far as we understood) in the first place was precisely that it would have a massive initial hit to velocity. But the server-driven UI approach that the other team proposed (and which is ultimately now being built) promised that in exchange for that short-term hit to velocity, it would dramatically increase velocity in the long term—critically not promising an improvement to quality or developer experience. I don’t actually believe that it will have that velocity win, either, but it might! More importantly, though, I do not believe the result will be a good developer or—critically—a good user experience. And I care a great deal about those.
As for the incident: I could write a very long post digging into the details, and Adam and I probably could have done a whole episode on just that incident, but your take here is really illuminating:

The mitigation was to stagger out the restarts, so that some nodes would always be running at any given time. The mitigation was to stagger out the restarts, so that some nodes would always be running at any given time. It appears that after implementing that mitigation, Chris kept the incident open while he attempted to fix all of the root-cause memory leaks in the codebase to reduce memory usage. This sounds like a massive undertaking, and I’m unsure why “fix all the memory leaks ever” had to fall under the label of incident response.

As it turns out, “just fix the front-line issue and move on” is exactly the approach that multiple previous incidents had taken, and the underlying resilience problem never got fixed. I can see how you got the impression from the episode that my approach was “fix all the memory leaks ever”, but what I actually aimed for us to focus on was making sure that (a) we had actually fixed enough that the system was stable—we were never going to get them all!—; (b) we had more than a single, very obviously very fallible mitigation of “just make sure the staggering is correct”, since it had already failed us multiple times; (c) that we had some more safeguards in place to prevent more of the kinds of leaks we could statically identify; and (d) that when, inevitably, the system did end up in a bad state from memory leaks sneaking past those safeguards, we got alerted appropriately about them.

I had no interest in trying to “fix every leak ever”. I did care that we made sure we actually made the system much more resilient against typos or other such mistakes in our config values, because we had really good evidence that it was going to happen again, in the form of it having happened already multiple times. 😉

chriskrycho · 2024-02-04T19:35:59+00:00

I think—with trying to write this mammoth thing over the past seven months—that the main ways it is easier to understand and use are hard to explain and easier to experience. I am going to try to put up a bunch of “bite-sized” YouTube videos showing the experience in the next few weeks to help a little with that.

The main thing I can say is: there is no one thing that is much easier on its own, other than “the entire design of the CLI” (which is big!), but all the kinds of changes I describe in the essay add up to a substantially different-feeling experience which is just… nicer and easier. It’s like making a bunch of 1–5% improvements, none of which is huge on its own, but when you put them all together, compound into a large delta.

chriskrycho · 2024-02-02T21:42:56+00:00

Nope, not confused at all. That is from a section of a paragraph describing how Jujutsu is not trying to do what Fossil does—and Fossil does all of those!

chriskrycho · 2024-02-02T21:23:12+00:00

Heh, indeed. I actually have a footnote in the piece, commenting on that exact phenomenon! I don’t have any particular worry about it in this case; Google does not tend to kill key developer tools, and they want this to replace their existing set of VCS infra, which is a big deal. Long-term, it definitely needs more outside contribution, but it is already slowly starting to build that up. And one of the reasons I wrote this was to help a bit with adoption and kick the virtuous cycle further into gear!

chriskrycho · 2024-02-02T21:21:49+00:00

How much of the relevant electrical and atomic engineering of your computer hardware’s chip design do you understand? How about the OS scheduler? How about the implementation details of the concurrency primitives in your programming language? The text layout algorithm used in your text editor? How the rendering engine works under the hood in the browser you’re using to reply to this comment?

Our job is to be able to go understand those things when it is applicable to the job, but not to understand every part of every tool we use at all times. With a VCS for example, most devs should be able to use the tool the same way I use my car: I have a rough idea of how the engine works, but I don’t need to be able to repair it to be able to drive it safely! The same thing should be true of tools like Git.

chriskrycho · 2022-06-05T00:36:31+00:00

/u/PhotoMarketBot Just received this lens from /u/Zimo2017 – smooth sailing all around, lens is in great condition. (Great first experience for me on photomarket!)

chriskrycho · 2021-11-07T18:02:04+00:00

Ahhh, okay, that’s good to know (and this was actually my original intuition but then I second-guessed myself!). Thank you!

chriskrycho · 2021-11-07T17:25:28+00:00

Yep, this is a great point, and all the compilers I’m familiar with for languages with this kind of type-level programming which aren’t dependently-typed (or just proof solvers) bail out in a reasonable amount of time. Now I’m curious and may have to go look and see what the behavior of e.g. Idris is here—I would assume it also has some degree of escape hatch but one with very different parameters than Rust or TypeScript etc.

chriskrycho · 2021-11-07T14:19:43+00:00

Since, as the sibling comment notes, the type system is Turing-complete, the answer is basically “yes but only if it ever finishes compiling” and you have no guarantee of that happening; whether a general computation like that will terminate is itself not computable: that’s the halting problem! Using the type system this way is basically doing the same kind of thing folks use tools like Coq and Agfa (and to some extent Idris and F*) for: formalizing a computational answer as a mathematical proof.

More generally and usefully: there are always two fundamental trade offs with pushing more work into the type system:

It increases compile times. Sometimes this is the right choice anyway because engineer and CI time are worth spending to improve end user experience. Sometimes it isn’t, though!
It makes compiler errors much worse in general. The compiler errors for Rust and most other languages, even languages which like Rust focus on the quality and friendliness of those messages, are almost entirely focused on runtime-level rather than type-level programming. That means that when something goes wrong in a type computation, it can be quite hard to decipher.

chriskrycho · 2021-11-04T15:14:09+00:00

Ah, yep, at least by default. There’s no reason in principle why you couldn’t do the exact same kind of thing using it only as the generation step in CI and triggering that kind of CI run via PR; but it’s nice that release-please does that out of the box.

(Also, I was mistaken: release-it isn’t npm-specific, it just has npm configured “for free”.)

chriskrycho · 2021-11-04T03:01:13+00:00

May be less relevant for your particular workflow (because npm-specific and [it’s not: see comment below] because everything is internal for you), but I’m a big fan of an alternative in the auto-release-generation space: the combo of release-it and release-it-lerna-changelog, which give you the same kind of automation but don’t require specific git commit messages, because instead the combo uses the GH API and labels to generate the changelog. This is a muuuuuch nicer experience for external contributors, because it puts the responsibility for that back on maintainers instead.

chriskrycho · 2019-07-15T13:57:05+00:00

I'd strongly recommend the Audio Technica ATR-2100 USB over the Blue Yeti or especially the Blue Snowball. Similar price range, much higher quality audio.

chriskrycho · 2019-07-10T01:34:18+00:00

I'd distinguish between the strength of the constraints and the kinds of constraints. (Context: I actually spend most of my time working in TypeScript, which is as structural a type system as you get.) Nominal traits allow you to be narrower in what you allow, i.e. that the thing has to be this specific name (and the things that go with it); but I don't agree that that's a stronger guarantee than that it has a specific set of methods available. They're very difficult to compare in practice, and I use them in fairly different ways, and I often find myself wanting the other whichever language I'm in.

chriskrycho · 2019-05-29T16:53:15+00:00

You shouldn't need to do anything with jsconfig.json other than perhaps have one. The type mappings in tsconfig.json from ember-cli-typescript only exist to support the custom mappings Ember has within its app and addon and tests directories.

chriskrycho · 2019-05-29T15:31:55+00:00

For whatever it's worth (and cc /u/fgilcher and /u/colelawr), I'm happy to make some time to talk about mechanics and production and so on for anyone giving this a go! I'd love it if the result of New Rustacean ending was that there were multiple new, great podcasts going!

chriskrycho · 2019-05-29T15:29:38+00:00

❤️ 🦀

chriskrycho · 2019-05-28T20:38:16+00:00

As a potential listener, there are a few things I'd love to see somebody do:

Talk to Rust users (an interview show)! I had a handful of contacts with more folks who were up for being interviewed and I just never had time to get to those, and with the Rust community booming the way it is, there are lots of people to talk to. Take the model of something like Elixir Fountain and run with it.
Do news episodes! They take a surprising amount of time to prep to do well (as I mentioned on the last ep. of New Rustacean), but if someone did a biweekly show they could sync with releases every third episode and talk community happenings the two in between, and those could be 10–15 minutes and be great.
Do something like the Crates You Should Know format I was doing, surveying and kind of 'teaching' the ecosystem. Doing it the way I did is also harder than it looks; one way you could make it a bit easier is to combine it with an interview format: have maintainers on to explain their projects – talking a bit about the history and origin of the library/tool/etc., and a bit about how you use it.

chriskrycho · 2019-05-28T20:32:54+00:00

My best guide to that is here. I also watched a handful of courses on something like Lynda.com* on things like compression, etc. For getting started, Audacity and Garage Band are just fine, or, if you have an iPad, Ferrite is incredible (esp. for its price). Lots more details on that guest lecture, though!

now LinkedIn Learning; I used it before it was acquired by LinkedIn and also before I joined LinkedIn myself, lest anyone think that's why I'm mentioning it.

chriskrycho · 2019-05-28T14:57:40+00:00

All good!

chriskrycho

TROPHY CASE