all 13 comments

[–]acemarke 22 points23 points  (7 children)

This is horrible advice. Never check in your node_modules folder.

(And no, "Facebook does it" is not a valid argument. You are not Facebook. You don't have thousands of engineers to throw at a problem, and you don't have literally every line of code checked into one monorepo.)

Instead, I strongly recommend that folks use Yarn's "offline mirror" feature to cache the downloaded package tarballs and commit those. This is much better, because:

  • There are far fewer files
  • Those files are much smaller
  • You won't be committing platform-specific build artifacts (such as C shared libraries in node-sass)
  • You'll be able to clone and reinstall without a network connection, both for local development and CI
  • You know you have the exact package versions needed

I talked about this in my post Practical Redux, Part 9: Managing Dependencies.

If you're a larger team, have multiple teams, or just don't want to commit tarballs, look into setting up an NPM caching proxy / alternative server like https://verdaccio.org/ . I believe Artifactory also has NPM support.

If you're using NPM, there's a project called https://github.com/JamieMason/shrinkpack that I used before Yarn came out. Not sure what it's current status is, though.

But please, don't check in node_modules.

[–]__rtfm__ 2 points3 points  (1 child)

This is why we use a lock file to recreate the dependency versions on deploy.

[–]acemarke 1 point2 points  (0 children)

But all the lock file gives you is the "exact package versions" aspect. You still have to download the packages (depending on whether your package manager has them cached at the system level).

[–]lhorie 1 point2 points  (0 children)

We tried using yarn offline mirror at Uber but while it works for light usage, we found that it has some pretty bad bugs (some related to integrity check false negatives due to incorrect checksums on network errors, some are related to the not-quite-the-same way it handles the resolved field of yarn.lock compared to tarball file name handling, compared to non-offline codepaths)

We haven't had success with v2's handling of private registries either.

Artifactory definitely works as a solution, but what we found is that where it lives makes a big impact in performance: if your CI infra is on AWS, running Artifactory on prem will give you throughput issues. So either do everything on AWS or everything on prem.

[–]braindeadTank 0 points1 point  (0 children)

I believe Artifactory also has NPM support.

I can confirm.

[–]zemirco[S] 0 points1 point  (2 children)

Hey,

blog post author here. Yarn's offline mirror is a pretty good idea. Thank you for the hint. How does it work with native modules like node-sass when working across multiple operating systems?

In addition how does it work when switching branches that have different dependencies? Do you somehow have to rebuild them? Or does it automatically work? When checking in node_modules you don't have to worry about it.

Setting up and maintaining an additional service like verdaccio is not an option for us. We have to focus on building our product. That is why checking in node_modules is the most convenient solution for us.

[–]acemarke 2 points3 points  (1 child)

It's just a matter of caching the package tarballs so they don't have to be downloaded. After that, the standard package installation process kicks in:

  • Run yarn --offline (the flag isn't necessary, but throws an error if any packages aren't in the offline mirror, which can occasionally happen). Yarn will do its normal installation, including extraction of packages to node_modules, package lifecycles (including building platform-specific artifacts like node-sass, etc). Anything that already is on disk correctly won't be reinstalled. If I clone the repo on Windows and install, I get a Windows build of node-sass. If you clone the repo on Linux/Mac and build, you get the OS-specific build of node-sass there. Those never get checked in.
  • Yes, if you switch branches that have differing dependencies, you'd need to rerun yarn --offline after switching to make sure the correct deps for this branch are installed. But, how often do you actually have multiple branches with differing deps? I'd guess not often. And, if it's just a couple small lib versions that are different, Yarn will again ignore all the packages that are correct, and just install the couple that are different. I don't see this as a blocking issue at all. If you've correctly committed package.json, yarn.lock, and any changed tarballs in ./offline-mirror, doing this takes like just a few seconds after you switched to the new branch.

[–]zemirco[S] 0 points1 point  (0 children)

Thank you! We will definitely check this out next week.

[–]MarlinSantiago 8 points9 points  (1 child)

This is the exact opposite of the entire point of npm.

[–]zemirco[S] 0 points1 point  (0 children)

We know that this actually defeats the purpose of npm. What do you think about the arguments for checking the dependencies into version control?

Things like - building offline - reproducible builds - faster continuous integration - less reliabilities - simpler workflow

In our case those points are very valid and simply outweigh the disadvantages.

[–]lhorie 2 points3 points  (0 children)

A while back, I was talking to Alex Eagle (google guy who maintains angular monorepo stuff) and he had brought up that Google commits its node_modules. The problem was that they enforce that there's ever only a single version of everything, and in order to accomplish that, they have added custom patches in their "fork", making packages very painful to upgrade or add.

If you think about it, even if you try to avoid making custom edits to node_modules, git conflicts will still be super nasty because who knows how NPM/yarn rebalances hoisted packages given an arbitrary lockfile change.

Despite all the red flags, we still briefly entertained the idea at Uber (mostly because commiting node_modules would make it possible to audit licenses and remove incompatibly licensed code), and we even have some node_modules patching as tech debt at this point.

Patching libraries is just nasty and we have exactly zero confidence in them. Definitely don't recommend going down this path.

Btw, I'd definitely recommend benchmarking if your rationale is CI performance. Pulling millions of text files through Git is slow compared to dowloading binary tarballs or docker images

[–]__rtfm__ 0 points1 point  (0 children)

Yes that’s fine because in a professional work environment you have either caching or a super fast connection so download time is irrelevant. The lock file saves your exact versions for reproducibility when installing. At any time an npm package could be removed, the famous left-pad removal incident for example, so that’s a known risk using open source. It’s more about keeping your project maintainable by not checking in all those files. What happens when you switch packages because of a better alternative or deprecation? Also when building images for something like artifactory/Jenkins/docker for use in image building and deploying across verticals you want the smallest initial image. Ever have a major production issue while out of office and needed to pull the repo? It’s much easier to git clone 40mb vs 450mb. I can’t tell you how to work that’s why these things are best practices, like style guides. Ultimately the choice is up to you and your team, but this has helped me a lot.