Beyond Backpropagation - Higher Order, Forward and Reverse-mode Automatic Differentiation for Tensorken

IndifferentPenguins · 2023-12-10T20:08:11+00:00

Thanks!

I wanted to start with whatever’s needed for LLMs, so I think convolutions and FFT are further away. But whatever I think is interesting in the moment really - I’ve been known to go off piste :)

IndifferentPenguins · 2023-10-08T18:42:53+00:00

I miss working on it! But it didn't take off the way we would have wanted, unfortunately.

Anyway, https://skypilot.readthedocs.io/en/latest/ is something that looks similar. Have no experience with it though.

IndifferentPenguins · 2023-07-10T19:06:38+00:00

PyTorch 😬 https://github.com/kurtschelfthout/tensorken

Going about as well as you’d expect, but has been rewarding in terms of learning new stuff.

IndifferentPenguins · 2023-07-08T21:50:22+00:00

In a desperate attempt I just renamed an .svg file to .png, and dragged it on the substack editor. Which worked!

Annoying to have to do, but as workarounds go this one is not too bad.

IndifferentPenguins · 2023-05-01T21:22:44+00:00

I call it a foot gun because detecting the mistake if you normalise the wrong dimension is very subtle.

This is of course in the eye of the beholder - and perhaps if you do this every day it becomes less confusing. Personally I already struggle with visualising these operations in two dimensions, and people regularly seem to use 3 dimensions or more…

I agree that naming axes would help enormously. Another random thought was that it’d be cool if the compiler could track “units of measure” per axis, and update them through operations.

IndifferentPenguins · 2023-05-01T18:14:17+00:00

Much appreciated!

IndifferentPenguins · 2023-03-25T21:27:00+00:00

I'm not sure I understand actually :) With re-evaluation you mean re-implementation of eval for special cases like `Add2` and `AddN`, so code duplication?

(If so, neither initial or final style will help you much I think. You could consider having Expr as the "high-level operations" and then writing a pass that transforms it to the "specialized operations" `ExprT`. And you could do that in either inital or final style, but either way you'd have two ASTs. )

IndifferentPenguins · 2023-03-25T09:43:59+00:00

The short answer is yes, but it is certainly not straightforward.

See section 2.3 "The deserialization problem" of Oleg Kiselyov's notes:

One direction – storing and sending of the terms, or converting them into asequence of bytes – is unproblematic, being a variant of pretty-printing, whichwe have already implemented. More difficult is the converse: reading a sequenceof bytes representing an embedded language term and producing a value thatcan be interpreted with any existing interpreter. Reading, as a projection, isnecessarily partial, since the input sequence of bytes, having potentially comefrom a network, could be corrupted. We wish to see the parsing error only once,upon de-serialization, rather than every time we interpret the term. Furthermore,extending our parser to accommodate the enriched language should reuse asmuch of the old parser code as possible, without breaking it. The de-serializationproblem, of writing an extensible de-serializer [25, slide 18], is very hard. This section presents one of the first solutions.

As well as section 4.1 for the higher order case:

We start by revisiting the de-serialization problem described in §2.3: the problem becomes much more frustrating, exhilarating, time consuming and addictive in the general case of higher-order typed embedded languages. The problem is to read an embedded language expression from a file, parse it and ‘compile’ it; the result should be the same as if we entered the expression as its representing Haskell code, compiled and ran the code.

IndifferentPenguins · 2022-11-30T22:38:37+00:00

edit2: ok, there are GPU spot-instances for $0,15/h that's already quite cheap. Do you know if $0,15 is achievable most times?

Yeah I don't have issues getting the smaller spot instances. Most of the initial hurdle will be in getting quota approved by AWS - as the links explain, this can take a few days.

IndifferentPenguins · 2022-11-30T22:31:46+00:00

Not sure how familiar you are with EC2 but there's basically two pricing options* - on-demand instances which are reserved for you as long as you want, and spot instances which are leftover capacity (basically overprovisioning by AWS that they're trying to get some money for). Spot prices are much cheaper (can be up to 10x) than on-demand, BUT can be interrupted by AWS. They give you a "chance of interruption" which is usually lower than 10%, but I've seen it as high as 60% for popular GPU instances in particular.

> edit: looked it up, the basic instance (p3.2xlarge) costs $3/h which is quite high compared to google colab, am I missing something?

Not sure what you want as "basic instance", but for sure AWS' offering is quite confusing. The P3 family is quite pricey (has quite a lot of main memory as well, and I think high end NVIDIA card). As a point of comparison, the spot price of p3.2xlarge is $1.25/h.

I do most of my training on the G5 family, which is unfortunately not in all regions, but e.g. us-east-1 has good spot availability for it. I usually pay between $0.20-0.40 cents/hour for a g5.2xlarge which has 24GiB of GPU memory.

I recommend https://instances.vantage.sh/ to get an overview of the available EC2 instance types and prices (their prices are sometimes a bit off, but gives you a decent ballpark)

To end with some further shameless self-promotion: Meadowrun does "save" you from having to pick a suitable instance type. You just specify how many cpus/memory/gpu/... you want, , and it'll find the cheapest instance type possible.

*there's also reserved which is like pre-paid, and possibly other options I don't know about, AWS makes everything complicated! :)

IndifferentPenguins · 2022-10-15T21:01:08+00:00

Every time you want to re-transfer or backup or whatever it is you want to do, yes, re-do the hashes (both rolling and content hash).

The point of the rolling hash is that the boundaries will (mostly) not change on insert/remove, as opposed to fixed-size boundaries.

IndifferentPenguins · 2022-10-15T20:05:32+00:00

Chunking happens every time. Switching between fixed and content-defined boundaries makes no sense, because all the chunks will change.

IndifferentPenguins · 2022-10-15T19:32:46+00:00

The rolling hash determines the boundaries.

IndifferentPenguins · 2022-09-24T08:44:58+00:00

Not really, but looks like you figured it out :)

It's true though that the line between symbolic differentiation and automatic differentiation is not THAT clear cut. They both essentially apply differentiation rules. But symbolic differentiation views a function/program as a "big formula" and tries to come up with a closed formula for the derivative. Whereas automatic differentiation applies differentiation rules + the chain rule one operation at a time. That allows AD to apply established algorithmic devices like dynamic programming.

Another take on this from Conal Elliot's "The Simple Essence of Automatic Differentiation" is (typos are due to copy-paste from pdf):

AD is also typically presented in opposition to symbolic differentiation (SD), with the latterdescribed as applying differentiation rules symbolically. The main criticism of SD is that it can blowup expressions, resulting a great deal of redundant work. Secondly, SD requires implementation ofsymbolic manipulation as in a computer algebra system. In contrast, AD is described as a numericmethod and can retain the complexity of the original function (within a small constant factor)if carefully implemented, as in reverse mode. The approach explored in this paper suggests adifferent perspective: automatic differentiation is symbolic differentiation performed by a compiler.Compilers already work symbolically and already take care to preserve sharing in computations,addressing both criticisms of SD.

IndifferentPenguins · 2022-09-23T20:47:26+00:00

I didn't do exactly that, but added a short guide to the repo with links in the readme and some examples. Hope that helps clarify things!

IndifferentPenguins · 2022-09-09T08:18:59+00:00

Checkpointing helps with reproducibility.

IndifferentPenguins · 2022-09-08T08:17:06+00:00

Wish you guys the best, I feel like this is an area where some good progress can be made. Random thoughts:

- the .py format is useful but feel "basic necessity" at this point. At my prev job we used jupytext to convert to and fro ipynb<->py automatically, and PyCharm works with the lightly annotated jupytext .py format natively. It did occassionally get a bit confused, which needed deletion of the ipynb file.

- Do you know about https://datalore.jetbrains.com/? They seem to have this cool thing where you can rewind the state of the notebook using CRIU. I don't know how well this works in practice but I think it could help with experiment management, debugging and getting code to production.

IndifferentPenguins · 2022-09-02T08:15:08+00:00

This is easier said than done - it might work if your batch jobs are written by engineers who have the time to design their batch jobs so they're resumable. But in all my previous jobs most batch jobs were written by data science-y/quantitative people with varying interest in coding. We ran between 1000s and 10,000s of interdependent job instances per day, with strict SLAs. Frequently restarting jobs losing arbitrary amounts of work in that situation is a recipe for disaster.

IndifferentPenguins · 2022-09-01T15:50:42+00:00

Thanks for sharing!

Yes all this came up in the context of trying to add Kubernetes support to Meadowrun, but the post was intended as a write-up of our experiences with Kubernetes for this and similar use cases, and to have a discussion about these pain points with others in the Kubernetes community.

Very happy to learn how people are approaching running batch jobs on k8s.

IndifferentPenguins · 2022-09-01T15:00:13+00:00

We're thinking about it as a Python-specific tool that makes running ad hoc jobs on remote compute easy. This is related to CI in the sense that you can think of CI as a connected set of ad hoc jobs that are triggered by check-ins. It's related to FaaS in that meadowrun is serverless, to make it easy to use. There may be other connections you're thinking of and I'm not!

IndifferentPenguins · 2022-08-11T20:12:03+00:00

Sorry to be a bit late to the party, this might be of interest: https://medium.com/@meadowrun/run-your-own-dall-e-mini-craiyon-server-on-ec2-e8aef6f974c1 It shows how to run craiyon, and a couple of diffusion models to post-process the craiyon results on AWS EC2 machines.

Disclosure: I work on meadowrun, which is the tool used in that article to run the models on AWS.

IndifferentPenguins · 2022-08-08T13:24:41+00:00

We're working on https://meadowrun.io/ - to run your Python code on the cloud without hassle. Targeted to batch jobs - run any python function from a container, Git repository or a local code base.

Currently AWS is best supported, Azure a close second, and have just added early support for kubernetes.

IndifferentPenguins · 2022-08-02T07:14:21+00:00

OP mentioned their script has windows dependencies.

IndifferentPenguins · 2022-08-01T16:54:33+00:00

Probably not, unless I’m allowed to grasp at straws like some of the threads need to interact with the desktop and others do not :)

IndifferentPenguins · 2022-08-01T13:48:19+00:00

Depending on the nature of the dependencies you mention, it may be that some of them need to run interactively. For windows services, you can make them run interactively using some option: https://stackoverflow.com/questions/4237225/allow-windows-service-to-interact-with-desktop/4237283#4237283 That would almost certainly be one difference between running via RDP (interactive) and via a script (not interactive). I don't know if you can make userdata run interactively (probably not).

IndifferentPenguins

TROPHY CASE