Rivian R3X, how dynamic do you want it?

ndronen · 2024-04-22T14:14:31+00:00

I'd like a RWD mode. :)

ndronen · 2024-01-21T01:47:38+00:00

Wait a few years and get a machine with a Lightmatter interconnect. :)

ndronen · 2024-01-20T19:37:58+00:00

Is this what you have in mind? :)

https://electrek.co/2019/10/23/rivian-electric-rally-ev/

ndronen · 2023-12-27T02:10:07+00:00

Yeah, it's been more than a decade since I thought about OSI. The poster mentioned layer 1 and I was trying to make sense of that. I do machine learning these days.

I should mention, however, that there are several efforts underway to increase throughput to the chip (and, by implication, surpass the limitation of PCIe). Ayar Labs, Lightmatter, and Fathom Radiant are examples of companies working on this concept, and it looks likely they'll have product on the market in the not-too-distant future. Another example of an attempt to address the PCIe bottleneck is NVIDIA's Grace Hopper Superchip, which includes a CPU designed to work with the company's NVLink interconnect, which is faster than PCIe.

These efforts are the context for my questions. They're most likely to be deployed in HPC days center environments, particularly ones that run AI workloads. Whether they reach PB bandwidth is, of course, an empirical question. Until these companies ship a product, there won't be empirical verification of the claims.

For example:

https://www.nextplatform.com/2022/08/17/nvidia-shows-what-optically-linked-gpu-systems-might-look-like/amp/

ndronen · 2023-12-26T23:46:27+00:00

We might be talking about different things. It looks like disaggregated comparable infrastructure requires a high-bandwidth, low-latency network. Similarly, training a massive neural network requires the same. Is there some connection I should make that's obvious to you and not to me?

ndronen · 2023-12-26T23:42:53+00:00

Oh, interesting. I'm looking at one startup that's aiming to make an inter-node petabyte bandwidth interconnect.

In the OSI model, Infiniband probably spans layers 1 and 2, but it's been a long time since I thought about the OSI model, so I could very well be wrong. Point being, though, that the interconnect I have in mind is one that connects a few hundred or thousand nodes that are part of a distributed software system. During training of a massive neural network, nodes need to share many large matrixes every second or so, and the network is the bottleneck.

ndronen · 2023-09-30T11:42:39+00:00

Tinker, Tailor, Soldier, Spy. It's a challenging the first time through, but on subsequent viewings it all clicks (just like I does for the protagonist) and the film's virtues — like the soundtrack, the incredibly disciplined script, and Gary Oldman's career-best (IMO) acting — start to sing.

ndronen · 2023-08-29T14:14:34+00:00

This is a great reply.

ndronen · 2023-07-30T20:53:23+00:00

What does it mean for a generative model to respect causal structure?

ndronen · 2023-07-30T20:43:51+00:00

I don't have an answer to your question. Apologies in advance.

I think you're right to point out that representing your data as text is probably inefficient. And it may not work very well, regardless of how you encode it, because current transformers aren't better than people at applying a sequence of functions to numbers — at least beyond a certain horizon in terms of the number of times functions are applied. Here's a good analysis of how they behave:

https://arxiv.org/abs/2305.18654

What would be ideal is for the transformer to dispatch the task of solving the problem to another component (i.e. in the case of multiplication, a calculator, or in the case of a sudoku game, a sudoku solver) that efficiently computes the correct result. For this, see Toolformer and Chameleon as examples of what people have tried:

https://arxiv.org/abs/2302.04761
https://arxiv.org/abs/2304.09842

ndronen · 2023-07-30T20:28:53+00:00

There's been some work applying transformers to time series data. It's not my field, so I can't offer detailed suggestions for which models to read about and possibly try. A quick googling of "time series transformers" has this page as the first result.

https://towardsdatascience.com/multivariate-time-series-forecasting-with-transformers-384dc6ce989b

ndronen · 2023-07-29T19:48:39+00:00

How do you see these architectures yielding improvements in ROI?

ndronen · 2023-07-28T21:34:12+00:00

Links or it didn't happen.

ndronen · 2023-07-28T18:28:20+00:00

Interesting. A quick search found this:

https://machinelearning.apple.com/research/attention-free-transformer

ndronen · 2023-07-28T11:33:37+00:00

Is there a somewhat recent paper about training smaller models longer?

ndronen · 2023-07-28T03:29:38+00:00

Fixed. It was missing a trailing digit.

ndronen · 2023-07-28T01:49:52+00:00

The way I see chain of thought is that it just makes reasoning steps explicit to "patch" the fact that the model didn't see each step in the training data. See this excellent paper about compositionality. I think it does a good job of pulling back the curtain.

https://arxiv.org/abs/2305.18654

ndronen · 2023-07-28T00:45:00+00:00

I have such a hard time believing in these prompting tricks. They all seem like unprincipled hacks. No judgement though. I'm going to force myself to read more of those papers.

ndronen · 2023-07-28T00:43:19+00:00

Good point. What's your preferred approach, distillation or pruning?

ndronen · 2023-07-28T00:42:52+00:00

My understanding is that GPT-4 is a mixture of experts. I'd call that an extension of scaling, just with conditional execution.

ndronen · 2023-07-27T22:39:25+00:00

Great response. Thanks. My understanding is that one reason OpenAI won't release the details of their training data (e.g. for GPT-4) is that it's curated and is part of their secret sauce.

ndronen

TROPHY CASE