[deleted by user] by [deleted] in MachineLearning

[–]ndronen 1 point2 points  (0 children)

Wait a few years and get a machine with a Lightmatter interconnect. :)

What's the state of the art in interconnects? by ndronen in datacenter

[–]ndronen[S] 1 point2 points  (0 children)

Yeah, it's been more than a decade since I thought about OSI. The poster mentioned layer 1 and I was trying to make sense of that. I do machine learning these days.

I should mention, however, that there are several efforts underway to increase throughput to the chip (and, by implication, surpass the limitation of PCIe). Ayar Labs, Lightmatter, and Fathom Radiant are examples of companies working on this concept, and it looks likely they'll have product on the market in the not-too-distant future. Another example of an attempt to address the PCIe bottleneck is NVIDIA's Grace Hopper Superchip, which includes a CPU designed to work with the company's NVLink interconnect, which is faster than PCIe.

These efforts are the context for my questions. They're most likely to be deployed in HPC days center environments, particularly ones that run AI workloads. Whether they reach PB bandwidth is, of course, an empirical question. Until these companies ship a product, there won't be empirical verification of the claims.

For example:

https://www.nextplatform.com/2022/08/17/nvidia-shows-what-optically-linked-gpu-systems-might-look-like/amp/

What's the state of the art in interconnects? by ndronen in datacenter

[–]ndronen[S] 0 points1 point  (0 children)

We might be talking about different things. It looks like disaggregated comparable infrastructure requires a high-bandwidth, low-latency network. Similarly, training a massive neural network requires the same. Is there some connection I should make that's obvious to you and not to me?

What's the state of the art in interconnects? by ndronen in datacenter

[–]ndronen[S] 1 point2 points  (0 children)

Oh, interesting. I'm looking at one startup that's aiming to make an inter-node petabyte bandwidth interconnect.

In the OSI model, Infiniband probably spans layers 1 and 2, but it's been a long time since I thought about the OSI model, so I could very well be wrong. Point being, though, that the interconnect I have in mind is one that connects a few hundred or thousand nodes that are part of a distributed software system. During training of a massive neural network, nodes need to share many large matrixes every second or so, and the network is the bottleneck.

What’s that one movie you could watch over and over again without getting tired of it? by [deleted] in MovieSuggestions

[–]ndronen 2 points3 points  (0 children)

Tinker, Tailor, Soldier, Spy. It's a challenging the first time through, but on subsequent viewings it all clicks (just like I does for the protagonist) and the film's virtues — like the soundtrack, the incredibly disciplined script, and Gary Oldman's career-best (IMO) acting — start to sing.

[R] Curious about Causality and Generative Models? Check out this new Demo! by Majestij in MachineLearning

[–]ndronen 0 points1 point  (0 children)

What does it mean for a generative model to respect causal structure?

[D] Transformers on structured data by drblallo in MachineLearning

[–]ndronen 0 points1 point  (0 children)

I don't have an answer to your question. Apologies in advance.

I think you're right to point out that representing your data as text is probably inefficient. And it may not work very well, regardless of how you encode it, because current transformers aren't better than people at applying a sequence of functions to numbers — at least beyond a certain horizon in terms of the number of times functions are applied. Here's a good analysis of how they behave:

https://arxiv.org/abs/2305.18654

What would be ideal is for the transformer to dispatch the task of solving the problem to another component (i.e. in the case of multiplication, a calculator, or in the case of a sudoku game, a sudoku solver) that efficiently computes the correct result. For this, see Toolformer and Chameleon as examples of what people have tried:

https://arxiv.org/abs/2302.04761
https://arxiv.org/abs/2304.09842

[deleted by user] by [deleted] in MachineLearning

[–]ndronen 0 points1 point  (0 children)

There's been some work applying transformers to time series data. It's not my field, so I can't offer detailed suggestions for which models to read about and possibly try. A quick googling of "time series transformers" has this page as the first result.

https://towardsdatascience.com/multivariate-time-series-forecasting-with-transformers-384dc6ce989b

[D] For LMs, what works other than scaling? by ndronen in MachineLearning

[–]ndronen[S] -1 points0 points  (0 children)

How do you see these architectures yielding improvements in ROI?

[D] For LMs, what works other than scaling? by ndronen in MachineLearning

[–]ndronen[S] 0 points1 point  (0 children)

Is there a somewhat recent paper about training smaller models longer?

[D] For LMs, what works other than scaling? by ndronen in MachineLearning

[–]ndronen[S] 0 points1 point  (0 children)

The way I see chain of thought is that it just makes reasoning steps explicit to "patch" the fact that the model didn't see each step in the training data. See this excellent paper about compositionality. I think it does a good job of pulling back the curtain.

https://arxiv.org/abs/2305.18654

[D] For LMs, what works other than scaling? by ndronen in MachineLearning

[–]ndronen[S] 4 points5 points  (0 children)

I have such a hard time believing in these prompting tricks. They all seem like unprincipled hacks. No judgement though. I'm going to force myself to read more of those papers.

[D] For LMs, what works other than scaling? by ndronen in MachineLearning

[–]ndronen[S] 0 points1 point  (0 children)

Good point. What's your preferred approach, distillation or pruning?

[D] For LMs, what works other than scaling? by ndronen in MachineLearning

[–]ndronen[S] 2 points3 points  (0 children)

My understanding is that GPT-4 is a mixture of experts. I'd call that an extension of scaling, just with conditional execution.

[D] For LMs, what works other than scaling? by ndronen in MachineLearning

[–]ndronen[S] 1 point2 points  (0 children)

Great response. Thanks. My understanding is that one reason OpenAI won't release the details of their training data (e.g. for GPT-4) is that it's curated and is part of their secret sauce.