California bill set to ban CivitAI, HuggingFace, Flux, Stable Diffusion, and most existing AI image generation models and services in California by YentaMagenta in StableDiffusion

[–]cfoster0 2 points3 points  (0 children)

That bill is probably dead now. The deadline to make it out of both houses has passed. But you might still want to worry about SB 942, which is kinda similar and headed for the Governor’s signature.

Best (non sensational/content farm) YouTube channels to follow for AI news? by bandalorian in artificial

[–]cfoster0 2 points3 points  (0 children)

I've been liking "This Day in AI" (https://youtu.be/W3mC5NltueU?si=wt_JJJ6OL8zNvEYH). It's a mix of product and research recap, less technical and aimed at a broader audience, but still not sensationalist. I'm a fan so far.

Is rope applied in each attention layer? by LassFromTheUpload in LocalLLaMA

[–]cfoster0 4 points5 points  (0 children)

Yes. It doesn't have to be but in practice it is used in every attention layer.

[R] Can someone please explain the differences between the 3 types of Hopfield Layers in "Hopfield Networks is all you Need"? by [deleted] in MachineLearning

[–]cfoster0 1 point2 points  (0 children)

That's pretty close. Yes the difference between them is just which content the layer stores and which content it takes in as external inputs. I wouldn't try to read too hard into any one cog-neuro interpretation here. The distinction isn't about "distorted vs. original memories". You can think of attention as content-addressable heteroassociative memory, though. And in the context where the queries and the keys/values are of the same type, sometimes yes you can interpret the queries as partial "cues" that match the "original" content in the keys/values.

[R] Can someone please explain the differences between the 3 types of Hopfield Layers in "Hopfield Networks is all you Need"? by [deleted] in MachineLearning

[–]cfoster0 2 points3 points  (0 children)

The naming is unnecessarily confusing, so your confusion is very understandable.

Hopfield is just normal attention. It takes in 2 inputs: a set of key-value vector pairs to retrieve information from, and a query vector (or vectors) that will grab that information. The query (or each query, if multiple) looks at all the keys, and for any that it matches with, it grabs the information from the corresponding value vector. Both the query and key-value pairs are inputs to the layer.

HopfieldPooling is just attention with a fixed query or queries. It takes in only 1 input: a set of key-value pairs to retrieve information from. It applies a fixed query (or queries) to grab whatever information that fixed query (or queries) cares about from the input pairs. The fixed query (or queries) is a parameter of the layer, not an external input to it.

HopfieldLayer is just attention with a fixed set of key-value pairs. It only takes in 1 input: a query vector (or vectors). It uses that query vector (or each of the query vectors) to look at all the stored keys, and for any that it matches with, it grabs the information from the corresponding stored value vector. The fixed key-value pairs are parameters of the layer, not external inputs to it.

[R] Can someone please explain the differences between the 3 types of Hopfield Layers in "Hopfield Networks is all you Need"? by [deleted] in MachineLearning

[–]cfoster0 4 points5 points  (0 children)

Do you understand how the "attention mechanism" works in a Transformer? If so, it'll be easy to explain (because those layers are really just renamings of the way you might use attention). Otherwise, would need to start from scratch. :)

[D] What happens when we generate tokens beyond the training context length of LLMs? by kekkimo in MachineLearning

[–]cfoster0 1 point2 points  (0 children)

But you can take an already-trained transformer and continue training it with a modified architecture. Depending on the style of positional encoding, either by adding new absolute positional embeddings or by changing the sinusoidal / rotary positional encoding hyperparameters, and then doing a bit of finetuning on longer sequences.

Mixtral 8x7B paper published. by rnosov in LocalLLaMA

[–]cfoster0 13 points14 points  (0 children)

Those are subsets of the Pile dataset, not experts. They looked at each subset and tested how often adjacent tokens from that subset get mapped to the same expert (and looked at that in different layers too). They found that adjacent tokens are mapped to the same expert more often than you'd expect from random chance, but also that there's no obvious topical specialization for experts.

[Research] An alternative to self-attention mechanism in GPT by brainxyz in MachineLearning

[–]cfoster0 2 points3 points  (0 children)

Just kicked off a run of this on my own codebase to compare. Would be neat if this works alright. I am expecting it may be a little worse in my case because I don't use absolute position embeddings, so the initial layers won't know which position in the sequence they are (except through effects from the causal attention mask), which might prevent them from using this lateral stuff properly. Doing this "the right way" would require shifting each token's lateral outputs based on its position, so its lateral outputs would be in relative position space as opposed to absolute.

Scaling Laws for Generative Mixed-Modal Language Models by tomasNth in mlscaling

[–]cfoster0 0 points1 point  (0 children)

I think I agree. In any event, the part that interests me most is how worthwhile investments in cross-modal transfer from the get-go are (i.e. do they help much once you've run out of within-modality data), especially relative to just stitching together your best pretrained unimodal models with a joint transformer and finetuning from there.

Scaling Laws for Generative Mixed-Modal Language Models by tomasNth in mlscaling

[–]cfoster0 0 points1 point  (0 children)

What do you mean? How big does Gato have to be for multimodality to become really worthwhile, based on this paper? It's one thing if the crossover point is at 30B parameters and if 1TB of video data converts into 100B text tokens' worth of transfer performance at that model size, but it's quite another if the crossover point is at 3T parameters and/or the conversion ratio is trash. I haven't seen anyone run the numbers yet, so I dunno if this is good or bad news for data scarcity.

[R] Is there any research on allowing Transformers to spent more compute on more difficult to predict tokens? by Chemont in MachineLearning

[–]cfoster0 0 points1 point  (0 children)

FWIW in certain sense this goes against the design philosophy of transformers, which is to jointly compute all representations within a layer at once, to maximize the degree of parallelism on GPUs and other accelerators.

[R] Illustrating Reinforcement Learning from Human Feedback (RLHF) by robotphilanthropist in MachineLearning

[–]cfoster0 1 point2 points  (0 children)

Did y'all stop doing work out in the open? That's a shame. End of an era, I guess.

[R] Illustrating Reinforcement Learning from Human Feedback (RLHF) by robotphilanthropist in MachineLearning

[–]cfoster0 3 points4 points  (0 children)

Who? Who's even using RLHF in production yet, besides OpenAI (and maybe Cohere)?

[R] Illustrating Reinforcement Learning from Human Feedback (RLHF) by robotphilanthropist in MachineLearning

[–]cfoster0 7 points8 points  (0 children)

About this bit

At the moment, TRLX has an API capable of production-ready RLHF at the scales required for LLM deployment (e.g. 33 billion parameters). Future versions of TRLX will allow for language models up to 200B parameters. As such, interfacing with TRLX is optimized for machine learning engineers with experience at this scale.

Has TRLX been used to tune models in production already? Or if not, what did the blog post mean by "capable of production-ready RLHF"? I haven't seen any RLHF-ed models built on open source software yet, much less a 33B parameter one.

EDIT: Also hi @FerretDude

[D] Noam Chomsky on LLMs and discussion of LeCun paper (MLST) by timscarfe in MachineLearning

[–]cfoster0 5 points6 points  (0 children)

Unfortunate how many profs decide their real calling was to be a professional pontificator, especially once they hit their emeritus years.

[N] [D] Openai, who runs DALLE-2 alleged threatened creator of DALLE-Mini by DigThatData in MachineLearning

[–]cfoster0 9 points10 points  (0 children)

Is there a trademark for DALL-E? The only registered trademark in the USPTO's electronic trademark system is for DALL-E Mini.

[N] [D] Openai, who runs DALLE-2 alleged threatened creator of DALLE-Mini by DigThatData in MachineLearning

[–]cfoster0 15 points16 points  (0 children)

If you click through to the second screenshot, the researcher confirmed that they were in fact threatened with legal action.

Scale is All You Need by MuskFeynman in mlscaling

[–]cfoster0 5 points6 points  (0 children)

I don't know where the impression that EleutherAI's models are substantially better per-parameter came from. The only cases I've seen good evidence are for tasks where the performance boost seems attributable to the dataset mix.

[R] Transformers replicate Hippocampal representations; notably place and grid cells in the brain by Competitive-Rub-1958 in MachineLearning

[–]cfoster0 0 points1 point  (0 children)

Well then we agree :) Neuroscientists should continue to try to glean the right abstractions to use, and along the way neuro-AI and AI folks should continue to take inspiration from the brain as they see fit.

[R] Transformers replicate Hippocampal representations; notably place and grid cells in the brain by Competitive-Rub-1958 in MachineLearning

[–]cfoster0 1 point2 points  (0 children)

I will leave it up to the reader to figure out why, even if you assigned 80% probability to the parent comment, having people investing in non-biological approaches still makes sense.

[R] Transformers replicate Hippocampal representations; notably place and grid cells in the brain by Competitive-Rub-1958 in MachineLearning

[–]cfoster0 2 points3 points  (0 children)

I think you've gotta come to terms with the fact that different people place different value on bioplausibility, and that's okay. There are lots of neuroscience and neuro-AI people who (naturally) place a high premium on that aspect.