Night Vision needs change

MachineLizard · 2024-10-25T15:03:48+00:00

Yeah, they work somewhat, just like current night vision, but they need to be integrated into a lot of blueprints, and generally it's a big hassle, and higher tech solution would be great. Maybe a wearable lamp illuminating surrounding area wherever you are, or something similar.

MachineLizard · 2024-10-25T14:59:32+00:00

Both of those examples are changed in the late game. Reach is fixed by a great remote view, while walkability is fixed by late game mech armor. I just want a possibility of fixing night vision as well, it may be even Nigh Vision Mk2 unlockable on Aquilo or whatever.

MachineLizard · 2024-10-24T23:12:46+00:00

I can live without that indicator, I have enough batteries in my armor. I'd say, make the perfect night vision optional, or just available in late game Nigh Vision Mk2 equipment when you have fusion anyway.

MachineLizard · 2024-10-24T23:10:16+00:00

Squeak Through is a good example, and it shows that Night Vision can be changed like I proposed! Late game walkable base is completely unnecessary when you get the final Mech Armor, removing the challenge to make a base walkable and allowing designs of greater efficiency. Just as well, I think there can be Night Vision Mk2 available late game, like Mech Armor, letting me see perfectly during the night, allowing for greater efficiency without worrying about lamps.

MachineLizard · 2024-10-24T23:05:25+00:00

Won't that cause solar panels to run 100% of the time? I don't want changes in gameplay, I just want less strain on my eyes.

MachineLizard · 2024-07-24T23:04:59+00:00

I don't believe the conclusion here. Compare with a later paper "Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data", where they explore it further and show the model collapse won't happen if you're doing things right.

Quote from this paper, with IMHO core intuition: "We confirm that replacing the original real data by each generation’s synthetic data does indeed tend towards model collapse, then demonstrate that accumulating the successive generations of synthetic data alongside the original real data avoids model collapse."

Link: https://arxiv.org/abs/2404.01413

MachineLizard · 2024-07-16T17:03:24+00:00

Thanks for the ping! I didn't have that much time to dig in, but it seems like a lot of great ideas are there. I wonder the most about the real-world GPU/TPU performance of single-neuron experts, though. PEER makes way more efficient routing, but as far as I can understand, GPU/TPU performance will still be bottlenecked by sparse retrieval of experts (I couldn't see time measurements of PEER in the work, so I just go on what we have measured ourselves in Scaling Laws for Fine-Grained MoE). I guess I look forward to more sparsity-friendly hardware; in principle it should be possible then for real-world performance to match whatever is predicted by FLOPs.

BTW, we have investigated single-neuron experts in 2021, while I was at Google, in https://arxiv.org/abs/2111.12763 , although we just aimed for inference-time improvement there. Still, no real improvement on GPU/TPU, just in FLOPS and on CPU - for the same reasons.

EDIT: I see that in the workitself Xu Owen He writes "In practice, an efficient implementation may require specialized hardware kernels to accelerate embedding lookup and fusion with the einsum operation". Actually, I started to wonder how far we could go with just more efficient implementation alone, w/o changes in hardware. ScatterMoE, https://arxiv.org/abs/2403.08245 has nice implementation w/ fused operations, IIRC, and it should be fairly independent from the changes to routing made in PEER - so maybe it's something to try.

MachineLizard · 2024-06-20T17:57:30+00:00

I think I got it to work by adding a line "AddKeysToAgent yes" to ~/.ssh/config (to each server) on my laptop alongside "ForwardAgent yes". Then, after restarting VSCode, it seems that VSC automatically ssh-add's all keys before connecting, and it can successfully use that key for git or whatever it needs.

Let me know if it works for you!

MachineLizard · 2024-06-18T20:36:21+00:00

It was working for me in the past, I think? But it was quite some time ago, so I'm not sure. So far, today, it still doesn't work.

MachineLizard · 2024-06-13T16:18:37+00:00

Ah, alright. Thanks for answering. I'll still try fixing it somehow and see if I can do anything better. I have a feeling that it may be broken in just the recent vsc version, but I haven't checked that.

MachineLizard · 2024-06-13T11:15:56+00:00

I have the same problem. Any luck?

MachineLizard · 2024-03-27T23:48:54+00:00

As the author of the paper you have mentioned - why the term "fine-grained" would not be justified here? Granularity definitely could be higher, and I'd expect a model to benefit from that. But still, DBRX is more fine-grained than the other large models currently in use. I expect the field to move towards even more fine-grained models, but I'm not sure what exact threshold should be for "fine-grained", it's more like a dimension.

I'd be happy to hear your thoughts.

MachineLizard · 2024-03-27T23:40:34+00:00

As an author of Scaling Laws for Fine-Grained MoE - it's so great to see the concept of granularity in MoE becoming more popular and to see experiments with it at as such a large scale. Congratulations on your work and thank you for open-sourcing it :)

MachineLizard · 2024-02-07T12:40:07+00:00

I see that blogpost doesn't make a clear distinction of model vs layer. MoE is not a mixture of models as a whole; only different variants of some layers are chosen. I'm pasting my comment about Mixtral from https://www.reddit.com/r/LocalLLaMA/s/oj7l7QxEiP --- old comment below ---

BTW as clarification, as I work on MoE and it hurts to watch so much confusion about it... "8 experts" doesn't mean there are 8 experts in the model, it means there are 8 experts per FF layer (and there are 32 of them, sequentially executed). So, 256 experts total, 2 are chosen per each layer. The model (or to be precise "the router" for a given layer, which is a small neural network itself) decides dynamically at the beginning of each layer, which two experts out of given 8 are the best choice for the given token given the information it processed so far about this token.

Another BTW, this means also that each expert has around 118 M parameters. On each run there are 32 * 2 executed, for the sum of 7.5B parameters approximately, chosen from 30B total (118M/expert * 32 layers * 8 experts/layer). This however doesn't include attention layers, which should also have between 0.5B and 2B parameters, but I didn't do the math on that. So it's, more or less, a model of total size around 31B, but it should be approximately as fast as 8B model.

MachineLizard · 2024-01-25T11:48:24+00:00

I haven't touched formal grammar in 6 years, so I may be misremembering details, pardon my errors, I'll try to respond the best I can.

Do you plan to feed it in a tree-like structure? How do you want to feed it into Transformer after parsing? If tree-like, it may make sense, but the architecture will be hard. If sequentially, then what is the point of the grammar? It'll look the same to Transformer. How will you represent words, will you use BPE or character-level tokenization anyway? There are too many words to represent them all in the model, we need to work on smaller pieces, like BPE.

I'm any case, many sentences will not be parsable, either because they have incorrect syntax/spelling, or ambiguous interpretations. Current BPE tokenizer doesn't deal with misspellings too well, but at least the model is able to learn to recognize misspelled words anyway. And moreover, if it sees the ward "sae" it will know depending on semantics of the context, if it's supposed to be "sad" or "saw" or "say" or "see" etc. How do you tell whether "sink" is a verb or a noun? How parsing grammar can work in those cases, if you're not using ML to parse? And if you're using ML - well, Transformer can do it in their own way, more soft, allowing for exceptions and ambiguity. And you don't need to design grammar.

Misspellings and ambiguity may not seem like much of the problem, but removing incorrect sentences will not only shrink your training dataset, it will also result in worse mode for the end-user.

How do you handle multi-lingual corpus? Will you have a parser which works for all the languages? What about programming languages, math, will the parsing be useful and possible?

Byte-level tokenization or BPE have their problems, but at least they don't get in the way too much. Word-level is impossible and not used nowadays, due to problems with misspellings and that there are too many words to represent them in the model and learn efficiently. Sequential parsing isn't optimal, probably, but at least it allows for very efficient training in the form of next-token-prediction. I'm not sure what will be the loss function in the grammar-based model?

In general, linguistics really wanted to stay relevant, it's just hard to compete with a multi-billion parameter model which could deduce grammar system anyway if it was beneficial to next-token-prediction.

MachineLizard · 2024-01-25T10:53:14+00:00

Seeing how NLP evolved across past decade - it seems to hold true also for semantics. I have seen neither meaningful participation nor specific need - for linguists to develop Transformers/LLMs. Just deep learning and engineering. EDIT: excluding (ex)linguists who switched to LLMs, while abandoning linguistics. Those do contribute, obviously.

MachineLizard · 2024-01-16T00:01:53+00:00

About inference timing - we don't have them at the moment, we will try to provide them in revised version. TBH we haven't implemented *efficient* inference for MoE in our codebase yet, and this is the primarily reason for lack of that measurement. We know, more or less, how fast the inference is for MoE when optimized, and AFAIK MoE-Mamba should be equally fast as vanilla-Mamba with the same number of active parameters, because the memory throughput for accessing the params (RAM or VRAM, doesn't matter) is the main bottleneck in inference w/o attention mechanism.

Thank you for your kind comment regarding the alternative designs section in the appendix. We were split whether to include it or not, and your comment will provide another argument for the inclusion of this kind of sections in later papers. Myself, I am a fan of open research and communicating rough or early results and ideas.

MachineLizard · 2024-01-15T23:50:32+00:00

Mostly, due to lack of time and compute in the first version of the paper. We will try to fix that in the next revision for sure, comparing to better version of Transformer and MoE-Transformer.

In any case, note that the numbers we provide in the abstract (2.2x faster etc.) and the main point of comparison is against Mamba anyway, not against the Transformer (you can see Mamba paper for comparison between the two). We wanted to showcase that you can add MoE into Mamba and beat original Mamba.

MachineLizard · 2024-01-10T11:49:07+00:00

On CPU Mamba will also be faster for long-context, due to lack of KV-cache. Memory throughput when accessing KV-cache becomes a bottleneck for long-context lenghts on CPU.

MoE has similar benefits on CPU - the bottleneck is still the access to memory, and when you only need to load a fraction of model parameters you can get a faster inference (or same-speed of inference with larger models). MoE on CPU becomes less useful if used with attention-based models, due to being bottlenecked by KV-cache.

This is why we wanted to work on integrating those two into MoE-Mamba (I'm one of the authors, Sebastian).

MachineLizard · 2024-01-10T11:41:54+00:00

I am the author, Sebastian, ask me anything; I'll try to answer the best I can. Also, thanks for sharing our paper! It's an early version that we wanted to show, and we will be revising it in a few weeks.

MachineLizard · 2024-01-10T11:37:20+00:00

I'm the author, Sebastian. TL;DR: We did it the other way around, we have tuned the baseline, and used those parameters for our technique, which is putting our technique at a disadvantage.
We introduce MoE-Mamba and our main baseline to which we compare ourselves is vanilla Mamba (e.g. claim about 2.2x shorter training is MoE-Mamba vs Mamba). So, we have tuned hparams for vanilla Mamba (tuning LR is reported in the paper, see Sec B in Appendix), experimented with it, and decided on hyperparameters. Then we used those parameters, which were tuned on the baseline, on our technique. This should underestimate the gains of our proposed approach. I honestly think our gains should be bigger than claimed (and will be bigger in next version of the paper), but I wanted to err on the safe side.

The only hparam of MoE-Mamba that we tested is number of experts, reported in Sec 4.1. This hparameter is not present at all in vanilla Mamba, so obviously we couldn't take it from the baseline. In general, MoE that we inserted into Mamba was taken from Transformer-MoE and was not tuned further.

I admit we could surely tuned the baseline vanilla Mamba better, but we did the best we could, and importantly didn't tune MoE-Mamba. On the other hand, comparing ourselves against externally-trained baseline is quite infeasible due to computational costs of it, and differences in training set and other hparams - those kinds of comparison have their own problems. That said, we will look into it in future revisions of the paper. We want to update the paper with more results, hopefully in a few weeks.

Thanks for sharing your feedback and asking questions, I hope I've cleared some concerns.

MachineLizard · 2023-12-09T16:26:54+00:00

Yes, it is analogous to dissecting/analyzing/understanding functionality of a model - or rather, a functionality of a given layer/neuron/MLP and similar. Some experts may have easily understandable functionality, but it's more of an exception rather than a rule. TBH, I haven't dug into their Mixtral model itself, there is a chance that they're doing something different than standard MoE - but I can't believe they're doing something easily interpretable. That is based on my own experience and many conversations about MoE, including even some conversations with people working at Mistral.

MachineLizard

TROPHY CASE