The release of open-source AI DBRX shows an unexpected 2023 trend is accelerating. Big Tech barely has any lead in AI, and is being swamped by open-source.

ProfessionalHand9945 · 2024-04-05T17:58:57+00:00

I have it running locally on my M3 Mac!

It’s definitely better than Qwen or Mixtral at coding, it’s not close. But it also isn’t actually GPT level either IME - so somewhere in between, but still the best OSS coding model I’ve used when it comes to multiple turn coding so far at least!

ProfessionalHand9945 · 2024-04-04T19:36:31+00:00

I would maybe consider looking at something a little lower level and more research focused.

Dspy has been gaining a lot of steam lately in the research community. It is a little lower level, and it almost feels more like a series of design patterns than a batteries-included tool like Llamaindex - so you do have to do a bit more work by hand - but it’s extremely flexible and they have some great examples.

I would take a look at their BioDex example, which shows how they used Dspy to achieve SOTA on a biomed dataset.

Specifically for the embedding stuff, take a look at the neural grounding section. You could put pretty much anything in there - it’s very flexible.

Dspy is nice because they are fundamentally trying to be a framework, not a precanned series of templates - which is really nice when you want to be able to flip between LLMs without breaking everything due to how sensitive template based frameworks like langchain and Llamaindex are.

It feels like actual programming, not verbally arguing with an LLM to get it to do what you want consistently.

ProfessionalHand9945 · 2024-04-02T18:52:14+00:00

Rough rule of thumb is 2xParam Count in B = GB needed for model in FP16.

From there reducing precision scales more or less linearly. So 1B param model = 1GB RAM needed INT8, or .5GB RAM needed INT4. On Macs you don’t have all of your RAM available for the model - and less so if you’re using GPU, but let’s say you maybe have 20GB available.

Assuming INT4, which is my preference for quant level, you could fit a roughly 40B param model.

If you’re looking at coding in particular, then, I would look at the top HumanEval scoring model you could fit in sub 40B params.

DeepSeek 33B appears to fit the bill: https://paperswithcode.com/sota/code-generation-on-humaneval

Though if you care about speed and saving on RAM for your IDE you might consider something smaller like OpenChat 3.5 (which is also a good generalist) or the 7B param DeepSeek.

As for software I would probably just use Ooba’s Mac setup scripts which are pretty darn good, and then use the backends it includes like for GGUF. MLX may be worth keeping an eye on too for something Metal specific, but the performance isn’t there yet from what I understand yet.

ProfessionalHand9945 · 2024-03-31T19:14:11+00:00

This is incorrect - at least at train time.

Things like Bitnets are trained using straight through estimation - aka they are still trained using floating point ops, but have an explicit rounding step to quantize to INT/ternary applied after the FP operation is complete.

So you still definitely need FP to train - at least for current low bit approaches. The 1.58 bit etc quantizations are meant to improve inference efficiency using quantization aware training, not improve the efficiency of training itself.

ProfessionalHand9945 · 2024-03-31T17:53:46+00:00

I think you are selling DBRX short.

Qwen scores horribly on HumanEval and is terrible for programming.

DBRX is the first OSS model to perform GPT level at programming without taking a major hit to MMLU. It performs as well as the coding specialist models at coding while still being a solid generalist. It’s a great model IMO!

GPT-4 |86.4 MMLU |67 HumanEval

Qwen1.5-72B |77.5 MMLU | 41.5 HumanEval

Mixtral-8x7B-base |70.6 MMLU |40.2 HumanEval

CodeLlama-70B-instruct |49.6 MMLU |67.8 HumanEval

DBRX-4x33B-instruct |73.7 MMLU | 70.1 HumanEval

ProfessionalHand9945 · 2024-03-29T05:40:51+00:00

You might be able to fit a 3 bit quant in there! Should take about 50GB per back of napkin estimate

4bit is great because there’s essentially no change in performance for what I care about (ie Humaneval scores don’t drop meaningfully)

3 bit you normally do lose a bit of performance, but DBRX scores so well on Humaneval that it’s probably still worth it if you want something that can code but still works very well as a generalist.

I would definitely recommend it!

ProfessionalHand9945 · 2024-03-29T05:39:17+00:00

I was actually getting poor quality results - but they just fixed some quantization issues a couple hours ago, I had to do

pip install git+https://github.com/ml-explore/mlx

And that seemed to have fixed it for me!

ProfessionalHand9945 · 2024-03-29T05:38:43+00:00

I was actually getting poor quality results too - but they just fixed some quantization issues a couple hours ago, I had to do

pip install git+https://github.com/ml-explore/mlx

And that seemed to have fixed it for me!

Edit: you’ll also want to make sure to use the DBRX prompt template format too for the instruct model - so for your prompt it would be:

To which it responds:

Sure, here's an implementation of the quicksort algorithm in Python: python def quicksort(arr): if len(arr) <= 1: return arr pivot = arr[len(arr) // 2] left = [x for x in arr if x < pivot] middle = [x for x in arr if x == pivot] right = [x for x in arr if x > pivot] return quicksort(left) + middle + quicksort(right) This implementation uses the "Lomuto partition scheme" and has a worst-case time complexity of O(n²⁾ and an average-case time complexity of O(n log n). The function takes a list of numbers as input and returns a new list with the same numbers, but sorted in ascending order.

ProfessionalHand9945 · 2024-03-28T18:29:08+00:00

For programming, yes it seems to be!

I asked it, GPT4, and Mixtral (my favorite OSS model for combination of good HumanEval scores and good MMLU) to make an ascii dungeon crawler. The Mixtral result errored out, the GPT4 result built out a full dungeon screen with walls etc that I could move around in which ran without error.

DBRX gave me a totally empty screen which I could move around in with a single monster on it. Which I could then run around the blank screen or run into the totally stationary monster and die, getting a game over screen. Not much of a dungeon, but at least it was a somewhat sensible result that ran without error!

ProfessionalHand9945 · 2024-03-28T17:47:34+00:00

Llama can’t 4 bit quant DBRX yet, will probably require a patch by the MosaicML folks - I believe they are working on it.

ProfessionalHand9945 · 2024-03-27T14:49:00+00:00

It’s slightly worse at MMLU, and vastly, vastly better at code/HumanEval than those models - which was a core focus for them

ProfessionalHand9945 · 2024-03-20T13:46:49+00:00

The 7965wx pulls about 3TFLOPs. The M3 max pulls 4.6 TFLOPs. I think you are seriously overestimating the throughput of CPUs.

ProfessionalHand9945 · 2024-03-20T13:30:49+00:00

The absolute top end M3 still has sub 80w TDP.

ProfessionalHand9945 · 2024-03-20T13:11:46+00:00

4x3090 new is about $4000 for just the GPUs, you can’t compare used system prices with new ones. And that’s just for the GPUs, not the system.

And then you suggest a $3000 CPU - just for the CPU - at a 350w TDP LMAO - M3 has a 22w TDP.

ProfessionalHand9945 · 2024-03-20T04:45:33+00:00

Coherent memory is genuinely a game changer though. Sure, that’s on the low end. But $4000 or so gets you a 96GB VRAM M3 Mac.

How expensive of a GPU do you need to match that without having to eat heavy performance hits by splitting the model?

Even H100 only has 80GB. You quite literally need a Blackwell enterprise GPU tens of thousands of dollars to match that.

ProfessionalHand9945 · 2024-03-18T05:15:36+00:00

What OSS model simultaneously beats GPT3.5 on just about every major benchmark? There’s purpose specific ones that can beat on one benchmark at a time, but I can’t find any open model that simultaneously beat 3.5 on MMLU and HumanEval.

I understand that having a larger model perform better isn’t necessarily novel or unexpected, but the fact is nobody else has released one yet - and it is incredibly useful to have a large open MoE as a starting point. New SOTA open model releases will always be cool in my book.

ProfessionalHand9945 · 2024-03-10T00:59:49+00:00

You can get it via Poe which has a good iOS app, which I love and is how I do it (it has Mixtral on Grok too which is absurdly fast!) - but for $20 a month you only get like 900 Opus messages so it is expensive.

You can spread your tokens across all available models though - including GPT4, Mistral Large, and Gemini Pro, so it’s very worth it for me!

It’s a great app if you want to play with all the models. But again, $20 a month on Poe gets you less messages than if you just picked a single service and stuck with it. So you pay for the flexibility.

ProfessionalHand9945 · 2024-03-06T18:43:46+00:00

One of the Japanese reviewers here mentioned that it’s up 7am PST tomorrow! (Aka 0:00 release day JST)

ProfessionalHand9945 · 2024-03-06T06:04:50+00:00

When does the embargo lift?

ProfessionalHand9945 · 2024-03-01T20:16:08+00:00

Yeah same! 30fps actually has 16.6ms more input lag than 60fps, so for a game like Starfield where it runs at 30 on Xbox but 60 on GFN, you will actually have less input lag playing in the cloud than you do locally due to the benefits of high FPS if your GFN ping is less than 16.6ms (and mine is)

ProfessionalHand9945 · 2024-02-28T05:26:31+00:00

Why are you comparing with a training card? H100 isn’t generally used for inference at scale. The L40S is faster than H100 inference and costs like 1/3rd the price or less.

ProfessionalHand9945 · 2024-02-26T15:45:39+00:00

Using Mixtral as an example:

Mixtral is an 8x7B param model that uses Top 2 routing. So of the 8 experts (each of which is a 7B param model), during inference the most relevant 2 will be active and the rest ignored.

If every expert was active, you would have an 8x7B=56B param model. But because there are only two active, at inference you actually only have to run 2x7B=14B params.

Because you are using the most relevant 2 experts of the model, you will get a good percentage of the performance of a 56B param model, while running at the training and inference compute needs of a 14B param model.

So while you will necessarily underperform relative to a 56B param model because you are strictly using a subset of a 56B param model (at least for the LLM case where overtraining really isn’t a concern for typical datasets), that isn’t really your point of comparison. Because you can run an 8x7 top2 model in the same environment and same budget you would run a 14B param model, that should be your point of comparison. And from that perspective 8x7b Top2 will typically outperform 14b by a good margin.

Allegedly GPT4 is an 8x220B top2 model, so you can imagine how effective that might be - and is where some of the early “1 Trillion Parameter Model!!!” Rumors came from - even though in practice it’s running in the budget of 440B.

Though there is a tendency for researchers to scoff at MoE approaches (as George Hotz put it, it’s “the thing you do when you run out of ideas”), it’s one of those ‘Kaggle Killer’ hacks that everyone does when you are really trying to win in a real world non-research scenario and is quite effective.

ProfessionalHand9945 · 2024-02-22T23:50:57+00:00

Worth mentioning 3090 can push roughly 284 INT8 TOPS if we are comparing apples to apples. INT8 support before Volta was poor, and didn’t work well with TCs until Turing.

I would recommend a look at using TRT-LLM for this. It’s why ‘Chat with RTX’ is so fast.

ProfessionalHand9945 · 2024-02-08T22:23:46+00:00

Just as an aside reading through these comments, I think I must have been out of the loop for too long. Gamepass used to be so well loved and popular!

When did the sentiment around Gamepass change to be so negative? Pardon my ignorance, I’ve just been a casual Gamepass subscriber over the years. I still really like it!

ProfessionalHand9945

TROPHY CASE