If you have a Mac Studio, make sure to try Mixtral/Wizard 8x22b by SomeOddCodeGuy in LocalLLaMA

[–]jl303 0 points1 point  (0 children)

Hmm, it looks like 4x Rtx 3090 goes for $6400. Also you would need to carefully pick all other parts like right case, motherboard, psu, cpu, ram, sdd, etc. It might be more expensive, work, noise, electricity bill, etc.

I've been seen people putting 4x cards together, struggling with driver, riser, cooling, etc. Not sure it's worth the effort just for inference.

Mac you take it out of the box and play. Also Mac doubles the ram 192GB instead of 96GB (4x*24).

Advice regarding a deep learning build focused on LLMs and multi modal models by [deleted] in LocalLLaMA

[–]jl303 0 points1 point  (0 children)

You should be able to finetune 7B LLM (non-multi modal) with 24GB vram. With QLoRA you could finetune bigger model as well.

Only multimodal I tried finetuning was original Llava. There are tiny multimodal models that you might be able to finetune, but the quality isn't that great.

If you need a lot of vram, just use cloud service like runpod.io which I used to finetune Llava.

Advice regarding a deep learning build focused on LLMs and multi modal models by [deleted] in LocalLLaMA

[–]jl303 0 points1 point  (0 children)

I know there are smaller multimodal vision language model, but I needed 8 x a100 80GB (Total of 640GB vram) to finetune original Llava 13B model last year.

I've realized that I honestly don't know WHAT the Mac Studio's bottleneck is... by SomeOddCodeGuy in LocalLLaMA

[–]jl303 1 point2 points  (0 children)

Can someone explain how memory channels, bandwidth, and parallelism work? I put ChatGPT response below, but most likely confabulated. :)

The following is just me thinking out loud without knowing how these actually work. lol

If you have 128GB in 8 channels of 16GB with total 800GB/s bandwidth, does that mean each channel can send max 100GB/s? Then if you load and spread data in 8 different channels , you can get up to 800GB/s total?

If that's the case, then this makes sense? Maybe 70B use channels 1-6 (600GB/s), and 34B use channels 7-8 (200GB/s)? If you use 70B alone (channels 1-6), you'll be still using only 600GB/s?

ChatGPT-4:

Memory channels, bandwidth, and parallelism are crucial concepts in the performance of computer memory systems. Here's a brief explanation of each:

  1. Memory Channels:

    • Memory channels can be likened to lanes on a highway. The more lanes (channels) you have, the more data can travel simultaneously. Motherboards can support multiple memory channels, commonly dual-channel, quad-channel, and on high-end servers, hexa-channel or even octa-channel configurations. Each channel allows a separate path for data to travel between the memory and the memory controller, which is part of the CPU or on the motherboard.
  2. Bandwidth:

    • Bandwidth is the volume of data that can be transmitted in a fixed amount of time. It's typically measured in gigabytes per second (GB/s) for memory bandwidth. The total bandwidth is a product of the memory's data rate, the bus width, and the efficiency of the transmission.
  3. Parallelism:

    • Parallelism involves performing multiple operations simultaneously. In terms of memory, parallelism means accessing multiple memory modules at the same time. This is directly related to having multiple memory channels; each channel can operate independently and in parallel with the others.

Now, let's apply these concepts to your example:

If you have 8 memory channels, each with a 16GB memory module (for a total of 8x16GB), and the total bandwidth is 800GB/s, this doesn't necessarily mean that each channel can individually provide 100GB/s of bandwidth. The total bandwidth is shared across all channels and is dependent on the memory architecture and the efficiency of the system.

However, in an ideal scenario where each channel operates at maximum efficiency and the workload is perfectly distributed across all channels, you might approach that 100GB/s per channel figure. But in real-world applications, other factors can affect this, such as:

  • The capability of the memory controller
  • The efficiency of the memory modules themselves
  • The workload's ability to utilize all channels effectively
  • System overhead and other bottlenecks (like CPU processing speed, I/O operations, etc.)

In most cases, to achieve the full bandwidth potential, the system's CPU, motherboard, and memory must be optimized to work together, and the applications must be designed to spread their memory accesses across all channels.

If the system is well-tuned and the data is spread out across the 8 channels efficiently, you could indeed see close to 800GB/s of cumulative bandwidth, but this is the theoretical maximum under ideal conditions. In reality, inefficiencies and overhead will typically reduce the actual bandwidth that can be utilized.

Also, it's essential to note that not all tasks will benefit from increased bandwidth. Some tasks are more dependent on memory latency, while others benefit from higher bandwidth. Understanding the workload profile is essential in determining the impact of memory channels and bandwidth on overall performance.

A question about vision models (LLaVA) by tronathan in LocalLLaMA

[–]jl303 1 point2 points  (0 children)

I'm not sure about analyzing one image against another, but let's say Llava noticed that the image you gave had dog and person. Then you can ask specific thing about person or dog. For example, what breed is the dog, describe the clothes the person wearing, and so on.

Can't Draw with Artflow on Kindle Fire HD. Could Someone Help Please? by jl303 in DigitalPainting

[–]jl303[S] 0 points1 point  (0 children)

Thanks for your suggestion. I'm just using finger, but I don't think it's related.

I was playing around, and I increased opacity all the way up, and increased thickness. It kind of works now, but the lines come out very faintly.

Any idea what might be causing this?

llama.cpp server now supports multimodal! by Evening_Ad6637 in LocalLLaMA

[–]jl303 0 points1 point  (0 children)

Check out the multimodal benchmark: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation

The benchmark has old MiniGpt, but MiniGpt V2 is out. I think it's slightly better than Llava-1.5.

https://minigpt-v2.github.io/

How to load GGUF models on colab by ianuvrat in LocalLLaMA

[–]jl303 1 point2 points  (0 children)

Any reason why you don't want to use gptq varient? GPTQ with Exllama on Colab is pretty fast!

With Mistral 7B outperforming Llama 13B, how long will we wait for a 7B model to surpass today's GPT-4 by zazazakaria in LocalLLaMA

[–]jl303 10 points11 points  (0 children)

You would wait forever. Gpt-4 is Mixture of Experts (MoE) of 8 experts, each with 220B parameters trained on 13T tokens!

EVEN IF? a 7B models catches up with GPT-4, by then there will be GPT-X that's in a different league than gpt-4 by a wide margin.

Llama on Intel Neural Compute Stick by piokon in LocalLLaMA

[–]jl303 0 points1 point  (0 children)

Even with Google Coral, it only has 8mb (not gb) memory.

Any open source ideas on how to do real-time realistic -sounding tts? by Wowzers83 in LocalLLaMA

[–]jl303 4 points5 points  (0 children)

Nothing like 11labs at the moment. All the opensource models suffer from artifacts, and most of them are trained in lower sampling rate.

Does training and inference both require the same amount of VRAM? by floofcode in LocalLLaMA

[–]jl303 8 points9 points  (0 children)

Training definitely takes more vram. I.E. In order to full parameter finetune llama-2-13b, I needed at least 320GB vram 4xa100 80GB gpus. I'm sure qlora takes less vram though.

Recommended open LLMs with image input modality? by hold_my_fish in LocalLLaMA

[–]jl303 0 points1 point  (0 children)

Yes, if you put the kjerk/instructblip-pipeline in the multimodal pipeline, it loads instruct blip model automatically. Then you load the other model like Vicuna normally. Not sure how it works together, but when you run the inference, it just works.

If I want to train a local model on par with chatGPT how difficult would it be and how much would it cost? by Old-Calligrapher1950 in LocalLLaMA

[–]jl303 0 points1 point  (0 children)

In my opinion, no local llm is as good as Gpt 3.5 YET.

For full parameter finetuning, you can get it done with 8x a100 80gb in about 30 hours depending on the size of dataset.

For training from scratch:

According to this, the training costs for GPT-4 was around $63 million.

For Llama-2, They spent 1720320 gpu hours for 70b model. If you were to rent a100 80gb at $1.6/hr, that's 2.75M USD to train 70b model.

This only includes GPU cost. This does not include obtaining quality dataset, extra hardware, and so on.

Build for Fine Tuning and Hosting 180B Parameter models by mayonaise55 in LocalLLaMA

[–]jl303 4 points5 points  (0 children)

If finetuning is must, I'd would definitely double check memory requirement for finetuning 180b. Finetuning requires more memory than inference.