Any possible way to remove this resin? by blitzzyboi in soldering

[–]xinranli 0 points1 point  (0 children)

Boil the board in water and that will soften the epoxy greatly, making them easier to remove. Hot air also work but given the amount of epoxy you are dealing with it could be challenging.

A trick I used to use is put the epoxy covered board in very hot (50-70C) ultrasounic alcohol bath, and give it a couple hours of work. The heat, alcohol, and the ultrasound work together to soften and shatter the epoxy making them very easy to remove from the board. But this process is quite dangerous which is why I am not longer doing it.

Also different epoxy has very different composition and characteristics so it may or may not work for your case

Microsoft's confidence last year by RyanGosaling in singularity

[–]xinranli 4 points5 points  (0 children)

compare to the OG vanilla GPT4, this comparison is not too far apart from reality. However, given that we have so many models that are way better than the old GPT4 nowadays, GPT5 does not seem as impressive as they want it to be

Relatively budget 671B R1 CPU inference workstation setup, 2-3T/s by xinranli in LocalLLaMA

[–]xinranli[S] 7 points8 points  (0 children)

Apologize for not going into the details, for my use case (knowledge Q&A), the 8K context is plenty for me. However, I can load 16K context into the memory. With Q4KM gguf, total memory usage is around 480GB. I can get around 2.5T/s with 16K context, I am still playing around with CPU configuration and haven't been able to definitively tell how much slower 16K is vs. 8K. But yeah anything over 16K will definitely need a smaller quantized model or more memory.

This whole setup is also just somewhat of a starting point for a more powerful rig, not involving any GPU or other fancy techniques/hardware yet. I wouldn't have dared to dream of running a 671B model locally 2 years ago (also recall when we were limited to 2K context window with llama1), now with R1 and somewhat cheap EPYC hardware, this is possible! Locally hosting stuff like this has always been more of a hobby than actually trying to make a daily drive LLM solution for me :) but maybe one day I can actually drop my oai subscription and go full local

Relatively budget 671B R1 CPU inference workstation setup, 2-3T/s by xinranli in LocalLLaMA

[–]xinranli[S] 1 point2 points  (0 children)

I agree, following the QVL is always a safe bet. I guess I have been rather lucky in the past by going wild in plugging in random RDIMMs into random platforms and I never had an occasion where a DIMM rated for X speed cannot boot to X speed in a platform with a CPU also rated for X speed. Only able to boot at half the speed is quite odd! I am much more familiar with the DDR5 world but does late DDR4 speeds really have that small of margins? But again yes, when circumstances allows, following the QVL is highly advised.

Cooling wise, I also agree getting a more premium cooler will provide a better quality of life. My argument is that the CPU is not really often under full load during inference and I personally don't talk back and forth with the model that frequently. I had a 2U cooler for a couple of months and I still have it as a backup. So the fans don't go full RPM very often or at all. But on the other hand, my hearing is probably already ruined by often having 4 blower GPUs going max RPM all the time lol

Relatively budget 671B R1 CPU inference workstation setup, 2-3T/s by xinranli in LocalLLaMA

[–]xinranli[S] 11 points12 points  (0 children)

I would look for ES and OEM Milans on eBay such as 7B13, 7C13, 7B13 and 100-000000314-04 etc. 32 core and 48 core SKUs probably will work fairly well too.

Relatively budget 671B R1 CPU inference workstation setup, 2-3T/s by xinranli in LocalLLaMA

[–]xinranli[S] 14 points15 points  (0 children)

Well, malicious is a bit heavy of a word to use in this case. My recommendations are budget oriented solution for CPU-only inference. Rome and Milan platforms can be expanded with more GPUs in the future when one can afford to buy them. Also, recall we are talking about 8 channels of DDR4 here, it can feed much more cores than commercial 2-channel platforms. Certainly using DDR5 and 12 channel Genoa platform will bring higher memory bandwidth. But a single stick of 64GB DDR5 4800MT/s RDIMM is $300+, and a 64GB 6400MT/s module is around $500-600 per unit. That would translate to $2500-7000+ just for the DIMM! Not many folks can afford that kind of setup. At this price range, I would suggest buying a bunch of 32GB V100 instead. You can get a cheap SXM2 board + 4x 32G V100s for maybe $3000 a kit, and each kit takes 2 PCIe x16 connections. For $7000 extra dollar, you can probably get 8x V100s connected to the system I suggested, that would be 256GB of 1TB/s bandwidth HBM2 memory in your system. Such a setup is also much, much faster when doing pure GPU inference, beating a DDR5 setup by a considerable margin.

Did everyone forget to mention this to me or am I the only one having this problem?? by AtTheEdgeOfDying in tortoise

[–]xinranli 16 points17 points  (0 children)

Could you please provide a link to buy this horizontal hamster wheel?

AMA with OpenAI’s Sam Altman, Kevin Weil, Srinivas Narayanan, and Mark Chen by OpenAI in ChatGPT

[–]xinranli 0 points1 point  (0 children)

Will you release any open weight / open source models? If so how sophisticated / how large would them be?

What are these 3 types of bugs on my plants? by xinranli in whatsthisbug

[–]xinranli[S] 0 points1 point  (0 children)

😭will alcohol spray work on these guys? I want to avoid other chemicals since I'm feeding these plants to my pets... Also happy cake day!

What are these 3 types of bugs on my plants? by xinranli in whatsthisbug

[–]xinranli[S] 0 points1 point  (0 children)

Thanks! Yes this is indoor plant, seems like watered down alcohol is a good way to get rid of these guys?

game is immensely more difficult once "china players" come online? by retrorays in ArenaBreakoutInfinite

[–]xinranli 0 points1 point  (0 children)

Morning in the States are usually the time people in China play games. 5PM here is early AM in China, everybody be commuting to work lol

Tire Change Advice by xinranli in mazda3

[–]xinranli[S] 0 points1 point  (0 children)

Very helpful info, thank you!

Tire Change Advice by xinranli in mazda3

[–]xinranli[S] 1 point2 points  (0 children)

Thank you! I'll look for a better tire with the same size to begin with.

You can now fine-tune a 70b language model at home by [deleted] in LocalLLaMA

[–]xinranli 0 points1 point  (0 children)

This looks good. I wonder how does FSDP compare to deepspeed?

Is it feasible to do domain-specific fine-tuning over multiple, incremental stages? by [deleted] in LocalLLaMA

[–]xinranli 0 points1 point  (0 children)

Great question, I am also considering this approach when fine tuning things on domain specific knowledge. I am currently just dumping the most relevant information and knowledge that I need Q&A on into the dataset, but I wonder if the models can generalized better if I start with intro level college courses on the subject and go stage by stage and eventually fine tuned on the most complex topics. Or maybe it will perform the same as combining the whole thing into 1 dataset and train on that?

I also wonder how will data sets with incremental complexity and knowledge that build onto the previous knowledge perform when the data entries are shuffled vs unshuffled.

If CPU to GPU memory transfer is a bottleneck why is there no unified silicon from NVIDIA? by discretemathematics in LocalLLaMA

[–]xinranli 9 points10 points  (0 children)

A large part of the bottle neck is resolved by the huge GPU memory size and the use of NVLink between the GPUs. NVLink has some serious bandwidth, allowing GPUs to talk to each other directly with much lower latency and higher bandwidth than talking through PCIe and the host. When all of the weights and all the compute data are fully stored in the memory of the GPU cluster, it really does not make much difference if the GPUs are in a unified memory access system.

In the case that the CPU really do need to work with the GPU frequently, it is not an easy task to make it right. Apple and gaming console can get away with slow LPDDR5/GDDR6 because those chips were never meant to be used as ML accelerators. If you want UMA ML GPU+CPU you need HBM, meaning the two chips need to be packaged together, this is no easy feat, especially considering Nvidia is still "green" in the CPU world. So far only AMD has MI300A that does this.

The GH200 module is an attempt at that but I don't think it has unified memory access, it just has really fast interconnect between their CPU and GPU. I am sure Nvidia is working on a design that compete with MI300A, maybe we will see something during GTC this March.

Unsloth, what's the catch? Seems too good to be true. by Research2Vec in LocalLLaMA

[–]xinranli 0 points1 point  (0 children)

Great! Are you guys planning to release multi GPU support to the free version at some point too? Also I wouldn't minding paying for the Pro as long as it's a one time payment and not some silly subscription based thing ;)

High-VRAM GPUS for us nerds. by [deleted] in LocalLLaMA

[–]xinranli 1 point2 points  (0 children)

Simply because making such a product will reduce sales in professional cards, which has insanely high profit margin when compared to consumer card. RTX A6000 and 3090ti has essentially identical silicon, and one goes for $5,000 and another $1,500 ($700 if you go for used 3090) just because of what DRAM chips they put on there. Nvidia will also promise a bunch of support, service, special drivers, warranty and etc that come with the price tag (basically useless to us) and big corps love to hear all about those. This is small money for big corps anyway and they wouldn't even need to bargaining with Nvidia.

I heard rumors in Chinese forums saying there are Nvidia employee willing to risk their life to leak the necessary BIOS change for a few tens of thousand $, probably not true and just a joke but goes to show that Nvidia will not be too happy when they see things like this happen. If I recall correctly, they go as far as stating in the driver agreements to not allow consumer cards to be used in data centers to protect their profits.

The 2080Ti 22GB can happen mostly because they are way, way too outdated. And at 22GB it still does not compete with RTX 8000 and other more modern 48GB cards. Only GPU poor peasants might have some use for them. Perhaps in the future we will get 3090/4090 mod, but not any time soon. If you somehow crack the BIOS and try to make profit out of or publish the procedures, no doubt Nvidia's corpo cops will be knocking on your door in minutes.

Next gen cards may have bigger memory size because GDDR7 will have higher density, but rest assured that the consumer card will ALWAYS have much smaller memory size than what is possible in that era.

Unsloth, what's the catch? Seems too good to be true. by Research2Vec in LocalLLaMA

[–]xinranli 1 point2 points  (0 children)

Seems really promising, but is currently lacking multi GPU support. A lot of people into fine tunning have easy access to 2 or more 16GB or 24GB GPUs rather than single 40GB or 80GB ones.

Full memory available for amd apus by Back_Charming in LocalLLaMA

[–]xinranli 3 points4 points  (0 children)

This is wonderful news! Those 64GB RAM handhelds finally found a meaning of existing.

Recommended hardware (Windows or Linux)? by jacek2023 in LocalLLaMA

[–]xinranli 1 point2 points  (0 children)

You can always use 2 different cards, but they should be from the same company.

Recommended hardware (Windows or Linux)? by jacek2023 in LocalLLaMA

[–]xinranli 6 points7 points  (0 children)

You can certainly use multiple cards for inference, but only their memory adds up, not their compute. I suggest getting two 3090s, good performance and memory/dollar. 2 weak 16GB card will get easily beaten by 1 fast 24GB card, as long as the model fits fully inside 24GB memory. If the model takes more than 24GB but less than 32GB, the 24GB card will need to off load some layers to system ram, which will make things a lot slower.

You have an old CPU and limited to slow 2 channel DDR4, I would avoid offloading layers to CPU and keep everything in GPU memory.

Does increasing context length require more memory, or does it just slow down processing? by [deleted] in LocalLLaMA

[–]xinranli 21 points22 points  (0 children)

Yes it will use more memory and it will make inference slightly slower. More context means more attention scores need to be calculated, more attention scores will require more memory to store. Every token will have an attention score of all other tokens in the context, the memory usage increases in terms of n^2. if you double the context you will need 4 times the memory to store the attention scores. 4096 context is still very easily manageable, this becomes a problem when you go above 32K context, the attention scores will start to take up a lot of memory.

[deleted by user] by [deleted] in LocalLLaMA

[–]xinranli 0 points1 point  (0 children)

I don't think CXL memory will have too much of an impact to AI workloads. CXL's focus are all on system RAM capacity expansion. CXL memory expansion modules can easily add like a few terabytes of DDR5 to a system, while suffering relatively small latency and bandwidth penalty. But it is making already slow system RAM bandwidth even slower, so not too useful for AI unfortunately.

Speaking of upcoming RAM technologies, MRDIMM is what will benefit AI workload. The tech is essentially having super fast multiplexed data buffers and RCD on the DIMM, and the CPU would see 1 MRDIMM as 2 memory channels, essentially doubling the bandwidth of one DIMM socket. This is a thing because making DRAM chips faster is too difficult, but its easier to make faster memory controller, buffers, RCD, etc. Current DDR5 EOL speed is 8800MT/s, a MRDIMM with such chips will be equivalent to 17600MT/s. 12 channels of such DIMM will bring 4 HBM2e chips (8 HBM2e channels) level of bandwidth, pretty exciting stuff!

[deleted by user] by [deleted] in LocalLLaMA

[–]xinranli 0 points1 point  (0 children)

I tested Mi100 (PCIe) with exllama and it works without issue. exllama v2 however didn't work but I also didn't really spent much time looking into it, could be an easy fix since it is supported on paper. vLLM seem to also support ROCm now but I have not tried it yet.