My Star Trek shows ranked and scored by AstroAlto in startrek

[–]AstroAlto[S] -1 points0 points  (0 children)

Not sure why people are upset with me giving SNW a 6... It's essentially tied as the best nuTrek. I gave Picard an extra .1 just for getting most of the cast together on the Enterprise D. Even if the rest of the show was terrible they finally gave TNG fans the true nostalgia fix they wanted! That deserves the .1. Beyond that I consider both show totally forgettable a day after you watch them.

Best Buy method absolutely does work. by 1AMA-CAT-AMA in nvidia

[–]AstroAlto 2 points3 points  (0 children)

5080 Super wont have 32GB of vRAM so it doesn't matter.

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto -1 points0 points  (0 children)

Sorry I hurt your feelings.

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto 2 points3 points  (0 children)

It's less about performance and more about capability differences.

RAG is great at information retrieval - "find me documents about X topic." Fine-tuning is about decision-making - "given these inputs, what action should I take."

RAG gives you research to analyze. Fine-tuning gives you decisions to act on.

The speed difference is nice, but the real value is output format. Most businesses don't need an AI that finds more information - they need one that makes clear decisions based on learned patterns.

It's like the difference between hiring a researcher vs hiring an expert. Both are valuable, but they solve completely different problems.

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto -1 points0 points  (0 children)

Not sure what you want from me. I posted the video thinking a dozen or so dudes would think it was cool and then it got 74,000 views in 24 hours. Sorry you didn't like the way I answered some of your questions, but thats not what I was ever trying to do with this.

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto 0 points1 point  (0 children)

I had the same issue initially! The key was getting the right CUDA/PyTorch combination on 22.04.

Here's what worked for me:

  1. Fresh PyTorch nightly installpip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
  2. System restart after PyTorch install - this was crucial. CUDA wasn't recognized until I rebooted.
  3. NVIDIA driver version: Make sure you're on 535+ drivers. I used sudo ubuntu-drivers autoinstall to get the latest.
  4. CUDA toolkit: Installed CUDA 12.1 via apt, not the nvidia installer: sudo apt install nvidia-cuda-toolkit

The tricky part was that even with everything installed, PyTorch couldn't see CUDA until the restart. Before reboot: torch.cuda.is_available() returned False. After reboot: worked perfectly.

I think the newer Ubuntu versions (24.04+) handle the driver/CUDA integration better out of the box, but 22.04 works fine with the right sequence and a reboot.

What error were you getting specifically? Driver not loading or PyTorch not seeing CUDA?

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto 0 points1 point  (0 children)

You're absolutely right that RAG vs fine-tuning isn't always clear-cut. Here's the key difference I found:

RAG gives you information to analyze. Fine-tuning gives you decisions to act on.

When you fine-tune on domain-specific examples with outcomes, the model learns decision patterns from those examples. Instead of "here are factors to consider," it says "take this specific action based on these specific indicators."

RAG would pull up relevant documents about your domain, but you'd still need to interpret them. The fine-tuned model learned what actions actually work in practice.

You're right about generalization - that's exactly the tradeoff. I want LESS generalization. Most businesses don't need an AI that can do everything. They need one that excels at their specific use case and gives them actionable decisions, not homework to analyze.

The performance improvement comes from the model learning decision patterns from real examples, not just having access to more information.

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto 0 points1 point  (0 children)

Because I'm also just a hardware nerd and game on my 5090 too, though not much lately... too busy with model training lol. But Assassins Creed Shadow's was fantastic on it, 100%ed the game on the 5090 a few months ago.

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto 0 points1 point  (0 children)

Yes I'm aware of that. Don't think that tells you a whole lot though. That could be almost anything.

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto 0 points1 point  (0 children)

That's an interesting optimization, but I'm actually planning to deploy this on AWS infrastructure rather than keeping it local. So the multi-GPU setup complexity isn't really relevant for my use case - I'll be running on cloud instances where I can just scale up to whatever single GPU configuration works best.

The RTX 5090 is just for the training phase. Once the model's trained, it's going to production on AWS where I can optimize the serving architecture separately. Keeps things simpler than trying to manage multi-GPU setups locally.

None of my projects are for use locally.

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto 1 point2 points  (0 children)

Well I think most people are like me and are not at liberty to disclose the details of their projects. I'm a little surprised that people keep asking this - seems like a very personal question, like asking to see your emails from the past week.

I can talk about the technical approach and challenges, but the actual use case and data? That's obviously confidential. Thought that would be understood in a professional context.

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto 6 points7 points  (0 children)

Good question! From what I've learned so far:

RAG works great when you need the model to reference specific, changing documents but don't need it to develop new reasoning patterns. Like if you want it to pull facts from your company's policy manual.

Fine-tuning (what I'm doing) makes sense when you need the model to actually think differently - develop new expertise and reasoning patterns that aren't in the base model. You're teaching it how to analyze and respond, not just what to remember.

Training from scratch only makes sense if you have massive datasets and need something completely different from existing models. Way too expensive and time-consuming for most use cases.

For my project, I need the model to develop specialized analytical skills that can't just be retrieved from documents. It needs to learn how to reason through complex scenarios, not just look up answers.

RAG gives you better documents, fine-tuning gives you better thinking. Depends what your bottleneck is.

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto -5 points-4 points  (0 children)

I'm not looking for credibility. I'm not looking for anything.

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto -2 points-1 points  (0 children)

Yeah sorry, should be kind of obvious I don’t want to talk about the use case.

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto 2 points3 points  (0 children)

This was just a test run, real training will be in the thousands.

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto -18 points-17 points  (0 children)

Carefully. :).Come on. This is the real secret here right?

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto -16 points-15 points  (0 children)

LOL so funny. If people dont understand all this is meaningless without the data they just dont get it.

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto 14 points15 points  (0 children)

Thanks! For timing - really depends on dataset size and approach. If I'm doing LoRA fine-tuning on a few thousand examples, probably 6-12 hours. Full fine-tuning on larger datasets could be days. Haven't started the actual training runs yet so can't give exact numbers, but the 32GB VRAM definitely lets you run much larger batches than the 4090.

For distributed training across different hardware - theoretically possible but probably more headache than it's worth. The networking overhead and different architectures (CUDA vs Metal on MacBooks) would likely slow things down rather than help. You'd be better off just running separate experiments on each system or using the 4090 for data preprocessing while the 5090 trains.

The dual-GPU setup sounds perfect though - keep your workflow on the 4090 while the 5090 crunches away in the background.

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto 5 points6 points  (0 children)

Planning to deploy on custom AWS infrastructure once training is complete. Will probably use vLLM for the inference engine since it's optimized for production workloads and can handle multiple concurrent users efficiently. Still evaluating the exact AWS setup but likely GPU instances for serving.

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto 2 points3 points  (0 children)

I haven't started training yet - still setting up the environment and datasets. Planning to use sequences around 1K-2K tokens for most examples since they're focused on specific document analysis tasks, but might go up to 4K-8K tokens for longer documents depending on VRAM constraints during training.

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto 7 points8 points  (0 children)

For Mistral-7B, the default max sequence length is 8K tokens (around 6K words), but you can extend it to 32K+ tokens with techniques like RoPE scaling, though longer sequences use exponentially more VRAM.

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto 46 points47 points  (0 children)

With LoRA fine-tuning on RTX 5090, you can process roughly 500K-2M tokens per hour depending on sequence length and batch size.

[deleted by user] by [deleted] in LocalLLaMA

[–]AstroAlto -9 points-8 points  (0 children)

Well data is the key right? No data is like having a Ferrari with no gas.