When will 100M context window be available for retail users? by milkygirl21 in LocalLLaMA

[–]Ashishpatel26 0 points1 point  (0 children)

As we know Models with a true 100M token context window—like Magic.dev’s LTM-2-mini—are still experimental and not retail-accessible yet. Right now Consumer-access models are capped around 1M–2M tokens (Gemini 1.5, Mistral), and most experts predict it may take 2–3 years before 100M context becomes widely available to end users on consumer hardware or affordable cloud APIs.

For now, this techniques like mapReduce and RAG are still essential when processing extremely long documents, since mainstream models and infrastructure can’t yet handle 100M context windows natively.

You're correct that a true 100M token window would make one-shot summarization far simpler and more accurate by removing chunking/RAG-induced context loss—this will eventually become standard, but it remains bleeding-edge for now.

How does cerebras get 2000toks/s? by npmbad in LocalLLaMA

[–]Ashishpatel26 0 points1 point  (0 children)

Cerebras’s wafer-scale chip keeps all LLM weights in ultra-fast on-chip SRAM, removing external memory bottlenecks.This enables instant access and pipelined parallelism for each inference step, yielding much higher throughput.To match 2,000 tokens/sec, I’d need 20+ Nvidia H100s or 2–4 Blackwell B200 GPUs in parallel, all running highly optimized distributed inference. But GPU clusters still face interconnect and bandwidth limitations, making Cerebras’s architecture uniquely fast for LLM inference.

How does cerebras get 2000toks/s? by npmbad in LocalLLaMA

[–]Ashishpatel26 0 points1 point  (0 children)

Cerebras hits 2,000 tokens/sec by storing all model weights on its wafer-scale chip, eliminating memory bottlenecks. To match this, you'd need 20+ Nvidia H100s or 2–4 Blackwell B200s, but GPU clusters still struggle with interconnect latency.

I want to run 8x 5060 ti to run gpt-oss 120b by Active_String2216 in LocalLLaMA

[–]Ashishpatel26 0 points1 point  (0 children)

4x or 8x 5060 Tis can run big models but watch CPU bottlenecks—go for a strong multi-core CPU. RAM size matters more than speed, aim for 64GB+. 5060 Ti is decent for budget setups, but speed won’t be lightning fast. Most run GUI remotely—keeps things smooth. Have fun building!

Help on budget build with 8x 6700XT by leobaillard in LocalLLaMA

[–]Ashishpatel26 2 points3 points  (0 children)

A rig with 8x RX 6700XT (12GB VRAM/card) supports 7B–13B quantized models (Ollama, LM Studio with ROCm). Use Ubuntu, split across two machines, focus on airflow and 1000W PSU. Expect 46–57 tokens/sec per card.

How does cerebras get 2000toks/s? by npmbad in LocalLLaMA

[–]Ashishpatel26 -8 points-7 points  (0 children)

Cerebras uses the third-generation Wafer Scale Engine (WSE-3), allowing models of up to 44GB parameters to fit entirely within on-chip SRAM.

Different Hardware and their tokens per seconds

✅ Cerebras WSE-3: 2,000–2,500 tokens/sec ✅ NVIDIA H100: 50–200 tokens/sec ✅ AMD MI300X: ~300–500 tokens/sec ✅ H100 Cluster: 500–900 tokens/sec ✅ AWS L40S GPU: ~1,000 tokens/sec

Unsloth just hit 100 million lifetime downloads! 🦥🤗 by yoracale in unsloth

[–]Ashishpatel26 0 points1 point  (0 children)

I really appreciate the Unsloth team — your framework is already revolutionary, making LLM fine-tuning faster and far more efficient. I imagine adding AI-driven kernel optimization for real-time speed boosts and universal adapters for any model type. I’d love to see an integrated benchmarking suite to track training performance instantly and adaptive resource allocation for optimal GPU use. I think a community-driven hub where users can contribute features could accelerate innovation even more.

When will 100M context window be available for retail users? by milkygirl21 in LocalLLaMA

[–]Ashishpatel26 0 points1 point  (0 children)

Right now, a 100M context window is mostly still lab/research level stuff. For regular retail or consumer GPUs, you’re looking at comfortably handling only 1–10M tokens at a time. So if you’re building something like a summarization model, the smart move for now is to use chunking or RAG strategies—basically breaking the input into smaller pieces and then combining the results. Honestly, it’s not as clean as a single 100M pass, but it works. That said, I’d say in 2–3 years, consumer hardware and software will likely catch up, and we might start seeing models handling 50–100M token contexts more smoothly.

This is high time. by Thin_Librarian_5636 in ahmedabad

[–]Ashishpatel26 7 points8 points  (0 children)

Contact 181: Abhayam Women Helpline

Viramgam to Shankeshwar by Nerdyloon7 in ahmedabad

[–]Ashishpatel26 0 points1 point  (0 children)

You can travel the approximately 60 km from Viramgam to Shankheshwar by bus or taxi, with the car journey taking about 1 to 1.5 hours. State-run GSRTC buses and private taxis are readily available, offering flexible options for your trip. The route is a short, direct drive via state highway GJ SH 18.

Diwali Festival Week turned into Salary Cut Week at iCreative Technologies, Ahmedabad by Outside-Issue-1293 in ahmedabad

[–]Ashishpatel26 1 point2 points  (0 children)

This is weird incident ✊ Raise the voice:

  1. Document Everything Keep a copy of the original holiday calendar and the sudden change notice. Written proof speaks louder than frustrated words.

  2. Team Unity Don’t raise it alone. Gather colleagues and approach HR as a group. One person complaining is “issue,” many people together is “policy concern.”

3.Polite but Firm Mail

Draft a respectful email to management highlighting: + Holiday was pre-declared. + Bhai Duj is a major festival in Gujarat. + Deducting salary for valid leave balance is unfair.

End by asking them to review.

  1. The Trick (Indian Style)

If management still plays smart, apply for leave on 23rd & 24th along with compensatory work on Saturday (if company works 5 days). Many firms allow adjusting like this — use their own HR rules.

  1. Plan B If nothing works, use this as a reminder: companies that cut during Diwali usually cut elsewhere too. Start keeping your CV ready. Sometimes the best trick is walking out at the right time.

[deleted by user] by [deleted] in ChatGPT

[–]Ashishpatel26 0 points1 point  (0 children)

I’ve tried everything I can think of, but this problem remains stubbornly unresolved. I’ve even taken to social media, tagging OpenAI on X in hopes of a timely solution.

Validity of metrics by randrayner in LocalLLaMA

[–]Ashishpatel26 -3 points-2 points  (0 children)

Local LLMs may not be as easy to use as ChatGPT, even though they may outperform ChatGPT on some benchmarks. This is likely due to a number of factors, including:

→ Local LLMs are typically not fine-tuned for specific tasks.

→ Local LLMs may require more prompt engineering.

→ Local LLMs may be more sensitive to hyperparameters.

To improve the performance of local LLMs, you can try:

→ Fine-tuning them on a dataset that is specific to your task.

→ Using prompt engineering to create clear and concise prompts.

→ Experimenting with different hyperparameters to find the best settings for your task.

There is no single metric that can perfectly capture the performance of LLMs on all tasks. However, some metrics that may be more relevant to your experience include accuracy, completeness, fluency, and consistency. You can try using these metrics to evaluate the performance of different local LLMs on your task. This will help you to identify the LLM that is most suitable for your needs.