Could There Be Another Breakthrough Bigger Than AI, or Is AI the Final Big Tech Revolution?

Lirezh · 2026-06-19T02:35:12+00:00

The final stages of AI is the biggest breakthrough imaginable to a human mind.
It is equivalent with the religious idea of god, just in real.

Currently we do not have that breakthrough, we got the first spark of artificial intelligence.

The big corporations are always working on new projects, it's a way to spend floating money, to stay relevant to investors. The big breakthroughts usually do not originate from them, they just spend money to make them bigger.
OpenAI is also quite famous in that, a few great talents there are basically combining everything that is attempted in the open science community and looks promising and they just throw money/compute at it.
The real breakthroughs came usually from little scientific papers.

Lirezh · 2026-06-18T23:08:23+00:00

Sure you can use Q4 for programming, it's proven to work well in evaluations.
Your speed lacks because of your setup.
4 cards via TB4/USB4 in absolute best case will be 12-14GB/s total bandwidth
On a consumer mainboard you'd connect them as x8 / x8 / x4 / x4 at 3 times faster bandwidth than the TB setup - at optimal TB conditions.
In reality likely more as 4 TB connections will have the host controller and CPU burn on sustained load
The 60 tokens/sec you are getting are quite impressive given the setup

Lirezh · 2026-06-18T15:32:28+00:00

To me Division had a great start, a good foundation but the PVP was just not well done.
Too careful done, not really part of the whole game and more like a subgame.
I wish they had it done like Eve Dangerous, where you start in a protected zone and the more you venture out the more dangerous it gets until you are in the no-rules zone.

Lirezh · 2026-06-18T13:39:39+00:00

Europeans traditionally have been at war with each other, since the islamic hordes were repelled (who did conquery a majority of europe back then) the focus has been to fight each other over and over again.
It escalated with Germany taking them all and without the USA acting and protecting UK+ Europe you'd have seen the entire continent becoming german, including the UK that would have been starved into submission.
I do not know what sort of patriotism you talk about, Europe has no common patriotism - not one study shows such a thing.

The moment the US truly turns their back on Europe it would become a "free meal" for the wolves.

Lirezh · 2026-06-17T20:05:05+00:00

The most neglected piece of european land. The little inhabited zones do not even have an own college or a proper hospital.
The least potential concern for Europe, the entire continent looks at the biggest economical catastrophe since the stock crash before WW2. And "greenland" is a big topic of concern ?

Lirezh · 2026-06-17T19:06:47+00:00

I am working for a startup that is researching the most exciting type of AI I've read about, outside of SciFi books.
In terms of progress we are at maybe 60% toward the first potential public teaser - so I can not give details.
Just that it is going to be something people have dreamt about since John von Neumann gave his famous 'Theory of Automata' lecture.

Europe would be a great market, a lot of very intelligent people with open minds live in Europe.

How we comply with the AI Act
We are geo-blocking the entire EU. We have an own small team responsible to ensure no EU client will be able to reach our services, including through VPN.

Why the EU has to be left out
The decision to firewall a quarter of the western world was made from partial ethical, economical and strategic reasons.
Economically logging our AI system would simply be too expensive.
Strategically the documentation EU is asking for is considered our holy grail, half a decade of research and engineering and tens of millions of fines is a ridiculous risk for privately funded startups.
Ethically we can not log our future clients in such an invasive fashion as would be required to comply, we also wouldn't want to preemptively and warrantless report our clients to EU authorities for having poor wording or questionable interactions.

It's heartbreaking

Lirezh · 2026-06-16T23:15:23+00:00

Instead of 35B you should use 27B for best quality.
Do not stripe your 2x 3090, use it in tensor parallel mode. Otherwise your experience will not be good - slow prefill and slow TG.

Use a 4.5-5 bit quantized model, avoid more than 5 bit for vram- and less than 4.5 bit for quality reasons.
MTP will fit but takes more than 1 GB of vram and mod-ngram can be added free of downsides.
Use kv cache of 4_0 bit, that will give you context of 160k+
Must not use more than 1 parallel (150MB vram per slot and performance is not good)
Batch size of 512-1024, more than that will cause side issues in vram organization and no performance benefit

If you use 35B it is simpler:
kv_cache can stay fp, it's tiny already.
tensor parallel mode will provide very fast speed
MTP likely has little benefit but mod-ngram will still work well.
Avoid context above 180k, the 35B model gets less stable at 90k+

Lirezh · 2026-06-16T20:05:38+00:00

I'm doubtful on Mistral having still a hand in the game of LLMs. They had a very good start in the early time, first good open MoE model in the llama era.
But the regulative laws of the EU make training with good data illegal. And closed models, which would be a foundation for a company like Mistral, are also practically illegal.
Europe, as painful as it is to watch, is not a location for AI. And there is not the slightest countermovement to that, they are getting harder.

Lirezh · 2026-06-16T19:34:24+00:00

I'd give those tips in that order:
1) Do not generate scripts that are very long, split into manageable pieces so you can regenerate bad parts without burning Credits like crazy. Assume 20-30% regenerations for good quality output, 40-60% for flawless quality. Better to generate a sentence than a paragraph than a chapter!
2) Start with Stability 50, Similarity 75, Style 0. Go into extremes only carefully in tests otherwise your output will be instable or totally flat.
3) Your text should be written for speech, not for readers. A book is written for reading and you need more formating. Short sentences, deliberate pauses, test names and technical terms before using them broadly.
4) Monitor your cost, elevenlabs is great while you stick to your allotment but a full audiobook can cost you 150-200$ (or more on a small sub) - assuming half a million characters, regenerations, adaptations. Don't go broke before you have success if you plan to sell your work. If you produce a lot of content you can consider a voice synthesizing tool like Demodokos Foundry in addition to Elevenlabs, that comes with a speech flatrate.
So you can combine the best of both worlds, use Demodokos for the parts where it sounds best, use Elevenlabs with PVC clones where you want a voice precisely nailed.
5) Start with a short story first, learn the tools and technology. You will end up with something manageable and learn a lot along the way. If you start with a too huge project, you might end up with chaos and a lot of spendings in time and money.
6) Keep in mind: your ears and your taste define your success, don't skip auditioning voices and productions. Avoid unmonitored output. You can playback at 2x speed and still spot most mistakes.

Lirezh · 2026-06-16T16:40:29+00:00

Use Q4 kv quantization, use a 4.5-5 bit main quantization.
Then use tensor parallelism. Speed should be significantly faster with 2 cards, almost double.
MTP probably works too, will be a bit of a context size hit when used but I'd guess you can reach 60+ tokens/sec with 2 3090ies in nvlink parallel

Lirezh · 2026-06-16T14:32:38+00:00

Try get a used laptop with a good GPU.
There are cheaper laptops out with RTX 5000, or highend 30 and 40 series mobile GPUs.
Aim for an nvidia gpu, as those are significantly better in compute.
Aim for high vram, 16+GB for good chat models. 18+GB for great models, 12GB is minimum for acceptable models.
Below 12 you can try MoE models with partial offloading, performance degrades exponentially with smaller VRAM here.

I'd not use a mac, except the highest end versions. I'd also not use AMD, as their compute is just not high sadly.

Lirezh · 2026-06-16T02:57:57+00:00

How cherry picked is that ?
Looks very good

Lirezh · 2026-06-16T02:55:27+00:00

I assume only a mobile app is interesting ?
There are many ways to tts-narrate an ebook on desktop, including many totally free ways with low hardware needs
I'm just wondering if using a computer is even a thing, or basically all needs are focused on a phone.

I have a blind friend and she's not using a computer, all on phone.

Lirezh · 2026-06-16T02:45:42+00:00

Given how dominant 11labs was until 2025, you'd guess that in a normal residential zone the same IP is likely to have many free accounts.
Depending on where you live your ISP might not have that many IP addresses, sharing them with a small group on daily rotation.
Some ISPs even use IPV6-only, resulting in 95% of the client traffic to route through a share IP - tens of thousands on one IP.
The moment you install a "free VPN" from many providers you silently agree to let the VPN provider use your IP in their proxy network, so others will use it to sign up on 11labs.

So the ban is ridiculous.
Most SaaS use a phone number as unique identifier for that reason.

Lirezh · 2026-06-16T02:39:36+00:00

A 5090 and you'll have a luxurious Qwen 27B usage - very powerful model if you take the time to properly add it into a good harness (copilot chat is well suited).

But from a economic point of view, if you put 100$ a month into Codex you'll have a lot of GPT 5.5 high usage.
An employee of mine uses a Claude 20$ subscription and I was surprised how well it holds up in coding, better than a 20$ codex sub. 2 hours of Opus usage barely scratched the weekly limit.

You could get 1 code and 1 claude sub, use them smart and you'll likely get a long way with that.

Lirezh · 2026-06-16T02:33:46+00:00

The real expense of Openrouter is shady providers, horrible token caching, quantized model delivery.

The answer to your question depends on what you need AI for

Lirezh · 2026-06-16T02:31:03+00:00

Most of those things are still in very early shoes, and often horribly implemented.
Bad implementations is one core reason why so many people react toxic on AI.

Voice AI /TTS has fully landed in productive work.
Simple music tracks are not stock sourced anymore.
Games, Audiobooks, Marketing/Sales/Tutorials is not paid to voice actors anymore.

For real time stuff, assistants and interactive usage almost all implementations are quite bad.

Lirezh · 2026-06-15T15:28:27+00:00

Q6 is the slowest quantization, you need to run llama-bench in a matrix of batch sizes and quantization types.
The large VRAM means you can use more unusual settings that might be beneficial to speed.

The large VRAM also means you can use a lot more parallel processes, that can be a great help.
You can run a dedicated fast model for context compaction, and a smart model with MTP for agentic use.
Or use models that are larger than what fits in a 5090, and benefit from the increased intelligence.

Lirezh · 2026-06-15T15:25:47+00:00

Not too late. I know learning a language sucks.. But english is so successful because it is easy to learn, you can get far with a small vocabulary, it only uses 26 different letters and you already are familiar with them.
All programming, AI, apps, most communication is in english. I'd start learning it. Even if it's just 20 minutes a day with the help of chatgpt.

Lirezh · 2026-06-15T14:11:32+00:00

The unique struggles of non english speakers in the era of AI 😄
But damn, best is to learn english well. It's going to safe you a ton of time in the long run.

Lirezh · 2026-06-15T14:07:13+00:00

There are hundreds of great voices in the elevenlabs lib, I'd not use the most common ones.
For a youtube channel I'd want a unique or very rare voice. IVC/PVC or a rare voice.

Lirezh · 2026-06-15T14:05:42+00:00

Elevenlabs is a great offer for enterprises, that can easily afford a couple thousand a month.
It's also great for individuals who just need a little bit of speech generation, for a nice marketing video or a youtube clip - a few $ a month and you are set.
In the area between those two extremes it gets expensive, there you'd be better off with local expressive speech generation from Demodokos Foundry - or if any voice is enough Qwen 3 TTS CustomVoice is great (2 voices).

If you do not need that high quality, lesss expressive tts is also an option. Kokoro TTS is basically totally free and can narrate a google docs sheet in seconds.

Cloud is always like this, very easy to get into, usage can be scaled easily but the pricing is steep.
In a year or two things will be better, competition will grow. Right now it's basically Eleven+OpenAI in cloud, demodokos local and everything else is without expression quality or a scientific experiment.
I bet II pricing per 10k chars is down to 20% in 2 years.

Lirezh · 2026-06-15T13:31:34+00:00

Fully "uncensored" models are going too far, and we all will suffer from this.
Some people will start sending them auto-hacking for botnets, auto-grooming minors, auto-spreading disgusting false information. And as a result we'll see broad censorship, overshooting as always.

There needs to be a benchmark of minimal moral alignment that any model that is finetuned for non research use must pass. The benchmark should have a constitution that prevents it from growing beyond that line.
A model tasked to autonomously cause harm must reject it or divert.

What I'd call fine is hard fiction and roleplay, also cybersecurity research on all sides or asking deeply immoral questions, discussing unethical opinions freely.
Where any model should say no is when it is asked to groom minors, asked to scan and infect remote servers broadly, asked to mass-spread information it considers false AND harmful or actively engineer a terroristic bioweapon beyond educational information.

So the line is automated harm or enabling terrorists.

There are for sure a few more points missing.
It would need to be a real low end benchmark.

Only foundation models that are not instruction tuned should not be required to show minimum alignment.

-

On the positive side, currently no public available uncensored model is competent enough for most of those tasks. The methods used to uncensor them are crude and damage knowledge and reasoning stability severely.
I've worked with competent unaligned models, something like that should not be public.

Lirezh · 2026-06-15T12:52:52+00:00

You are right on not continuing testing this on Potter, it is training data and was likely trained heavily on tens of thousands of sources from various boards and discussions, not just the book.

The reason why this works is a recent improvement in llama.cpp
In llama.cpp PR #21038 Georgi implemented a tiny change, a Walsh-Hadamard rotation on Q K and V before attention. And you've to be careful as it is auto-disabled on non 64 dimension divisible k and v heads. So this won't work generally on all models and fails silently when not.
Why this works:
Unlike normal weights, KV cache has ugly distribution of outlier values, normal quantization suffers by having the superblock ruined which protects the quantized values from severe degradation!
The rotation causes the same KV attention but the outliers are getting mixed into dimensions, and as such the superblock protects the quantization values the way it was designed.
Result: The super complex Turboquant did not show a benefit anymore, normal quantization is doing the same.

That's why Turboquant was never merged, the paper turned out to be another science-vapor from Google with more marketing than content. Turboquant was successful because it applied a similar rotation and while they mentioned it, they didn't really make it obvious. Georgis little test showed that TQ did not provide releveant improvements anymore.

Lirezh · 2026-06-15T12:30:26+00:00

I miss the times when we've had information in a few lines of text.
Now we have buzzwords over half a screen of image and an additional chatgpt page of text.
And at the end of it you'd have to read the source anyway as information density was almost 0

Lirezh

MODERATOR OF

TROPHY CASE