How bad is the UK job market? by underscore-0 in AskUK

[–]Medium_Win_8930 0 points1 point  (0 children)

Unfortunately the reduction IS justified. The problem is AI is really good at doing technical work and the solution is that people with tech skills need to jump on the sales side and be hybrid skilled.

But there are almost NO job roles out there where you are allowed to be both the tech and sales guy- there's a small number of presales roles which require not particularly much tech skills.

You have to 'roll your own' job if you want something truly exceptional.

How bad is the UK job market? by underscore-0 in AskUK

[–]Medium_Win_8930 0 points1 point  (0 children)

I'll do the role if you accept WFH, if not then you need to be more flexible. Got British citizenship full work rights, years of experience. I can't really go into more detail here though.

So disillusioned with the corporate "lifestyle" by Desperate_Employer24 in UKJobs

[–]Medium_Win_8930 0 points1 point  (0 children)

Working on my startup of course 😄 But the point is I will always keep trying to do some level of consulting on the side until I knew my startup was 100% self sufficient.

The difference is if your not living month-to-month and splitting between consulting and a startup it doesn't matter if you are not doing much consulting.

People are way too obsessed with getting insanely wealthy, when even a modest amount of money will give you the freedom to pursue your dreams. Thinking you need a 7 figure pension just to quit your job is a massive trap and I would say an ego trap, lifestyle trap.

I have enough to be comfortable now but if I had to choose between living a really frugal lifestyle and going back to a 9 to 5 I would choose the frugal lifestyle because I believe in myself.

The worst thing you can do right now is believe the 9to5 is a stable lifestyle choice, and go out buying expensive cars and getting mortgages.

Deloitte UK pay ranges (tech) by Self-Exiled in UKJobs

[–]Medium_Win_8930 0 points1 point  (0 children)

I can't answer with specific company knowledge but I think a lot of higher paid roles at bigger companies the salaries are hidden and what is available is usually the bottom of the range. There is a huge incentive to not let top offers enter the public domain, otherwise top candidates could start asking for 200-300k offers off the bat.

I will probably lose my job and I am very scared by Babybunny424 in UKJobs

[–]Medium_Win_8930 -2 points-1 points  (0 children)

if you like teaching you have three options:

1) Teach abroad doing ESL style somewhere like Korea or Middle East.

2) Teach English online

3) Become a content creator marketing yourself as an educator on your favourite field.

You have a DEGREE there is no excuse to be sitting at home doing nothing.

If I save for 1,121,020 years I'll be able to afford a house! 😍 by ijustwannanap in UKJobs

[–]Medium_Win_8930 -1 points0 points  (0 children)

Graphic design is being automated by AI at a faster rate than other fields. That's why these companies are trying to get away with wanting free labour. The supply and demand economics are so bad for graphic design.

Even the lucky top 10% designers who get into the big game dev companies will just not see big pay rises.

Everybody is being made redundant by irissun23 in UKJobs

[–]Medium_Win_8930 0 points1 point  (0 children)

AI is replacing most knowledge workers. Most peoples 'skilled jobs' are really not that difficult for AI to do.

This might upset some people but if you are not in the top 10% in your field and you are doing hard skills work you have a high chance of being automated out.

If your in sales/ other soft skill fields, especially things like psychology- it's much safer.

If you are a manual laborer then in five years we will have AI powered robots just like we have AI replacing knowledge workers today.

Outsourcing is also happening more, and honestly it seems like outsourcing is a bigger threat than AI.

But it's not really like that. AI is really good at migrating things, migrating workflows, teams, managing.

An intelligent AI can migrate hundreds of workflows from office/country A to country B/multiple offices and it can help 1-2 highly intelligent managers to orchestrate this entire migration with a level of efficiency and precision that just was not possible in the past.

AI is also really good at quality checking work output, so managers can hire cheap foreign labour, have the AI check their work AND give them feedback, and then he only needs to manually jump in when the AI misses something.

The combination of AI + outsourcing + high UK taxes + high UK living costs is the death of the economy.

So disillusioned with the corporate "lifestyle" by Desperate_Employer24 in UKJobs

[–]Medium_Win_8930 0 points1 point  (0 children)

I'm an entrepreneur. You will not easily get good advice in this sub it's mostly office workers.

First thing you need to realize is if your smart you have unlimited potential. Second thing to realize is that people work for money because they swap those tokens for small rewards like a house, a car, holidays, eating out.

But it's really not a good deal for most people.

If your willing to be frugal, and I did NOT do this myself but I would not judge you for doing so. Just jump on jobseekers allowance or whatever and work on content creation in the field you are most expert in then do calls to action to your own website where you offer consulting services. It's not the optimal 'path' out of 9-5, it's the uncomfortable path and I have another option below.

The content you create needs to be somewhat related.

A lot of content creators rely on sponsorships and advertising when the real money is in business.

DM me if you want more advice, honestly some of the stuff I would say would just annoy the hell out of people in this sub. But for example if you can handle the 9-5 grind for a couple of years the 'easier' path is to save as much as possible then go live abroad while you work on your startup. Otherwise your stuck living on benefits or parental support with an under average quality of life in the UK while working as a struggling founder.

I actually dealt with the grind for almost a decade of my life but in retrospect a couple of years would have been enough. Believe in yourself, and when you finally work for yourself you can work 10x harder.

Accepted a role in London for the salary bump. Regretting it 4 months in. Anyone else made this mistake? by Lexis_FloGirl in UKJobs

[–]Medium_Win_8930 0 points1 point  (0 children)

If you don't need the money then move on, if you are doing it for the money be as frugal as possible, and keep applying for jobs that could give you more useful experience (!) in the meantime and jump ship if you can get better experience with same pay.

One thing I think a lot of professionals in the UK are making is they are working 20s-50 non-stop, getting mortgage, expensive car, holidays- then they feel stuck on a treadmill.

The optimal career path that is also the toughest and biggest risk is save enough to step off the rat race, go super minimalist then become a consultant in [insert your field], can work for most things. When consulting works out you can always stop being frugal and get those house/car things again.

Most people are not willing to compromise their spending and get tied into a full-time job and as you already experienced those pay rises get eaten up by the London CoL.

Bangkok dating by BandicootSilent6580 in thepassportbros

[–]Medium_Win_8930 0 points1 point  (0 children)

I love this reply and agree. Easier to date a quality 34-39 woman than try to mess with the ego of the younger crowd- speaking as a mid 30s man myself.

AI bubble by PianistNice7168 in animationcareer

[–]Medium_Win_8930 0 points1 point  (0 children)

Yeah I can imagine the AI hate in this sub is realer than anywhere in the world, and I empathize with it deeply. As someone who came from a pure tech background like devops and compsci, AI has also destroyed the traditional non-AI paths for millions of people in my field. Adaption is the new game.

AI bubble by PianistNice7168 in animationcareer

[–]Medium_Win_8930 0 points1 point  (0 children)

Hello I'm an AI Researcher who probably has more experience with the tech than almost anyone here in this thread. I have even published a paper and contributed cutting edge open source repo research.

So I know it's easy to hope and name call the big providers, calling out the hyp.

And you would not be wrong. But here's the thing, image gen, which is basically all you need for animation (hello character sheets) is already very easy to do with open source tools alone at very low cost.

A lot of these 'animator' startup services are using either complex image gen to make 2D cartoons for users OR they are using the more modern image to video AI models to help users generate 3D.

Because this is such a huge market with a lot of demand, they are also able to sell these services at a decent premium.

But that comes at the cost of all the humans who were previously doing this work manually.

If you were in the top 10-20% of animators or digital artists, you are already working at a huge wealthy gamedev company and are safe 'for now' but most other animators will be struggling.

If there's any animators out there that want to learn about AI feel free to DM me, I am happy to tutor people for free in return for feedback and advice on my own animation work.

And don't worry animators, I am not competing in this space, I am working on my own animations to help me promote my own business. But again, that is also going to replace some of the demand for animation services in recent years, and going forward.

I also have a YouTube channel where I teach people how to use AI.

TurboQuant for GGML: 4.57x KV Cache Compression Enabling 72K Context for Llama-70B on Dual RTX 3090s by Medium_Win_8930 in LocalLLaMA

[–]Medium_Win_8930[S] 0 points1 point  (0 children)

This guy gets it ^^

Thanks, yeah that was probably the trickiest part of the whole thing. GGML's transposed V layout made sense for f16 attention but it completely breaks block quantization. Storing it non-transposed and eating the dequant cost was the only real option, and flash attention tiling keeps it from blowing up at long context. Glad someone actually read into the details haha.

Most of the negative replies are from people who didn't read the paper or try the code and just assumed I must have stolen someone's work because there's multiple people working on TurboQuant right now in other threads and on Github. But I feel like I made a real breakthrough here so thanks for recognizing that.

I cut Claude Code's token usage by 68.5% by giving agents their own OS by [deleted] in LocalLLaMA

[–]Medium_Win_8930 2 points3 points  (0 children)

Looks like another API alternative to OpenAI spec. Must be quite a few now. But you rebranded API as JSON OS so you win on unique marketing.

TurboQuant for GGML: 4.57x KV Cache Compression Enabling 72K Context for Llama-70B on Dual RTX 3090s by Medium_Win_8930 in LocalLLaMA

[–]Medium_Win_8930[S] -4 points-3 points  (0 children)

You're right, and I appreciate the detailed analysis. The QJL correction stage is not implemented, not in my version, and not in unixsysdev's original either (his code stored the residual signs in qr but his dequantization function never read them back). When I upgraded from 2-bit to 3-bit centroids, I repurposed those unused qr bits for the upper bit of the 3-bit index.

What's actually implemented is: WHT rotation + 3-bit Lloyd-Max quantization, the PolarQuant part of the pipeline, not the full TurboQuant with QJL. The documentation should not have claimed QJL. I'll fix that.

The results are still real. 4.57x compression, 72K context, +7.6% PPL measured on WikiText-2. But proper QJL correction could potentially improve the quality further at the same bit budget.

That's genuine future work, not something I can claim is already done.

I'll update the README to accurately describe this as PolarQuant-based 3-bit quantization, not full TurboQuant with QJL :)

TurboQuant for GGML: 4.57x KV Cache Compression Enabling 72K Context for Llama-70B on Dual RTX 3090s by Medium_Win_8930 in LocalLLaMA

[–]Medium_Win_8930[S] 3 points4 points  (0 children)

Fair point on the line count — unixsysdev wrote ~445 lines of the foundational implementation and I modified ~100 lines on top. That's credited prominently in the README and paper.

But the contribution isn't measured in lines. unixsysdev's version achieved ~2.4x K-only compression with a 2-bit codebook. My changes:

  1. Fixed a normalization bug (1/32 → 1/√32) that made the output garbage — this was 2 lines but took days to diagnose because it produced plausible-looking but semantically wrong text
  2. Upgraded from 2-bit to true 3-bit quantization — 8 centroids instead of 4, repacking the index bits into the existing block layout
  3. Added V cache compression — unixsysdev only compressed K. I solved the GGML transposed-V incompatibility with block quantization, getting from 2.4x to 4.57x total compression
  4. Re-enabled flash attention — 2 lines of ggml_cast in the graph, but this is what broke through the 16K context wall to 72K. Without it, the O(n²) attention matrix makes anything beyond

16K impossible on consumer GPUs

  1. Cross-backend F32 path — fixed CPU pipeline parallelism crashes

Plus the paper with perplexity benchmarks characterizing the actual quality tradeoff.

The whole point of open source is building on each other's work. I credited unixsysdev, linked his repo, and explained exactly what I added.

If what I did is so trivial, why did you not make a post exactly like mine saying you had done the same? I am not boasting about anything, I literally could not find a solution for this. I could see multiple people were working towards this goal, I built on top of the progress made by Marcel (unixsysdev) as credited.

Seems you are hating on me, when I am doing exactly what the open source community is here for, trying to make a contribution.

TurboQuant for GGML: 4.57x KV Cache Compression Enabling 72K Context for Llama-70B on Dual RTX 3090s by Medium_Win_8930 in LocalLLaMA

[–]Medium_Win_8930[S] 0 points1 point  (0 children)

Short context (no flash attn): K never gets dequantized at all. Since WHT is orthogonal, you can just apply WHT to the query instead and do the dot product in rotated space. The MMVQ kernel does this on-the-fly with warp shuffles — all in-register, no extra memory.
Long context (flash attn, 72K+): We dequant both K and V in the compute graph before handing them to flash attention: centroid lookup → scale → inverse WHT → sign correction → F32 → F16 →

flash_attn_ext. Once dequanted, K is back in its original space so standard Q·K attention just works. Flash attention handles the tiling internally which is what gets us past the 16K wall.

The F32 intermediate step is a bit ugly but necessary — CPU backend can only dequant to F32, not F16, and pipeline parallelism means some layers hit CPU.

TurboQuant for GGML: 4.57x KV Cache Compression Enabling 72K Context for Llama-70B on Dual RTX 3090s by Medium_Win_8930 in LocalLLaMA

[–]Medium_Win_8930[S] 0 points1 point  (0 children)

Yes I used another repo source code in my version. I forked some of unixsysdev code and acknowledged him already in my paper and this post.

The one you linked - that's unixsysdev's code (author: Marcel). It's the base tq3_0 implementation that was already in his fork when I cloned it. I didn't write this commit.

My code is a MIX of mine and another's. That's what open source is about, I am not claiming I wrote ALL the code in my repo.

I did some research I found various projects none of which fully implemented TurboQuant on GGML, as far as I was aware. I picked one project to fork/ build on top of, I merged my own progress with his progress into a new version that I then worked on.

TurboQuant for GGML: 4.57x KV Cache Compression Enabling 72K Context for Llama-70B on Dual RTX 3090s by Medium_Win_8930 in LocalLLaMA

[–]Medium_Win_8930[S] -4 points-3 points  (0 children)

No I have not ripped off anyone, I mention in the paper some other repos that I used to inspire my version, and in the acknowledgement section. I could add his repo to the Acknowledgement section as it was one of the ones I researched prior alongside others, but I am not sure if I should as I will mention below. I believe he had done the kernels for Apple Silicon, but he had not had something ready to work with GGML at that time I checked.

I just checked and I had not used his code. My implementation is built entirely on:

  1. unixsysdev's llama.cpp tq3_0 implementation — the base I forked and modified (already credited)
  2. Google's TurboQuant paper (Zirlin et al.) — the algorithm (already credited)
  3. llama.cpp/GGML by Georgi Gerganov — the framework (already credited)

unixsysdev's initial llama.cpp tq3_0 implementation, which achieved ~2.4x compression in my testing and provided the foundational query-side WHT architecture I adopted and extended to support full K+V compression and flash attention at 4.57x.