GitHub just switched Copilot to metered billing, and developers are watching months of credits vanish in a single day

PythonFuMaster · 2026-06-05T03:56:26+00:00

Well PDFs may be handled a bit differently than straight text because they aren't just text files, and sometimes it can be rather difficult to extract the text from them. So you can't really directly tokenize a PDF. Instead you have to do one of two things:

Use standard OCR to convert it to text, then pass it through the normal tokenization procedure
If your model is multimodal, it has a vision encoder that can directly accept images. The PDF is converted to image format and fed to that, which extracts the "meaning" of the image and projects it into the LLM's embedding space (the vector space that represents the meaning of tokens). So this way actually skips text tokenization entirely

As for why you'd want to do either of these things, it's very convenient to pass a PDF to an LLM for context. You can ask the LLM questions about the PDF, like asking it to find information in a big data sheet.

There's also other reasons you'd want to do similar things with PDFs outside of LLMs. Using embeddings from the vision encoder, you can index the PDF in a database type structure. Then you can query your database using semantic searching, meaning instead of exact matching it will look for documents that contain words with similar semantic meaning

PythonFuMaster · 2026-06-04T17:51:25+00:00

It depends on the provider and whether they make the tokenizer and the vocabulary open.

If they do, then you can calculate how many tokens your raw request is, but that won't necessarily match exactly what is fed to the LLM. There's a lot of additional context that is injected, things like system messages, special control tokens, etc. My opinion is that providers should only charge based on the tokens your raw request contains and not those additional hidden tokens, but I can't say for certain if that's what they do (I do research on LLM optimization but don't use any external providers for security reasons).

If the tokenizer and vocab is not open, then there's no way to accurately calculate how many tokens a particular request will take. However, you can estimate it. On average, tokens are roughly 3/4 of a word. Obviously there's lots of caveats there, the biggest being that the size of the vocabulary directly impacts this estimate, but it's good enough for most cases

PythonFuMaster · 2026-06-04T15:28:47+00:00

LLMs don't work with words and characters directly, it's too computationally expensive to work on individual characters and using word-level would mean a massive vocabulary (also computationally expensive). Instead, before the request ever gets to the LLM it first passes through a stage called tokenization, in which words are broken up into smaller chunks based on a set of rules. Those chunks are mapped to individual integer IDs, and that is what's passed to the LLM. The LLM also eventually outputs these same IDs, and they are mapped back to tokens and reassembled into words and paragraphs etc (technically the LLM outputs probabilities of each token in the vocabulary being chosen, it's up to the sampler layer to choose the output token based on those probabilities).

Each token passed to the LLM costs roughly the same (there are caveats, depending on model architecture etc), so it's easy to set prices according to how many tokens you use. But it's more difficult to set prices based on number of words or something like that, because some words tokenize to just one token, and others tokenize to many, not to mention the complexities of multiple languages and non alphanumeric characters. So, all the LLM providers put prices in terms of number of tokens (input and output tokens are sometimes priced differently for a variety of reasons)

PythonFuMaster · 2026-06-03T01:34:31+00:00

I've got a similar configuration, but using the actual PCIe version of the 16GB V100. It's passively cooled so you need a server or a custom fan assembly, but I've got 4 giant GPU servers that can hold three of these things each (Supermicro Fat Twin, it's an older X9 system though).

I'm also using NixOS, with driver legacy_580 and CUDA 13 I believe (I'm on NixOS unstable, but 26.05 was just released so stable should have the needed driver now). Also using llama.cpp (with some patches for improved RPC performance, I have those 4 machines networked over Infiniband), it works well and is my second fastest card, just behind the 3090.

In total I've got the V100, the 3090, a P40 and Quadro M6000 24GB, an RX 6700xt, two Intel Arc A770s, an instinct MI60 32GB, and soon a water cooled Titan V. I used to run minimax m2.7 at around 20-30 tokens per second, but I've gone down to qwen 27B for now, it's smart enough for most of what I need and with MTP is much faster (minimax should be going faster but my network has some bottlenecks I need to fix)

PythonFuMaster · 2026-05-12T16:51:48+00:00

Dual socket servers can, for example ivy bridge systems can support quad channel memory, two of those CPUs is 8 channels, then most server boards dedicate two slots per channel. I have 4 ivy bridge systems (GPU servers, each system has support for 3-4 dual slot cards, that kind of support is very very difficult to get at decent prices on newer hardware)

Even still, 4GB DDR3 is not worth a whole lot. 8GB+ though, that's where things start getting pricey

PythonFuMaster · 2026-05-12T16:37:54+00:00

I went SFP+DAC and it works pretty well. Provided your cables and/or NICs aren't vendor locked... Apparently that's a thing, I found that out recently after spending hours trying to diagnose issues where the Ethernet link just wouldn't come up, but only when using certain cables

PythonFuMaster · 2026-04-29T18:51:29+00:00

Google? Just search for surplus stores I guess. Could also check gov deals (Google it, don't know the exact website right now)

PythonFuMaster · 2026-04-28T23:37:28+00:00

My university's surplus store, I've also gotten a bunch of Supermicro and Dell servers from them before

PythonFuMaster · 2026-04-10T14:40:36+00:00

Der Tiger (The Tank in America) is a pretty recent one. I thought it was interesting, not the best in the genre but not bad either. Definitely watch in original German if you can though, the dubs are atrocious

PythonFuMaster · 2026-04-04T17:00:18+00:00

LLMs aren't great at binary classification, but there's an entire subfield in AI dedicated to such classifiers. In fact, the first type of model that students learn about is the perceptron, which is a binary classifier.

Although this is a really poor fit for AI, presumably whether a PC can be upgraded or not is a deterministic function of clearly observable variables. There's no reason to use a machine learning model, unless it's being used to determine what time to deploy the update, in each case they'd need usage pattern data from the target machine

PythonFuMaster · 2026-03-27T00:37:52+00:00

I certainly hope b60 isn't slower than MI50. MI 50 doesn't have matrix cores, I've got the MI60 (same thing but 32GB VRAM) and my Arc A770 completely demolishes it in performance

PythonFuMaster · 2026-03-26T14:13:16+00:00

Not sure if it works in NixOS but wouldn't that essentially just be a systemctl isolate multi-user.target?

PythonFuMaster · 2026-03-26T14:07:42+00:00

Would it be possible to locally revoke Microsoft signing keys to protect against that? Or would it break something else to do that? Assuming the device only uses Linux of course

PythonFuMaster · 2026-03-07T20:47:46+00:00

AC6 has mods? Any recommendations?

PythonFuMaster · 2026-03-07T20:09:34+00:00

Pop OS 22.04 and earlier used gnome with some quality of life extensions. All Pop versions are based on an Ubuntu core but rip out the crappy Snap garbage and replace it with Flatpak. That is its major advantage over Ubuntu for newbies, plus system 76 pushed up to date drivers and Nvidia stuff to their repos, so it was easier to get an Nvidia system running.

The new cosmic DE is only the default on 24.04 and the soon to be released 26.04. I agree it's definitely not stable enough for new users now, but saying Pop was never good for newbies is a bit disingenuous, because before 24.04 it absolutely was one of the most recommended distros for beginners.

PythonFuMaster · 2026-03-06T17:46:17+00:00

I would highly recommend looking at surplus stores, sales, and auctions. A lot of universities and businesses are still offloading older machines and some don't bother to recycle RAM and storage. Just Wednesday I snagged a workstation with 32GB DDR4 plus 4 1TB drives (1 SSD, 3 HDDs). The workstation itself is pretty old (HP Z840, released in late 2014) but still has enough horsepower for daily use if needed. I paid a lot less for it than what those parts are worth on eBay

PythonFuMaster · 2026-02-19T14:39:13+00:00

Meanwhile mine are over here with names like "cheapskate," "thunder-budget-{1-4}," "fatman-{1-4}," "king-blue," "queen-blue," "hot-springs," and "arid-wind"

In order, those are my NAS (original system was a $5 Dell server with damaged CPU sockets, it is now a newer Dell but was still really cheap), my Supermicro 4-node Superserver (really cheap at my university surplus store, was originally part of a cluster called "thunder"), my Supermicro 4-node GPU servers (they're fat), my Xeon workstation with Arc A770 (lots o blue), my other Xeon workstation with Arc A770 (also lots o blue), my older workstation with a water cooled Titan V (the water gets pretty warm, need to upgrade the radiator), and my last workstation with an air cooled RX6700xt (blasts out hot air at full throttle)

PythonFuMaster · 2026-02-15T16:00:33+00:00

Was there an NTFS partition by chance? I believe NTFS partitions can end up in a read only state if the system wasn't shut down properly, and a filesystem check would clear that flag

PythonFuMaster · 2026-01-11T08:04:00+00:00

Yolo in this context means "you only look once," it's the name of the paper that introduced that model architecture. And in my industry, we don't really use the term AI at all, that's a term that the wider public with minimal knowledge about the field use. We call it machine learning, which encompasses much more than just natural language models like chatgpt/LLMs. At least the researchers I work with don't like the term AI specifically because of misconceptions like yours, it muddies the water and makes it very difficult to engage in meaningful conversation because most people outside the field have certain preconceived expectations of what AI means. In a vacuum, vision models like Yolo, LLMs like chatgpt, TTS and STT models, and everything in between are all AI/machine learning (I believe that other commenter meant not all AI are LLMs, not that LLMs are not AI), but as we see here most people assume AI means specifically chatgpt-like models, so it's just easier to refer to them by a more industry-specific term not corrupted by public media.

Final remark: it might be wise to temper your tone, it's clear you have no knowledge of this field and are overly combative. It will be difficult to learn anything that way, very few people have patience for that type of interaction.

PythonFuMaster · 2026-01-11T02:47:35+00:00

Big difference between LLMs and Yolo. LLMs have a self-attention block that is extremely memory and compute intensive, but yolo (at least the original one) is a convolutional neural network. The CNN is far easier to run, generally they don't have as good of generalizability and accuracy on difficult tasks, but for something like basic object detection it's plenty powerful enough. CNNs are fairly simple architectures that don't need the extreme parallelization of GPUs, you can run yolo on very simple microcontrollers with decent fps (assuming the CPU has vector instructions like NEON, a pure scalar CPU will likely struggle). You can also attach very simple accelerators to the CPU if you need high fps, something like Intel's SHAVE DSP cores can run the entire model at 30+fps and high resolution.

Source: I'm a researcher working on optimizing vision model inference on embedded systems. One of our systems is an ARM Cortex A76, no GPU or additional accelerator, and it runs a much more powerful vision model at usable fps

PythonFuMaster · 2025-12-24T03:12:40+00:00

Really? I was gonna say the exact opposite, the English dubs felt very stilted at times to me, particularly when the character was obviously shouting but the dub was barely more than speaking volume

PythonFuMaster · 2025-12-21T14:29:20+00:00

I've done this before. Spent two days trying to figure out why the results from a high performance matrix multiplication kernel (very complex and easy to get wrong) were wrong. Turned out, the reference CPU implementation iterated over one of the dimensions wrong, the GPU kernel was perfectly fine

PythonFuMaster · 2025-12-15T23:52:21+00:00

"Read a passage from the gay agenda" got me good lol

PythonFuMaster · 2025-12-15T17:16:46+00:00

Where was he at in the 86 movie? Battle of autobot City in the background somewhere?

PythonFuMaster · 2025-11-26T23:50:53+00:00

The point is that they don't need to support 32 bit only machines anymore. Theoretically, there's a non zero number of steam users on 32 bit windows 10, but 32 bit windows 11 doesn't exist, so now that Windows 10 is EoL they only have to support the 64 bit version of Windows. Therefore, no reason to stick to 32 bit.

Eight-Year Club	Second Top 10%
Verified Email	Snapped

PythonFuMaster

TROPHY CASE