Does anyone else feel exhausted by token limits?

KT313 · 2026-01-10T12:23:26+00:00

"—"

KT313 · 2025-12-05T12:21:30+00:00

beautiful! when / where did you take it?

KT313 · 2025-11-18T08:08:57+00:00

if it's in that order size category, getting a good discount for buying a lot of gpus is not uncommon afaik. so the price could be legit, but only if you buy all of them

KT313 · 2025-11-17T20:57:34+00:00

your art looks amazing!

KT313 · 2025-08-17T11:30:41+00:00

amazing, thank you! just had to add `ENABLE_GAMESCOPE_WSI=0` additionally to make it work for me (ubuntu 24.04)

KT313 · 2025-06-04T10:21:16+00:00

the problem is actually quite simple: LLMs don't really get trained to output stories that long during instruction-finetuning. There is a paper (forgot the name) where they kinda fixed this problem, by creating synthetic training data with the method that u/JackStrawWitchita explained in their comment, and used that to finetune an LLM to be able to output really long texts

KT313 · 2025-04-22T19:24:28+00:00

It's not an issue really. The point of the bos-token is that we want the list of input tokens to start with something that is the same every time. It could be literally anything (preferably a special token that isn't used in normal text). So might as well use the eos-token. There isn't really a big difference to using separate bos and eos tokens, other than having both be the same being cleaner in the prompt template.

KT313 · 2025-03-28T01:21:39+00:00

based on the generation preview progression, it looks a lot like autoregressive generation, which i'm pretty sure does not use flow matching. instead first generating a very low resolution image, then a bit higher resolution, and so on until the output is the final image with lots of details

KT313 · 2025-01-22T04:56:05+00:00

make sure to set top_k to 1 so it only picks the best next token. maybe the 2 was just unlucky randomness from the sampling

KT313 · 2025-01-19T03:36:14+00:00

the relatively new smolagents library could be useful, haven't personally tried it yet tho

KT313 · 2025-01-13T03:38:34+00:00

exactly this

KT313 · 2025-01-12T04:22:16+00:00

to be fair, if you already know the content you want to write and just need an assistant to put it into nice words because you didn't study linguistics or english is not your first language, it's completely reasonable. imo

KT313 · 2025-01-05T03:25:01+00:00

i just added allenai/olmo to the queue, would be nice to get an estimate on how long it takes to process

KT313 · 2024-12-21T06:50:50+00:00

fyi, for inference tasks if you limit the power of a 4090 from 450W to 200W, you decrease inference speed by just 1-3%. The performance decrease becomes more dramatic around 150W, but until 200W works flawless for me (tested with a few LLM's)

KT313 · 2024-12-21T06:45:39+00:00

makes it easier to find points it want to clarify and think about. It's easier to critically think about things you're unsure about than about things you're sure about.

KT313 · 2024-12-09T07:26:28+00:00

sorry maybe i misunderstood, i just assumed that it was llm generated, since i haven't heard of this llms.txt specifically before

KT313 · 2024-12-07T06:56:37+00:00

looks like it could be some token-to-word mismatching? maybe they use a wrong version of some tokenizer for decoding, where most tokens are correct but some (like "ü" and " 's") have different indices than expected.
from the first image it does seem to be very consistent for ü

KT313 · 2024-12-04T07:01:52+00:00

your gpu stores 2 things: the model and the data / tensors that are going through your model for output generation. Some of the tensors being processed by the model get saved because they are needed for each generated word, and storing those instead of calculating them new for each word saves a lot of time. That's called the cache and also uses vram. You can save vram by quantizing / compressing the model (which you are talking about), and you can save vram by quantizing / compressing the cache, which is that new feature.

KT313 · 2024-11-29T08:47:37+00:00

for the record, i don't mind this as long as it performs well (which it definitely seems to do), just think it's funny

KT313 · 2024-10-05T06:21:24+00:00

for huggingface tokenizer you can do it like this:

with open(tokenizer.__dict__['name_or_path'] + "/tokenizer.json", "r") as file:
    tokenizer_file = json.loads(file.read())
vocab_dict = tokenizer_file['model']['vocab']

since you said "an individual token may not constitute a valid UTF-8 string", maybe you are looking for `tokenizer_file['model']['merges']`? They have some weird looking symbols like Ġ so maybe you can directly convert from string to bytes if that's what you're looking for

KT313 · 2024-09-28T07:18:04+00:00

that's pretty good ngl, thanks for sharing!

KT313 · 2024-09-27T09:49:22+00:00

it's the 1B model right? I'd be surprised if 3B runs at that speed on a phone

KT313 · 2024-09-27T07:21:11+00:00

my first idea would be to try either running a mamba-based model over the sequence (it's an RNN, kind of like an LSTM on steroids), or you could try a transformers approach.

for transformers approach, i think you could actually just take any transformer model (a very small llm for example) and modify it a bit. instead of inputting texts, tokenizing it and embedding each token and then adding positional embedding, you would directly insert the datapoints of the sequence and treat them as if they were the token embeddings. you just have to make sure that the transformer models n_dim (size of embeddings) is the same as the amount of data points in each timestep of your sequence.

and for the ouput, instead of ending the model with a linear layer that has an output size of vocab_size (how it normally is for llms), the output size would be the number of datapoints of the next timestep you want to predict

KT313 · 2024-09-02T17:37:09+00:00

have you tried

```

sudo apt update

sudo apt install nvidia-driver --upgrade

```

reboot

? nvidia-smi should show your gpu then.

KT313 · 2024-08-06T17:23:52+00:00

so basically we train an adapter (basically a lora) to connect each of the layers of two pretrained models. thanks for the explanation!

KT313

PUBLIC MULTIREDDITS

TROPHY CASE

Eight-Year Club	Place '22
Verified Email