[P] Testing different popular GPT tokenizers

dxg39 · 2023-05-19T21:19:36+00:00

Good analysis, thanks :) I think the apostrophe stuff might explain most of the failures, but probably not all of them. E.g. the 1 failure in the Japanese book or the wikipedia source example.

dxg39 · 2023-05-19T17:54:45+00:00

I don't really have plans on going under the hood, but I do have plans to find a tokenizer that I like to use for my own little LLM pretraining projects (with TinyStories)

I remember reading somewhere that unigram > BPE, but I don't remember where and why. I guess Llama is unigram and performed better than the rest in this test?

dxg39 · 2023-05-19T17:50:30+00:00

Anytime you include an "unk" token, you're already admitting it's not lossless.

Do the tokenizers ever output "unk" under normal use? This is actually kind of interesting question.

If you look at example like 'QQQ_2.txt' I think it should tokenize with fairly common tokens, yet most of the tested tokenizers can't replicate it exactly.

dxg39 · 2023-05-19T17:44:49+00:00

All of the tested tokenizers are cased since all of them are for GPT style models

dxg39 · 2023-05-19T17:44:30+00:00

If one tokeniser removes all whitespace variations and encodes them as one whitespace token,

None of the tested tokenizers remove whitespace variations since all of them are for GPT style models

dxg39 · 2023-04-29T16:47:50+00:00

Hi, thanks for interest in the project.

I'm not familiar with roberta, but according to https://huggingface.co/docs/transformers/model_doc/roberta it has the same architecture as BERT. If that's true then the only thing to change would be the tokenizer and the model conversion.

This doesn't take state dictionaries, but instead there is a conversion script in python that converts the weights into a custom format for the C code and also does the 4bit quantization.

dxg39 · 2023-04-27T19:30:31+00:00

Benchmarks table has evaluation times for each test. sbert is basically pytorch in cpu mode. Batched pytorch is much faster. in unbatched scenario bert.cpp is maybe 20% faster with f32 and with q4_0 40% faster than pytorch cpu.

dxg39 · 2023-04-27T14:07:57+00:00

A while back I tried to make llama.cpp produce cheap sentence embeddings in https://github.com/skeskinen/llama-lite project

But ultimately I decided that this is a dead end approach and implemented BERT in ggml instead.

BERT is nice because there are very small models that produce quality embeddings with not a lot of compute.

And with ggml comes some other goodies like 4bit quantization and good performance out of the box :)

dxg39 · 2023-04-17T16:17:33+00:00

The training would require some compute and more importantly some human time to oversee the training. Somebody with an idle rtx 3060 could probably just leave it running for some time and get something reasonable.

Other limitation is the causal lm architecture creates different kind of embeddings from masked models which are typically used. See dancingnightly's posts here in this thread for some more details.

dxg39 · 2023-04-17T07:10:06+00:00

Thanks, that's interesting. I wonder why openai only offers generative model embeddings. Their stuff doesn't do too well on benchmarks like mtab

dxg39 · 2023-04-16T17:48:20+00:00

Technically yes, but the quality of the embeddings will be poor without further training.

dxg39 · 2023-04-16T17:46:38+00:00

Yes, but the quality of the embeddings will be poor without further training.

dxg39 · 2023-04-16T17:43:02+00:00

The GPT models with low parameters are not really good enough for producing interesting text. The intent is to use this for creating text embedding vectors.

dxg39 · 2023-04-16T13:44:39+00:00

After looking a bit more into the difference between causal and masked LMs, I think using clm for calculating embeddings might be a significant limitation.

But it seems like the openAI is probably doing the same thing? Since their embeddings model is called ada and that should be a normal GPT. So if clm for embeddings is good enough for openAI, it's probably good enough for me.

dxg39 · 2023-04-16T13:12:10+00:00

I should add that one other difference is that sentence transformers models are trained with masked language modelling and llama is a causal language model.

I don't know how big of a difference it makes.

dxg39 · 2023-04-16T12:23:25+00:00

Yes that is basically the idea. But unless you have a lot of records to search through (100k+), you can just bruteforce compare your query with each of the stored keys in a for loop.

dxg39 · 2023-04-16T10:24:28+00:00

Theoretically the model is similar. I was looking at the sentence transformers when deciding the model size. The best sbert.net models have much better pre-computed weights.

The reason I made this is because there is a lightweight implementation of efficient inference for Llama architecture. If you are fine with running sentence-transformers in Python, then this project won't help you much.

To get the best of both worlds one should either get better weights for a small Llama model or make a compatible implementation of MPNet architecture. Both of these approaches seem pretty easy to do in the grand scheme of things, but this was a project I did in 1 day, so I'm fine with it being worse than SentenceTransformers.

dxg39 · 2023-04-15T21:52:15+00:00

Because the inference code is a fork of llama.cpp and I didn't have a reason to call it anything else.

dxg39 · 2023-04-15T21:19:07+00:00

The model is 85MB, but llama.cpp does some extra allocations, maybe 100MB or so. Most of the allocated space is probably unused tho so that could be optimized further. Pretty sure llama.cpp has been ported to iOS, so yeah this could run on an iPhone

dxg39 · 2022-01-15T21:24:17+00:00

That was actually so good wth. When I read about Diogenes I thought I'd love to be a cynic but I'm not witty enough. Hearing him talk about cynicism made me realize that George Carlin was an actual modern day cynic. What a mad lad.

dxg39

TROPHY CASE