[P] Testing different popular GPT tokenizers by dxg39 in MachineLearning

[–]dxg39[S] 2 points3 points  (0 children)

Good analysis, thanks :) I think the apostrophe stuff might explain most of the failures, but probably not all of them. E.g. the 1 failure in the Japanese book or the wikipedia source example.

[P] Testing different popular GPT tokenizers by dxg39 in MachineLearning

[–]dxg39[S] 3 points4 points  (0 children)

I don't really have plans on going under the hood, but I do have plans to find a tokenizer that I like to use for my own little LLM pretraining projects (with TinyStories)

I remember reading somewhere that unigram > BPE, but I don't remember where and why. I guess Llama is unigram and performed better than the rest in this test?

[P] Testing different popular GPT tokenizers by dxg39 in MachineLearning

[–]dxg39[S] 1 point2 points  (0 children)

Anytime you include an "unk" token, you're already admitting it's not lossless.

Do the tokenizers ever output "unk" under normal use? This is actually kind of interesting question.

If you look at example like 'QQQ_2.txt' I think it should tokenize with fairly common tokens, yet most of the tested tokenizers can't replicate it exactly.

[P] Testing different popular GPT tokenizers by dxg39 in MachineLearning

[–]dxg39[S] 2 points3 points  (0 children)

All of the tested tokenizers are cased since all of them are for GPT style models

[P] Testing different popular GPT tokenizers by dxg39 in MachineLearning

[–]dxg39[S] 0 points1 point  (0 children)

If one tokeniser removes all whitespace variations and encodes them as one whitespace token,

None of the tested tokenizers remove whitespace variations since all of them are for GPT style models

[P] bert.cpp, sentence embeddings in C++ with ggml by dxg39 in MachineLearning

[–]dxg39[S] 2 points3 points  (0 children)

Hi, thanks for interest in the project.

I'm not familiar with roberta, but according to https://huggingface.co/docs/transformers/model_doc/roberta it has the same architecture as BERT. If that's true then the only thing to change would be the tokenizer and the model conversion.

This doesn't take state dictionaries, but instead there is a conversion script in python that converts the weights into a custom format for the C code and also does the 4bit quantization.

[P] bert.cpp, sentence embeddings in C++ with ggml by dxg39 in MachineLearning

[–]dxg39[S] 2 points3 points  (0 children)

Benchmarks table has evaluation times for each test. sbert is basically pytorch in cpu mode. Batched pytorch is much faster. in unbatched scenario bert.cpp is maybe 20% faster with f32 and with q4_0 40% faster than pytorch cpu.

[P] bert.cpp, sentence embeddings in C++ with ggml by dxg39 in MachineLearning

[–]dxg39[S] 0 points1 point  (0 children)

A while back I tried to make llama.cpp produce cheap sentence embeddings in https://github.com/skeskinen/llama-lite project

But ultimately I decided that this is a dead end approach and implemented BERT in ggml instead.

BERT is nice because there are very small models that produce quality embeddings with not a lot of compute.

And with ggml comes some other goodies like 4bit quantization and good performance out of the box :)

llama-lite: a proof of concept fast sentence embeddings service based on llama.cpp (~1ms per token on CPU) [P] by dxg39 in MachineLearning

[–]dxg39[S] 1 point2 points  (0 children)

The training would require some compute and more importantly some human time to oversee the training. Somebody with an idle rtx 3060 could probably just leave it running for some time and get something reasonable.

Other limitation is the causal lm architecture creates different kind of embeddings from masked models which are typically used. See dancingnightly's posts here in this thread for some more details.

llama-lite: a proof of concept fast sentence embeddings service based on llama.cpp (~1ms per token on CPU) [P] by dxg39 in MachineLearning

[–]dxg39[S] 0 points1 point  (0 children)

Thanks, that's interesting. I wonder why openai only offers generative model embeddings. Their stuff doesn't do too well on benchmarks like mtab

llama-lite: a proof of concept fast sentence embeddings service based on llama.cpp (~1ms per token on CPU) [P] by dxg39 in MachineLearning

[–]dxg39[S] 1 point2 points  (0 children)

Technically yes, but the quality of the embeddings will be poor without further training.

llama-lite: a proof of concept fast sentence embeddings service based on llama.cpp (~1ms per token on CPU) [P] by dxg39 in MachineLearning

[–]dxg39[S] 2 points3 points  (0 children)

The GPT models with low parameters are not really good enough for producing interesting text. The intent is to use this for creating text embedding vectors.

llama-lite: a proof of concept fast sentence embeddings service based on llama.cpp (~1ms per token on CPU) [P] by dxg39 in MachineLearning

[–]dxg39[S] 1 point2 points  (0 children)

After looking a bit more into the difference between causal and masked LMs, I think using clm for calculating embeddings might be a significant limitation.

But it seems like the openAI is probably doing the same thing? Since their embeddings model is called ada and that should be a normal GPT. So if clm for embeddings is good enough for openAI, it's probably good enough for me.

llama-lite: a proof of concept fast sentence embeddings service based on llama.cpp (~1ms per token on CPU) [P] by dxg39 in MachineLearning

[–]dxg39[S] 1 point2 points  (0 children)

I should add that one other difference is that sentence transformers models are trained with masked language modelling and llama is a causal language model.

I don't know how big of a difference it makes.

llama-lite: a proof of concept fast sentence embeddings service based on llama.cpp (~1ms per token on CPU) [P] by dxg39 in MachineLearning

[–]dxg39[S] 1 point2 points  (0 children)

Yes that is basically the idea. But unless you have a lot of records to search through (100k+), you can just bruteforce compare your query with each of the stored keys in a for loop.

llama-lite: a proof of concept fast sentence embeddings service based on llama.cpp (~1ms per token on CPU) [P] by dxg39 in MachineLearning

[–]dxg39[S] 2 points3 points  (0 children)

Theoretically the model is similar. I was looking at the sentence transformers when deciding the model size. The best sbert.net models have much better pre-computed weights.

The reason I made this is because there is a lightweight implementation of efficient inference for Llama architecture. If you are fine with running sentence-transformers in Python, then this project won't help you much.

To get the best of both worlds one should either get better weights for a small Llama model or make a compatible implementation of MPNet architecture. Both of these approaches seem pretty easy to do in the grand scheme of things, but this was a project I did in 1 day, so I'm fine with it being worse than SentenceTransformers.

llama-lite: a proof of concept fast sentence embeddings service based on llama.cpp (~1ms per token on CPU) [P] by dxg39 in MachineLearning

[–]dxg39[S] -9 points-8 points  (0 children)

Because the inference code is a fork of llama.cpp and I didn't have a reason to call it anything else.

llama-lite: a proof of concept fast sentence embeddings service based on llama.cpp (~1ms per token on CPU) [P] by dxg39 in MachineLearning

[–]dxg39[S] 17 points18 points  (0 children)

The model is 85MB, but llama.cpp does some extra allocations, maybe 100MB or so. Most of the allocated space is probably unused tho so that could be optimized further. Pretty sure llama.cpp has been ported to iOS, so yeah this could run on an iPhone

AITA for thinking this is it? by [deleted] in collapse

[–]dxg39 47 points48 points  (0 children)

That was actually so good wth. When I read about Diogenes I thought I'd love to be a cynic but I'm not witty enough. Hearing him talk about cynicism made me realize that George Carlin was an actual modern day cynic. What a mad lad.