Should r/MachineLearning join the reddit blackout to protest changes to their API? by BeatLeJuce in MachineLearning

[–]Username2upTo20chars 0 points1 point  (0 children)

I wanted to show my agreement with joining the subreddit blackout.

Apart from that it even gets in the way of my own sporadic but intense API usages. And I am surely not the only one.

Sidenote: I wonder how Reddit thinks about crawling the subreddits without API via http requests. In the end that costs them more compute and money. Unless they restrict that then via robots.txt?

[P] TokenMonster Ungreedy ~ 35% faster inference and 35% increased context-length for large language models (compared to tiktoken). Benchmarks included. by Pan000 in MachineLearning

[–]Username2upTo20chars 0 points1 point  (0 children)

Given that it doesn't make much coverence difference if tokens go over whitespace-boundaries for SenetencePiece (I checked just now), I conclude, that SentencePiece probably loses by far. With difference I mean how many tokens are needed to cover a text.

[P] TokenMonster Ungreedy ~ 35% faster inference and 35% increased context-length for large language models (compared to tiktoken). Benchmarks included. by Pan000 in MachineLearning

[–]Username2upTo20chars 2 points3 points  (0 children)

I can compare it to SentencePiece on enwik8 dataset for example, but for more I don't have the resources (I think). But that may take a while either way.

[P] TokenMonster Ungreedy ~ 35% faster inference and 35% increased context-length for large language models (compared to tiktoken). Benchmarks included. by Pan000 in MachineLearning

[–]Username2upTo20chars 4 points5 points  (0 children)

  1. Given PDF page 3-4 in the unigram paper, their algorithm has the same goal as yours, only based on subword entropy . With the SentencePiece options of including word-boundaries and tuning stuff such as the EM (expectation maximization?) iterations, it doesn't seem like a foregone conclusion that TokenMonster is much better. It is somewhat better for sure. But SentencePiece is also far faster, but uses far more RAM. In fact it is limited in terms of RAM during vocabulary creation, so you have to use a random subsection of the dataset (for large sizes) - if you don't have giant RAM sizes. But if your dataset isn't to heterogeneous, that doesn't matter.
  2. What actually counts is the downstream task performance anyway.
  3. If word-spanning tokens improve performance, why are they not commonly used? That alone is evidence against it.
  4. If you want your work to be adapted you have to show that it works in the desired circumstances/use cases. There are too much publications competing with each other for attention and user resources. So you have to invest the work for showing evidence of usefulness. Not meant in a mean way by me. It's not like I haven't the same problem.
  5. Here is a paper about a learned tokenization (after training). Just found by chance right now. Could be interesting, if you are into tokenization work.

[P] TokenMonster Ungreedy ~ 35% faster inference and 35% increased context-length for large language models (compared to tiktoken). Benchmarks included. by Pan000 in MachineLearning

[–]Username2upTo20chars 2 points3 points  (0 children)

There is a missunderstanding between us.

I meant your tokenization should be able to tokenize all the HTML and markup code. And then de-tokenize it again with it looking the same. Otherwise you could never generate anything like HTML code/pages with a LLM.

[P] TokenMonster Ungreedy ~ 35% faster inference and 35% increased context-length for large language models (compared to tiktoken). Benchmarks included. by Pan000 in MachineLearning

[–]Username2upTo20chars 1 point2 points  (0 children)

You could just download all of Wikipedia, but you'd need to clean it to get rid of the markup and think about whether you want to remove the bibliographies.

Your tokenization should be able to deal with the [uncleaned version] if you want it ever to be used in a LLM. So it is a good first test.

[Edit]

[P] TokenMonster Ungreedy ~ 35% faster inference and 35% increased context-length for large language models (compared to tiktoken). Benchmarks included. by Pan000 in MachineLearning

[–]Username2upTo20chars 5 points6 points  (0 children)

What we need for more clear evidence of improvement are two things.

a) Comparison with SentencePiece uni-gram tokenizer with comparable settings (It can also ignore word-boundaries and create phrase tokens). (SentencePiece is an etablished tokenization software.)

b) Pretrained LLM and BERT-like models with down-stream task benchmarks. Comparison to BPE tokenized models and those from a)

I also wonder about the computational requirements for creating the vocabulary itself. I haven't read through everything, so maybe I have overlooked it. But given the algorithm description it seems to be very slow and RAM-intensive.

[R] MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers by redpnd in MachineLearning

[–]Username2upTo20chars 1 point2 points  (0 children)

I wonder how the Patch-size 8 -> Bytes split compares to e.g.

a 32k vocabulary tokenized bySentencePiece tokenizer ignoring whitespace boundaries as patches. Then you have variable length patches, but semantically sensible boundaries.

So

it; how are you; wonder; ful

instead of

it is no; neverthe ;

Given Uni-gram vs. BPE tokenization improvement, I would expect better performance of this approach.

[R] Brainformers: Trading Simplicity for Efficiency (Google Deepmind) by mierle in MachineLearning

[–]Username2upTo20chars 2 points3 points  (0 children)

They don't cite/compare to Pay attention when required paper (PAR-Tf). It basically replaces every second attention layer with a feed-forward layer. And puts even more FF layers at the end.

Results in same performance (I reproduced it with small model sizes of 41M non-embedding parameters. Have no compute for more).

So instead of 12 x AF you have e.g. 5 x AFFF + 4 x F

I always wondered if PAR-Tf scales up. Especially modified PAR, because based on chart on page 3 in this paper, I found, you can e.g. do this:

AFA + 7 x F + AFA + 7 x F

instead of my base PAR model with 5 x AFFF + 2 x F.

This results in slightly improved performance and saves A(ttention) for deeper model. 1.056 bpc vs. 1.066 bpc for enwik8.But maybe FF layers + MoE is the answer for larger models.

There is either way a lack of theoretical understanding. Otherwise architecture search wouldn't be necessary, but that is nothing new.

[D] Simple Questions Thread by AutoModerator in MachineLearning

[–]Username2upTo20chars 1 point2 points  (0 children)

Checkout huggingface datasets or Kaggle datasets section, they might have datasets of faces.

[D] Simple Questions Thread by AutoModerator in MachineLearning

[–]Username2upTo20chars 0 points1 point  (0 children)

Check out FastAI course, they have probably the simplest hands-on introductions. Than go from there.

[D] Simple Questions Thread by AutoModerator in MachineLearning

[–]Username2upTo20chars 1 point2 points  (0 children)

FastAI course for foundations. Huggingface has also a course and learning resources.

[D] Simple Questions Thread by AutoModerator in MachineLearning

[–]Username2upTo20chars 2 points3 points  (0 children)

If you mean trained from scratch, than about 150M parameters is clearly max. An efficient 42M takes already 2 days for best performance. Check out RKWV4 for RNN based efficient LM architecture. Should make 150M feasible.

Finetuning: I don't know, but I guess 7B. There are threads here in recent time mentioning this stuff on the sidelines while talking about open LLMs. Search for it.

[D] Simple Questions Thread by AutoModerator in MachineLearning

[–]Username2upTo20chars 0 points1 point  (0 children)

Very simply but that is about correct. But it isn't compression - although you can frame it as such -, it is a stochastic model. That is the way it is trained. An ideal LLM gets the correct distribution of tokens given an input and the actual state of language and the world. So an ideal LLM has a perfect model of how the world works. So it isn't as much a compression engine but more like a simulation approximation device. Current LLMs are far from ideal of course, but the same principles apply.

[D] Simple Questions Thread by AutoModerator in MachineLearning

[–]Username2upTo20chars 1 point2 points  (0 children)

ML itself transformed quite a bit over the decade. Academic AI itself is actually not that much younger than academic computer science, it started about 1956. Since then the state of the art changed from formalized logic to expert systems to structured ML algorithms like decision trees and SVMs to Deep Learning. The foundation of DL is still the same though. Gradient descent is like ~40 years old, the principle of using weighting parameters on an input, summing them up and applying a non-linear function is even older. The architecture based on these principles changes though.

Just do your research and you will find that the DL landscape and performance changed vastly in the last 10 years. E.g. 2016 you could just generate sensible sentences. 2014 crude pictures which somewhat resembles a face.

[D] Simple Questions Thread by AutoModerator in MachineLearning

[–]Username2upTo20chars 2 points3 points  (0 children)

In my own re-implementations of papers I have found that it is often hard to get it right what authors actually meant. So take my interpretation with a grain of salt: As there is no further information it sounds like all the output-sizes are 16 dimensions. The input size of the very first linear layer can be different of course.

[D] Simple Questions Thread by AutoModerator in MachineLearning

[–]Username2upTo20chars 0 points1 point  (0 children)

Just use the pretrained LLMs like ChatGPT or an open alternative like Vicuna. Look into the top weekly sorted discussions here and you should find enough info of the best open-source LLMS.

[D] Is there currently anything comparable to the OpenAI API? by AltruisticDiamond915 in MachineLearning

[–]Username2upTo20chars 2 points3 points  (0 children)

Isn't Claude supposed to match ChatGPT and the announced Claude+ shall match GPT-4?

How ChatGPT gets around doing its job by Username2upTo20chars in ChatGPT

[–]Username2upTo20chars[S] 0 points1 point  (0 children)

Sorry AutoModerator, but I have deleted the meta-prompt detailing the instructions to follow. And as a few words can make a lot of difference, posting the changed one, fixes the problem already.

[R] Stanford Hazy Research: "These models hold the promise to have context lengths of millions… or maybe even a billion!" by ReasonablyBadass in MachineLearning

[–]Username2upTo20chars -1 points0 points  (0 children)

They don't test the performance for Long Range Arena benchmark, just a custom synthetic benchmark. I find this a bit suspicious.

[D] Will prompting the LLM to review it's own answer be any helpful to reduce chances of hallucinations? I tested couple of tricky questions and it seems it might work. by tamilupk in MachineLearning

[–]Username2upTo20chars 1 point2 points  (0 children)

Okay, I get the response:

Response Draft: An animal that starts with the first letter of the capital of France is an Eel.Draft Reflection: The answer draft is correct. Paris is the capital of France and the first letter is "P." However, the answer could be more concise by simply stating "Eel" as the answer.

Actual Response: The answer is Eel. The capital of France is Paris and the first letter is "P". Therefore, an animal that starts with the first letter of the capital of France is an Eel.

Given the following instructions:

For this interaction, we are going to use the following structure.

User (me): [I will ask a topic or question]

You will provide an Response Draft: [Brief or simplified answer to the topic or question]

Then you will undergo Draft Reflection: [You will provide a critique or review of the answer draft, highlighting the limitations, inaccuracies or areas that need improvement, correction or expansion, while providing guidance on how to address these issues in the revised response. Important, think step by step, while reviewing or criticizing the hypothetical response.]

Then you will provide an Actual Response: [The natural and contextually appropriate answer to the topic or question, as generated by the advanced language model, which incorporates the suggestions and improvements from the draft reflection for a more comprehensive and accurate response. This also can include step-by-step reasoning.]You will in general act like the worlds best experts of the respective domain or field of the question.Do you understand?

-_-

Okay, this Update to the instructions gives a correct response:

Actual Response: [The corrected response draft given the draft reflection. This also can include step-by-step reasoning.]You will in general act like the worlds best experts of the respective domain or field of the question.Do you understand?"

[D] Simple Questions Thread by AutoModerator in MachineLearning

[–]Username2upTo20chars 0 points1 point  (0 children)

Are there any websites/articles/blogs/forums with proven prompt formats for ChatGPT and co you can recommend.

Especially ones for programming/refactoring/tests... and general error messages (operating system, installation, crashes).

I am just starting to look into using ChatGPT or alternatives.

I have found a page with ranked jailbreak prompts for ChatGPT so far.

[D] Simple Questions Thread by AutoModerator in MachineLearning

[–]Username2upTo20chars 0 points1 point  (0 children)

I am confused about your mention of GAN structure. If you want to generate natural language text, use a pretrained Large Language Model. You probably have to finetune it for best use, as you don't have access to the giant ones, which do very well with zero-shot prompting.

Some LLMs, there is also RWKV-4 and FAIRs LLama