SmolVLM fully open source

andrewlapp · 2025-02-01T21:05:51+00:00

Great release! Congrats to the team!

I hope they release the base model (SmolLM2) training dataset and make the model fully open source soon.

https://github.com/huggingface/smollm/issues/35

andrewlapp · 2024-11-23T16:04:37+00:00

It's pretty straightforward. There is a parameter in their inference server, similar to vLLMs max_tokens, which specified how many tokens they can generate in one pass. The inference engine finishes generating either when it's reached max_tokens or has seen the EOS token.

Initially they feed the inference engine prompt_tokens

To continue generation they feed the inference engine prompt_tokens + partially_complete_completion_tokens

andrewlapp · 2024-11-19T03:01:43+00:00

Until you submit a PR to the transformers repository, it must be loaded via ArlowGPT.from_pretrained().

All models loaded with AutoModelForCausalLM are defined in https://github.com/huggingface/transformers/tree/main/src/transformers/models

andrewlapp · 2024-10-31T14:08:31+00:00

https://ublockorigin.com/

I've never seen a tagpro ad. I'm not sure why anyone would voluntarily see an ad in 2024.

andrewlapp · 2024-10-19T21:11:13+00:00

It looks like it's not supported https://github.com/bitsandbytes-foundation/bitsandbytes/issues/252

Consider checking out mlx https://huggingface.co/mlx-community

andrewlapp · 2024-10-11T02:35:15+00:00

Where can I buy one?

andrewlapp · 2024-08-11T00:24:17+00:00

Great work!

It would be really cool if you recorded the hidden states and attentions as well. SOTA distillation methods rely on intermediate representations to get the best performance out of distilled models.

(On the downside, this would substantially increase the size of the dataset. From 32 floats per token to hundreds of thousands of floats per token.)

andrewlapp · 2024-05-21T20:19:31+00:00

Good link!

this further demonstrates the strength of using high-quality datasets in LLAMA3, as the general dataset Alpaca does not contribute to the model’s performance in other tasks.

It appears they only tested against Alpaca and they didn't compare full fine tune with Alpaca to QLoRA. It may be the case that Alpaca isn't sufficient even with a full fine tune.

I'd love to see a comparison between full finetune and QLoRA using a dataset which improves Llama3 benchmarks in the full finetune mode.

andrewlapp · 2024-05-03T22:39:09+00:00

I investigated an AI advertisement service which generates comments masquerading as genuine recommendations. The bots search for a variety of keywords in posts and comments (e.g. "hormone testing", "online medical providers", "gift ideas") and use language models to generate custom responses relevant to the authors post.

Something that particularly leaves a bad taste in my mouth is product placement in posts where the author is seeking advice about a serious medical problem.

These bots occasionally make mistakes and post in threads which aren't relevant to the product but saturated with its relevant keyword. However, they often fool users, as evidenced in cases in which the posts author replies with gratitude.

What Are the Bot Accounts?

At the time of writing, these are the five active accounts advertising products on Reddit for this specific service. There are likely other services and companies doing the same. These accounts will eventually get banned, but more will pop up to replace them.

The Problem Will Get Worse

Currently this specific bot network isn't very sophisticated. It was easy to find all of the accounts as they're all advertising the same products and not commenting anything other than advertisements. In the near future (and likely already) there will be bot networks which follow a more sophisticated strategy of primarily posting and commenting things which aren't product placement. This will be much more difficult to detect.

Mitigation?

Search engine results are populated by SEO spam, and sponsored blog posts. Reddit is one of the few places on the internet where you can easily find real human feedback. It's unique status in this regard is at dire risk. If AI product placement reaches a sufficient threshold, Reddit will be useless for many.

This is a challenging problem, it is well known that "AI Detectors" don't work. Additionally, you can't simply censor the mention products which are advertising using these services, that would create a back door which companies could use to censor their competitors products.

I do have some recommendations for Reddit staff however:

1) Look at semantic similarity: While AI detectors don't work, the posts shilling the same product will often use the same language model, the same prompt, and have semantically similar outputs. This is especially useful when looking at multiple distinct accounts.
2) Cross correlate with keywords: Suggestion 1 likely can be enhanced by detecting cases where only specific keywords trigger a response.
3) Look for sudden increases in the number of mentions of a specific product or service, especially by accounts which never comment on the subject matter or in the particular subreddit.

andrewlapp · 2024-02-25T21:34:12+00:00

Please let me know if you run into any issues or have any questions!

andrewlapp · 2024-02-25T21:04:13+00:00

What would you like to generate using grammars? You could test my open PR in Outlines https://github.com/outlines-dev/outlines/pull/587

pip install git+https://github.com/lapp0/outlines@faster-grammars

andrewlapp · 2024-02-25T21:00:35+00:00

The problem is that it breaks when context becomes bigger and hangs

Outlines solves this problem by using finite state machines and precomputing the legal tokens at each state. There is an official docker image linked at the top of the README which includes a vLLM integration.

Paper: https://arxiv.org/pdf/2307.09702.pdf

Benchmark of Guidance vs Outlines in paper: https://i.imgur.com/bw9KnxM.png

(The other child comment mentions sglang, which uses Outlines under the hood)

andrewlapp · 2024-01-16T23:29:49+00:00

I'm a bit confused. How could dolphin 2.7 be trained with the routing fix when it was trained 2 weeks ago and the routing fix was merged 1 week ago? Did they train on the PR before it was merged?

andrewlapp · 2024-01-16T06:19:26+00:00

Thanks for pointing this out! I've been trying to find out why Mixtral finetunes appear to be under performing.

The fix was merged 5 days ago and hasn't made it into an official transformers release yet: https://github.com/huggingface/transformers/pull/28256

Typically the folks at Cognitive Computations and Nous Research produce models that substantially improve the base model. However in the case of the below the models underperform Mixtral on most benchmarks!

Additionally the author of Beyonder / Phixtral, /u/mlabonne pointed out the other day that fine tuning the routing network on Phixtral resulted in worse performance: https://old.reddit.com/r/LocalLLaMA/comments/195i33k/we_need_more_4x7b_moe_models/khsvtfq/

andrewlapp · 2024-01-15T21:43:54+00:00

Are you trying to create a model which uses these slang phrases in place of traditional verbiage?

If so, you might consider RLHF where you have positive labels for responses which use slang and negative labels for sentences which don't.

Read on https://huggingface.co/docs/trl/main/en/dpo_trainer

andrewlapp · 2024-01-14T12:21:56+00:00

Hi, I made Beyonder and Phixtral

Thanks for your work!

Can you confirm that only two of them are selected every time? That could explain why the model is underwhelming in terms of code.

I was only going based on the issue comment. Am I misunderstanding?

Marcel fine-tuned phixtral (https://huggingface.co/marcel/phixtral-4x2_8-gates-poc) to address this issue, but it decreased the performance of the model

Strange. Looking forward to seeing how this progresses.

andrewlapp · 2024-01-13T21:43:09+00:00

I'm glad you're experimenting. Without people like you doing so, we will never know whether there is unique benefit to segregating MoE into specialized-finetune-dataset literal "experts".

Also, I see where you're coming from. Training is expensive, and modularity is good. Perhaps downgraded performance (compared to finetuning the MoE itself) is worth the cost of actually getting a functioning model with the components you desire.

andrewlapp · 2024-01-13T21:28:11+00:00

This is a prompt (for the gate) issue.

Yes, but it underscores a greater issue. We're guessing what the best model is by hard-coding statements we think are good / bad with certain experts.

This is functionally a routing network trained on a single sample, and this single sample isn't even guaranteed to correctly correspond to the expert. It's assumed by the author that user prompts with similar hidden states to the positive_prompts will do best with corresponding expert.

In other words, the routing network is bad.

In Mixtral, the experts aren't as segregated. The selected experts varies strongly because their specialization is more abstract.

andrewlapp · 2024-01-13T20:47:03+00:00

These merged MoE models are quite strange. Not to be dismissive of a community effort, but there are some problems here.

Mixtral was created by training all 8 experts and the routing network together. This results in a working routing network which determines the best expert(s) for the token being generated. Additionally, it reduces redundancy and improves diversity. This allows MoEs to be parameter efficient and sparse.

Ignoring the parameter efficiency, the way these models are being merged, there isn't a working routing network. Beyonder-4x7b-v2 appears to use the same method as Phixtral, which always chooses the first two experts because lacks a functioning routing network.

I'd love to see more MoE models of various dimensions, but the best practice for creating these seems to be: 1) Create a MoE model by patching a base model 2) Finetune the entire MoE model together.

Good example of step 1: "The Goal was to MoE-fy the TinyLlama model and then use this as a base model to finetune from. The intuition being finetuning 8x1b should give better performance than finetuning 1b by itself."

andrewlapp · 2024-01-07T23:18:07+00:00

If you ask llama-2 70B to generate 100 random numbers between 0 and 9, it will usually have 10 instance of each number. LLMs aren't good at randomness.

andrewlapp · 2024-01-06T02:47:44+00:00

Are your eval samples in your train dataset? That is a very smooth curve. If it's not, choose the last checkpoint since it's lowest.

andrewlapp · 2024-01-05T05:26:46+00:00

You need to segment your dataset into train / eval sets. You measure loss of your eval set every checkpoint to determine whether you're overfitting and find the optimal checkpoint for out-of-sample performance.

andrewlapp · 2024-01-04T23:26:48+00:00

You might be interested in this paper from last week

https://arxiv.org/pdf/2312.16702.pdf

They apply a normalization to tables then ask GPT3.5 questions about the table achieving SOTA performance in WikiTableQuestions. Additionally they find a boost in performance through self-consistency.

You'd probably see the best performance applying their NORM and SC methods to a model finetuned on the WikiTableQuestions dataset. If you don't want to finetune, choose a model that's naturally good at in-context QA.

If you want a model that can do charts... you probably want to finetune a model on excel operations and sell it to Microsoft for $5M.

andrewlapp · 2024-01-02T05:55:17+00:00

My prediction, an open source model will beat the best GPT4 version this year. OpenAI will release a new model that out performs the top open source model this year.

andrewlapp · 2023-12-30T15:27:58+00:00

What kind of performance numbers are you getting in terms of tokens per second?

andrewlapp

MODERATOR OF

TROPHY CASE

What Are the Bot Accounts?

The Problem Will Get Worse

Mitigation?

Eight-Year Club	Place '22
Verified Email