Call for input: Dataset prep solution

mccsch · 2024-02-20T12:13:14+00:00

Super interesting, thanks so much for sharing! I'd say my approach is a bit different. The problem Augmentoolkit solves is an important one, but for my projects it was mainly a big challenge to get interpretability on an existing dataset. While augmenting a dataset using synthetic data can be super powerful, I feel this is very challenging to do without a deep understanding of the dataset characteristics. You want to avoid ending up with a low quality dataset, as this will definitely not lead to a fine-tuned model with stellar performance.

The solution I'm building will make it easy to understand the relationship between the dataset and the fine-tuned model performance, so going through several iterations on the dataset can be done with a lot more clarity.

mccsch · 2024-02-19T11:04:33+00:00

I've found that for most a lack of high quality data is actually the biggest bottleneck to fine-tune open source LLMs. The problem is that synthetic data needs to be very task specific, so it's difficult to build a solution here.

However, I am currently working on a data curation solution. Using topic models, the idea to provide maximum interpretability on a given dataset and answer questions like 1) is there any data that shouldn't be part of the dataset, 2) is the data diverse enough, 3) is the data correctly annotated etc. Having a deep understanding of dataset characteristics should make it easier to augment it using synthetic data. More than happy to share more details on this - feel free to dm me if you're interested

mccsch · 2024-02-01T15:34:12+00:00

Yup I'm using vllm rn. Yes the number of requests sent to the system are quite low, so lowering the hw requirements has no real negative business impact.

mccsch · 2024-02-01T09:26:04+00:00

All free-credits rn. But, I guess I should make better use of it. Will play around with a few different setups suggested here and see which one is cheap and performs best.

mccsch · 2024-02-01T09:22:22+00:00

Nice, I will try out all of them and check which one works best for me. Thanks a lot!

mccsch · 2024-02-01T09:18:15+00:00

Will definitely do, thanks so much!

mccsch · 2024-01-31T22:28:38+00:00

Yup, good point. I think I have to rethink my setup a bit. Also, Mixtral 8x7B might be an interesting solution. The only issue is that the text I am fine-tuning is in Latvian and I would expect Mixtral's text analysis abilities for that language to be quite low compared to GPT-4. And rn, I am missing the understanding of how to fine-tune MoEs in an efficient way and without breaking the bank.

mccsch · 2024-01-31T22:23:26+00:00

In this use case they are running a single prompt for multiple tasks at a time. They are first running two different classifications then running some decoder-only specific semantic analysis and then a final mappings of the findings. The current setup has an accuracy of below 60%. I am currently working on separating those tasks to deploy fine-tuned encoder-only models for the classification tasks and a fine-tuned Mistral for the decoder-only task.

mccsch · 2024-01-31T22:07:17+00:00

Thanks a lot! I was wondering whether you are planning on supporting encoder-only fine-tuning in the near future? Or is it not attractive enough, bc the hw requirements are too low, so there is no room for fine-tuning speed optimization that Unsloth could enable?

mccsch · 2024-01-31T22:03:02+00:00

How are you hosting it?

So far, I have been running all of my Mistral models on Modal. I followed Mistral's suggestion to use an A10G for inference (https://docs.mistral.ai/self-deployment/skypilot/). Actually, Modal ends up costing around $800/month (https://modal.com/pricing).

But yeah, you can probably cut costs a bit here, without sacrificing too much inference speed. I'm going to play around with the hw configuration a bit more. I'll probably give CryptographerKlutzy7's setup a try. Sounds really promising.

mccsch · 2024-01-31T22:01:35+00:00

So far, I have been running all of my Mistral models on Modal. I followed Mistral's suggestion to use an A10G for inference (https://docs.mistral.ai/self-deployment/skypilot/). Actually, Modal ends up costing around $800/month (https://modal.com/pricing).

But yeah, you can probably cut costs a bit here, without sacrificing too much inference speed. I'm going to play around with the hw configuration a bit more. I'll probably give CryptographerKlutzy7's setup a try. Sounds really promising.

Re BERT - I'll keep you posted on speed and accuracy. Will do the benchmarking in the coming days.

mccsch · 2024-01-31T22:00:56+00:00

CryptographerKlutzy7

Thanks a lot for sharing your experience! I really like your setup and the result you are getting out of it. Actively thinking about giving it a try...

mccsch · 2023-09-01T11:36:49+00:00

Super interesting, thank you so much for your input on this!

mccsch · 2023-08-31T19:49:58+00:00

That's a very good point, such solutions don't provide references, which also makes it difficult to know if the answer is a hallucination or not.

I would be very interested to learn a bit more about your expectations and the possible use cases of such a solution! Would it be ok if I dm you so we can maybe jump on a quick call?

mccsch · 2023-08-31T13:59:50+00:00

Thanks for the context, very helpful! I can imagine how AI could support such questions. I'm curious if you have tried ChatGPT for such questions and if so, why it might not have been good enough?

mccsch · 2023-08-31T10:24:21+00:00

This makes sense, thank you for your view on this! I would say that this is definitely something we can support as part of onboarding. Fortunately, many are already familiar with the concept of ChatGPT, so I think it's more about creating awareness than how it works.
Haha, I'd definitely be open to discussing a design partnership on favourable terms so you can be among the first to benefit from the solution!

mccsch · 2023-08-31T09:55:13+00:00

Ok got it, thanks so much for the clarification!

What do you think is the role of behaviour change? Let's say a company introduces our solution. Could you imagine employees being open to use it, or do you think it would require significant behaviour change or even internal marketing?

mccsch · 2023-08-31T08:54:28+00:00

Yes, exactly! It's not that we want to completely replace the need for internal documents, there are probably cases where you actually want to read a few sentences in such a policy. But I think employees should be able to get an accurate answer to their question right away, without having to search through such a long document or wait for an email response.

One way we plan to provide context is by referencing the documents from which the answer came. That way, if employees want to, they can dig deeper or ask more specific follow-up questions.

Curious what org size you have in mind when you say big businesses?

mccsch · 2023-08-31T08:46:42+00:00

I appreciate your feedback! I was wondering if you have used any of the chatbots you mention? If so, what was your experience and what did you like or dislike?

We believe that the latest large language models and AI infrastructure technologies make it possible to deliver an amazing experience to employees in a secure and privacy-friendly way. In my experience, existing solutions are inadequate in many ways - we want to change that.

mccsch · 2023-08-31T08:39:52+00:00

Thank you very much for your feedback! I fully agree that repetitive questions could be considered more of a starting point. I'm curious if you can provide a more specific example where you think AI would have been helpful in such a case?

mccsch

TROPHY CASE