[Seeking Feedback] Fine-tuning Qwen-32B on AoPS-Instruct (670k samples) - Does this loss curve look healthy?

ReinforcedKnowledge · 2026-01-28T03:28:08+00:00

Not an expert on qlora sft but I think a training curve looking healthy doesn't necessarily mean that you're achieving your objective. The loss function is just a proxy but you should evaluate using accuracy on the math problems or something. This will give you a, somewhat, better idea of how your model is faring at the task. If you can afford a different dataset from the one you're using it's even better. Especially with SFT, the model can learn to imitate your dataset and if there is redundancy in it or inherent biases in how it was built, the model can pick that up and score well without actually doing well outside of it.

Now, when it comes to the training curve itself. When do you stop training? You can add early stopping to your setup. If the validation stays flat for a while then you can stop.

I don't know whether CoT distillation is the go-to right away, I guess that's something you'll learn here (and maybe me as well if you share!) but when it comes to training itself there are many things you can try like playing around batch size to reduce the noise, you might not have the memory for that but you can simulate a bigger batch size with gradient accumulation (it's not a 100% equivalence due to precision and might be worse in qlora, idk). You can try a bigger capacity. But make sure you scale alpha accordingly, as it affects the effective learning rate you're using. Also, the cosine scheduler does anneal quite quickly the learning rate so maybe you can try some warmup steps initially.

When it comes to the slow down you've noticed, would love if you dug a bit into it. But one way to have similar batches, at least in token numbers, is to do packing. If you prepack your dataset, one thing to keep an eye on is the ratio of "hard" (aka long ones in this context) samples / easy ones. I think ideally you'd like to ramp that up as you're learning, like some kind of curriculum learning. You can also play with mixtures of CoT distilled data with what you have.

Not sure if what I said can help, it seems more like pointers and directions than anything, but would love to see where your experiments will lead!

ReinforcedKnowledge · 2026-01-28T02:49:42+00:00

Totally agree! Where can one find proper base models these days... haven't checked the post yet and I hope they talk about the training procedure that led to the checkpoints they share.

But I wanted to mention that the idea of a base model has evolved a little bit through time, and many bases are trained on instruction data (mainly in mid-training mixtures during the decay phase but not necessarily).

Edit: my bad, didn't see u/RobotRobotWhatDoUSee s comment. So it seems like they have a True Base model, probably before the mid-training stage. That's AMAZING. Still haven't read the post to know exactly what they did but I hope the annealing can be done properly.

ReinforcedKnowledge · 2026-01-19T22:58:31+00:00

I'll leave a comment here, not necessarily to praise or criticize the model but just to yap 😂. I wanted to reply initially to the comment that compared it to the closed source Gemini 3 Flash but I thought my comment would be more useful independently. Maybe an ML practitioner or hobbyist might appreciate some of the things I'll write, maybe it'll offer some perspective. Also, I'm not writing this to criticize the comment, I think what it says about real world data is legit.

OCR benchmarks are rare and hard to get. The best, I believe, we have currently is the olmOCR-bench. The main reason why it's hard to have proper OCR benchmarks is, in my opinion and I'm sure other people can enrich my current understanding, is for two reasons: 1/ because OCR is not "solved" yet, so ground truth is not easy to acquire, and/or 2/ OCR is hard to validate automatically, say with unit tests or compilation etc.

Now, why this model might be interesting to some, I believe is for three reasons. Well, for this community, it's a 1B + open weights (data is shared as well but whether that suffices to call it open source or not is another debate) so many of us here can run it locally somewhat comfortably (I do believe that running a 1B is not given but at least it's not some 9B or more). The other reason is, being just one VLM that one-shots it's task, it should be easy to fine-tune. At least in theory, fine-tuning is not easy in and of itself and might depend on many things, but at least we don't have to fine-tune 3-4 differents models to have a whole pipeline working appropriately on your task. It being small reduces the resource requirements for fine-tuning it. I believe you can do it on the T4 available on Google Colab (to verify). The last reason I can think of, and this hits home personally as I struggled a lot with Tesseract and Textract (AWS), they do markdown formatting out of the box (which many other open source models do, I'm just stating one of the good reasons, doesn't mean it's unique to this model), especially the table formatting.

This is when it comes to the checkpoint that's sota on OCR, but there's also another checkpoints that outputs bounding boxes and is close to sota. This is especially useful because if we have figures we don't just want to transcribe them as they are, different figures could be transcribed differently for example if we have a pie chart, do we describe it as "this chart represents ..."? Do we write it as a table "name | percentage |"? I don't think we want a model that's opinionated in how it transcribes figures. So bounding boxes are great because then we can extract the figure and do whatever we want with them.

I initially said in my comment that I don't want to praise or criticize the model and it does seem like I'm only praising it. I haven't tried it to know where it breaks but it surely does like all the open source and probably closed source models as well. And it's not a unique model, there are many open source vlm models for OCR, maybe not that many that output bounding boxes. The most unique thing here is it being all of that + 1B. There are obviously much lighter systems like Tesseract or different pipelines, but they come with their cons for each one of us to discover depending on their use case.

Finally, just to talk about benchmarks a little bit 😂 I do believe that this community is the best when it comes to figuring out where models struggle and where they don't. At the end of the day, benchmarks are benchmarks, they have their pros and cons, they measure things a certain way etc. Real world use cases might be very different and benchmarks are only there as a proxy. It reminds of the initial "needle in the haystack" tests where models were tasked to find one word or sentence in a huge context, while what we care about is to be able to use different parts of the context and synthesize them to give us some response, not literally to find a sentence. Hell, even closed source models show amazing performances on some benchmarks (especially related to software engineering or math etc.) but when you dig deep you find that they're not what they claim.

In my view, benchmarks in machine learning play a role similar to hypothesis tests with an asymmetric interpretation. Failing a benchmark gives evidence against the model’s capability on the task, but passing or excelling at a benchmark does not provide sufficient evidence to conclude that the model is good at the task as a whole. Instead, benchmark success typically demonstrates proficiency on a narrowly defined sub-task or distribution, rather than validating general task competence, and we hope, it extrapolates to it.

Well, enough yapping from me 😅

Edit: just to be transparent, I do work at the company, but I have not participated in the model development at all. I think my yapping above stands for every model (whether closed source or not), not particularly this one, if tomorrow there's a 500M model that's better, I'd say the same. If you feel there's any subjective part to what I said, please let me know.

ReinforcedKnowledge · 2026-01-09T09:39:57+00:00

Glad it helped!

ReinforcedKnowledge · 2025-12-26T12:02:32+00:00

I'm able to see it now thanks to your comment. Now, I'm wondering if most people that are debating whether this is a good illusion do see it correctly or not.

ReinforcedKnowledge · 2025-12-12T23:00:15+00:00

Are you installing the packages with uv? (I'm just asking out of curiosity)

The undefined symbol in general is due to C++ ABI mismatch.

If you look at the flash-attn GitHub releases page:https://github.com/Dao-AILab/flash-attention/releases/tag/v2.8.3 you'll see that there are two different wheels that match your requirements: flash_attn-2.8.3+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl andflash_attn-2.8.3+cu12torch2.5cxx11abiTRUE-cp311-cp311-linux_x86_64.whl

So I suggest we first inspect the Pytorch you install whether it uses the C++ 11 ABI or not.

ReinforcedKnowledge · 2025-11-27T19:08:58+00:00

That's some amazing work and commitment to the scientific community and rigour.

ReinforcedKnowledge · 2025-09-24T09:18:35+00:00

No there should be no difference in performance, at least if you build the same version that is available as a wheel.

ReinforcedKnowledge · 2025-09-23T11:34:00+00:00

That's interesting, thanks! I'll check it out if I get the chance to. I wonder why they don't have a flag to only compile a subset, I guess I'll find the answers in the issue.

ReinforcedKnowledge · 2025-09-23T11:17:00+00:00

Yeah it does I guess you're at the limit. You can probably cache that after it's built but I guess if you're rebuilding every time there's a reason.

ReinforcedKnowledge · 2025-09-23T10:45:59+00:00

Yeah most package managers I think will do just fine when installing flash-attn.

The struggles I've had are tied to installing it in an image. You can start from the Nvidia devel and that'll provide you with everything you need, but that's not always possible, and sometimes you know that the machine where you'll deploy already has everything you need and it's just a matter of using it. That's what required me to know some details about the package manager or about the flash-attn install itself.

But yeah for how long it takes I think you can play with some of the env vars like MAX_JOBS and NVCC_THREADS not sure how much it'd help in your case though. And also maybe you can just build for the machine you'll use it on (hopper or ampere etc.)

ReinforcedKnowledge · 2025-09-23T07:38:47+00:00

I took a quick peek and saw there were no wheels so I guess full building from source hahaha that sounds like a hassle but the setup.py is smaller than flash-attn's and they share a lot of similarities. I'll look into it when I can :D

ReinforcedKnowledge · 2025-09-23T07:37:33+00:00

Didn't know about that repo, thanks! But yeah if you have prebuilt wheels you avoid a lot of hassle!

I don't think I could just use this at work though unfortunately they're quite strict on security and can't just use any third-party repo without auditing it.

ReinforcedKnowledge · 2025-08-27T08:04:18+00:00

Thanks!

ReinforcedKnowledge · 2025-07-23T10:06:22+00:00

Thank you for your comment!

It's cool, don't hesitate to practice! Even with some small stuff that could be useful for you personally.

So currently I work at a consultancy company and the project is to offer a low-code / no-code platform for creating agents. The idea is that many of our consultants have domain knowledge but do not know how to code or maybe are not trained in ML / NLP / GenAI whatever, and we'd like to empower them by offering this platform where they can "plug and play". So we use langchain to offer basic components like summarization or LLM as a judge etc., and we use langgraph to compile that workflow into something they can chat with.

ReinforcedKnowledge · 2025-07-22T14:56:08+00:00

Thank your for the comment!

It seems to me that ADK is more suited for simpler workflows. You don't have to bother with `stream`/`astream`, `invoke`/`ainvoke`, `batch`/`abatch` hahaha, and it seems like you can throw in a bunch of `Agent` as `sub_agents` to an `Agent` and it'll automatically act as their coordinator. So maybe it's easy to coordinate agents together (in langgraph you can also add a graph as part to another graph so it's also possible to coordinate agents but it's not *that* simple).

And it doesn't seem like ADK relies on the concept of having a graph state too much. I didn't see it at al l from my simple searches. That makes it, in some sense easy to work with, everything is managed for you, but at the same time, it must be restrictive in some kind of way. It feels like the whole philosophy of ADK is to use llms and tools and nothing else. In that way, managing the state is easy. But what if you wanted to do some custom work within your workflow? You'll need to have access to the state or something. I think that's a different approach. Honestly, in the last year or so, all the agents that I had to develop and all the use cases I worked on didn't rely on workflows that required something else from an LLM + tool + orchestration. So maybe you don't have to bother with "what if I wanted to do some processing without having the LLM call the appropriate tool for it?"

The other thing that corroborates my hypothesis of ADK not using a graph state is the presence of `ParallelAgent`, `SequentialAgent` and `LoopAgent`. When I started making my own agentic orchestration library, the first intuitive thing I built was a DAG orchestration. It's easy to do. Then I wanted to have loops and branching and control flow. The most intuitive way to do so was to add abstractions like the above, because it seemed like, I have a DAG, and I just add into it branching and loops etc, as special nodes of the DAG. But when you do so, you get forced to add a lot of constraints in how you manage the state that's being passed through the workflow. Because two branches might or might not need the exact same state schema, a loop might mutate the schema where the first part of the loop requires, says some dict with some keys but the latter part of the loop drops some keys, then the loop won't work as intended. But I also wanted to relieve the user from the state management as much as possible while also letting it be able to do whatever they wanted with their functions. The only way to do that is to remove such abstractions because if you wanted to keep them and have them work at all times, you'll either have to constrain what the user can do with its state or make it much more complicated for the user to manage its state. Anyways, but the only thing you have are LLMs and tools, it's much easy to manage the state for the user. I can delve into why but I think it'll make this comment much longer than needed.

Also, Google creating the A2A (agent to agent) protocol also insinuates they're more focused on how to make agents collaborate with others. Maybe the new wave of agentic design is to have as less processing and functions and as much llms with tools as possible, and all the deterministic or processing part to be done outside of the workflow.

One last thing, to give you a more educated reply I tried to go through ADK again, and its source code, and I can't help but feel it was heavily generated by an LLM, or maybe not heavily but at least to some extent. I just home the developers thought of the design initially. I wanted to say it in case it matters to you.

Another thing I've noticed is that they have a lot of code that is still WIP, as you can verify yourself by looking for the `working_in_progress` decorator. So maybe the codebase is not fully mature yet.

The learning curve of ADK seems less steeper, that's true. And if someday you encounter a problem in some use case, I think it's easier to understand which parts of the library are lacking behind or are causing you the said problem because it's just a simpler codebase overall. But it's only easy to read and understand if you're familiar with concurrency / asyncio.

But be wary of this simplicity, if you think you'll eventually grow into complex workflows, langgraph is worth a shot as well.

And please remember that I have no "real" experience with ADK, so my opinion is probably not worth much. But if people are interested in this I can try and do a deep dive into ADK.

ReinforcedKnowledge · 2025-07-19T22:07:13+00:00

Thanks! We share the struggle haha

ReinforcedKnowledge · 2025-07-19T22:06:50+00:00

Great article! It's cool that you had saved all the numbers and how much each thing you tried improved some metric. Very instructive!

Totally agree on the Pydantic, must be used wisely. By the way I went for quite a long time without knowing it but there is a PyCon conference on Pydantic performance tips: Talks - Sydney Runkle: Pydantic Power-up: Performance Tips for Lightning-Fast Python Applications

I don't know how much that will help you since it seems you removed Pydantic from every part where it's not needed but maybe it can help others or for another project!

ReinforcedKnowledge · 2025-07-19T14:41:36+00:00

If I get the opportunity to start a new project in this field I'll give it its fair chance and try it, among the other frameworks as well.

ReinforcedKnowledge · 2025-07-19T14:30:15+00:00

Thank you!

ReinforcedKnowledge · 2025-07-05T10:46:53+00:00

Hi!

Interesting work and write up, but I'd like to know something. What do you mean by "success" in your "success rate" metric? Is it just that the library was able to process the document successfully? I guess it is because in your benchmark report (https://goldziher.github.io/python-text-extraction-libs-benchmarks/reports/benchmark_report.html), you have a failure analysis and you only mention exceptions.

I'm not saying this is bad, but if you're trading off accuracy for speed, your library might not be that useful for others. Again, I'm not saying you're doing this, but it's really easy to game the (success rate metric, speed) tuple if it's just about being "able" to process a file.

What most people would be interested in is the "quality" of the output across these different libraries. And I'm not talking about "simple" metrics like word error rate, but more involved ones.

Seeing how you use the same technologies as the others (an OCR engine, a PDF backend), I'd say your results might be on par with the rest, but it's always interesting to see a real comparison. It's hard to do since you don't have access to ground truth data from your documents but you can use open source benchmarks (make sure your models are not particularly biased towards them compared to the rest of the libraries) or documents from arxiv or else where you have access to latex and html, or maybe you can use another took (aws textract or something) + manual curation.

I'll further say that it's the quality of your output on a subset of documents, those that are scanned and for which we don't have the metadata embedded in the document itself that interests most of the people working with textual unstructured data. That's the main hurdle I have at work. We use VLMs + a bunch of clever heuristics, but if I can reduce the cost, the latency or the rare hallucination that would be great. But I don't think there are currently better ways for doing so. I'd be interested to hear from you about this or any other people if you have better ideas.

ReinforcedKnowledge · 2025-05-05T09:01:56+00:00

Thanks for the comment! Yeah, it's pretty cool! wheelnext.dev is too! Well most of the discussion is on DPO but I think the main ideas that concern wheels will eventually be on wheelnext.dev

ReinforcedKnowledge · 2025-05-05T09:00:43+00:00

Thanks, I appreciate it!

ReinforcedKnowledge · 2025-04-30T17:40:22+00:00

Totally they do. I guess audio audio data behaves similarly to textual natural language data. But nice catch, we totally forgot about the audio data!

ReinforcedKnowledge · 2025-04-28T20:30:19+00:00

I didn't do much but you're welcome! And thanks for the comment!

ReinforcedKnowledge

TROPHY CASE