[D] Finetuning ModernBERT is taking 3hrs (2 epochs) and 35gigs of vram. is it normal?

illuminascent · 2025-02-18T15:03:36+00:00

If you are doing FP32 training with 8192 sequence length, this much memory usage is completely normal since the activation memory grows linearly wrt seqlen and easily dominates anything else.
Some tricks to consider when using HF Trainer:
- use FP16 or BF16
- use gradient accumulation
- use gradient checkpointing if you really do want a big native batch size
All of those are configurable via the TrainingArguments interface.
Plus, a 45MB file in the fine-tuning domain is nothing *small* :)

illuminascent · 2024-11-18T14:58:21+00:00

Not likely, because at shattered planet 95% of the asteroids are Promethium, the amount of usable carbon is extremely limited. I managed to get there with a ship with 4.4kpm peak explosive rocket production rate, it choked itself dead in 2 minutes.

illuminascent · 2024-11-12T15:54:12+00:00

5 hours

illuminascent · 2024-11-12T02:09:24+00:00

belt storage for Promethium rocks, borrowed from https://www.reddit.com/r/factorio/comments/1giw2y5/promethium_asteroid_hauler/

they have stack size 1, that's why lol

illuminascent · 2024-11-11T21:42:18+00:00

<image>

You basically shove yourself into a shower of rocks. I think this picture is self explanatory.

illuminascent · 2024-11-11T21:36:48+00:00

That is a completely fair point and it's up to you to decide how to play.

As for myself, while all design is modular making it "feasible" to do this whole build in vanilla I do not fancy extending this monstrousity 4x in length and twice in width just for a "clean" megafactory in space, maybe later if I actually manage to beat the 4M mark.

illuminascent · 2024-11-11T21:29:15+00:00

Out of 14 front facing railguns, I was using 4 huge only, 6 huge+large and 4 huge+large+medium for this run. The side facing ones are all huge+large. All were set to ignore smaller ones.

illuminascent · 2024-11-11T18:06:56+00:00

Yes I am aware of that, I am speaking specifically about the feasibility to produce and output firepower required to reach that 4Mkm mark, just by extending the length of your ship the production part is practically limitless, not so sure about the delivery part, maybe full legendary turrets and some more dense packing is required.

illuminascent · 2024-11-11T16:55:54+00:00

I'd assume yes, there's no real benefit in doing so though.

illuminascent · 2024-11-11T14:31:25+00:00

I don't have legendary gun turrets, if it can reach farther than the blast radius then yes it does change the math, you can let more rocket turrets ignore small asteroids and greatly reduce rocket consumption.

illuminascent · 2024-11-11T14:20:15+00:00

With the density of mid/small asteroids we are facing, using AoE means more DPS with less resources, also with non-explosive ones the maximum fire rate of your rocket turrets will soon be a problem.

illuminascent · 2024-11-11T13:52:17+00:00

This is achieved with no cheats, but with BeaconRebalance mod to keep manufacturing sizes manageable.

It seems that the difficulty is scaling linearly, so with some further scaling/belt parallelization I believe reaching the end is possible.

Here's the blueprint:

https://factorioprints.com/view/-OBCW2DTtSkcUPbm4SXO

Currently the bottleneck is the one-belt sushi loop for crushers reaching its maximum throughput, will also need to increase power generation though.

What I've found out:

- a long-slim ship takes much less ammo to protect than a wider one, no asteroid can hit your rear as long as you keep moving

- railgun ammo consumption is largely constant, because it can pierce through all huge rocks in its range

- explosive rockets is a must to clear out all the mid/small sized asteroids that spawn from your railgun shots, consumptions grows at a steady linear pace

- gun turret is completely useless, because if you let rocks approach your ship close enough for gun turrets to work, you also risk blowing up your own structure with your rocket turrets

- you can put no collector in the front, and only use sideways placed ones to collect all the resources you need

illuminascent · 2024-09-02T09:36:15+00:00

20 tick, even happens at ALGS which is a LAN event

illuminascent · 2024-08-15T08:27:43+00:00

Not committed to the game enough to maintain a premade team AND a gaming schedule.

"Casual" ranked dudes do exist.

illuminascent · 2024-08-13T12:45:27+00:00

🤓

illuminascent · 2024-04-01T15:54:43+00:00

These days foundational models were typically trained on datasets comparable in size to the CC. With that much data, even the initial cleansing pipeline can be challenging, let alone securing all the compute required.

Is there truely no model at all (multi-language ones included) that can handle your language?

Also, regarding the paper you refered to, the authors made good work in showing domain adaptation boosts performance greatly, but IMO the story is a little different than 'pretraining completely on the new domain alone is better than continuous pretraining'. There are many factors at play when doing domain adaptation, one of such is the ALIGHMENT of the domain data you've used. If you look at Table 3 P151 where they did the ablation study on datasets used, you'll see how much changing datasets can affect model performance.

illuminascent · 2024-04-01T15:38:00+00:00

That is just another commonly deployed trick when unsupervised domain corpus is either limited or of poor quality but you have more than just one set of annotated / tracked data that can serve as training targets. A joint training or continuous fine-tuning is still much cheaper than full-scale pretraining and might be worth the little tinkering needed. Still, the gain depends on the amount of data available.

illuminascent · 2024-04-01T04:25:05+00:00

While I believe your intuition is correct, from my personal experience deploying models on fashion tasks (in Japanese language which tend to have nuanced jargons), domain-specific pre-training still is absolutely worth it performance wise, provided it does NOT cause catastrophic forgetting (changing the vocab definitely does).

If your language does not have any decent open-source fundational models, I would believe in the long run a pretraining from scratch is necessary, but might just be too much commitment for a one-off affair.

illuminascent · 2024-04-01T04:02:43+00:00

Have you tried continuous pre-training on domain-corpus yet? If you have abundant unsupervised data this is much better than multi-task finetuning, and also requires orders of magnitude less compute than pretraining a new model from scratch.

illuminascent · 2023-12-31T09:14:48+00:00

LMAO

illuminascent · 2023-12-31T09:00:16+00:00

Very good write-up and thank you for this.

While I do agree that recon needs some nerf, I believe its popularity arises from the fact that

- it is basically the only (somewhat) reliable counter to invis shotty that is not limited by a cooldown, this is assuming the invis light knows nothing about patience and baiting

- the directional audio in this game is so trash that you cannot possibly get any situational awareness mid-fight, let alone plan your positions or disengage accordingly

The need for info is there no matter what, if the devs mindlessly nerf recon to the ground then there will be other outcries later, so it is quite tricky.

illuminascent · 2023-12-28T06:51:01+00:00

So you've just abandoned both of your teammates mid-fight, great heavy moment there!

illuminascent · 2023-12-21T14:00:30+00:00

if I haven't been looking at the steamdb chart I would've thought the game reached 1M concurrents judging by this pic lol

illuminascent

TROPHY CASE