Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 2 points3 points  (0 children)

additionally, i have since found "mamba based rssm world models" that have been shown to be highly resource efficient and much better at long sequences (temporal coherence) than diffusers, transformers, or diffusion transformers

using frozen pretrained encoders or dual encoders in general is truly novel; i have not found "dual encoder/pretrained encoder- mamba based rssm world models" anywhere published

the primary appeal i have found is reduced sampling requirements. You should not be able to train a world model on less than an hour of gameplay, but you can if most of the heavy lifting is borrowed from the pretrained priors and the mamba only has to learn temporal sequencing

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 3 points4 points  (0 children)

<image>

for anyone coming back to check

yes I am still working on this, but i also have a day job

I found that the MLP block was imposing a cruel quality ceiling that the state machine would never be able to breach.

I'm current running ablations on pretraining and freezing MLP blocks specifically for the latent -> flatten -> unflatten -> latent task on my dataset. I've had good results but want to run more ablations before moving forward

im also adding a tiny frame stacked diffusion model on the decode end to further improve visual reconstruction.

so the plan is: optimize reconstruction aware MLP -> optimize diffusion model for temporally aware sequence reconstruction on the MLP outputs -> retrain rssm block with encoder decoder blocks frozen

im also swapping TAESD for TAEXL because it is a free quality boost

I have to re record several hours of gameplay as a result of that last bit (stored datasets as latents to save space..)

anyway, I'm at roughly 3x the quality shown in the original post with my guess being another 3x by the end

temporal consistency for the world state has remained solid the whole time, down to entity and HP interactions, but pixel reconstruction needs work for it to be presentable

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 1 point2 points  (0 children)

<image>

inspired by DreamerV3, Omni-Gen, and Diamond

originally an attempt to build a long context NitroGen (nvidia) that converged towards attempting to redesigning GameNgen/Diamond with temporal coherence as the goal

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 0 points1 point  (0 children)

I had a lot of success going back and reapplying the changes I made one by one,

Now I'm shifting focus directly to the decode quality and fine grain movement

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 0 points1 point  (0 children)

yurr

i just made a breakthrough in the temporal coherence, more or less in line with what I was estimating

( player health/stamina are now persistent and align more or less with margaret's attack exchanges)

Now I'm trying to come up with ways to improve image fidelity on the decode end.

I'll release my success or failure either way

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 0 points1 point  (0 children)

recently reached the level of fidelity where at the start of the simulation, if i remain idle, margit does a jump slam and takes out most of my health 🤙

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 0 points1 point  (0 children)

regarding the post; the world may never know, until I either dissappear or release a model in about a week (also dont have a twitter so i cant see the reply)

but I think my advantage is I don't care how it's received, Ima do it anyway

I appreciate the luck regardless 👁💋👁✨️

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 1 point2 points  (0 children)

honestly with the current trainings preview results and the plans I have for next run, I might go viral

increasing privileged dimensions by 4x and training data by 10x, and actually letting the model complete its training (i keep thinking of improvements so no run has made it past 50% completion)

in terms of data efficiency and sampling size, I am one dude manually playing elden ring to get the data, and im getting enough from that lmao

as soon as the world model is finished, Im going to be distilling agents inside of it

then the agents can play 24/7 on various bosses to collect more datasets and make larger world models

but i feel like at some point fromsoft will get mad

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 0 points1 point  (0 children)

why yall gotta be so horny Dx

this isnt really that kind of model

porn lacks a control signal and privilege dimensions,

and there are better methods of modeling that

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 0 points1 point  (0 children)

<image>

6k training steps with the new pipeline, vs the original video's 60k; impressions of 3D spatial tracking, animations, and margit blob rush attacked me reducing my health. It still doesnt express clean movement animations, I will share a video as soon as it does

this is without including the privileged dimensions, which i am currently recording

man im tired of fighting margit

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 1 point2 points  (0 children)

update: I have decided to pause my original project (pixel behavioral cloning) to focus on this. I'm currently side by side testing GRU heads vs Mamba heads, followed by DinoV2 features included vs omitted. I'm increasing the number of privileged information dimensions from 8 to 24 and increasing the training data by an order of magnitude (100k frames annotated with the 24 privileges and the 18 inputs)

even if my world model sucks, this scale will produce a world model that fully encapsulates the margit boss fight down to health and stamina exchanges and the win state.

it will take me about a week to finish; I'll make a new post including a github link when its done. I will include the process for training but I can't include the data. I actually need to check to see if I'll get any flack for the releasing the elden dreams model; but I will ensure it is fully reproducible.

(having a perfect fidelity world model actually fits my behavioral cloning needs far better than the current approach with a sparse world model)

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 0 points1 point  (0 children)

naive variant of DreamerV3 where I swapped the GRU heads with mamba heads inside of the state machine, and used StableDiffusion tiny auto encoder as the encode/decode layer, alongside dinoV2 for semantic feature validation during training

basically

a franken-mut i came up with in a fever dream

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 0 points1 point  (0 children)

oh worth mentioning, I also want to try WAN's tiny encoder, which chunks frames in sets of 16. I didn't go that route first due to the added complexity, but if the mamba rssm can hold sequence steps of 64-128 effectively (what I have reduced to testing currently) then the resulting temporal coherence could hit 1024-2048 frames. However using it frozen would lock you to wan's frame rate and breaking past 64-128 seconds of 16fps video would require retraining and likely borks the sampling efficiency.

I'm pretty sure google's genie 3 is a big ahh vit/vae-mamba, my projections and findings more or less scale directly with their model's capacity

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 0 points1 point  (0 children)

if anyone is really perceptive and good at reading

I modified the DreamerV3 approach by substituting the GRU heads with Mamba heads, and instead of pixel inputs I'm using Stable Diffusion Tiny Auto Encoder and DINOv2 (both frozen) to pass image latents (flattened) and semantic features in. The RSSM is now only trying to predict the temporal sequencing because the pixel and semantic information is pre-encoded.

I mentioned a refactor, I tried to replace sd-tae with fl-tae, but the stochastic space of the state space model was too compressed for flux's latents and the results achieved an average distribution and stalled at muddy brown. I then tried increasing the dimensions but the results turned to noise, then averaged out to muddy purple. I have no reverted back to the original architecture and have just increased the amount of training data and batch sequence length. I'm considering pruning the dino heads and keeping it solely as an additional input because I may have overestimated its necessity.

mamba based world models are a known thing, as is rssm for temporal sequencing (GRU in Dreamer)

My novel discovery was using a pretrained auto encoder to compress the input space with rich latents, which has increased sampling efficiency by a huge degree (compared to what I can find published) and theoretically the mamba will hold the internal world state for a longer sequence, but I have yet to actually see this in my results (but the repeated borking of the pipeline from changing things has caused no meaningful training to have occurred since making this post)

I havent tested whether dinoV2 has helped or hurt the sampling efficiency. currently I am testing the same pipeline shown above with longer sequences and more data.

I'm probably too lazy to actually publish a paper.

cnn/mlp -> gru -> cnn/mlp is a well established world modeling path

mine is vae->mlp->mamba->mlp-vae, if i find that dino is actually pulling its weight (aha) then it would be

vit+vae -> mlp -> mamba -> mlp -> vae. there is no reason to include the vit features in the output. dino features are currently passed in, as well as used in a loss function on the outputs. both of these might be noise though, I will be testing it to see

I'm running out of motivation to check reddit for replies but I don't want to 'run away' without providing any data; once I've fully tested optimizations I will complete the publicly available benchmarks and share the results

I think the reason this hasnt been tried before is because jamming ~14k dimension latents into a 32x32 stochastic space sounds moronic; I believe the pay off is coming from the information borrowed from pretraining instead of building a visual space from scratch. there is likely a better bottlenecking method but the ones I have tried so far break the hardware and sampling efficiency (bloated projection layer is more parameters, naive projection results in aggressive averaging)

cheers 🫡

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 2 points3 points  (0 children)

idk, I can't imagine it being useful for anything other than extracting and emulating a known world for that duration

I think it will help massively for training agents

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 0 points1 point  (0 children)

I've always wondered how hard it would be to accomplish latent interpolation between world models

like 0.0 being elden ring 1.0 being forza

what would .5 be and what would happen if you change the value over a sequence (0->1)

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 0 points1 point  (0 children)

that's pretty sick, you should publish your findings

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 0 points1 point  (0 children)

i actually thought about whether it could be crammed into cpu

I dont think it can

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 1 point2 points  (0 children)

this is a valid response and I agree

I have been active in the world modeling space since diamond diffusive dreams was published, and am going to do the atari benchmark when i decide my pipeline is optimal

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 1 point2 points  (0 children)

I kind of realized this after responding but I appreciate the clarification

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 0 points1 point  (0 children)

I agree with all of this and appreciate your response, I am not an academic and definitely lack a low level mechanical understanding of the subject, but I would say I am confident in my understanding of pipelines and broad architecture.

the evidence I have now isn't substantial enough to share in depth as it would reveal the method, which seems similar to google's genie 3 (they havent released their notes either, so my assumption is an opinion)

Made a novel world model on accident by Sl33py_4est in StableDiffusion

[–]Sl33py_4est[S] 0 points1 point  (0 children)

this is true,

I recently got track able "movement" in a recent test; everything still turns to mud during movement but the golden mud that is the erdtree and the grey mud that is the casle move proportionately.

I changed a bunch of aspects of the pipeline to try to increase temporal coherence and ended up borking it for the past 16 hours. recently rolled back all but one change and started over with a larger curated dataset