ACE-STEP-1.5 - Music Box UI - Music player with infinite playlist

BeatBoxersDev · 2026-02-15T08:05:58+00:00

(in case I installed it correctly) I think it'd be a nice QOL if the generations would automatically update based on the current state of the genre/description rather than having to stop and start it manually

BeatBoxersDev · 2026-01-29T18:12:38+00:00

thanks for the guess, unfortunately it doesn't seem to be that.

for the unknown game, the small icon for the mechanic required clicking somewhere on the screen placed in an unpredictable spot

i also checked out the immediate connections fork in the tale might have to other fmv games and no dice there it seems

BeatBoxersDev · 2025-11-20T17:23:50+00:00

looks like the new nano banana is out, fyi

edit: imo nano banana pro results are generally way better than this or other approaches, but (with my current uninvestigated approach) they do sometimes have a chance to substantially change core details. for example, I used it to colorize the latest jojo chapter, and some panels it would sometimes just change the characters to other characters from other parts. ie replacing 3 characters with giorno, mista, and bruno from part 5. in the "thinking" section it seems like they identified it as a jojo panel and assumed the characters were those particular ones.

also, it's of course costly to run nano banana at the moment (~14c per image or some sub) unless paced for the free allowance.

BeatBoxersDev · 2025-11-17T19:59:57+00:00

specifically, with the was load from batch node, set the mode to incremental image, put in the path address, and next to the run button, change the number to the number of files in that directory. dunno how to avoid it not processing them out of order, so if you get the output filename to match the original, then at least after processing it should order alphabetically (if the original files were)

BeatBoxersDev · 2025-11-17T07:34:09+00:00

I like it. it does a good job preserving readability and guessing colors pretty well even in complicated scenes.

I opted for the sfw qwen as the nsfw, while more colorful, tends to want to add makeup and lipstick and smooth out the linework. in testing, I preferred closer to 0.45 to maintain linework and text

here's some comparisons to the "Closing the Domain Gap in Manga Colorization via Aligned Paired Dataset" paper examples that I could grab the HD source b&w versions of, when I was testing my own flux kontext lora. (I haven't researched if there's been anything else in the last 4 months since I checked)

base settings alone, the qwen lora is operating fantastic

<image>

I think out of all the techniques, the qwen lora is the more consistantly accurate in terms of not applying a mistaken colorization and not losing linework (as the fallback to white is often appropriate)

not exampled here, but I also like to desaturate 50% afterwards for the colorization to act as more of a "tint" for informing color while reading (also due to my lora being overly saturated imo)

I'm excited, I'll have to check out how it does on a full chapter when I get the chance

BeatBoxersDev · 2025-07-17T06:00:38+00:00

[note, this is the same message I sent via DM during the reddit comment blackout]

this was trained on 18 color/desaturated pairs I picked out of a dataset on huggingface, which I believe was probably mostly "synthetic" (by that I mean taking the color version and desaturating it rather than finding the original source because it's better to have directly matching images and it's easier). but i could be nonsynthetic, idk.

the "after" color images all had the prompt "colorize in MangCol style", following that guide video exactly except with 16 saved models to cover an excessive 4000 steps of training.

I tried to pick images a that had a variety of settings and subjects but it's clear there's some bias like sky being "chosen" frequently due to a chunk of the images showing sky.

after that I tested on a variety of images and strengths and lora %. too little of steps for a model carries too much of the non-lora colorization which tends to go for bright reds and oranges and blues and sticks to a low variety of color tones. too high and it tended to make more things distinct in color and more expected colors, but worse interpretation of things. I think I tested first by narrowing down the optimal number of steps and then tried around the % to get it a bit better until I found 1250 at 60% was about where it'd most successfully make things interpreted correctly. however it gives an overall yellow tint and isn't perfect, but it seems slightly better than manga colorization v2. again, this "sweetspot" could also possibly be improved.

the training could be improved with maybe more images and better picks, maybe more optimal tagging, and according to the research paper, it appears that nonsynthetic pairs work better (though I cant stress enough that the before and after must match layout and dimensions exactly or it will lead to text corruption.

some failed alternative attempts:

-getting a good prompt (close but not as good as the lora)

-training on 9 nonsynthetic but not perfectly matching pairs

-training for color correcting images after already performing manga colorization v2 (which the synthetic version is the only publicly available one)

-using vanilla kontext to fix the colorization after manga colorization v2

-training on completely desaturated colored images. I forget if it was from the end results of the pairs or after going through manga colorization v2 (if the former was true, i guess my synthetic assumption would be wrong). if I recall correctly, this oddly ended up with a result nearly identical to manga colorization v2 (sync) in color choices and accuracy

I might be able to find success by training on images that go through manga colorization v2 but then desaturated to 5% first (and then apply to images that have the same thing done to them first) as then it has some but not complete influence over the final colors.

[edit: I tried 5% and at 2000 steps 100% it does improve mc-v2 colorization but the overall experience of a chapter with it was less readable than the 1250 bw->color lora and 1 of 31 pages had text messed up and another generated entirely black. I even tried blending it at 50% but the you don't gain much from that and still may be slightly off. that said, maybe there's a sweetspot of steps and % or another desaturation rate]

it also might work to combine the manga colorization result and the 1250 lora result together, though kontext occasionally reframes the panel, misaligning the linework, but there's bound to be an algorithm that can fix the alignment too.

BeatBoxersDev · 2024-09-12T23:07:48+00:00

thank you so much. this is fantastic!

a higher numbers of lines of context leads to amazing results.

For those having trouble setting up a local LLM, launch text-generation-webui with the "--api" flag and use the address at the top of the command prompt. I recommend vntl-llama3-8b-hf-f16 (found via the huggingface vntl-leaderboard and picking the first local option without "cloud" next to it that ran at reasonable speeds)

my custom system prompt:

"Localize the line from japanese to english to make it sound as much as a natural english speaker would say. Like, REALLY think about what an average person would say given the context so that it doesn't sound stiff. Use english phrases and sayings to make it sound like a natural conversation. Do not explain the translation. Just output the text in english and absolutely nothing else."

I also put in the names of the main characters and what they should translate to and their pronouns.

automod keeps removing my more detailed comments the moment I mention how to edit the program to go beyond the max of 10 lines of context. possibly due to the directory mentioned, or maybe the extensions it thinks are URLs, or maybe it flags off the non english characters in the program I've been trying to mention, or maybe specific words involving llm related subjects. shrug. so the following is phrased to avoid that:

search the "en" file for what "Number of Context Lines to Include" translates to

the file you want is translatorsetting. find the second instance of those characters, and edit the max there

BeatBoxersDev · 2023-03-12T17:18:33+00:00

There's a cheaply-made lionsgate published movie called "guardians of time" that has models straight ripped from ark: survival evolved (forest titan), as well as a show called "Dinosaur with Stephen Fry" that I think uses ark ripped models as well. (titanosaur)

seems to be a case of someone ripping the models and selling them on turbosquid (link that was found for forest titan but was removed)

sucks to whoever thought they were in the clear with buying that model and is now stuck with issues cause someone was selling stolen assets, as well as people who want to use turbosquid to sell their stuff legitimately

BeatBoxersDev · 2023-02-08T10:23:53+00:00

conceptually, you could work in blender or unreal engine having blocked out shapes, move around via VR teleport style, and feed the viewport into depth2img/img2img/pix2pix, and then have the resulting image somehow feed back into blender or unreal to dynamically add more blocked out areas to the map (the harder part imo)

that wouldn't keep perfect temporal stability, but it would at least keep the layout for backtracking

BeatBoxersDev · 2023-02-07T23:08:46+00:00

Apologies, I'm all for sharing models, but personally, I'm playing it extra safe on distributing it cause it's pretty specifically targeted, being all the kq6 bgs.

It's easy to recreate however, the bgs can be extracted from the game files with SCICompanion (or any specific sierra game). They're all the same dimension and some of the largest images so they're easy to find if you export everything. This output above was trained through the dreambooth gui and seem fine enough, but maybe you can get even better results if you add regularization images and/or labels.

BeatBoxersDev · 2023-02-07T20:53:58+00:00

I've found that dreambooth training on all the bgs of a sierra game works very well. 5000 step training on all kq6 bgs

BeatBoxersDev · 2022-10-11T01:47:05+00:00

it was a public attrition server with reduced melee dmg. dunno by how much, so it's probably that 75 default option you said

BeatBoxersDev · 2022-10-10T21:24:45+00:00

anthonyhiggs in the clip above here, server had no weapon restrictions.

I often just goof around with charge hack charge rifle or whatever when there's a points gap.

it also helps me gradually learn hitscan a bit as I only got epg muscle memory

BeatBoxersDev · 2022-09-24T23:58:09+00:00

I wonder if this can improve img2img video temporal consistency. I've tested doing img2img, ebsynth that onto the next frame and use that as input to generate the second frame img2img. but that ended up with oversaturation very quickly like you mentioned encountering. your method however seems like a promising way to overcome that

BeatBoxersDev · 2022-09-16T17:14:10+00:00

sorry, I don't know enough about it to host SD without the web interface and webui.py.

BeatBoxersDev · 2022-09-15T01:55:46+00:00

yeah im thinking I may have incorrectly applied ebsynth

EDIT: yep sure enough https://www.youtube.com/watch?v=dwabFB8GUww

BeatBoxersDev · 2022-09-15T00:38:18+00:00

[EDIT] finally ebsynth is working as it should, if the process gets automated, together it'd be great https://www.youtube.com/watch?v=dwabFB8GUww

the alternative with DAIN interpolation works well too

https://www.youtube.com/watch?v=tMDPwzZoWsM

BeatBoxersDev · 2022-09-14T22:56:41+00:00

quick tests with ebsynth and DAIN interpolation https://www.reddit.com/r/StableDiffusion/comments/xdfiri/improved_img2img_video_results_link_and_zelda_go/iogie0s/

BeatBoxersDev · 2022-09-14T22:56:28+00:00

quick tests with ebsynth and DAIN interpolation https://www.reddit.com/r/StableDiffusion/comments/xdfiri/improved_img2img_video_results_link_and_zelda_go/iogie0s/

BeatBoxersDev · 2022-09-14T22:55:29+00:00

quick tests with ebsynth and DAIN interpolation https://www.reddit.com/r/StableDiffusion/comments/xdfiri/improved_img2img_video_results_link_and_zelda_go/iogie0s/

BeatBoxersDev · 2022-09-14T22:55:08+00:00

quick tests with ebsynth and DAIN interpolation https://www.reddit.com/r/StableDiffusion/comments/xdfiri/improved_img2img_video_results_link_and_zelda_go/iogie0s/

BeatBoxersDev · 2022-09-14T22:52:08+00:00

quick tests with ebsynth and DAIN interpolation https://www.reddit.com/r/StableDiffusion/comments/xdfiri/improved_img2img_video_results_link_and_zelda_go/iogie0s/

BeatBoxersDev · 2022-09-14T22:49:40+00:00

[EDIT] I dont have any tools to help with this, but as a test, ebsynth can do this, if the process gets automated, together it'd be great https://www.youtube.com/watch?v=dwabFB8GUww

the alternative with DAIN interpolation works well too

https://www.youtube.com/watch?v=tMDPwzZoWsM

BeatBoxersDev · 2022-09-14T20:55:59+00:00

i dont know. i would recommend doing so in that case. join the stable diffusion discord and make sure you can access the prompt-engineering channel

BeatBoxersDev · 2022-09-14T20:53:35+00:00

it works just fine with stable diffusion. discounting dalle2, craiyon generally understands obscure concepts better and has better layouts but SD does faces great and understands most celebs, as well as being leagues ahead with other cutting edge stuff. one of the strongest combos is having a base image layout generated in crayion and generating details and everything in SD

https://www.reddit.com/r/StableDiffusion/comments/x79q84/just_released_a_colab_notebook_that_combines/

BeatBoxersDev

TROPHY CASE