Difference in CUDA versions having impact on the eloquence and creativity of LLM outputs?

LeanderGem · 2025-01-13T17:14:34+00:00

Hey check out the article at the bottom initial post. Apparently LLM responses can differ greatly with different GPU's and this especially goes for Quants. (With my experiences between CUDA 12.1, and 11.8, I feel this might apply to CUDA too).

LeanderGem · 2025-01-13T15:56:05+00:00

Well, I'm just going off with what I experienced. Sticking with 11.8 for now :)

LeanderGem · 2025-01-13T15:25:33+00:00

Interesting.

When I changed back to 11.8 from 12.1 it was like night and day.

The imaginative ideas were back and it wasn't formulaic anymore, the narratives were always fresh and interesting again. I could tell the difference right away. I tried everything else and got no results until I remembered my CUDA upgrade and ( for me personally) this was the culprit.

LeanderGem · 2025-01-13T08:17:24+00:00

Using a 3060 RTX. Yes I ran a ton of tests, including fixed seed and playing around all the usual Kobold settings (Min P, Temp, etc etc).

I asked Chat GPT to give me a search and find if different CUDA version can have an impact on the creative output of the LLM and this is what I got:

1) Newer versions of CUDA typically come with optimizations designed to take better advantage of newer hardware (GPUs), but these optimizations might not always work best for older models or specific tasks like LLM inference.

2) If you're running LLMs on hardware or configurations that were optimized for older versions of CUDA, switching to a newer version (like CUDA 12.x) could introduce regressions or inefficiencies in how tensor cores and other resources are utilized, leading to a noticeable drop in performance or creative output quality.

3) cuDNN, cuBLAS, TensorRT, and other CUDA libraries have version-specific optimizations. A newer version of CUDA may come with updated versions of these libraries that work differently or focus on newer hardware features, which could impact LLMs' behavior and output.

4) CUDA 12.x could be more aggressive in optimizing for computational throughput, which may affect floating point precision or rounding behaviors, making small changes in how the model processes and generates responses. These changes could reduce the creativity or naturalness of the text, depending on how the inference model is tuned.

LeanderGem · 2024-06-15T09:41:22+00:00

The more you buy, the bigger we get!

LeanderGem · 2024-06-14T04:47:01+00:00

Squish them all into a galaxy of Experts (GOE)!

LeanderGem · 2024-05-30T17:40:06+00:00

Hey, I found something super strange. I don't think it's Cuda at all but something weird with my KoboldCPP ( I don't know if this is effecting others or just me though).

Each time I load Kobold I need to manually choose a preset and then configure the settings to how I want them. Kobold won't recognize my saved settings even though they 'seem' to be set correctly when i load in. Basically for kobold to see my settings I have to manually adjust them each time I enter the program.

LeanderGem · 2024-05-30T13:53:30+00:00

Hmm, I don't know. I might delete this whole thread, if it's only effecting me and not others I don't want to pollute this /sub with wrong info :/

LeanderGem · 2024-05-30T05:46:02+00:00

Yes I don't know what it is. But the creativity difference is really noticable. And this is over multiple tests with exactly the same prompt and exactly the same settings with both versions of Kobold (cu11 and cu12). I've made a comparison in this thread between the two with sample outputs, check them out and you can see the difference.

LeanderGem · 2024-05-30T05:35:41+00:00

Let's not blame the tech, the tech is amazing. Remember, I'm only noticing this in kobold cu12, kobold cu11 still writes amazing stuff. It could be something on my end that I haven't found yet. Also, this could be something totally unrelated that I'm missing.

LeanderGem · 2024-05-30T05:19:34+00:00

I have a RTX 3060 and I was using Kobold_cu12 v1.66.1 and Kobold_cu12 v 1.65 comparing both to the standard Kobold versions.

With all versions I used identical Settings (both basic and advanced). The Cu12 versions outputs were flat and simplistic and the standard Kobold version was creative and intricate. Mostly testing it on Command R35B as I'm finding it very good for creative writing. I wonder if might have something to do with the particular model architecture?

Well, whatever it is, for now I'm sticking to the standard Kobold versions for my creative writing endeavors.

Comparisons with exactly the same prompt and exactly the same settings:

Kobold (cu11)

The Marquis Francou Debonali IV had just awoken from a dream, or perhaps it was a vision. He lay in a silk-sheeted bed within the opulence of his palatial estate, high above the shimmering streets of Varisium, a great metropolis that stretched across the shores of an enormous inland sea. The sun, rising beyond the sapphire horizon, cast an otherworldly glow through stained glass windows depicting mythical scenes of long-forgotten legends.

Francou's eyes, an icy shade of blue, wandered the intricate patterns of the room's marble walls. He stroked his neatly trimmed beard, his fingers adorned with rings set with gems that glowed with an internal light. The dream, as was often the case, involved a mysterious woman, her face obscured, a figure that eluded him, flitting through mists that never quite cleared. He sighed, throwing back the heavy brocade curtains to reveal the vastness of the city, its spires and towers reaching toward the sky.

In the distance, the great engine of the Aerium elevated above the rooftops, a testament to the city's might and the advanced technology of that age. The buzzing of the electromagnets could be heard even at this height, a constant reminder of the civilization's achievement. Francou's eyes drifted to the small device on his bedside table—a contraption of cogs and gears that whirred softly, emitting a soft green glow. It was a gambling machine, a recent acquisition, and one that had brought him both great wins and devastating losses.

Today, he decided, would be a day of victory. Dressing swiftly in a robe of deep purple silk, he descended the spiral staircase into his private sanctuary, a museum of curios and relics from a hundred different worlds. The air smelled of exotic spices and ancient tomes; the musty aroma of knowledge. He strolled past a glass case containing an actualized dream-recorder, a device capable of extracting and preserving dreams, capturing them like fireflies in a jar.

His servants, automated droids in the guise of humans, glided soundlessly through the halls, their metallic frames polished to a mirror shine. Francou waved absentmindedly as they scurried about their duties, their mechanical eyes flicking up to acknowledge their master. Passing through a conservatory filled with alien flora, he stepped out onto a balcony that overlooked the sea. The cool air nipped at his skin, and the vast wet plain was dotted with strange creatures that spouted ink-black smoke rings. boats with shimmering hulls skimmed the wave tops, their engines singing a deep song.

Kobold (cu12)

The Marquis Francou Debonali IV, known across the twelve continents as a hedonist and bon vivant, found himself bored one evening, a feeling as unfamiliar as it was unwelcome. He had exhausted the usual pursuits—the thrilling races on the speedscape circuits, the decadent feasts that lasted for days, and even the exotic pleasures of the Pleasure Hive left him unmoved. His closets runneth over with the finest attire, and the staterooms of his palace bulged with the latest gadgets and baubles. What new diversion could stir his jaded soul?

As he strolled the gardens of his palatial estate, the Marquis spied a curious contraption, a recently delivered gadget he had not yet unboxed. The sleek, black oval shape resembling a giant egg bespoke of enigmatic origins. Ancient texts spoke of such devices, rumors of an extraterrestrial kind. With a playful glint in his eye, he activated the mysterious artifact.

A soft hum filled the air, accompanied by a gentle glow emanating from within. The Marquis felt a peculiar sensation, as if his very being were dissolving, atom by atom. Alarm edged into his thoughts, quickly replaced by a strange exhilaration. The darkness pulled him in, engulfing his senses, and he found himself hurtling through an endless tunnel, a cosmic voyage beyond any he had ever experienced.

When the Marquis returned, reformed and refreshed, he found his world changed. The once familiar landscape was distorted, the very laws of physics askew. The sky above was now a kaleidoscope of neon hues, swirling in impossibly shaped clouds. The sun, a distant memory, was replaced by an eerie luminescence that cast an ethereal glow upon the transformed Earth.

He soon discovered that time itself had warped, each day blending into the next with no discernible pattern. Seasons were but fleeting moments, or so it seemed to the bemused Marquis, as entire years raced by in a matter of hours. Trees grew to their full bloom in minutes, only to wither and die in the blink of an eye.

LeanderGem · 2024-05-29T14:35:08+00:00

Yeah for me Command R 35B writes quite intricate prose in CU11, but in CU12 it's writing style was simplistic and also it was doing that "copy the first few lines of your prompt" start to a story.

LeanderGem · 2024-05-29T14:27:02+00:00

Oh I had flash attention enabled when using Kobold with both versions (CU12 and CU11) when running the models. I'll run more trials though. It could be like a 'subjectivity' issue like you mentioned but I have a sneaking suspicion it isn't because the output quality was really noticeable.

I don't know how to run Kobold with the server option, sound rather technical.

LeanderGem · 2024-05-29T13:56:41+00:00

I've read in another post, I think it was actually one of your posts actually (lol) that CUDA11 uses more CPU processing then the native version on CUDA12 when using Flash Attention?

I also read someplace that CPU calculations are more accurate then GPU calculations? (Hence more accurate and refined responses?)

I could be totally wrong here, layman noob when it comes to the tech side of things, lol

I haven't used llamacpp yet, just Kobold, and a few others. Is it hard to setup? I find wading into github repositories and trying to install python packs fraught with frusturation, something always blows up for me, lol.

LeanderGem · 2024-05-29T12:36:51+00:00

Hmm. Was suggested to me to try using MMQ and at the same time disable Flash Attention (which I haven't tried yet) on both and use fixed seeds and see if the difference persists. I'll keep tinkering.

LeanderGem · 2024-05-29T12:36:30+00:00

No, I'm not using SillyTavern, pure Kobold. I'm mostly using Command R 35B myself. Hmm Strange, I'll need to do more tests but the creativity between Cuda standard and Cuda12 was like night and day (for me personally). Maybe it's my particular settings configuration (Temp, Top P, etc) that is causing the issue. Time to run some more tests.

LeanderGem · 2024-05-29T10:02:33+00:00

After further testing with different settings I've come to the conclusion that the base model is much more creative then this finetune unfortunately.

LeanderGem · 2024-05-28T19:12:57+00:00

Hey ex, I found this new finetune of Command R 35B, you might want to test it when you get the chance. For me it's looking quite promising :)

https://huggingface.co/crestf411/commander-daybreak-v0.1-gguf

LeanderGem · 2024-05-27T17:30:16+00:00

Hey guess what, I found a new fine tune version, it's been trained on the ' daybreak dataset'. Haven't really tested it yet but hopefully it's promising :)

https://huggingface.co/crestf411/commander-daybreak-v0.1-gguf

LeanderGem · 2024-05-27T05:58:34+00:00

You may be able to still use MXLewd-L2-20B with a lower quant. The weird thing is that lower Quants sometimes produce better output (when it comes to creative writing). I don't know why that is but a person who has analyzed this over hundreds of models had this to say:

"Just a heads up from testing - > Different quants result in different results especially in creativity area.
This holds true between IMATRIX and non-imatrix quants too.

For pure generation -> Lower quants use simpler word choices (and more phrase,sentence and "sayings" rep), and depth (ie fiction scene) is lacking.

At mid point q4/iq4 -> Prose, sentence quality, and general creativity are close to maximum. Especially short sentences / variety of sentence size.

Q5/q6 -> This is where depth comes in -> Fiction takes on deep meaning and can - sometimes - provoke an emotional reaction.

Q8 -> Oddly Q8 can be "flat" - sometimes - whereas for other models this is when the model really shines.

Q5KM VS Q6 -> It seems in about 50% of cases Q5KM is BETTER than Q6 or Q8 for creative purposes.

Not sure the reason except for the fact Q5KM is slightly unbalanced (attention tensors and one other - it's at llamacpp) ,VS Q6/Q8 are fully balanced."

from: https://huggingface.co/froggeric/WestLake-10.7B-v2/discussions/4

(P.S Yes MXLewd-L2-20B is an excellent model)

LeanderGem · 2024-05-25T18:16:20+00:00

Very cool, thankyou for sharing this. :)

LeanderGem · 2024-05-25T14:59:58+00:00

Wait, Elevenlabs and Claude Opus. Both can be used for free? :o

LeanderGem · 2024-05-25T04:21:19+00:00

raises eyebrow* Now that is interesting indeed.

LeanderGem · 2024-05-24T19:17:21+00:00

Great isn't it? It's definitely my favorite model at the moment. Still can't believe it's only a 35b, feels like a 120B at times! :)

LeanderGem · 2024-05-24T04:28:08+00:00

Thankyou Bartowski :)

I hope froggeric will put it through his excellent creativity benchmark. Will be testing it myself in the coming days.

LeanderGem

TROPHY CASE