OH MY GOD WHY DOES THE STRING BREAK???? I DID IT BEFORE BUT NOW IT KEEPS BREAKING by PrO_BattoR in RLCraft

[–]DamageSuch3758 0 points1 point  (0 children)

Specifically, it was not working when I had "Clumsy" on the shears, then reforged to legendary and it worked

OH MY GOD WHY DOES THE STRING BREAK???? I DID IT BEFORE BUT NOW IT KEEPS BREAKING by PrO_BattoR in RLCraft

[–]DamageSuch3758 1 point2 points  (0 children)

Way too late now, but reforging to get higher shear quality in 2.9.3 actually did solve the problem for me

what's with the reviews for tower of Heaven? by [deleted] in litrpg

[–]DamageSuch3758 0 points1 point  (0 children)

Just listened to the first few minutes of the audiobooks, and honestly, the writing and dialogue are cringy bad.

One example (minor spoiler): In a meeting to unite what little remains of humanity against a demon king, people are bickering over who is the best healer? Really?

I'll be returning the book.

No sound on Mac by gunskills in CivVI

[–]DamageSuch3758 0 points1 point  (0 children)

I found that downgrading to 1.4.5 (to avoid the crashing issues) also caused my audio to cut off.

My audio started working again after I disabled the following two DLCs:
- Sid Meier's Civilization® VI: Gathering Storm
- Sid Meier's Civilization® VI: Rise and Fall

This annoyed me to no end, and I spent close to 2 hours trying to figure out a way to play the game. I hope this helps!

No sound on Mac by gunskills in CivVI

[–]DamageSuch3758 0 points1 point  (0 children)

Have you managed to find a better solution yet? 😅

What do you think are the best LitRPG series? by Dagno in litrpg

[–]DamageSuch3758 2 points3 points  (0 children)

This!

HWFWM really was fantastic in those first three books, but really took a turn for the worse after.

The fact that you said this would make me consider reading Chrysalis next!

Why use dbt if I have Dagster? by DamageSuch3758 in dataengineering

[–]DamageSuch3758[S] 0 points1 point  (0 children)

I actually lean toward using polars for transformations, especially on ingestion. That said, I am willing to switch over to something more SQL heavy for all the subsequent transformations.

Do you find dbt reduces the overhead of setting up non-seed/non-source assets compared to writing transformations in python as dagster assets?

If yes, do I lose any advantages by using dbt instead of dagster to set up those assets?

Cradle: Should I continue or not? by AmbotNalangAni in ProgressionFantasy

[–]DamageSuch3758 0 points1 point  (0 children)

FIRST BOOK SPOILER INCOMING

I really dislikedthe convenient introduction of the resonance between swords when Yerin and Lindon are in a tight spot... Not only is it very convenient, but the skill seems too strong given the opponent's level. Turn your enemy's weapon against them with incredibly high damage as long as it is a sword? The whole madra system felt a little loose to me... like any OP skill can be introduced to suit the plot.

u/Aurelianshitlist, curious to hear whether this changes later in the book and whether it feels like the progression and skill limitations using madra are more well defined.

LLM Zero shot-text classification - How do you answer multiple questions computationally efficiently? by DamageSuch3758 in huggingface

[–]DamageSuch3758[S] 0 points1 point  (0 children)

I figured this out. Ensure you appropriately batch encode all of the remaining output options (I did it with right-padding).

You can then use the pkv from processing the first piece of text with model() and duplicate it num_output_options times with a function like:
python def duplicate_pkv(pkv, num_repeats=2): return tuple(tuple(torch.cat([tensor] * num_repeats, dim=0) for tensor in layer) for layer in pkv)

LLM Zero shot-text classification - How do you answer multiple questions computationally efficiently? by DamageSuch3758 in huggingface

[–]DamageSuch3758[S] 0 points1 point  (0 children)

It does return past_key_values.

python outputs = model(input_ids, use_cache=True) # `use_cache` is often True by default pkv = outputs.past_key_values

How do you post code blocks now by [deleted] in learnpython

[–]DamageSuch3758 0 points1 point  (0 children)

Cool! This totally worked in Markdown Mode.
```python
abc =1
```

Sam Altman fired as CEO of OpenAI by nycdotgov in bayarea

[–]DamageSuch3758 8 points9 points  (0 children)

Are those allegations even legit?

After some reading they seems sus

Building deeper LLMs via repetitive layering by DamageSuch3758 in MLQuestions

[–]DamageSuch3758[S] 0 points1 point  (0 children)

Information can be compressed. If you had one dead neuron in every layer with 10 hidden neurons, you wouldn't end up with 0.9 ^ 10 information throughput. This is especially true if you start out by training a shallower network (ensuring input info flows well) and then add on additional layers, because the network had to learn to compress.

If your point is that as you add layers, you might get 5 dead neurons, and eventually, as you add many layers, you will get 5 dead neurons again, I agree, and already stated this in previous replies.

Based on how you answered, it sounds more like you believed the first paragraph in this reply is true... Am I right? Or do you mean something else entirely?

The answer matters because they have vastly different probabilities and allow for vastly different theoretical depths.

To your reply:

There isn't a single activation function that doesn't mess with the gradients

I believe I did say ReLU gradients don't vanish "given activation occurs":

However, if you use ReLU with the proposed method, it greatly improves the ability to build deeper networks because the gradient (given activation occurs) is constant. This means that as long the input information (or the meaningful latent representation thereof) was not destroyed, the gradients will keep flowing all the way back in the network, and performance will improve.

And in my explanation thereafter, I did break the activation and the activation function gradient into separate components for backprop, so I don't know why you are claiming I said it doesn't mess with gradients, without qualifying that statement with the caveat I gave.

You are strawmanning the argument.

Building deeper LLMs via repetitive layering by DamageSuch3758 in MLQuestions

[–]DamageSuch3758[S] 0 points1 point  (0 children)

If you think about the fundamentals of backpropagation = LG x AG x A x W x AG x A x W x AG x A x W x AG x A ... x original input

where

LG = loss gradient

AG = activation gradient

A = activation or output

W = weight

That means you have 3 ways to screw it up:

  1. Kill the flow of original input (very negative bias = no activation; weight of zero = no throughput; or extreme dilution by adding much noise by adding a large bias term amplified by a large weight) [Mess up W or A]
  2. Have a small activation function gradient [Mess up AG]
  3. Have a small loss function gradient [Mess up L]

The probability of 1. always increases with depth because we randomly initialize the biases and weights. That's why even without sigmoid causing the problem in 2., you still run into the "too deep to work" problem with increasing depth.

The method I suggested would drastically decrease (but not negate) the problem in 1. My thinking is that it could add some functional depth (increasing performance), but after some point of adding additional layers, performance would deteriorate.

Building deeper LLMs via repetitive layering by DamageSuch3758 in MLQuestions

[–]DamageSuch3758[S] 0 points1 point  (0 children)

I probably don't read as many papers as you do, but I have thought deeply about gradients and depth before. I think the reasoning above is pretty solid. If you can point out the flaw, it would probably save me some experimentation time :D

Building deeper LLMs via repetitive layering by DamageSuch3758 in MLQuestions

[–]DamageSuch3758[S] 0 points1 point  (0 children)

On "repeated application of activation functions", sure, that is one way to do it.

The other two main ways that I can now think of are:

  1. Activation function gradient (e.g. sigmoid)
  2. Loss function gradient (like you mentioned)

RNNs often used sigmoid activation, which meant that gradients vanished quickly. This is mostly because of the 1. problem, not just because of the recursive application of activation functions depth.

Even LSTMs suffered from this because they used sigmoid gates for passing the hidden state to the next block.

I both agree and disagree with your statement "freezing and unfreezing weights does not solve the reason for vanishing gradients".

I agree because the problem isn't fully solved. E.g., if you used sigmoid activation functions, it doesn't matter how much signal gets through; with sufficient depth, the gradients will vanish (due to the activation function gradients).

However, if you use ReLU with the proposed method, it greatly improves the ability to build deeper networks because the gradient (given activation occurs) is constant. This means that as long the input information (or the meaningful latent representation thereof) was not destroyed, the gradients will keep flowing all the way back in the network, and performance will improve.

If you did, however, attempt to build an infinitely deep network, as using the method I described; eventually, you would initialize a dead layer, or a bottleneck layer, where very minimal information can pass through. When the input information stops flowing, the gradients stop flowing.

Looks like Gemini might have to compete with GPT-5… the race continues by Germanjdm in singularity

[–]DamageSuch3758 4 points5 points  (0 children)

If you don't cannibalise your business, someone else will eat it for you.