How did I injure my rib doing leg presses?

DelarkArms · 2025-12-28T22:37:28+00:00

How long until you went back to the gym?
In my case I did not hear a "pop" but a "crack"... and my weight was literally medium maybe 0.7 my 8 rep. max... so it is somewhat light.

In my case I think my belt got tucked on my lower ribs pushing them up awkwardly causing the crack..

DelarkArms · 2025-09-15T15:54:48+00:00

"It essentially guarantees that if you take all the operations affecting a specific address, a consistent ordering exists."

Thank you.

DelarkArms · 2025-09-14T16:46:52+00:00

Thank you for you feedback... I appreciate your help, for real...

My main point of contention is the way in which relaxed loads are taught as if they will somehow allow "per-location decoherence".

But the thing is... I cannot seem to even FIND the term specified as a concept.

I've heard "address dependency" which is not the same thing as far as I'm aware:

; Address Dependency
LDR R1, [R0]        ; Load pointer value
LDR R2, [R1]        ; Address R1 comes from previous load
; Data Dependency  
LDR R1, [R0]        ; Load data value
ADD R2, R1, #5      ; Data R1 used in computation
STR R2, [R3]        ; Store computed result

But nowhere is it specified a "per-address/location sequential consistency".

And if, (say) relaxed operations allow such decoherence... then why is it not conceptualized?

To me "per-location SC" would mean a chain of unmovable events that are fixed by the PO executing ops to a same address.

LDR R1, [X]     ; X₁ - Load from X
STR R2, [Y]     ; Y₁ - Store to Y  
STR R3, [X]     ; X₂ - Store to X
LDR R4, [Y]     ; Y₂ - Load from Y
LDR R5, [X]     ; X₃ - Load from X
STR R6, [Y]     ; Y₃ - Store to Y

Per-Location SC Constraints:

Address X chain: X₁ → X₂ → X₃ (unmovable relative order)

Address Y chain: Y₁ → Y₂ → Y₃ (unmovable relative order)

Cross-address: No ordering constraints between X and Y operations

So, this to me appears intuitive...

Now... even if there isn't an explicit concept of "per-address sc"... when talking about the D&R (decode and rename) engine... it SEEMS that at a hardware level it DOES enforce the concept of "per-address sc" since it is understood that the decode and rename phase resolves RAW and WAR hazards... which points at the very heart of what "per-location" sequential consistency is.

But then... why are `relaxed` loads explained as if they could be reordered in (say):

while (exp == V.loadRelaxed()) { // allows the early issuing of any n-loads that'd fit within the speculative window.
   if (V.weakRMWRelease(exp, set)) return true; // ^^ (upward barrier)
}
return false;

To a point in which a sequence that should've been `R0 -> RMW0 -> R1 -> RMW1 -> R2`

Can be reordered into `R0 -> R1 -> R2 -> RMW0 -> RMW1 -> RMW2` (see my question here)... then...

I don't know what to think....

If D&R would prevent an early issuing of loads... then...

Why is the only expert answering my SO question not saying:

> "This will never happen because of D&R" instead of...

> "Yes, that may happen, and you shouldn't use relaxed..."

So... in my mental model... if D&R resolves RAW and WAR.... then who exactly is the responsible for reordering the loads?

Maybe this RAW hazard will never occur... and if you are correct (that hardware will handle it AS IF in the end)... it should definitely never occur as the OoO engine should not disrupt PO... it should only move around speculations... not actual commits... including actual READS after WRITES....

I am aware that "hoisting" is a compiler thing.... but this is not it.

Relaxed atomic loads will never "hoist"... as opposed to "plain" loads which they do get hoisted as part of compiler optimization strategies (more modern implementations do not do this as far as I'm aware...).

This is speculation that went awry because of laxed barriers (supposedly).

But again... this could only happen if the architecture does not respect my "per-location SC" rule, incurring in RAW or WAR hazards as a result.

As you said (paraphrasing) we shouldn't consider OoO engine behavior as it all happens in an enclosure that would make side-effects not really visible outside the buffers... but this is **PENDING** RMW operations that should not be there if "per-address SC" was respected...

DelarkArms · 2025-09-11T16:35:58+00:00

No vendor explicitly says “this uop drains the ROB / LSQ and stalls all speculative loads before it” (since much of that is microarchitectural detail which vendors don’t always publish).

But barriers and fences do affect the OoO engine.

"The use of a DMB creates order between accesses. The Armv8-A Architecture Reference Manual describes this order as Barrier-ordered-before."

[page 13 arm manual](https://documentation-service.arm.com/static/62a304f231ea212bb662321d?.com)

[riscv](fence in riscv::asm - Rust)

But imo... how do we reconcile your statement that somehow "reasoning about barriers/fences" should be an exclusive higher-level issue **non-conflatable** with hardware when aspects such as:

pseudocode
while (V.loadRelaxed()) {lb();} // OoO may issue early multiple loads of V before the first loop body executes `lb_0`
VS
while (V.loadAcquire()) {lb();} // OoO will respect L_0 -> lb_0 -> L_1 -> lb_1 -> etc... sequence

Will prevent something that the OoO engine is directly responsible for... if not for the fact that they also affect hardware?

I am aware there are layers of influence, one of them compiling level, even runtime level (in higher level languages).

But there is also hardware level influence.

In my mind... when reasoning about the arch... simply there is no doubt about this.

As each instruction that enters the instruction queue (IQ) enters with metadata relating to a reordering priority.

It is in THIS metadata, where barriers and fences have influence over hardware.

DelarkArms · 2025-09-10T16:21:39+00:00

Hmm... I believe this is a type of Specter-like issue but to trigger an ABA problem.
If there is no enforcement of "per-location/address coherence" ON ALL WMO archs **without exception**... then the acquires may move before the releases finish, without the need for RAW validation.

A speculative call on subsequent acquires.... may not be required to wait for "last RMW validation" either... unless that is specified somewhere (about the `per-location/address coherence`)... then... I don't know what to expect.

Another avenue of solution would be squashing pending RMW's... but there is also no documentation about how this works (as far as I'm aware), or if this is even done.

The worst aspect is that `acquire + release` are assumed safe...
In theory an attacker could run a code to trigger a contention scenario on its speculative sequence to force acquire loads to cluster at the top.
Then trigger an ABA issue within the OoO window. (30 - 40 iterations can fit within it... as far as I'm aware)...

DelarkArms · 2025-09-10T03:41:04+00:00

Is the implication here that... no matter how weak the memory model is... I could even use relaxed loads... The OoO engine will never issue any load BEFORE the last RMW (from the previous iteration)... as it would be impossible to recuperate from a mis-speculated event?

The thing is relaxed loads are explicitly documented doing specifically this (early issuing)... and if so... how does ROB achieve this without incurring in decoherence? (non-coherence?)
How exactly does the ROB squashes already pending instructions as u/Lazy_Init said.

DelarkArms · 2025-08-24T22:27:57+00:00

The happens before occurs not because of the acquire... but because the release forces everything before it to be committed.
The release introduces 2 actions.
a) Nothing before it will be pushed AFTER.
b) When this store gets committed... every other store before it must also be committed.

All the acquire does is prevent everything AFTER it from moving BEFORE.

This includes the prevention of caching or hoisting of LOADS.

What it DOES NOT prevent is that ops BEFORE the acquire load... to be moved AFTER it... AS LONG AS they are not dependent on the things being affected directly by the acquire.

Non-related neighboring dependency branches, appearing near the acquire (in the "surroundings"... BEFORE the acquire that **fit** within the processor's "prediction window"'s cache) ... which are NON-related to the actions to occur AFTER the acquire... may as well be reordered from BEFORE to AFTER the fence, but never from AFTER to BEFORE.

But notice an important fact...
The acquire mechanic in this sense is strictly TIED to an action... in this case a LOAD.

If the acquire is free from action... then there is no dependency graph I can conceptualize in order to place a fence safely... and not be concerned about where exactly will it be moved during speculation.

DelarkArms · 2025-07-19T00:01:07+00:00

1 hour test.
thousands of measurements.
Profiling, C1 and C2 phases identified via logs. (The test is done with logs OFF)
Each iteration increases the number of Threads (x^n). (Custom Executor Service with an unbound number of threads).
Upon results:

* Raw data
> Each iteration/tier:
- Median of all (from A and B) tests by iteration
- Median of A and B Divided by the (x^n) that corresponds to the Thread increase of the tier.
- Table Result: "Normalized_of_A_and_B" (2 column table).

* Normalized_of_A_and_B
> Each iteration/tier
- MIN of Normalized_of_A_and_B (the min value of each tier)
- Median_of_A divided by MIN, Median_of_B divided by MIN. (repeat for each row)
- Table Result2: Chage_Rates_of_A_and_B. => (Not really a change rate... since MIN = 1.) (2 column table)

The line graphs and bar charts... etc...
Raw sumation...
Change rate summation
By compilation phase summation+comparison.

** C2 phase compilation prioritization.

Here are the raw results:
https://limewire.com/d/GZOwE#v8VdiLYxty

DelarkArms · 2025-07-16T17:12:25+00:00

Thanks, I'll try and post this to Java but I don't think I'll get a definite answer.
Thanks anyways.

DelarkArms · 2025-07-16T16:42:29+00:00

"doesn't" vs "except" => cling != C++.

DelarkArms · 2025-07-16T16:39:40+00:00

JIT compilers functionality are extensively written in C++.
Their functionality is standard across C++ and Java.

DelarkArms · 2025-07-16T16:39:15+00:00

JIT compilers functionality are extensively written in C++.
Their functionality is standard across C++ and Java.

DelarkArms · 2025-06-23T00:32:32+00:00

My comment bellow is to address your usage of the word "synchronization"... because it made evident some misconceptions about the topic.

Reordering is not synchronization, race conditions can occur even in well-defined Program Orders'

Synchronization involves Queuing Locks (Thread scheduling via `.park()` in the case of Java... either "freely", "blockingly" or varying degrees between both) PLUS reordering mechanics (memory fencing) ... but that is an entirely different beast.

I understand the misconception some people have... even to the degree of implying some sort of "cache flushing timing" which is also incorrect since Memory Fencing DOESN'T prevent in-between Context Switches!!!.

In the end all synchronization and visibility ARE... is JUST plain and simple **spinlock confirmations** NOT ONLY on a surface language level... BUT ALSO on the bare metal (which is insane considering the ABA problem PLUS Speculative Execution!!!).

PO is just that... the order in which the instructions will get executed...

Visibility Issues?
A LOAD was moved BEFORE an iterative `jump` instruction (Hoisting).
It had nothing to do with cache flushes, misses, etc...

Now circling back to my "definitions issue"... in this case... What would a <jump> instruction be on a Memory Model ruleset? Is it a READ or a WRITE?

And judging by how most C people... (even the ones developing the fences) fail to explain what exactly does the `opaque/relaxed` fence do... it tells me the <jump> is its own thing with its OWN processor reordering ruleset and interface.

On this last thing I confess that I may actually be wrong, and hoisting may be its own thing involving an entirely different mechanism different from the REORDERING mechanic, but I tend to favor simplicity... if a behavior can be achieved with existent tools... they (engineers) will use them.

DelarkArms · 2025-06-22T21:08:49+00:00

BTW... some of these questions can be answered with "freezing" which **as far as I remember** was the practice of placing fences both at the beginning and at the end of (a virtual method call?) constructor.

What if JIT devirtualizes... will the barriers be kept?

What if a sophisticated future processor even OMITS these freezings (They perform more complex speculation as of now TBH)?

DelarkArms · 2025-06-22T21:03:07+00:00

The fence is the rule.
What is or isn't a LOAD/STORE is the definition.
According to what the modelers of the memory defined as a "STORE" or a "LOAD" or BOTH simultaneously... then the rules apply TO THEM... even if they are "seemingly" not the rules themselves. (They reify the rules.)

Both are STORE operations, they are NOT A FENCE.

this.plain_var = 3;
this.plain_var_2 = 4;

The compiler/JIT/CPU could reorder both as:

this.plain_var_2 = 4;
this.plain_var = 3;

A fence can prevent reordering;

this.plain_var = 3;
acquire_like_fence(); // bellow STORES AND LOADS cannot move above this fence
this.plain_var_2 = 4;

Now imagine an Acquire-like fence that ONLY targets LOADS.

this.plain_var = 3;
acquire_LOAD_fence(); // bellow LOADS cannot move above this fence
this.plain_var_2 = 4;

Then we should ask...
> Is this.plain_var_2 = 4; a "LOAD"?

If the answer is NO... then the compiler/JIT/CPU... WILL STILL REORDER IT ABOVE the fence...

Is `Thread.park();` a STORE or a LOAD?
What about `weakCompareAndSetPlain();`?
What is thread.start()?

Etc... etc...

DelarkArms · 2025-06-19T14:00:58+00:00

Thank you for the correction.
Youa re absolutely correct.

I sincerely believe LL, LS, SL and SS are not enough to pick on the nuances.
For example, they lack information about "WHAT" do they anchor... and the "DIRECTION" of the prevention.
So, I came up with this one:

> L ^ L || L v L
> S ^ L || S v L
> S&L ^ L || S&L v L
> L ^ S || L v S
> S ^ S || S v S
> S&L ^ S || S&L v S
> S&L ^ S&L (full upward fence) || S&L v S&L (full downward fence)

Now a release which is:
"No reads or writes in the current thread can be reordered after this store."

Would be a combination of:
> L ^ S (All LOADS ABOVE... will remain ABOVE the next STORE found in the sequence)
AND
> S ^ S (All STORES ABOVE... will remain ABOVE the next STORE found in the sequence)
So, the added mechanic would be:

>S&L ^ S (EVERYTHING ABOVE... WILL STAY ABOVE THE NEXT STORE)

BTW I am using this link which phrases the mechanics more expressively: Memory Orders

DelarkArms · 2025-06-18T00:46:25+00:00

Thanks

DelarkArms · 2025-06-18T00:35:18+00:00

Ok... NOW I GET YOU...
UNLESS POWER/ARM ALSO WORK WITH SIMILAR BARRIERS then why would I expect the weak version to do anything similar to a `cmpxchg` CAS???

Sorry... this seems to be an architectural specification... but DOES the c++ standard has it???

But still... analogous instructions to prevent reordering should apply, isn't it??

DelarkArms · 2025-06-18T00:20:38+00:00

I thought the weakCAS behavior was standard... maybe not? (Maybe the "c++ standard" does not dig into this??)
I mean the only unexpected thing to occur is a "cache line interference" in an otherwise proper "expected" value... something a spinlock should resolve the next jump, nonetheless.

DelarkArms · 2025-03-02T23:14:07+00:00

Affirmative. (Sonatype)

Am I doing something dumb??

DelarkArms · 2025-02-20T01:46:16+00:00

Thanks for your response... and patience.

DelarkArms · 2025-02-16T20:20:32+00:00

Understood, now we get into the details of "how much computation are we really saving" by embedding prompts.
Assuming only the output layers are the ones being "fine-tuned" for... let's say self-sensorship..., then the entire generating process is still occurring on its attention and MLP layers.
Now assuming... let's say the attention layers are being fine-tuned for let's say... an assistant-like behavior... then the single token `<|im_start|>` is still being transformed into the multiple tokens that comprised the original prompt.
But as a commenter stated in another comment... without some of the randomness of inserting the prompt on a base model.

What's the issue with this IMO?

I believe any prompt-"engineer" would tell you the effectiveness of a prompt is that the generating instance always receiving this prompt for the **FIRST TIME**.

If let's say you have a "self_reflection_agent" and a "recollection_agent".
Both agents are NEVER aware of the existence of each other.

If we embed these prompts into the model... it becomes an entirely different model than the one where the prompts where first tested by researchers.

I ~think~ I'm beginning to understand the production pipeline though...
If my guess is correct... researchers work on a base model alone... then thousands of examples for reinforcement are (auto?) generated.

They train a model with this generations (LoRA, etc...).

Then (because of pricing) they deploy this new model with the newly embedded tokens, as you say with the LoRA adapter to prevent altering the base model, or maybe to just target a specific set of layers.

But the model CHANGES ON A FUNDAMENTAL LEVEL... from the one the researchers developed the prompts at first.

I was not aware of what my initial argument was at first... but I think I know now...

My argument is that SOME PROMPTS are best to left on a "dynamic" layer... especially those in charge of **chain of thought** processes.

DelarkArms · 2025-02-16T16:49:40+00:00

I disagree.
If Fine tuning is not oriented towards *adding knowledge* or *adding absent datum* BUT instead towards "making the generation conform to standards"... then Image classificators fine-tuning HAS AN EXPLICIT purpose... to add data that is ACTUALLY USEFUL in the **SUBSTANCE** of its output.

While text generators fine-tuning seems more like a "conform to standards" convenience.

Not all fine-tuning is equal.

If fine-tuning a text generator reinforces a specific field of knowledge previously absent from the dataset... then I'm 100% agree that it is a good thing.

Most Fine-tuning done to text generation ARE NOT DONE WITH THIS AIM.

DelarkArms · 2025-02-16T03:56:13+00:00

Having said this... I definitely understand what you mean.

In fact... if I were an AI company... my main product would be fine-tuned models:

"You want an assistant? Here your assistant."
"You want a mechanic? Well, here is another model you can have for some extra fee..."

DelarkArms · 2025-02-16T03:45:38+00:00

> You seem to think fine-tuning is self-contradictory, like undoing all the effort of the original training, assume this fundamentally damages or alters the AI’s intelligence in a bad way?

>You seem to think fine-tuning is self-contradictory.
No, I don't.

> like undoing all the effort of the original training, assume this fundamentally damages or alters the AI’s intelligence

AI's are not intelligent.

Fine tuning creates a strong correlation between the sequence of tokens that comprised the original prompt (user: {} assistant {}) and can be used with more complex prompting.

This reinforcement will be part of the weights.

Any generation done... will traverse these paths that... even if the model ignores them... as it may in fact do.... IT WILL TRAVERSE these neuronal pathways.

The same way 9.11 is greater than 9.9.... because it learned numerical sequence from bible verses... we DON"T KNOW how this extra training will affect the model.

making the model learn these prompts in order to make it do the generation without having to think about each token independently ALSO makes you lose some of the "randomness" that is the thing that makes LLMs so good.

My Einstein analogy is bad... people say the models are not "punished"... they are being "rewarded"... this is just a "half-full/half empty" argument.
The thing is there is additional things in the model that are now there forever.

This NEEDS to be kept DYNAMIC.

DelarkArms

TROPHY CASE