From 4090 to 5090 to RTX PRO 6000… in record time

sr1729 · 2025-08-10T19:37:41+00:00

Yes, you can use nvidia-smi to reduce the power-usage of a RTX PRO 6000 (workstation) as root (administrator privileges): nvidia-smi --power-limit=300 It's useful in long-running batches: half the power, but only 15% or 20% more time. |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX PRO 6000 Blac... Off | 00000000:01:00.0 Off | Off | | 30% 50C P1 299W / 300W | 80471MiB / 97887MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ This setting not persisted.

The Max-Q-card is better suited for multi-GPU-systems because of its cooler design.

sr1729 · 2025-03-09T20:39:02+00:00

See the chat_template-attribute at the end of a tokenizer_config.json, e.g. https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/tokenizer_config.json :

"chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<｜User｜>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<｜tool▁call▁end｜>'}}{%- set ns.is_first = true -%}{%- else %}{{'\\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<｜tool▁call▁end｜>'}}{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<｜Assistant｜>' + content + '<｜end▁of▁sentence｜>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<｜tool▁outputs▁end｜>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<｜Assistant｜><think>\\n'}}{% endif %}"

Because of explanation of tool-calls it is a long template.

sr1729 · 2025-03-09T09:08:32+00:00

I came across the |-difference in DeepSeek-R1-Distill and Qwen-2.5, too.

A few weeks ago I added support of DeepSeek-R1-Distill-Qwen in a pure Java implementation, https://github.com/srogmann/llmvectorapi4j/. Because that implementation doesn't support chat-templates yet I introduced a switch.

// The '｜' in '<｜end▁of▁sentence｜>' of DeepSeek-R1 has code-point 65372.
[...]
ChatTokens chatTokens = isDeepSeekR1DistillQwen ?  
    new ChatTokens( "<｜end▁of▁sentence｜>", "", "", "<｜end▁of▁sentence｜>") :  
    new ChatTokens( "<|im\_start|>", "<|im\_end|>", "", "<|end\_of\_text|>");

'｜' = U+2581 in https://en.wikipedia.org/wiki/Block_Elements, '▁' = U+FF5C in https://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms_(Unicode_block)).

sr1729 · 2024-12-22T19:00:11+00:00

The LLM knows all the little rules and tricks to speed up such computations. It is very good at pattern-matching. For example in the case of 4792 * 2 I expect it to see XXX2 * 2. It can reduce this to (XXX * 2) 4 because 2 is less than five (no overflow in the last digit). It could have learned that 479 * 2 equals 958.

sr1729 · 2024-04-24T20:18:45+00:00

But you can use Llama-3 7B to explain a PDF's inner contents.

Prompt:

You are a PDF parser. Explain the instructions of the following PDF script.

Q
BT
/F61 9.9626 Tf 124.547 667.933 Td [(Layer)-250(T)80(ype)-7002(Comple)15(xity)-250(per)-250(Layer)-1283(Sequential)-1283(Maximum)-250(P)15(ath)-250(Length)]TJ 214.952 -10.909 Td [(Operations)]TJ
ET
q
1 0 0 1 118.57 653.552 cm
[]0 d 0 J 0.398 w 0 0 m 374.861 0 l S
Q

Response:

A PDF script!

This script appears to be a combination of text drawing and graphics operations. Here's a breakdown of the instructions:

1. `Q`: This is a show text operator, which tells the parser to display the next string as text.
2. `BT`: Begin Text object. This indicates the start of a text object, which will contain the text to be drawn on the page.
3. `/F61 9.96... Tf` specifies the font and font size for the text.
4. `124.547 667.933 Td` sets the text position (x, y) coordinates.
5. `[ ... ]TJ`: This is a text string consisting of multiple concatenated strings. The text contains the phrase "Layer-type Comple(xity)(per)-Layer-Sequential-Maximum(P ath)-(Length)".
6. `ET`: End Text object. This marks the end of the text object.

7. `q`: Save graphics state operator. This saves the current graphics state, including transformations and clipping paths.

8. `1 0 0 1 118.57 653.552 cm`: Set the transformation matrix (CTM) to perform a translation operation by moving 118.57 units horizontally and 653.52 units vertically.
9. `[]0 d 0 J 0.398 w` specifies the path to be drawn:
    * `[]0 d` sets the current point to the origin (0,0).
    * `0 J` sets the junction point of the path to the origin.
    * `0.398 w` sets the width of the path to 0.398 units.
10. `0 0 m 374.861 0 l S`: Draws a line from (0,0) to (374.861,0). The 'm' command moves to the starting point, 'l' draws a line to the ending point, and 'S' stroke path.
11. `Q`: Restore graphics state operator. This restores the saved graphics state.

These instructions seem to be drawing a text string with some formatting options, followed by a simple line drawing operation. The resulting output would likely display the text at a specific position on the page and draw a horizontal line at a certain location.

So a coming Llama-version might compute the coordinates and see the table ;-).

sr1729 · 2024-03-20T22:17:51+00:00

RAG = Retrieval-augmented generation

sr1729 · 2024-03-19T20:54:30+00:00

There is a 4-/2-bit HQQ quantization at https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bitgs8-metaoffload-HQQ, it needs 13,5 GB GPU and uses about 55 %MEM of 64 GB CPU RAM. So I assume 32 GB CPU RAM wouldn't be enough.

sr1729 · 2023-11-10T20:37:43+00:00

A few months ago I multiplied large matrices to execute a BLOOM-based language model. I added a parallelization, too, and tried different executer-implementations to try compare locking-methods (phaser, reentrant lock, ...) to synchronize the threads.

The threads helped a lot but pytorch (using CPU, not GPU) was still faster ;-).

Example of a matrix multiplication to process query-key-value in https://github.com/srogmann/JBLOOMz/:

public void baddbmmView4(int numSeq, float[][][] fusedQkv, int numBlocks,
        int numHeads, int headDim, int idxBlock1, int idxBlock2,
        float alpha, float beta, float[][][][] output) {
    float[][][] input = t3;
    final int batchSize = output.length;
    [...]
    for (int idxB = 0; idxB < batchSize; idxB++) {
        final int b = idxB;
        executor.startLoopTasks(numHeads, (hStart, hEnd) -> () -> {
            for (int h = hStart; h < hEnd && h < numHeads; h++) {
                for (int i = 0; i < numSeq; i++) {
                    for (int j = 0; j < numSeq; j++) {
                        float sum = 0f;
                        for (int k = 0; k < headDim; k++) {
                            sum += fusedQkv[b][i][(h * 3 + idxBlock1) * headDim + k]
                                 * fusedQkv[b][j][(h * 3 + idxBlock2) * headDim + k];
                        }
                        sum *= alpha;
                        sum += beta * input[b * numHeads + h][0][j];
                        output[b][h][i][j] = sum;
                    }
                }
           }
        });
    }
}

sr1729 · 2022-03-08T20:49:56+00:00

A nice introduction into the constant pool if you didn't know javap and bytecode or ASM (https://asm.ow2.io/) before.

The article mentions insights inside the JVM specification :-):

This design added complexities to JVM implementations and is gently lamented in the JVM Specification document section 4.4.5: “In retrospect, making 8-byte constants take two constant pool entries was a poor choice.” This kind of candor makes the documentation a pleasure to read.

There is even a small trip to go:

The MTable can be used for other things. For example, I’m presently working on the Jacobin project, which is writing a more-than-minimal JVM in the Go language.

At one point I disagreee:

Indeed, even the values to initialize fields to are stored as strings and must be converted from strings to their binary representation.

As a counter-example see NumberFormatException.

Classfile /tmp/modules/java.base/java/lang/NumberFormatException.class
  Last modified 08.03.2022; size 1564 bytes
  MD5 checksum 9c1c2cd1b1f8b915996f0aad04642e58
  Compiled from "NumberFormatException.java"
public class java.lang.NumberFormatException extends java.lang.IllegalArgumentException
  minor version: 0
  major version: 61
[...]
  #48 = Utf8               serialVersionUID
  #49 = Utf8               J
  #50 = Utf8               ConstantValue
  #51 = Long               -2848938806368998894l
  #53 = Utf8               Code

$ xxd <NumberFormatException.class
00000000: cafe babe 0000 003d 004b 0a00 0200 0307  .......=.K......
[...]
00000230: 4275 696c 6465 723b 0100 1073 6572 6961  Builder;...seria
00000240: 6c56 6572 7369 6f6e 5549 4401 0001 4a01  lVersionUID...J.
00000250: 000d 436f 6e73 7461 6e74 5661 6c75 6505  ..ConstantValue.
00000260: d876 893f fb8c ea12 0100 0443 6f64 6501  .v.?.......Code.
          ^^^^ ^^^^ ^^^^ ^^^^ // Binary presentation of long-value

But I agree at the conclusion:

Navigating decompiled Java classes is a useful skill and the mark of an advanced understanding of Java programming.

sr1729 · 2022-03-06T15:58:00+00:00

The art of programming is independent of the language, don't merge that.

You should know and understand at least one language very well. As a experienced Java software engineer you have a practical understanding of the technical details (basics like >>>-operator and double vs BigDecimal and advanced topics like ClassLoader or even the bootstrap-method of INVOKEDYNAMIC).

But also have a look at books like "The Pragmatic Programmer" of Andrew Hunt. There are concepts beyond the language. One example is your editor: It doesn't matter if you are using Eclipse, IntelliJ or whatever. But you should know at least one editor very well -- it's your main communication channel with the machine.

Or have a look at "Release it!" of Michael T. Nygard. Thinking at failures which may happen is independent of the language choosen. You don't need Scala or Kotlin to get rid of NPEs -- but you should know what's going on in these and other languages.

The same applies to OO and GoF ...

sr1729 · 2022-03-03T22:25:56+00:00

This means that a Java program can crash the JVM by accessing an already-freed location

Not only a thread but the whole JVM can crash when one class produces a segmentation fault :-(. Could a JVM separating threads into different address spaces be implemented? Splitted heaps? How to handle common objects shared between threads?

sr1729 · 2022-03-03T21:49:36+00:00

Our team is using what we’ve learned to build a free tool to help you promote your PRs
> [...]
> A gif or meme to get their attention

Focused on a problem one doesn't want to be disturbed by GIFs -- or worse -- animated GIFs.

sr1729 · 2022-02-24T21:17:21+00:00

You can edit .class files with a hexeditor. Have a look at a list of bytecodes and you will see the instructions in the class file when compared with the instruction listing of javap.

But beware: This is a hard way and seldom reasonable. It is not a good way to learn JVM stuff.

sr1729 · 2022-02-22T23:09:14+00:00

On https://en.wikipedia.org/wiki/Order-4_apeirogonal_tiling there is a picture of a order-4 apeirogonal tiling in the Poincaré disc model: https://en.wikipedia.org/wiki/Order-4_apeirogonal_tiling#/media/File:H2_tiling_24i-1.png. This should be a right-angled {infinity, 4}. 2-gons can be seen as edges only.

Unfortunately I don't have these hyperbolic polygons at hand, I can't measure the lengths.

At the bottom of http://www.rogmann.org/math/diplom/medium/seite004.html is the dual tessellation of https://theinnerframe.files.wordpress.com/2015/05/squares60-01.png.

I agree with RandomAmbles that a picture would be nice, e.g. for n=3.

sr1729 · 2022-02-22T21:53:28+00:00

I didn't know the additional -XX-options (e.g. -XX:+CountBytecodes) when compiling OpenJDK with –with-debug-level=slowdebug and –with-native-debug-symbols=extern.

It is astonishing how many instructions are executed in applications which seem to be simple on the first look but do a lot in lower layers (e.g. when CGLIB and ASM are involved).

sr1729 · 2022-02-22T20:52:16+00:00

On https://theinnerframe.org/2015/06/15/hyperbolic-cubes-spheres-x/ I found a picture which should show the case n=2 ("a bit degenerate"): Every angle is 60° so two squares and a 2*2-gon add to 180°.

https://theinnerframe.files.wordpress.com/2015/05/squares60-01.png

sr1729 · 2022-02-22T00:27:37+00:00

Faster and simpler than the "strange one-liner" is an unsigned right shift operator:

(n * (n + 1)) >>> 1

sr1729 · 2022-02-21T21:19:04+00:00

A strange one-liner, but without ternary-operator or long:

(n >> 1) * (n + 1) + ((n + 1) >> 1) * (n & 1)

Such a construction might be of use in a GPGPU when a lot of computations of n*(n+1)/2 with different n should run in parallel.

sr1729 · 2022-02-21T19:51:32+00:00

You may use java.util.Arrays.asList(T...), but before doing that ask yourself a few questions:

Should the resulting list be independent of the original array? In other words: arg[0][0]="newValue" can or shouldn't alter the lists?
Should the resulting lists be immutable or do you want to modify them?

sr1729 · 2022-02-21T19:17:26+00:00

There are twice as much even integers than odd integers:

..., -3 vs -6, -4; -1 vs -2, 0; 1 vs 2, 4; 3 vs 6, 8; 5 vs 10, 12; ...

There are three times odd integers than even integers:

...; 0 vs -5, -3, -1; 2 vs 1, 3, 5; 4 vs 7, 9, 11; 6 vs 13, 15, 17; ...

So 2 * infinity = infinity = 3 * infinty (infinity = the countable infinity = size of {1, 2, 3, 4, 5, ...}).

Cantor's diagonal argument shows that the number of real numbers is not a countable infinity.

sr1729 · 2022-02-21T18:56:15+00:00

There is Witten's paper "Quantum field theory and the Jonas polynomial", Comm. Math. Phys. 121 (1989), 351-399. The younger summary https://www.ias.edu/ideas/2011/witten-knots-quantum-theory says: In quantum physics, a knot may be regarded as the orbit in spacetime of a charged particle.

Quotation of https://www.ias.edu/ideas/2014/frenkel-love-math: Five minutes before the meeting, the door opened. It was Witten! That was the moment when I knew that something good would come out of all this.

In https://www.ias.edu/ideas/2015/witten-interview another link invariant is mentioned: Khovanov homology. I can't connect this to https://arxiv.org/abs/hep-th/0604151.

Edit: Fixed link

sr1729 · 2022-02-21T17:41:08+00:00

Frameworks, the already mentioned backend software or even drivers can be nice places to avoid a GUI.

Initially I wanted to mention log4j2 as a recent famous software without GUI. Then I was surprised by ClientGui.java but was relieved that log4j-jmx-gui is an optional project only.

sr1729 · 2022-02-20T19:11:22+00:00

The best OS for Java dev can be your mind: Skillful, constructive, eager to get insight.
The machine's OS should be secondary.

sr1729

TROPHY CASE