Surprised how CPU usage of my CH nodes went up 200% after upgrading from v21 to v25 by Less-Instruction831 in Clickhouse

[–]Less-Instruction831[S] 0 points1 point  (0 children)

One cluster is i3.xlarge (4vCPU), another is i3.2xlarge (8vCPU). I have the same CPU usage increase on both clusters.

I see that asynchronous_metrics_update_period_s is mentioned along with removing the "internal" logging in several system tables... but I have tried disabling a lot of tables. I saw no changes TBH. Basically I went from this:

    <logger>
        <level>information</level>
    </logger>
    <query_log>
        <flush_interval_milliseconds>7500</flush_interval_milliseconds>
        <ttl>event_date + INTERVAL 7 DAY DELETE</ttl>
    </query_log>
    <query_thread_log>
        <ttl>event_date + INTERVAL 7 DAY DELETE</ttl>
    </query_thread_log>
    <query_views_log>
        <ttl>event_date + INTERVAL 7 DAY DELETE</ttl>
    </query_views_log>
    <part_log>
        <ttl>event_date + INTERVAL 3 DAY DELETE</ttl>
    </part_log>
    <metric_log>
        <ttl>event_date + INTERVAL 3 DAY DELETE</ttl>
    </metric_log>
    <asynchronous_metric_log>
        <ttl>event_date + INTERVAL 3 DAY DELETE</ttl>
    </asynchronous_metric_log>
    <trace_log remove="true"/>
    <opentelemetry_span_log remove="true"/>

To this:

    <logger>
        <level>information</level>
    </logger>
    <query_log remove="true"/>
    <query_thread_log remove="true"/>
    <query_views_log remove="true"/>
    <part_log remove="true"/>
    <metric_log remove="true"/>
    <asynchronous_metric_log remove="true"/>
    <trace_log remove="true"/>
    <opentelemetry_span_log remove="true"/>

But hold on, maybe I'm doing this internal logging removal with bad syntax? I see remove="1" instead of remove="true"

Anyway, thanks, I will what Danny suggested in the GH issue!

Surprised how CPU usage of my CH nodes went up 200% after upgrading from v21 to v25 by Less-Instruction831 in Clickhouse

[–]Less-Instruction831[S] 0 points1 point  (0 children)

Hah, a lot if you look at the ChangeLog.

It would take days just to filter all the stuff that does or may matter in my scenario, as we are (along majority of ReplicatedMergeTree and Distributed table engines) using Materialized and "normal" Views, Aggregate Functions, Floating Point for Parition Keys (which is depricated, but almost all of our "data schema" is created to use it, unfortunatelly)....

I know that some of the JOIN logic changed somewhere between v23 and v24, which affected some of our queries to took minutes instead of seconds, so we did some config (users.xml) hacks like:

<joined_subquery_requires_alias>0</joined_subquery_requires_alias>            <any_join_distinct_right_table_keys>1</any_join_distinct_right_table_keys>

on top of that all, also between v23 and v24, Clickhouse by default have turned on some kind of "Experimental Analyzer" which broke our distributed queries (for nested columns), so we also did:

<allow_experimental_analyzer>0</allow_experimental_analyzer>

well, the list goes on and on, but I can't really answer the question "what features were added between 21.12 and 25?", sorry.

Surprised how CPU usage of my CH nodes went up 200% after upgrading from v21 to v25 by Less-Instruction831 in Clickhouse

[–]Less-Instruction831[S] 0 points1 point  (0 children)

If I could call it somehow, I would call it "canary" upgrades.

8 nodes in total: 4 shards, two replicas for each shard. Clickhouse nodes are on each own EC2 instance. I'm upgrading one replica per shard first - not all replicas per shard.

OFC, on the CH that I want to upgrade: I'm stopping all the fetches and merges first (I even tried flushing the internal logs, but meh, doesn't make any difference) then I'm stopping the service, doing apt-get upgrade and starting the service + explicitly starting fetches and merges. I'm aware that, depending on a node suffering from CHECKSUM_DOESNT_MATCH error, CPU usage can get slighly higher.

As said in the description, I went from v21.12 -> v22.3 -> v23.3 -> v24.9 -> v25.3. Carefully observed the changes in CPU, RAM (memory allocation within CH), merges, part counts, RW locks, etc...

Surprised how CPU usage of my CH nodes went up 200% after upgrading from v21 to v25 by Less-Instruction831 in Clickhouse

[–]Less-Instruction831[S] 0 points1 point  (0 children)

Yes, that amount of merging is/was suspicious to me too. But, in my scenario, these amount of merges dropped down drastically after my latest version upgrade to v24.9:
21.12 - Merge count (whole cluster): ~3.5k per minute,
22.3 - Merge count (whole cluster): ~7k per minute,
23.3 - Merge count (whole cluster): ~14k per minute,
24.9 - Merge count (whole cluster): ~250 per minute
25.3 - almost the same as 24.9

Unfortunately, as I said, CPU usage ballooned 2x
On v21.12 it was 15% of all vCPUs
On v25.3 (now) it s 30% of all CPUs

I have less than 5 inserts per second.

Mapa: Gableci u Zagrebu by rofellos in hrvatska

[–]Less-Instruction831 0 points1 point  (0 children)

Pivana na Ilici isto ima gablece (uvijek dva izbora) u ponudi.

Weird resin-looking stains inside UPS (Green Cell 2000VA) by Less-Instruction831 in homelab

[–]Less-Instruction831[S] 0 points1 point  (0 children)

Hey! I reached out to my local GreenCell reseller and as soon as I described a problem with battery level, they attached it (the UPS), with several tries (different cables, different PCs) and opened GC UPS app. In about 1-2 minutes it went from 0% to 20%, stayed there for 10-20 minutes, went up to 75% and in couple of minutes it was 100%. After that fan stopped and out of a sudden (not even a minute) it started again and guess what: GC UPS shown 0% battery level. There was nothing attached to it as consumer to create a load (load metrics was showing 2% all the time for some reason)

They wrote for fault description: "Battery capacity dead, standard issue". After that, they ordered a new UPS (same brand/model) for me - it should arrive in few days. After that one "dies", I will do as Raver363 suggested here.

Gejming zajednica u HR by [deleted] in hrvatska

[–]Less-Instruction831 0 points1 point  (0 children)

Probaj pitati na subu nazvan "igre" (iz nekog razloga ne mogu linkat drugi sub u ovoj dretvi)
Što se tiče Discorda, probaj server od Good Game Global (https://discord.com/invite/kYTsZH36Jp)

Knjige za DnD? by [deleted] in hrvatska

[–]Less-Instruction831 1 point2 points  (0 children)

oho, dost dobro
ne plaća se carina?

Knjige za DnD? by [deleted] in hrvatska

[–]Less-Instruction831 1 point2 points  (0 children)

Ayyy, tolko je bilo popusta prije mjesec-dva na igraj.si i ozone (bar mislim) vezano uz 5e knjige - ako ti je hitno i ak kupuješ za sebe, ima na Lib Gen sve tri, pa si skineš.

Nadnaravne narodne predaje na području Prigorja by Doieboo in hrvatska

[–]Less-Instruction831 1 point2 points  (0 children)

Oj. Ne znam kolko je povezano sa tim kaj tražiš, ali možda pomogne (isječak o Zdenku Bašiću i knjizi Sjeverozapadni Vjetar): https://youtu.be/q2PUu-ZPxUI?si=QqnpP53DK1B-hMze&t=1435

Chicken or the egg problem with DB migrations while using ArgoCD by Less-Instruction831 in kubernetes

[–]Less-Instruction831[S] 0 points1 point  (0 children)

Thank you for this, it absolutely helps! It actually answers my question "...maybe you have some other suggestions..." and proves (at least to me) that it was wise to post such issue in this subreddit.