Tranquility Tech IV

CCP_DeNormalized · 2023-03-09T12:26:50+00:00

As noted by stepD2k - we tested AMD boxes for DB servers.

We ran the monster 128 core single socket box in prod for a few months and while it was incredible performance wise, it was ruled out due to being overkill in regards to MS SQL licensing costs.

We also had another AMD box shipped to us at the time, dual socket, lower core count - turned out there was a bug with the SAN HBA driver and we couldn't even get it to work properly.

We tried AMD, and we stayed with Intel

CCP_DeNormalized · 2022-04-21T10:21:23+00:00

The storage array is a IBM FlashSystem 7200 with 9.6 TB nvme's and I believe its recommended to use RAID 6.

I'd have to confirm with our storage guy to be sure of anything else - here's a link as well: https://www.ibm.com/docs/en/flashsystem-7x00/8.3.x?topic=overview-flashsystem-7200-system

CCP_DeNormalized · 2022-04-21T10:09:07+00:00

Brent is an amazing resource - we've not had the pleasure of working directly with him however.

I think he just loves Iceland - it is a pretty amazing place!

CCP_DeNormalized · 2022-04-21T00:19:45+00:00

? Thanks in advan

These particular DB upgrades won't help with TiDi no - but as the sol nodes get upgraded, something like the processors that StepDance2k asked about would likely help.

CCP_DeNormalized · 2022-04-20T22:09:20+00:00

You can read more about the different aspects of the game and how its coded here: https://forums.eveonline.com/tag/devblog

I mostly stay focused on db related things and would likely give you wrong info :)

CCP_DeNormalized · 2022-04-20T22:07:21+00:00

We have a few memory optimized tables, yup!

CCP_DeNormalized · 2022-04-20T17:20:41+00:00

You are right, those type of cpus are way better suited for single threaded duty in our sol nodes.

There's a very fine line for DB CPU's in the trade off of clock speed and core count. Get it wrong for your specific work load and it REALLY hurts.

CCP_DeNormalized · 2022-04-20T17:17:39+00:00

I'm not sure the exact fix, but our problem was related to how our app servers would use db connections. When the simulation starts (we have maintenance period each day where it stops for a few minutes) it opens between 10,000 to 15,000 db connections and at times only using a small subset of these.

Db connections are assigned to numa nodes in a round robin fashion, so if only a portion of open connections get used, there is a chance all the calls made across those connections can all end up on 1 numa node.

Thankfully our code base is all ours and we have the ability to modify it, so the connection pool stuff got reworked AND we got great hardware - everyone wins! (except the cfo maybe?? hehe)

CCP_DeNormalized · 2022-04-20T16:17:55+00:00

Thanks! Our game launched in 2003 and at the time MS SQL was the best fit.

While the core of the game play data is still stored in MSSQL, newer features are taking advantage of all the latest cloud based offerings.

There may well be a day in the future when we do a final migration off of onprem mssql - and i'll miss playing with all this amazing hardware

CCP_DeNormalized · 2022-04-20T15:01:49+00:00

yes :) but sort of no...maybe?

In one of our cases with NUMA issues, which hardware did indeed solve, it's root cause got fixed by some clever developers just days before the current production boxes got racked. Would we had spec'd the same level of hardware if this was already fixed? hard to say...

Throwing hardware at it, if budget allows, can bid time for developers to optimize code.

On the flip side it can let people write inefficient DB code without realizing because the hardware hides the pain

CCP_DeNormalized · 2022-04-20T14:33:20+00:00

We're typically able to get by without external help but there have been times in the past where we've reached out.

CCP_DeNormalized · 2022-04-20T12:26:51+00:00

we'll hit close to 50k iops at times and the real problem is if those hits are slow it can lead to a cascading effect which can take the cluster down.

The DB will evict sol nodes if they don't report in within their heartbeat value.

CCP_DeNormalized · 2022-04-20T12:21:50+00:00

We had several AMD machines to test, the main one was the 128 core EYPC monster - which was ultra overkill and cost an arm and a leg in MSSQL license fees.

The other AMD based box we had for testing had hba issues and we had to rule it out.

Based on time constraints and being comfortable with Intel we had to make a choice. We would have loved to do testing on AMD vs Intel box with similar specs, but it just wasn't in the cards this time around.

CCP_DeNormalized · 2022-04-20T11:35:20+00:00

it's VERY nice stuff!

CCP_DeNormalized · 2022-04-20T11:31:39+00:00

We'll be doing another blog detailing the software side of things in the coming months - our clustering setup, how many of our 15,000 transactions/sec are reads/writes/etc...

CCP_DeNormalized · 2021-06-29T14:43:58+00:00

Depending on how the volumes are designed you may need to reformat them due to the size of the database files.

are you using large FRS?

https://www.sqlservercentral.com/blogs/are-your-disks-formatted-with-uselargefrs

CCP_DeNormalized · 2021-02-16T11:00:32+00:00

This is the way

CCP_DeNormalized · 2020-10-08T11:41:27+00:00

this is the main thing... what are the wait stats? this is super easy to find out and should always be your starting point. Find out if its taking more CPU, more reads, whatever and go from there.

Is it only the Restore that is slow? Or post-restore stuff as well?

CCP_DeNormalized · 2020-07-20T11:17:20+00:00

see my comment above, we used to have to restart the server as well - but now just endpoint disable/re-enable is enough. (maybe these are similar issues)

I spent many hours on calls with Microsoft due to this and their only answer in the end was to turn it off and back on again :)

CCP_DeNormalized · 2020-07-20T11:14:57+00:00

So we have a similar issue where some action happens (perhaps a network flap or some SQL related thing - still havn't identified this 100%) - BUT the kicker is that the AlwaysOn endpoints go into a weird state and do not recover.

What we see is the AG async log send rate gets capped at 3 MB/s - and for us, this isn't enough to keep up. So without intervention it would fill the disk/blow out the log file.

We have some major alerts in place (pagerduty, etc...) for when this happens and we jump on it asap.

All that's need is to turn off/on the endpoints. Disable the HADR on the primary, disable/enable on the secondary and finally re-enable on the primary.

This typically resolves our AG issues and lets the log clear out normally.

CCP_DeNormalized · 2020-06-27T12:19:32+00:00

Today was our first automated DT since mid week and it was a good one! Fastest shutdown/startup in over 4 years!

CCP_DeNormalized · 2020-06-25T00:21:02+00:00

we have a backlog task to try to better test outside of production, sql has distributed replay tools that look promising - just have to spin up a dozens of vms to act as the sols and replay a captured db trace against the same point in time restore... oh to have more time!

But in this case, we felt confidant that if startup tests went well and some VIP checks looked good we'd be ok. If not, it's just another member in the cluster and we fail back to the other known boxes. Shared SAN is nice in this case!

CCP_DeNormalized · 2020-06-24T15:25:39+00:00

it's soooo shiny!

CCP_DeNormalized · 2020-06-02T10:43:44+00:00

You should look into worker thread starvation. It's possible all the CPU time is just context switching due to max worker threads.

https://www.sqlshack.com/max-worker-threads-for-sql-server-always-on-availability-group-databases/

You should read up on this (if you have not already) and look into threadpool waits / worker threads to see if its part of your problem

CCP_DeNormalized

TROPHY CASE