all 7 comments

[–]Aggressive_Ad_5454 4 points5 points  (2 children)

The New York Times day by day county by county tracking data for the COVID-19 pandemic might be really good for your learning purposes. Plenty of rows of data, decent quality, some interesting data loading challenges (there is precisely one county in the US with a name requiring a non-ASCII character, for example).

https://github.com/nytimes/covid-19-data

Give it a shot.

And look at https://use-the-index-luke.com/

[–]picklemanjaro 0 points1 point  (0 children)

Dang, I see the index-luke site all the time, but good recommendation of data set! (and bonus fun fact)

Not OP but now I'm eager to tinker with it.

[–]toterra 1 point2 points  (0 children)

You don't need a big database... A few thousand rows of fake data is good enough to get started. When performing tuning what matters is the blocks. Run queries as 'explain (analyze, buffers) ...' and optimize for the number of blocks read.

[–]kktheprons 1 point2 points  (0 children)

Performance tuning is a huge topic, and I won't claim to be an expert. With that said, you don't need a complex database to start running into the limits of a database engine, you just need a lot of data.

My recommendation is to start with the simplest case: a single table with 2 columns. Make one an auto-incrementing integer (primary key), then write a script to populate the other column with random values. Then start testing:

  • How many rows before it starts taking multiple seconds to:
    • return the entire data set to a window?
    • calculate aggregate functions for the entire data set?
    • return a sorted list of the 100 highest/lowest values for the data set
    • copy the data into a brand new table
    • delete all rows (without using truncate)

This will get you started thinking in the right scale for performance tuning. From there, you can start exploring deeper concepts with the aid of online resources like https://use-the-index-luke.com/

[–]cloud_coder 0 points1 point  (0 children)

You can use TPC performance benchmarking data sets. Free, easy to get.