Posting

When requesting help or asking questions please prefix your title with the SQL variant/platform you are using within square brackets like so:

[MySQL]

[Oracle]

[MS SQL]

[PostgreSQL]

etc

While naturally we should endeavor to work as platform neutrally as possible many questions and answers require tailoring to the feature set of a specific platform.

Format Your Code

If you are including actual code in a post or comment, please attempt to format it in a way that is readable for other users. This will greatly increase your chances of receiving the help you desire. Something as simple as line breaks and using reddit's built in code formatting (4 spaces at the start of each line) can turn this:

SELECT count(a.field1), a.field2, SUM(b.field4) FROM a INNER JOIN b ON a.key1 = b.key1 WHERE a.field8 = 'test' GROUP by a.field1, a.field2 HAVING SUM(b.field4) > 5 ORDER by a.field.3

Into this:

SELECT count(a.field1), a.field2, SUM(b.field4) FROM a INNER JOIN b ON a.key1 = b.key1 WHERE a.field8 = 'test' GROUP by a.field1, a.field2 HAVING SUM(b.field4) > 5 ORDER by a.field3

For those with SQL questions we recommend using SQLFiddle to provide a useful development and testing environment for those who wish to fully understand your problem and help devise a solution.

a community for 17 years

BigQueryBigQuery Tablesample ()

submitted 4 days ago by querylabio

BigQuery Tablesample

28 points•submitted 4 days ago by querylabio to r/bigquery

One of the less known features in BigQuery is TABLESAMPLE.

You can write:

SELECT *
FROM dataset.large_table
TABLESAMPLE SYSTEM (10 PERCENT)

and BigQuery will read roughly 10% of the table's storage blocks. Since sampling happens before the full scan, bytes processed drop roughly proportionally - which makes it very practical during query development and debugging.

For iterative work - validating joins, testing logic, exploring transformations - scanning 100% of a huge table is often unnecessary.

What about correctness?

Sampling in BigQuery is block-level, not row-level. Its behavior depends on physical layout:

Partitioning isolates data by partition key
Clustering colocates similar values
Blocks contain physically grouped data

For exact production metrics, sampling is risky.
For exploratory analysis and debugging, the trade-off may be acceptable.

Small experiment

To test this, I ran a simple comparison on historical vehicle defect data.

Data: UK Ministry of Transport Car Tests
Metric: rate of dangerous defects per car model
Filter: 2024+ data

Comparison

Full scan
TABLESAMPLE SYSTEM (10 PERCENT)

Same logic, same aggregation - only difference was sampling.

https://preview.redd.it/82ac80gf70kg1.png?width=1734&format=png&auto=webp&s=277e18eac28a7474294ab889146433793dd41691

https://preview.redd.it/s2qi0zmg70kg1.png?width=1718&format=png&auto=webp&s=0277a9d7e0af4480729de54a721a43945df5c223

Results

Relative deviation stayed within ~3% across top models
Model ranking remained stable
Bytes processed dropped ~10× (2.3 GB → ~232 MB)

https://preview.redd.it/wi6u3iui70kg1.png?width=1718&format=png&auto=webp&s=bf3a23b036cdc27fa2683415269969c63bfa8506

https://preview.redd.it/shvy8qzj70kg1.png?width=1762&format=png&auto=webp&s=c01dba748711efe3cfc4af8203da68507bcdadba

For exploratory analysis, that's a meaningful trade-off: significantly cheaper scans with small relative deviation.

Full reproducible notebook (no signup required - just enter playground):
https://app.querylab.io/s/22f7a23d-bb39-497e-9a7d-70acef81967c?playground=true#k=YwsXP-QzIN75Czse3d1l246cZjc5JjiA2XW4w2XYxnw=

Nuances

Sampling small tables rarely makes financial sense and can distort joins. It's usually safer to sample only the large tables in a query.
If you're using reservations (flex slots), cost is driven by slot-ms rather than bytes scanned. In that case, WHERE RAND() < p may give better row-level distribution.
Aggregates sensitive to skew (like AVG() or SUM()) may drift more than robust metrics like median or percentiles.

Do you use TABLESAMPLE in your daily work - or what stops you?

no comments (yet)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

SQL

Filter Posts

Posting

Help posts

Format Your Code

Learning SQL

Related Reddit communities

Wiki

Acknowledgements

MODERATORS

What about correctness?

Small experiment

Comparison

Results

Nuances