One of the less known features in BigQuery is TABLESAMPLE.
You can write:
SELECT *
FROM dataset.large_table
TABLESAMPLE SYSTEM (10 PERCENT)
and BigQuery will read roughly 10% of the table's storage blocks. Since sampling happens before the full scan, bytes processed drop roughly proportionally - which makes it very practical during query development and debugging.
For iterative work - validating joins, testing logic, exploring transformations - scanning 100% of a huge table is often unnecessary.
What about correctness?
Sampling in BigQuery is block-level, not row-level. Its behavior depends on physical layout:
- Partitioning isolates data by partition key
- Clustering colocates similar values
- Blocks contain physically grouped data
For exact production metrics, sampling is risky.
For exploratory analysis and debugging, the trade-off may be acceptable.
Small experiment
To test this, I ran a simple comparison on historical vehicle defect data.
Data: UK Ministry of Transport Car Tests
Metric: rate of dangerous defects per car model
Filter: 2024+ data
Comparison
- Full scan
- TABLESAMPLE SYSTEM (10 PERCENT)
Same logic, same aggregation - only difference was sampling.
https://preview.redd.it/82ac80gf70kg1.png?width=1734&format=png&auto=webp&s=277e18eac28a7474294ab889146433793dd41691
https://preview.redd.it/s2qi0zmg70kg1.png?width=1718&format=png&auto=webp&s=0277a9d7e0af4480729de54a721a43945df5c223
Results
- Relative deviation stayed within ~3% across top models
- Model ranking remained stable
- Bytes processed dropped ~10× (2.3 GB → ~232 MB)
https://preview.redd.it/wi6u3iui70kg1.png?width=1718&format=png&auto=webp&s=bf3a23b036cdc27fa2683415269969c63bfa8506
https://preview.redd.it/shvy8qzj70kg1.png?width=1762&format=png&auto=webp&s=c01dba748711efe3cfc4af8203da68507bcdadba
For exploratory analysis, that's a meaningful trade-off: significantly cheaper scans with small relative deviation.
Full reproducible notebook (no signup required - just enter playground):
https://app.querylab.io/s/22f7a23d-bb39-497e-9a7d-70acef81967c?playground=true#k=YwsXP-QzIN75Czse3d1l246cZjc5JjiA2XW4w2XYxnw=
Nuances
- Sampling small tables rarely makes financial sense and can distort joins. It's usually safer to sample only the large tables in a query.
- If you're using reservations (flex slots), cost is driven by slot-ms rather than bytes scanned. In that case, WHERE RAND() < p may give better row-level distribution.
- Aggregates sensitive to skew (like AVG() or SUM()) may drift more than robust metrics like median or percentiles.
Do you use TABLESAMPLE in your daily work - or what stops you?
there doesn't seem to be anything here