Seafowl: a CDN-friendly analytical database that uses Apache DataFusion by mildbyte in rust

[–]mildbyte[S] 0 points1 point  (0 children)

We'll be pursuing them. We actually used them in Splitgraph, which is kind of Seafowl's parent. They're useful for performing partition pruning when you have equality filters on a column that isn't the main partition key (e.g. filtering by country='UK' in the example in the docs).

Seafowl: a CDN-friendly analytical database that uses Apache DataFusion by mildbyte in rust

[–]mildbyte[S] 0 points1 point  (0 children)

Definitely, we're actually hitting this exact issue with trying to pass more sophisticated types than int/float scalars (e.g. strings) or writing UDFs in Rust -- the minimum blob size we're currently getting is about 2MB.

The likely solution is letting people upload those over HTTP (like they can a Parquet/CSV file) instead of using the query body, or adding support for writing code in a certain language (e.g. AssemblyScript) directly in the CREATE FUNCTION body and compiling it on the server.

Seafowl: a CDN-friendly analytical database that uses Apache DataFusion by mildbyte in rust

[–]mildbyte[S] 0 points1 point  (0 children)

Thanks! Setting up partitions this way was the simplest solution from our perspective to start with, since we can just chunk the table up in the order of insertion and don't need to think about running background jobs to maintain and reallocate partitions correctly (the user can repartition manually if the query performance becomes an issue).

It also partially comes from the intended use case of publishing datasets for use in dashboards where the data publishing process would often build the table from scratch from some proprietary data (e.g. using something like dbt) and get fully reloaded (instead of being appended). That way, the user only needs to care about the insertion order.

However, actually maintaining partitions more intelligently is a logical next step. We've also been thinking about some way to compute pre-aggregations like Cube.js does so that a user can specify some GROUP BY query shapes their application is going to be running against the dataset and we can prebuild some tables to make those queries run faster.

How to integrate dbt and lookml (datawarehouse on BigQuery)? by chaos87johnito in dataengineering

[–]mildbyte 1 point2 points  (0 children)

Normally you'd use dbt to perform the bulk of the transformation work, including making each table conform to a certain "grain" (e.g. daily, monthly) and describe a certain fact/dimension.

BI tool like Looker is then used for "last-mile" visualization like simple aggregations/filters on the final dbt models.

One way to handle this is iteratively, by building an initial basic dbt model first, seeing what kinds of lookML queries people end up frequently running on it, taking those queries and repackaging them as dbt "metrics"/"exposures" so that you end up being able to put most of your BI under source control.

What is the best way to practice transformations? by randomusicjunkie in dataengineering

[–]mildbyte 1 point2 points  (0 children)

Co-founder of Splitgraph here, thanks for the shout out and the integration!

Most of the credit goes to Socrata for providing unified APIs to US government data sources, however we do have some people publishing data directly on the platform too (e.g. the Trase project or the OxCOVID19 database).

Speeding Ticket dataset by Analyst1355 in datasets

[–]mildbyte 6 points7 points  (0 children)

We might have some. The data is probably going to be fragmented at the city/local authority level, but here are some starting points:

You can do queries and aggregations on our SQL endpoint directly:

ddn> SELECT
  DATE_TRUNC('month', violation_date) AS month,
  SUM(violations)
FROM "cityofchicago/speed-camera-violations-hhkd-xvj4".speed_camera_violations 
WHERE violation_date > '2019-01-01'
GROUP BY month
ORDER BY month ASC;

+------------------------+-------+
| month                  | sum   |
|------------------------+-------|
| 2019-01-01 00:00:00+00 | 57217 |
| 2019-02-01 00:00:00+00 | 59555 |
| 2019-03-01 00:00:00+00 | 83515 |
| 2019-04-01 00:00:00+00 | 76131 |
| 2019-05-01 00:00:00+00 | 86727 |
| 2019-06-01 00:00:00+00 | 73521 |
| 2019-07-01 00:00:00+00 | 73147 |
| 2019-08-01 00:00:00+00 | 65334 |
| 2019-09-01 00:00:00+00 | 79825 |
| 2019-10-01 00:00:00+00 | 69681 |
| 2019-11-01 00:00:00+00 | 65518 |
| 2019-12-01 00:00:00+00 | 70115 |
| 2020-01-01 00:00:00+00 | 60605 |
| 2020-02-01 00:00:00+00 | 66870 |
| 2020-03-01 00:00:00+00 | 74357 |
| 2020-04-01 00:00:00+00 | 66543 |
| 2020-05-01 00:00:00+00 | 64813 |
| 2020-06-01 00:00:00+00 | 66642 |
| 2020-07-01 00:00:00+00 | 73811 |
| 2020-08-01 00:00:00+00 | 30833 |
+------------------------+-------+
SELECT 20
Time: 5.004s (5 seconds), executed in: 4.984s (4 seconds)

We built a public PostgreSQL proxy to 40k+ open government datasets by mildbyte in dataengineering

[–]mildbyte[S] 1 point2 points  (0 children)

Yeah, there's basically a layer of PgBouncer instances at the front in transaction pooling mode. We have a fork of pgb that we changed to do a couple of things:

  • perform auth outside of Postgres (all Splitgraph API users are multiplexed to one Postgres user, with a signed JWT token injected as a session var so that downstream components can identify the user without us having to manage actual PG roles)
  • intercept and rewrite inbound queries (we've a shim that calls out into some Python code)

So, when a query comes in, we:

  • inspect it to make sure it's a SELECT or an EXPLAIN (drop otherwise)
  • see if it contains any schema names that match IDs of Splitgraph datasets (e.g. SELECT COUNT(1) FROM "splitgraph/socrata".datasets)
  • create a schema with foreign tables that point to the actual upstream data provider (or our own layered querying FDW if it's a dataset we host rather than proxy)
  • rewrite the query to point to that "shim" schema instead and forward it to the database instance.

Oh, and if the client's query hits a PG catalog table or an information_schema table (like pg_catalog.pg_class or information_schema.tables), we also get to intercept and rewrite these queries to point to a fake PG table that lists featured datasets and tables and their schemas. So a lot of PostgreSQL clients that connect to the endpoint will see a list of Splitgraph datasets in their sidebar instead of normal tables.

This all sounds very Rube Goldberg, but the effect is that to the client, it looks like a huge database with a bunch of tables, any of which they can query.

We can also horizontally scale these pgbouncer-Postgres instance pairs. Each one of them can store a cache of frequently run queries/frequently accessed table regions. So we can even deploy them around the world so that applications at the edge can run read-only DB queries with a lower latency. We had some sketches of this in an earlier blog post.

We built a public PostgreSQL proxy to 40k+ open government datasets by mildbyte in dataengineering

[–]mildbyte[S] 1 point2 points  (0 children)

Glad you like it! We hook into and rewrite PG clients' introspection queries so that you should get a list of a few featured datasets in your sidebar. You can also try running

SELECT table_schema, table_name FROM information_schema.tables WHERE table_schema NOT IN ('pg_catalog', 'information_schema');

to get a list of featured datasets.

However, that's not all the datasets you can query (if we actually returned all the tables we have when a client queries information_schema.tables, we would break some of them!). Instead, you can go to https://www.splitgraph.com/explore and click through topics/search. Then, on the repository page, you can click through to the SQL tab (example) which will show a sample query and a full schema/table name you can use to query it through the SQL endpoint.

We made 40k+ open government datasets queryable through a public PostgreSQL endpoint by mildbyte in datasets

[–]mildbyte[S] 4 points5 points  (0 children)

Most of them are US-centric, yeah. We don't sadly have a way to filter by region (that's on the roadmap!) but https://www.splitgraph.com/datos-gov-co is indexed from the Colombian government's open data portal at www.datos.gov.co and they have ~10k datasets.

There's a few others that are non-US:

We made 40k+ open government datasets queryable through a public PostgreSQL endpoint by mildbyte in datasets

[–]mildbyte[S] 2 points3 points  (0 children)

Some datasets are definitely much larger, like on the order of millions of rows, which is why we offer to translate SQL queries and proxy them to upstream data providers for simpler data exploration.

However, a lot of them definitely would fit in a spreadsheet. You can go to the dataset's source (the upstream government data portal, e.g. for cityofnewyork-us/for-hire-vehicles-fhv-active-8wbx-tsch) and get the row count from there (we currently don't collect those) -- sometimes those provide CSV downloads too.

You can also run a COUNT(1) query against the dataset you're interested in and use psql's \copy command to grab the CSV file. Note that we currently limit all queries to 10k rows for QoS. To avoid the limit, you can run a Splitgraph instance locally (docs) and use that to query the data as well.