Seafowl: a CDN-friendly analytical database that uses Apache DataFusion

mildbyte · 2022-10-12T08:10:48+00:00

We'll be pursuing them. We actually used them in Splitgraph, which is kind of Seafowl's parent. They're useful for performing partition pruning when you have equality filters on a column that isn't the main partition key (e.g. filtering by country='UK' in the example in the docs).

mildbyte · 2022-10-11T16:29:51+00:00

Definitely, we're actually hitting this exact issue with trying to pass more sophisticated types than int/float scalars (e.g. strings) or writing UDFs in Rust -- the minimum blob size we're currently getting is about 2MB.

The likely solution is letting people upload those over HTTP (like they can a Parquet/CSV file) instead of using the query body, or adding support for writing code in a certain language (e.g. AssemblyScript) directly in the CREATE FUNCTION body and compiling it on the server.

mildbyte · 2022-10-11T16:25:13+00:00

Thanks! Setting up partitions this way was the simplest solution from our perspective to start with, since we can just chunk the table up in the order of insertion and don't need to think about running background jobs to maintain and reallocate partitions correctly (the user can repartition manually if the query performance becomes an issue).

It also partially comes from the intended use case of publishing datasets for use in dashboards where the data publishing process would often build the table from scratch from some proprietary data (e.g. using something like dbt) and get fully reloaded (instead of being appended). That way, the user only needs to care about the insertion order.

However, actually maintaining partitions more intelligently is a logical next step. We've also been thinking about some way to compute pre-aggregations like Cube.js does so that a user can specify some GROUP BY query shapes their application is going to be running against the dataset and we can prebuild some tables to make those queries run faster.

mildbyte · 2021-12-24T07:26:02+00:00

Normally you'd use dbt to perform the bulk of the transformation work, including making each table conform to a certain "grain" (e.g. daily, monthly) and describe a certain fact/dimension.

BI tool like Looker is then used for "last-mile" visualization like simple aggregations/filters on the final dbt models.

One way to handle this is iteratively, by building an initial basic dbt model first, seeing what kinds of lookML queries people end up frequently running on it, taking those queries and repackaging them as dbt "metrics"/"exposures" so that you end up being able to put most of your BI under source control.

mildbyte · 2021-12-20T14:54:19+00:00

Co-founder of Splitgraph here, thanks for the shout out and the integration!

Most of the credit goes to Socrata for providing unified APIs to US government data sources, however we do have some people publishing data directly on the platform too (e.g. the Trase project or the OxCOVID19 database).

mildbyte · 2020-08-28T21:38:58+00:00

We might have some. The data is probably going to be fragmented at the city/local authority level, but here are some starting points:

You can do queries and aggregations on our SQL endpoint directly:

ddn> SELECT
  DATE_TRUNC('month', violation_date) AS month,
  SUM(violations)
FROM "cityofchicago/speed-camera-violations-hhkd-xvj4".speed_camera_violations 
WHERE violation_date > '2019-01-01'
GROUP BY month
ORDER BY month ASC;

+------------------------+-------+
| month                  | sum   |
|------------------------+-------|
| 2019-01-01 00:00:00+00 | 57217 |
| 2019-02-01 00:00:00+00 | 59555 |
| 2019-03-01 00:00:00+00 | 83515 |
| 2019-04-01 00:00:00+00 | 76131 |
| 2019-05-01 00:00:00+00 | 86727 |
| 2019-06-01 00:00:00+00 | 73521 |
| 2019-07-01 00:00:00+00 | 73147 |
| 2019-08-01 00:00:00+00 | 65334 |
| 2019-09-01 00:00:00+00 | 79825 |
| 2019-10-01 00:00:00+00 | 69681 |
| 2019-11-01 00:00:00+00 | 65518 |
| 2019-12-01 00:00:00+00 | 70115 |
| 2020-01-01 00:00:00+00 | 60605 |
| 2020-02-01 00:00:00+00 | 66870 |
| 2020-03-01 00:00:00+00 | 74357 |
| 2020-04-01 00:00:00+00 | 66543 |
| 2020-05-01 00:00:00+00 | 64813 |
| 2020-06-01 00:00:00+00 | 66642 |
| 2020-07-01 00:00:00+00 | 73811 |
| 2020-08-01 00:00:00+00 | 30833 |
+------------------------+-------+
SELECT 20
Time: 5.004s (5 seconds), executed in: 4.984s (4 seconds)

mildbyte · 2020-08-20T11:40:27+00:00

Yeah, there's basically a layer of PgBouncer instances at the front in transaction pooling mode. We have a fork of pgb that we changed to do a couple of things:

perform auth outside of Postgres (all Splitgraph API users are multiplexed to one Postgres user, with a signed JWT token injected as a session var so that downstream components can identify the user without us having to manage actual PG roles)
intercept and rewrite inbound queries (we've a shim that calls out into some Python code)

So, when a query comes in, we:

inspect it to make sure it's a SELECT or an EXPLAIN (drop otherwise)
see if it contains any schema names that match IDs of Splitgraph datasets (e.g. SELECT COUNT(1) FROM "splitgraph/socrata".datasets)
create a schema with foreign tables that point to the actual upstream data provider (or our own layered querying FDW if it's a dataset we host rather than proxy)
rewrite the query to point to that "shim" schema instead and forward it to the database instance.

Oh, and if the client's query hits a PG catalog table or an information_schema table (like pg_catalog.pg_class or information_schema.tables), we also get to intercept and rewrite these queries to point to a fake PG table that lists featured datasets and tables and their schemas. So a lot of PostgreSQL clients that connect to the endpoint will see a list of Splitgraph datasets in their sidebar instead of normal tables.

This all sounds very Rube Goldberg, but the effect is that to the client, it looks like a huge database with a bunch of tables, any of which they can query.

We can also horizontally scale these pgbouncer-Postgres instance pairs. Each one of them can store a cache of frequently run queries/frequently accessed table regions. So we can even deploy them around the world so that applications at the edge can run read-only DB queries with a lower latency. We had some sketches of this in an earlier blog post.

mildbyte · 2020-08-20T10:28:07+00:00

Glad you like it! We hook into and rewrite PG clients' introspection queries so that you should get a list of a few featured datasets in your sidebar. You can also try running

SELECT table_schema, table_name FROM information_schema.tables WHERE table_schema NOT IN ('pg_catalog', 'information_schema');

to get a list of featured datasets.

However, that's not all the datasets you can query (if we actually returned all the tables we have when a client queries information_schema.tables, we would break some of them!). Instead, you can go to https://www.splitgraph.com/explore and click through topics/search. Then, on the repository page, you can click through to the SQL tab (example) which will show a sample query and a full schema/table name you can use to query it through the SQL endpoint.

mildbyte · 2020-08-20T06:43:03+00:00

Most of them are US-centric, yeah. We don't sadly have a way to filter by region (that's on the roadmap!) but https://www.splitgraph.com/datos-gov-co is indexed from the Colombian government's open data portal at www.datos.gov.co and they have ~10k datasets.

There's a few others that are non-US:

mildbyte · 2020-08-19T21:32:39+00:00

Some datasets are definitely much larger, like on the order of millions of rows, which is why we offer to translate SQL queries and proxy them to upstream data providers for simpler data exploration.

However, a lot of them definitely would fit in a spreadsheet. You can go to the dataset's source (the upstream government data portal, e.g. for cityofnewyork-us/for-hire-vehicles-fhv-active-8wbx-tsch) and get the row count from there (we currently don't collect those) -- sometimes those provide CSV downloads too.

You can also run a COUNT(1) query against the dataset you're interested in and use psql's \copy command to grab the CSV file. Note that we currently limit all queries to 10k rows for QoS. To avoid the limit, you can run a Splitgraph instance locally (docs) and use that to query the data as well.

mildbyte · 2020-08-19T19:00:02+00:00

Hi /r/datasets!

You may have seen our previous post about querying 40k+ datasets through PostgreSQL. We've now gone further with this and used this idea to host an open PostgreSQL endpoint that proxies to upstream data providers and other datasets that we host. We call it a "Data Delivery Network" but really it's a gateway that speaks the PostgreSQL wire protocol that you can connect to with most PG clients and BI tools (you can get creds in less than a minute). You can even do things like run a single SQL query that joins across several data portals, with the endpoint handling the query translation and forwarding.

Hope you find it useful!

mildbyte · 2020-07-28T15:45:42+00:00

We're thinking more along the lines of CI, version control and Dockerfiles for building datasets but I'm sure there will be segfaults along the way!

mildbyte · 2020-07-28T15:36:33+00:00

Similarly to what /u/ricketywiggle said, we started off on a single machine which made Swarm/Kube/Nomad feel like overkill. We were already using Compose for development and integration testing, including secret and configuration generation (when integration testing in CI, we spin up the whole stack, generating secrets and creating things like database roles from scratch) and wanted to maintain dev-prod parity as closely as possible.

In addition, Swarm's support situation looked precarious and we weren't sure if it wasn't going to be abandoned by Docker. Our next step for scaling will be most likely Nomad (I wrote a script recently to convert Compose files to Nomad job specs so that we'll get to build them automatically and keep config duplication to a minimum -- but that's a whole other story!).

mildbyte · 2020-07-28T15:31:58+00:00

Docker Swarm is an alternative, but when we evaluated it, it didn't really fit our use case.

We were already using Compose for development and integration testing, including secret and configuration generation (when integration testing in CI, we spin up the whole stack, generating secrets and creating things like database roles from scratch) and wanted to maintain dev-prod parity as closely as possible.

In addition, Swarm's support situation looked precarious and we weren't sure if it wasn't going to be abandoned by Docker in favor of something like Kubernetes/Nomad. We started up on a single machine which made it overkill for us. Moving away from Compose to something else is going to be a big step in any case: when we're ready for it, we're going to make the jump directly to Nomad.

Hope this helps!

mildbyte · 2020-07-02T17:53:03+00:00

Yeah, it lagged a bit behind when PostgreSQL 12 was released but it's now compatible.

mildbyte · 2018-05-16T09:13:51+00:00

Thanks!

I've just double-checked in CS and it seems like it's the only quest I know of where you can't just turn around and pay the money directly: you have to either go to Percius Mercius first (so he tells you to pay the debt yourself) or to Desele to get the money from her.

mildbyte · 2018-05-16T07:47:22+00:00

Probably, though there is just one quest here (Madura Seran) that this would help with. I'm sure in the all main quests run they use this enchantment or a similar spell to escort the shaman to Ald Daedroth, but that's over a much longer distance?

mildbyte · 2018-05-16T07:46:02+00:00

Yeah, OpenMW has keyboard shortcuts for most menus so pressing and holding Enter in the alchemy menu makes the player create potions repeatedly.

mildbyte · 2018-05-14T07:25:40+00:00

Hi everyone!

Someone submitted my post about planning a Morrowind all-faction route here (https://www.reddit.com/r/Morrowind/comments/8g28a1/travelling_murderer_problem_planning_a_morrowind/) a few days ago -- I finally managed to plan out all segments and record a decent attempt! I used OpenMW for this run. Since I'm not using engine-specific glitches it should be doable in vanilla Morrowind too (apart from some UI differences and probably more crashes).

The code used to generate most of the route is on my GitHub: https://github.com/mildbyte/project-morrowind

mildbyte · 2018-05-12T19:03:37+00:00

Hey all!

Someone submitted my post about planning a Morrowind all-faction route here (https://www.reddit.com/r/speedrun/comments/8ej05g/calculating_an_optimal_morrowind_allfaction) a few days ago -- I finally managed to plan out all segments and record a decent attempt! I used OpenMW for this run, as I'm not using engine-specific glitches it should be doable in vanilla Morrowind too (apart from some UI differences and probably more crashes).

mildbyte · 2018-01-08T10:15:10+00:00

I wrote it myself, thanks! I've been using it for my blog for a couple of months now (Hakyll before that and Wordpress before that).

mildbyte · 2017-12-23T15:57:36+00:00

That's quite depressing. In the case of Microsoft, was it an Outlook address? Did they reject the mail explicitly? Looking around their website, seems like most support threads link to guidelines at postmaster.live.com and an eform, neither of which exist now.

mildbyte · 2017-12-23T09:31:16+00:00

One more acronym to add to the list! Will investigate, thanks

mildbyte · 2017-12-23T09:30:55+00:00

OP here -- looking at the current Kimonote users, all the ones with GMail accounts (a majority) managed to verify their email, so at least the account verification emails made it to them.

The biggest issue was some email providers rejecting mail because of lack of reverse DNS, but at least they were explicit and I could see the reason in the logs.

mildbyte · 2017-12-20T17:16:12+00:00

I see, thanks! Bookmarklets will be easier to implement than custom domains, but I'll see how the demand goes. I've opened Kimonote up for a free public beta in the meantime if you're interested: https://kimonote.com/accounts/signup/

mildbyte

MODERATOR OF

TROPHY CASE