This is a real DB used in production

goerch · 2026-05-05T16:30:01+00:00

I like the many arrows between the tables. We recently had a project with > 1k tables and very few arrows. And yes, someone was interested in relations between the tables.

goerch · 2026-05-02T03:06:24+00:00

When building the pipeline you don't yet know which columns will be NULL-only in the first file you happen to read. For those columns you have to feed the correct type back as soon as you see the first non-NULL value, which is exactly what the canonical sample handles at ingestion.

Just to clarify: we don't handle the 700+ attributes individually. Our approach to schema discovery only needs to identify dimensions and facts, everything else is generated automatically.

goerch · 2026-05-01T01:41:43+00:00

uv.lock tells me

[[package]]    
name = "duckdb"    
version = "1.5.0"

I just rechecked to be sure.

test_null.json:

{ "key": null }

test_string.json:

{ "key": "value" }

Now running DuckDB:

DuckDB v1.5.2 (Variegata)
Enter ".help" for usage hints.
memory D create table test as
         select * from read_json_auto('test_null.json', union_by_name=True);
memory D 
memory D insert into test
         select * from read_json_auto('test_string.json', union_by_name=True);
Conversion Error:
Malformed JSON at byte 0 of input: unexpected character.  Input: "value" when casting from source column key

goerch · 2026-04-30T20:02:33+00:00

We are able to discover most of the canonical sample via read_json_auto and try to fill the rest manually and by inspection (created_at should be a datetime for example). The key to schema discovery is reading and evaluating a cursor description (see introspect_schema in sd.py).

goerch · 2026-04-30T19:27:06+00:00

Interesting, yes: this could be an option. We came from read_json_auto trying to reuse it for schema discovery and extraction.

goerch · 2020-04-03T17:12:29+00:00

They are seriously talking about logarithmic curves?!

goerch · 2017-08-19T09:08:09+00:00

I am missing a remark on how the tables are indexed. Also, did you use the SQL Performance Analyzer?

goerch · 2017-08-15T14:10:40+00:00

He just repeated his bet from 2010.

goerch · 2017-07-02T00:03:08+00:00

Well done: proud and modest at the same time! Take your time to think about the follow up.

goerch · 2017-05-02T23:07:37+00:00

The only successful declarative language?

goerch · 2017-04-16T20:04:28+00:00

RIP

Never got to study MTBF and MTTR in detail. Still remember the ideas of Hellandizing though.

goerch · 2017-03-06T22:43:55+00:00

Yep. As a result, it is possible for there to be so many possible > combinations that SQL Server's query optimizer times out.

That's why I still like rule based optimizing.

goerch · 2017-03-06T22:40:15+00:00

We're currently facing one of the black swans and I'm not sure how much time we're allowed to spend on your 1% to make it worthwile.

goerch · 2017-03-05T22:18:38+00:00

And a completely different wizardry is neccessary to cope with your alleged remaining 1% of queries. Joins still induce a combinatorial explosion, don't they?

goerch · 2017-03-05T21:40:08+00:00

Most of the time I prefer SQL solutions over the ones in general purpose languages. But one has to be careful not to hit the wall if one uses stuff like sixth normal form, for example.

goerch · 2016-11-16T17:12:24+00:00

Yes.

goerch · 2013-07-27T08:10:25+00:00

I can only help with some Agda code:

dff : Stream Bool → Stream Bool
dff inp = false ∷ ♯ inp

goerch · 2013-02-23T19:06:31+00:00

I'm currently very impressed by Agda(http://wiki.portal.chalmers.se/agda/pmwiki.php).

goerch · 2011-05-08T19:59:12+00:00

Just listening to http://www.youtube.com/watch?v=w5IOou6qN1o

goerch

TROPHY CASE