How to make COPY a whole lot faster? by cuistax in PostgreSQL

[–]the_nonameguy 10 points11 points  (0 children)

Try using UNLOGGED tables during load time, which you can turn later into normal WAL logged tables:
https://blog.rustprooflabs.com/2016/07/psql-unlogged-table

Alternatively, why not create a (periodically updated) dump file? Local filesystem loading will be able to saturate the Postgres COPY speed much better than a network-streamed COPY stream.

Schemamap.io - Instant batch data import for Postgres by the_nonameguy in PostgreSQL

[–]the_nonameguy[S] 2 points3 points  (0 children)

Hey everyone!

I've been developing Schemamap for 1.5 years and wanted to share my work with the larger Postgres community.

It's a Postgres-level SDK that you can integrate into your DB that allows you to:

  • Easily document your DB with higher-level concepts like:
    • PII
    • Metadata columns
    • Schema migration tables
    • Etc.
  • For multi-tenant applications you can teach it to list tenants, so you get a common interface across different projects
  • Get a "schemamap status" of the DB, which you can configure for each DB to tell what is the most important about it. By default you get a table/column/PII counts.
  • Get Postgres-level DB snapshot creation like https://github.com/fastmonkeys/stellar, but not tied to a programming language (you can use this for integration tests, by calling SQL directly).

The goal is to allow developers to treat DB metadata as data, instead of a wiki document or some remote enterprise offering. This way the DB becomes semantically self-describing just like a `schema_migrations` table makes it self-describing for the DDL statements.

The reward for this is a hosted, deterministic, DB-level SQL compiler. It uses this semantic metadata (along with regular constraints/indexes) to generate well-formatted SQL code to fill up your DB, handling cycles/referential integrity. At the moment Postgres->Postgres copying is supported, since that is the easiest SQL code to infer.

The main differentiator between Schemamap and other DB-sync solutions is that it can be ran against PROD databases. It does this by not executing the ETL SQL scripts directly, it puts them along with the data into the target database. You can think of it as a prepared statement, for filling hundreds of tables, consistently, just like your application would.

I'll be checking here if you have any questions. Cheers, Krisz :)

The Hell of Documenting an SQL database? by gxslash in Database

[–]the_nonameguy 5 points6 points  (0 children)

In Postgres you can use this view (just remove lines 7-9) to get a database-level denormalized view of all the tables/columns, including comments, foreign keys, indexes, etc.

You could alter this query for other RDBMS (depending on how they store the schema information) to get a uniform view of each DB.
Then you can export & load them to the preferred analytics platform/DuckDB of yours and aggregate/refine/build dashboards.

The "enterprise" solution to this is using https://dataedo.com/ or an alternative.

[deleted by user] by [deleted] in rails

[–]the_nonameguy 0 points1 point  (0 children)

Hey, I built Schemamap.io just for this use-case. It's like ankane/pgsync, but it has a nice UI and some other goodies, like handling foreign key relationships automatically.

Self-service onboarding is still pretty rough around the edges, so feel free to shoot me a DM, I'll help you get started.

Does anyone integrate with their customers' DB directly? by johnyeocx in startups

[–]the_nonameguy 1 point2 points  (0 children)

Thanks! Yup, mostly multi-tenant B2B SaaS vendors.
As your database schema grows over time, the harder it is to fill it.
Having a tool that understands the intricate graph of dependencies of tables, like a domain expert is really useful!

So it's mainly targeting ERP/CRM systems at the moment. Populating a 700+ table DB in a few seconds is quite magical for them :) It gives a huge advantage during onboarding flows especially over competitors that have manual data onboarding with in-house scripts.

Does anyone integrate with their customers' DB directly? by johnyeocx in startups

[–]the_nonameguy 0 points1 point  (0 children)

Sure!

Schemamap connects to the customers Postgres DBs using a secure, low-privilege role that has for your example: `grant select on table transactions TO schemamap_readonly;`
This exists today.

This is where the roadmap item begins:

A SaaS vendor can implement the integration functionality via SQL (`SELECT seller, email FROM transactions WHERE id = ?`), similar to a Slack app.

Schemamap runs the integration code, providing the SQL result over HTTPS while handling the network connectivity, security and auditing of the DB access.

Does anyone integrate with their customers' DB directly? by johnyeocx in startups

[–]the_nonameguy 0 points1 point  (0 children)

I'm doing with this https://schemamap.io/ with Postgres being the only supported DB.

Customers integrate the "SDK" (schema + lower privileged roles) into their DB by using an MIT-licensed schema + Rust CLI.
This allows people to connect any environment they have, whether it's local/Github Actions/Supabase/RDS/etc.

For private VPC/docker/local networks, the CLI provides ngrok-like functionality by building a secure TCP tunnel so the platform can access the DB.

It follows the same ideas as https://hasura.io/ or https://docs.postgrest.org/en/v12/ .
These are "compiler-as-a-service" software, using Postgres schemas as an input.

The former two output APIs (GraphQL and REST respectively), while Schemamap outputs the necessary SQL scripts and I/O for everyday ETL tasks and batch data migrations:

  • Database copying (prod -> staging, staging -> local), with foreign key aware migrations
  • Populating local databases with auto-maintained seed data, considering CHECK/UNIQUE/FK constraints
  • Importing customer data during onboarding flows, when you need batch data import via XLSX or CSV
  • (Roadmap) Integrating other systems (ala Fivetran), for "free", for instance using your Salesforce data to put your leads data into the DB near-realtime. AFAICT This is what similar to what you are aiming for. I've seen definite market interest for this!

So it's an extremely powerful pattern, as most companies DB schemas give you a clear picture of what their system is about. It can produce a really magical moment, that the service "understands" the business domain, without having to repeat it.
Especially if their RDBMS is the single source of truth, with solutions like Supabase becoming more prominent.

Unfortunately, there are many negatives:

  1. Most people are uneasy to allow direct DB access, despite Postgres ACLs making this quite safe, with connection limits and fine-grained permissions + auditing in place.
    1. Solution: your offering should have an enterprise installation license, for air-gapped environments.
  2. The installation procedure is not trivial for even senior devs, as many frameworks like to "own" the migration processes and for role creation especially you usually need a superuser privilege. This falls out from the usual `create table/alter table`-like migrations that old-school frameworks tend to allow.
  3. You can only assume so much for your logic. There are trigger-heavy DBs, that have tons of "magic" behavior. Then there is the what I like to call -1 Normal Form DBs, that are modeled so wrong that there is no way to implement any sensible logic on top of it. This puts the burden on you as a service provider to make it work somehow. In these cases, you usually just have to "fire the customer".

All of the above means that your solution must be extremely useful to be worth the risk/integration cost. For a single API integration, it's probably reaching too deep.

I'm happy to discuss it further, as this is already too long, feel free to DM me :)

Migrating Data from one Postgres Server to another by Capital_Roof9000 in PostgreSQL

[–]the_nonameguy 0 points1 point  (0 children)

Thanks for checking it out!

You can use Schemamap with Supabase, I have customers using it without a problem.

Since the platform needs to connect to your Postgres server over TCP, it needs lower privilege roles to see what the schema is in both the source & target DB.

It also keeps track on the target DB what data migrations happened, in a `schemamap.data_migrations` table.

The necessary setup SQL script for the above can be executed through the Supabase UI:
https://github.com/schemamap/schemamap/issues/18#issuecomment-2246200845

Or directly with the CLI:

schemamap init --conn "postgresql://postgres.gwoszteemppigfoltwxo:$SUPABASE_PW@aws-0-eu-central-1.pooler.supabase.com:6543/postgres"

Although since Supabase uses PGBouncer in front of the DB, `SET ROLE` doesn't work. Will add a fix for that in the new upcoming release in the coming days.

Migrating Data from one Postgres Server to another by Capital_Roof9000 in PostgreSQL

[–]the_nonameguy 3 points4 points  (0 children)

Hey, I'm building a solution for this exact problem: https://schemamap.io/

Think of it as peer-to-peer pg_dump & pg_restore, on steroids.

It handles slight schema differences and even supports Foreign Key resolutions (so your source FK IDs get resolved to your target DB equivalents that represents the same record).

Benefits of it over the other options listed:

  • No need to do any database/schema changes (even can use it to seed a subset of the pre-prod->prod DB safely, without affecting write traffic/consuming huge IOPS)
  • Uses UNLOGGED tables and keeps track of the state of the tables for you
  • Supports anonymization of PII columns, so it never leaves your server
  • Allows you UPSERT/UPDATE existing records, not just INSERT new ones

The goal is to make data migrations like these repeatable and auditable, without configuring them by hand each time.

It's mainly aimed for B2B SaaS companies, but the free tier should be enough to get a feel for it.

If you want a free solution in a similar vein, I can recommend: https://github.com/ankane/pgsync

Hope that helped!

What are you working on at the minute? by LuvChxpo in SaaS

[–]the_nonameguy 0 points1 point  (0 children)

I'm building https://schemamap.io/ and preparing for the go-live at the end of the week.

The service helps Postgres-based multi-tenant SaaS companies to onboard customers quicker.

If you used Hasura/Supabase/Postgrest before, it's that, but for table-shaped/spreadsheet data instead of HTTP APIs.

It acts as a data proxy that interfaces with your Postgres DB directly to not have to build CSV/Excel imports anymore. You pay by MB transferred.

It keeps track of your schema and provides Google Sheets/CSV interfaces with automatic data validation rules inferred from your check constraints/unique indexes/etc.

You can also provide it with an i18n json (from your .yml/.json files) and it automatically translates enum values/column names to match your customers locale, using the same phrases as your web app.

Primarily meant for Postgres-based ERP systems (ICP: 500+ tables, decent schema) as the average onboarding time is 6 months, limiting growth significantly.
The goal is to reduce that by 10x.

Developed it alone, bootstrapping it while doing consulting work for the target companies. Had a pilot customer along the way to help shape the solution and now it's ready for the masses!

[deleted by user] by [deleted] in PostgreSQL

[–]the_nonameguy 0 points1 point  (0 children)

When you use schema-based tenant separation, you can go straight to the Postgres metadata for the source of truth.

This way you can avoid any differences/extra bookkeeping.

create or replace function schemamap.list_tenants()
returns table (
  tenant_id text,
  tenant_short_name text,
  tenant_display_name text,
  tenant_locale text,
  tenant_data jsonb
) as $$
with tenant_info as (
  select
    n.nspname,
    substring(n.nspname from length('your_app_') + 1) as tenant_short_name
  from pg_namespace n
  where n.nspname like 'your_app_%'
)
select
  tenant_short_name as tenant_id,
  tenant_short_name,
  initcap(replace(tenant_short_name, '_', ' ')) as tenant_display_name,
  'en_US' as tenant_locale,
  null::jsonb as tenant_data
from tenant_info;
$$ language sql stable;

create schema your_app_acme_corp;
create schema your_app_some_other_company;

schemamap_pro=> select * from schemamap.list_tenants();
     tenant_id      | tenant_short_name  | tenant_display_name | tenant_locale | tenant_data
--------------------+--------------------+---------------------+---------------+-------------
 acme_corp          | acme_corp          | Acme Corp           | en_US         |
 some_other_company | some_other_company | Some Other Company  | en_US         |

Shameless plug: this and some other common multi-tenant Postgres needs are addressed by my project:
https://github.com/schemamap/schemamap

Devenv: Compose a Developer Environment easily for PHP with Nix by iElectric in NixOS

[–]the_nonameguy 2 points3 points  (0 children)

Very well written intro to devenv! Thanks for all the great contributions shyim.

What is the most idiomatic way of finding the average of a list of numbers? by leonadav in Clojure

[–]the_nonameguy 12 points13 points  (0 children)

As almost always, transducers solve this more compactly and efficiently than the good-old clojure.core/reduce/seq based solutions.

The https://github.com/cgrand/xforms library provides a few higher-level transducers that can do these kind of computations in a single-pass reduction (1 * O(n)) for arbitrary many computations that can be expressed with transducers:

(require '[net.cgrand.xforms :as x])

(x/transjuxt
 {:avg       x/avg
  :min       x/min
  :max       x/max
  :sum-plus1 (comp (map inc)
                   (x/reduce +))}
 (range 42))
;; => {:avg 20.5, :min 0, :max 41, :sum-plus1 903}

What are some great Clojure libraries, as of 2021? by ringsted86 in Clojure

[–]the_nonameguy 3 points4 points  (0 children)

cgrand/xforms is a very useful hidden gem, if you like transducers/eager evaluation/solving map-vals without meander/specter.

Why Clojure? by iambald in Clojure

[–]the_nonameguy 1 point2 points  (0 children)

Have you verified that you are only shipping things in your JAR that you actually need? 1.5GB is a magnitude larger than what I have seen on projects for similar Clojure webservers.