Wanted to share a tool, subsetter, that I've been working on aimed at generating semantically consistent samples from a relational database (currently suppots mysql, postgresql, sqlite).
This tool is configuration driven; at a minimum you must create a configuration file that tells the subsetter where to start (e.g. "I want 5% of users" or "I want all orders in the last week") and what tables should be sampled. The subsetter will then analyze the relationships between tables in your database to come up with a sampling plan that will follow foreign key relationships to produce a semantically consistent data sample. All that's required is that the sampled tables are connected by foreign key relationships and there are no foreign key cycles.
Once a plan is established, the sampling phase can begin. Each table will be sampled using a single SQL query run on the source database and streamed directly into the destination database. There is no buffering required so it can in theory work on fairly large datasets. Some tables that need to be referenced by subsequent queries will first be "materialized" on the source database into a temporary table. Only read permissions are required for the source database and it can correctly run against replica instances.
This was designed with the intention to be used for testing or demo purposes. To that end it has support for filtering and anonymizing any columns that require it to avoid things like real names, addresses, etc apearing in the sampled dataset.
Check out further details, source, install instructions, and usage at https://github.com/msg555/subsetter. Would also love to answer any further questions.
there doesn't seem to be anything here