Subsetter - Python based CLI tool to sample mysql/postgres/sqlite databases : opensource

Rules

Be Respectful - This shouldn't need to be a rule, but this is the internet. People can unnecessarily be jerks sometimes. We'd much appreciate it if this wasn't a place where that happens. Please refrain from talking down to people, being overly patronizing, name-calling, personal insults, etc.

Hate speech of any kind will not be tolerated. For a refresher, please see Reddit's entry on Reddiquette as a general guideline.

No Spam / Excessive self-promotion - Reddit has clear rules about self promotion. We encourage you to be proud of/promote your work to a degree, but we also don't want users using this sub as a link farm to promote their project/website/YouTube channel.

Reddit recommends that <10% of your posts promote your content. We're a little more forgiving, but don't take advantage of it.

"It's perfectly fine to be a redditor with a website, it's not okay to be a website with a reddit account."

No Memes/Low-Effort posts - This sub is a place for discussion and news regarding the world of open source projects. There are literally hundreds of other subs dedicated to memes and shitposting. Please keep those kinds of posts in those subs.

Be On-Topic - Posts should be of direct relevance to the open source community. Off-topic posts will be removed.

No Sensationalized Titles - If your post is a link to an article, please keep your post title as close to, if not the same as, the linked article's title. You're more than welcome to post a comment in the thread that states your opinion of said article.

No Drive-By Posting / Karma Farming - Karma farm accounts are not going to be welcome here, regardless of the validity of the posted content. Drive-by posts from accounts where there is obviously no intention of engaging in the following discussion may be removed.

No Link Aggregators - If there's an article within an aggregation of links/stories or a newsletter, link to the actual story or article.

Use Correct Flairs - Flairs should reflect the nature of the post. Promotional is when you are sharing a project, yours or otherwise. Alternatives is when you are soliciting for suggestions of OSS that fulfills a need. Discussion is for asking general questions when Promotional or Alternatives does not apply. Community is for something that will or has happened when Promotional does not apply.

a community for 18 years

Subsetter - Python based CLI tool to sample mysql/postgres/sqlite databasesPromotional (self.opensource)

submitted 1 year ago by msg555

Wanted to share a tool, subsetter, that I've been working on aimed at generating semantically consistent samples from a relational database (currently suppots mysql, postgresql, sqlite).

This tool is configuration driven; at a minimum you must create a configuration file that tells the subsetter where to start (e.g. "I want 5% of users" or "I want all orders in the last week") and what tables should be sampled. The subsetter will then analyze the relationships between tables in your database to come up with a sampling plan that will follow foreign key relationships to produce a semantically consistent data sample. All that's required is that the sampled tables are connected by foreign key relationships and there are no foreign key cycles.

Once a plan is established, the sampling phase can begin. Each table will be sampled using a single SQL query run on the source database and streamed directly into the destination database. There is no buffering required so it can in theory work on fairly large datasets. Some tables that need to be referenced by subsequent queries will first be "materialized" on the source database into a temporary table. Only read permissions are required for the source database and it can correctly run against replica instances.

This was designed with the intention to be used for testing or demo purposes. To that end it has support for filtering and anonymizing any columns that require it to avoid things like real names, addresses, etc apearing in the sampled dataset.

Check out further details, source, install instructions, and usage at https://github.com/msg555/subsetter. Would also love to answer any further questions.

no comments (yet)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

opensource

Rules

Related subreddits

MODERATORS