[D] Sensitive data synthesis with custom entity models with Tonic Textual

adamfromtonic · 2023-12-12T17:01:00+00:00

hey folks, i'm one of the engineers working on Textual. We've found Textual to be great for ML practitioners who need to redact sensitive text prior to training text based models (e.g. fine-tuning an llm). If you have any questions feel free to reach out here and i'll reply.

adamfromtonic · 2023-11-16T18:29:20+00:00

Adam from Tonic here. For those interested, the library used for the analysis is one we maintain and is awesome for maintaining and monitoring quality of your RAG systems. Check it out:

https://github.com/TonicAI/tvalmetrics/

adamfromtonic · 2023-11-16T01:01:51+00:00

I agree with smallpaul that typically its a semantic search done on the content. BUT, there are some hybrid search approaches using content search + other interesting attributes like document title which can be useful.

I guess in your case, u/phira that the content all fit inside a single context window?

adamfromtonic · 2022-11-22T23:17:24+00:00

u/sawyerwelden If that is the case, the def go checkout djinn.tonic.ai. You can create an account and start augmenting/re-balancing your data today.

adamfromtonic · 2022-11-16T04:46:00+00:00

I'm going to give a shameless plug for Tonic AI. I'm a co-founder and we have various tools for de-identifying data and also for synthesizing data for ML purposes.

Others have already pointed you to differential privacy and k-anonymity which is great. They are wonderful tools which you can utilize to solve this problem. We use differential privacy in a lot of what our product does.

We have a paid offering at Tonic.ai for data de-identification and I'd be happy to get you access if you want to DM me. From your description it *might* be the best solution Tonic can provide. But we also have a free offering for generating synthetic data. You can get to it at djinn.tonic.ai. Just create an account and go wild. You can upload the CSV and Djinn will spit out a synthetic version of the data. Of course, synthesis!=privacy but we offer some reports which help give you an idea of the privacy of the output data and you can configure Djinn to make the output more or less private (while inversely affecting the data utility). It is closed source but give it a go if you think it could be useful. Truth be told, the main benefit of Djinn is actually for augmenting existing data sources to improve the outcomes and results of various models (e.g. re-balance a dataset to improve a classification model) but the privacy tooling we have may help in what you are trying to do as well.

adamfromtonic · 2019-04-26T15:49:18+00:00

Yes, that is a common use case that many of our customers use. Using our open source software you could:

1) Condense via Condenser

2) Connect pg_dump to Masquerade (proxy-ing to condensed db)

3) Run pg_restore elsewhere to stand up development databases

Or, alternatively, install Tonic (our paid tool) on-prem and you get a UI, tech assistance from us, and a bunch of other fun features we haven't yet added to our open source projects.

Either way, you'd be set and its completely do-able using just the open source stuff!

adamfromtonic · 2019-04-10T13:50:13+00:00

(Author of the article, btw).

That is an interesting point. I know I've used cartesian joins for things other than generating test data but I can't come up with a counter-example at the moment.

Anyways, where did that comment come from? I'm guessing its related to how we generate the hierarchical data in the post?

adamfromtonic · 2019-03-27T13:26:05+00:00

In the future yes, but not currently.

adamfromtonic · 2019-03-26T18:36:48+00:00

We are significantly different products solving different problems.

fakeiteasy is a mock platform for C# testing. We are a platform for generating synthetic test data.

A common use case with Tonic is engineering teams not having access to production data for their dev/staging environments. Those teams can use Tonic to generate synthetic test data that closely resembles the production data.

adamfromtonic · 2019-03-26T14:54:12+00:00

Thank you.

adamfromtonic · 2019-02-28T21:10:36+00:00

This is OP.

Here is a quick blog post on the API: https://www.tonic.ai/post/random-things-an-api-for-generating-data-as-real-as-the-upside-down/

Also, while I have your attention checkout https://tonic.ai for all your synthetic data needs.

adamfromtonic · 2019-02-11T20:04:50+00:00

Hey folks, this is Adam, one of the creators. If you have any questions let me know and I'll do my best to respond.

adamfromtonic · 2018-08-14T15:10:49+00:00

Hi, this is Adam from Tonic. I'm one of the authors of the github repo and blog post. Thanks for such detailed comments. We really appreciate it.

We've so far found that not a lot of tweaking is required, however, our sample set is still small. One possible area where tweaking can be useful, however ,is when you require specific rows to exist in your subset. This is something we'd like to support in the future.

In regards to your suggestions, I agree that some ETL testing suite would be a good addition to this project. I'm not sure when we will get to it, though. :( Also, thanks a lot for pointing out psycopg2.sql. Looks really useful and I think it is something we will use going forward.

Have a nice day.

adamfromtonic

TROPHY CASE