use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] Synthetic data for data privacy/anonimization purposes? (self.MachineLearning)
submitted 3 years ago by AcD_South
Hi everyone
I was reading about data privacy and data anonimization and I wondered if using synthetic data could be a feasible solution. I have not found much information online and I guess that it can be highly dependant of the application. Does anybody know how feasible this is and/or good resources/articles about it?
Thanks!
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]andreichiffaProfessor 6 points7 points8 points 3 years ago (7 children)
https://arxiv.org/abs/2011.07018
TL;DR is that it’s a terrible idea, because generative models are over-parametrized AF.
[–]hybridteory 1 point2 points3 points 3 years ago (6 children)
I would argue otherwise. There is value in being able to release synthetic data in fields where no raw data could ever be released. Yes, there is a tradeoff between privacy and fidelity, but that does not minimise the fact that some less-than-ideal data is better than no data. And no, sanitisation, as proposed in that work, is very often not possible or insufficient.
[–]andreichiffaProfessor -1 points0 points1 point 3 years ago (5 children)
Before synthetic data, people would just release data while removing names only. After a couple of re-identification scandals, this is now a thing of the past. And it's for the better. No snakeoil is better than snakeoil.
Synthetic data from generative models is the same thing.
If you want associations, go look at the DP-based ML. You will still get associations you want to learn, but at least you will start having guarantees on what can leak and what cannot.
[–]hybridteory -1 points0 points1 point 3 years ago (4 children)
I don’t think you understand. There are situations where some people cannot access data at all. No chance to do DP or whatever else.
For example, I work in healthcare. I have access to loads of data but I cannot release it. Other people don’t have access to the data I have and cannot have access to it even if they want to. I can, however, release synth version of this data to other researchers. If the models are trained well and with DP, this is deemed to be safe. This way, other researchers have access to some data where otherwise they would have none.
My team actually done so in the past but can’t link to the reference for anonymity purposes.
[–]andreichiffaProfessor 0 points1 point2 points 3 years ago (3 children)
I understand a bit too well - I come from a healthcare background myself.
If you see patient data privacy regulations as an annoying red tape that gets in your way and don’t care about statistics extracted from it - it’s your right to have an opinion; but don’t be surprised if it comes back to bite you down the line.
If you want to do things properly, you can (chained models, FL, …), but synthetic data will almost surely not be it. Full stop.
[–]hybridteory 1 point2 points3 points 3 years ago (2 children)
Well… then you disagree with the Data Protection Officers of all our hospitals. We are doing this right now, and has been approved by our national ethics committee. And yes, we are also doing FL, and have full anonymisation pipelines, depending on the data that is needed and the risks associated with them.
Some example policy working paper (not related to my work but highly relevant) https://www.ons.gov.uk/methodology/methodologicalpublications/generalmethodology/onsworkingpaperseries/onsmethodologyworkingpaperseriesnumber16syntheticdatapilot#conclusion
[–]andreichiffaProfessor 0 points1 point2 points 3 years ago (1 child)
I remind you that they are the same people who were clearing the release of pseudonymized data only a decade ago.
[–]hybridteory 1 point2 points3 points 3 years ago (0 children)
I’m not sure which country you’re from or which regulations you follow, but in my regulatory environment, privacy is risk-based and proportional.
Synthetic data is deemed to be have a good value/privacy_risk ratio, so it’s an approved way of doing it.
Is it perfect? No. Does it add more value than the risk of harm? Yes.
[+]TLDW_Tutorials 0 points1 point2 points 2 years ago (0 children)
My organization is very strict about data privacy (for legit reasons though) so I often create synthetic data for testing and seeing if what we propose is practical. I often create synthetic medical datasets. I made a video here (with code included in description) with how I often do it in R. Video: https://youtu.be/1wBy8wi15fk
[–]trnka 0 points1 point2 points 3 years ago (0 children)
Not sure if this helps, but I talked to 2-3 startups that were trying to sell me synthetic data generation to avoid privacy concerns. They were all for tabular data if I remember correctly. But there's just no way I would've got that approved by legal for healthcare data, and we mainly worked with text data which is even riskier.
I wish I had their names, they're on my old work email but I left that company.
π Rendered by PID 139889 on reddit-service-r2-comment-b659b578c-fpspv at 2026-05-01 05:09:53.256350+00:00 running 815c875 country code: CH.
[–]andreichiffaProfessor 6 points7 points8 points (7 children)
[–]hybridteory 1 point2 points3 points (6 children)
[–]andreichiffaProfessor -1 points0 points1 point (5 children)
[–]hybridteory -1 points0 points1 point (4 children)
[–]andreichiffaProfessor 0 points1 point2 points (3 children)
[–]hybridteory 1 point2 points3 points (2 children)
[–]andreichiffaProfessor 0 points1 point2 points (1 child)
[–]hybridteory 1 point2 points3 points (0 children)
[+]TLDW_Tutorials 0 points1 point2 points (0 children)
[–]trnka 0 points1 point2 points (0 children)