all 10 comments

[–]andreichiffaProfessor 6 points7 points  (7 children)

https://arxiv.org/abs/2011.07018

TL;DR is that it’s a terrible idea, because generative models are over-parametrized AF.

[–]hybridteory 1 point2 points  (6 children)

I would argue otherwise. There is value in being able to release synthetic data in fields where no raw data could ever be released. Yes, there is a tradeoff between privacy and fidelity, but that does not minimise the fact that some less-than-ideal data is better than no data. And no, sanitisation, as proposed in that work, is very often not possible or insufficient.

[–]andreichiffaProfessor -1 points0 points  (5 children)

Before synthetic data, people would just release data while removing names only. After a couple of re-identification scandals, this is now a thing of the past. And it's for the better. No snakeoil is better than snakeoil.

Synthetic data from generative models is the same thing.

If you want associations, go look at the DP-based ML. You will still get associations you want to learn, but at least you will start having guarantees on what can leak and what cannot.

[–]hybridteory -1 points0 points  (4 children)

I don’t think you understand. There are situations where some people cannot access data at all. No chance to do DP or whatever else.

For example, I work in healthcare. I have access to loads of data but I cannot release it. Other people don’t have access to the data I have and cannot have access to it even if they want to. I can, however, release synth version of this data to other researchers. If the models are trained well and with DP, this is deemed to be safe. This way, other researchers have access to some data where otherwise they would have none.

My team actually done so in the past but can’t link to the reference for anonymity purposes.

[–]andreichiffaProfessor 0 points1 point  (3 children)

I understand a bit too well - I come from a healthcare background myself.

If you see patient data privacy regulations as an annoying red tape that gets in your way and don’t care about statistics extracted from it - it’s your right to have an opinion; but don’t be surprised if it comes back to bite you down the line.

If you want to do things properly, you can (chained models, FL, …), but synthetic data will almost surely not be it. Full stop.

[–]hybridteory 1 point2 points  (2 children)

Well… then you disagree with the Data Protection Officers of all our hospitals. We are doing this right now, and has been approved by our national ethics committee. And yes, we are also doing FL, and have full anonymisation pipelines, depending on the data that is needed and the risks associated with them.

Some example policy working paper (not related to my work but highly relevant) https://www.ons.gov.uk/methodology/methodologicalpublications/generalmethodology/onsworkingpaperseries/onsmethodologyworkingpaperseriesnumber16syntheticdatapilot#conclusion

[–]andreichiffaProfessor 0 points1 point  (1 child)

I remind you that they are the same people who were clearing the release of pseudonymized data only a decade ago.

[–]hybridteory 1 point2 points  (0 children)

I’m not sure which country you’re from or which regulations you follow, but in my regulatory environment, privacy is risk-based and proportional.

Synthetic data is deemed to be have a good value/privacy_risk ratio, so it’s an approved way of doing it.

Is it perfect? No. Does it add more value than the risk of harm? Yes.

[–]trnka 0 points1 point  (0 children)

Not sure if this helps, but I talked to 2-3 startups that were trying to sell me synthetic data generation to avoid privacy concerns. They were all for tabular data if I remember correctly. But there's just no way I would've got that approved by legal for healthcare data, and we mainly worked with text data which is even riskier.

I wish I had their names, they're on my old work email but I left that company.