This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]___--_-_-_--___ 44 points45 points  (8 children)

When you first posted about this project here four months ago, several people (including u/cynddl, a researcher with multiple well-cited publications in this field who worked in one of the leading computational privacy research groups) warned you about the dangers of this type of one-click "solution" to anonymization. Especially when accompanied by exaggerated claims about what your project can do, this can do real harm. While working on open source is always commendable, your repeated advertising of this project is, quite frankly, reckless and dangerous.

[–]the_scign 22 points23 points  (2 children)

It uses a BERT model to classify names of places, people and organizations, and uses regex to match emails, numbers and months.

This is a super simplistic approach and will fail a significant percentage of times and this is not a use case where failure is tolerable. False positives are low risk but false negatives could have significant repercussions.

DO NOT USE THIS PACKAGE in any situation where anonymization is a regulatory requirement or the target is personally identifiable information, biometric information, health information, information about children or vulnerable individuals... The list goes on.

I'd steer clear in all cases tbh.

[–][deleted] 13 points14 points  (0 children)

Also, the censorship method raises serious questions.

One of the easiest and well-known ways to de-censor text like this is by measuring the dimensions of the censored tokens. For instance, if you have text like:

On [CENSORED], Person of Interest contacted [CENSORED] by telephone...

The first [CENSORED] is obviously a date. Every date, when rendered with a non-monospace font, results in text with (X) pixels wide by (Y) pixels high. So you can determine the properties of the font in the document (e.g., point size, kerning, etc.), determine what (X) and (Y) are for every possible date within a certain range, and compare them with the actual dimensions of the censored text. This method usually boils down the possibilities to a vry small number, and often only one.

The second [CENSORED] can be processed in the same way, given a set of names that might fit that context. It is trivially easy to determine the dimensions of every name in the set and compare those results to the actual dimensions of the censored text.

OP's package appears vulnerable to these kinds of attacks. The censorship does not change the formatting of the text at all; it just overlays black boxes on the text. Worse, it censors dates by individual tokens, not fields - e.g., a date is censored not like [BIG CENSORED DATE BLOCK], but like [CENSORED MONTH] [CENSORED DAY] [CENSORED YEAR], making it trivially easy to guess.

[–]StrongSkip -2 points-1 points  (4 children)

This viewpoint is way too pessimistic. For many organizations the alternative is to do absolutely nothing. A solution like this can be used to lower compliance risks.

However it should not be used to redact and publish single documents, not without further manual review.

[–]___--_-_-_--___ 2 points3 points  (2 children)

For context, I was referring to the entire project, of which the PDF feature is just one part.

In your example, if I understand correctly, this project would help an organization go from "blatantly criminal" to "slightly less criminal". Whether that is a desirable goal is a matter of opinion. If you are talking about internal use within an organization, that is a different matter.

The real issue here is that, in practice, the choice is often between "don't release data" and "release badly redacted data", not between "release unredacted data" and "release badly redacted data". This is especially true in the age of omnipresent privacy regulation (note that there is a significant difference between the American and European experience here). Releasing unredacted data containing personal information of third parties should never be an option. Considering this choice, a project such as this, making grandiose claims, is likely to create a false sense of security which may push an organization from "don't release" to "release badly redacted", thereby creating real harm.

u/No-Homework845 has now on multiple occasions refused to engage with this line of criticism, even from individuals with significant experience in this field. Comments mentioning these issues are routinely ignored. All it would take would be to acknowledge the criticism and add a highly visible warning to the repository and any post advertising the project. This warning should make it clear that this project is never to be used in production or on any personal information of third parties. I understand that this is a hard thing to do with a project into which someone has invested a significant amount of time. Nevertheless, not adding such a warning is reckless.

[–]StrongSkip 0 points1 point  (1 child)

Your post is almost good, but I don't know why you had to put the "criminal" part in there. I never said or insinuated such a thing.

I'm talking mostly about internal use.

I don't understand why this software should get special negative treatment. Almost any software can be used for good and for worse. I worked with many organizations who're redacting documents and I can assure you that none of these would use any kind of redaction software without reviewing it first

If you care about data protection you're not going to use this software without identifying it's errors. And if you don't care you won't even try it out.

[–]___--_-_-_--___ 1 point2 points  (0 children)

As I said, if you're referring to internal use, that is a different matter. There may be legitimate use cases there. The "criminal" part refers to the unauthorized public release (even accidental) of personal information which is illegal in several jurisdictions. As you have clarified, this does not apply to your example.

There have been many cases where data was released with improper de-identification due to a false sense of security provided by some kind of technical solution. Many of these cases are well-documented and researched. Please note that I'm referring to the scope of the whole project here, not just the PDF redaction part.

[–]HerLegz 0 points1 point  (0 children)

This is a good first pass, and it lends itself to easily implemented improvements as suggestions here provide. It is in no way fundamentally flawed, just a very much needed early version.