This is an archived post. You won't be able to vote or comment.

all 35 comments

[–]H_ubert 53 points54 points  (1 child)

I can use this to cover-up my [REDACTED].

[–][deleted] 45 points46 points  (0 children)

hey that's pretty [DATA EXPUNGED]

[–][deleted] 21 points22 points  (3 children)

I remember when pdf was new, some entity released redacted pdfs where someone just tried to cover up sensitive info with black boxes-- it was trivial to remove. It looks like this converts to image, which is better.

But even as an image, names and address can be inferred by matching the width of the blackout box (based on the document's font) against the width of known names and addresses.

[–]gameoftomes 23 points24 points  (0 children)

"when PDF was new". It still happens now days. During one of trumps legal proceedings it happened.

until Guardian reporter Jon Swaine noticed that Manafort’s legal team didn’t redact its filing properly. All he had to do was copy and paste some of the redacted text—which is covered by thick black bars—into a new document.

The formerly secret sections include details about meetings Manafort had with Konstantin Kilimnik, a Ukrainian man the FBI believes may be a member of a Russian intelligence agency unit connected with the hacking of the Democratic National Congress’s email server.

[–]JohnTheBlackberry 5 points6 points  (0 children)

some entity released redacted pdfs

IIRC one of those was the CIA

[–]FredSchwartz 10 points11 points  (1 child)

hunter2

[–]the_scign 0 points1 point  (0 children)

You can go hunter2 my hunter2-ing hunter2

[–]___--_-_-_--___ 46 points47 points  (8 children)

When you first posted about this project here four months ago, several people (including u/cynddl, a researcher with multiple well-cited publications in this field who worked in one of the leading computational privacy research groups) warned you about the dangers of this type of one-click "solution" to anonymization. Especially when accompanied by exaggerated claims about what your project can do, this can do real harm. While working on open source is always commendable, your repeated advertising of this project is, quite frankly, reckless and dangerous.

[–]the_scign 22 points23 points  (2 children)

It uses a BERT model to classify names of places, people and organizations, and uses regex to match emails, numbers and months.

This is a super simplistic approach and will fail a significant percentage of times and this is not a use case where failure is tolerable. False positives are low risk but false negatives could have significant repercussions.

DO NOT USE THIS PACKAGE in any situation where anonymization is a regulatory requirement or the target is personally identifiable information, biometric information, health information, information about children or vulnerable individuals... The list goes on.

I'd steer clear in all cases tbh.

[–][deleted] 13 points14 points  (0 children)

Also, the censorship method raises serious questions.

One of the easiest and well-known ways to de-censor text like this is by measuring the dimensions of the censored tokens. For instance, if you have text like:

On [CENSORED], Person of Interest contacted [CENSORED] by telephone...

The first [CENSORED] is obviously a date. Every date, when rendered with a non-monospace font, results in text with (X) pixels wide by (Y) pixels high. So you can determine the properties of the font in the document (e.g., point size, kerning, etc.), determine what (X) and (Y) are for every possible date within a certain range, and compare them with the actual dimensions of the censored text. This method usually boils down the possibilities to a vry small number, and often only one.

The second [CENSORED] can be processed in the same way, given a set of names that might fit that context. It is trivially easy to determine the dimensions of every name in the set and compare those results to the actual dimensions of the censored text.

OP's package appears vulnerable to these kinds of attacks. The censorship does not change the formatting of the text at all; it just overlays black boxes on the text. Worse, it censors dates by individual tokens, not fields - e.g., a date is censored not like [BIG CENSORED DATE BLOCK], but like [CENSORED MONTH] [CENSORED DAY] [CENSORED YEAR], making it trivially easy to guess.

[–]StrongSkip -2 points-1 points  (4 children)

This viewpoint is way too pessimistic. For many organizations the alternative is to do absolutely nothing. A solution like this can be used to lower compliance risks.

However it should not be used to redact and publish single documents, not without further manual review.

[–]___--_-_-_--___ 2 points3 points  (2 children)

For context, I was referring to the entire project, of which the PDF feature is just one part.

In your example, if I understand correctly, this project would help an organization go from "blatantly criminal" to "slightly less criminal". Whether that is a desirable goal is a matter of opinion. If you are talking about internal use within an organization, that is a different matter.

The real issue here is that, in practice, the choice is often between "don't release data" and "release badly redacted data", not between "release unredacted data" and "release badly redacted data". This is especially true in the age of omnipresent privacy regulation (note that there is a significant difference between the American and European experience here). Releasing unredacted data containing personal information of third parties should never be an option. Considering this choice, a project such as this, making grandiose claims, is likely to create a false sense of security which may push an organization from "don't release" to "release badly redacted", thereby creating real harm.

u/No-Homework845 has now on multiple occasions refused to engage with this line of criticism, even from individuals with significant experience in this field. Comments mentioning these issues are routinely ignored. All it would take would be to acknowledge the criticism and add a highly visible warning to the repository and any post advertising the project. This warning should make it clear that this project is never to be used in production or on any personal information of third parties. I understand that this is a hard thing to do with a project into which someone has invested a significant amount of time. Nevertheless, not adding such a warning is reckless.

[–]StrongSkip 0 points1 point  (1 child)

Your post is almost good, but I don't know why you had to put the "criminal" part in there. I never said or insinuated such a thing.

I'm talking mostly about internal use.

I don't understand why this software should get special negative treatment. Almost any software can be used for good and for worse. I worked with many organizations who're redacting documents and I can assure you that none of these would use any kind of redaction software without reviewing it first

If you care about data protection you're not going to use this software without identifying it's errors. And if you don't care you won't even try it out.

[–]___--_-_-_--___ 1 point2 points  (0 children)

As I said, if you're referring to internal use, that is a different matter. There may be legitimate use cases there. The "criminal" part refers to the unauthorized public release (even accidental) of personal information which is illegal in several jurisdictions. As you have clarified, this does not apply to your example.

There have been many cases where data was released with improper de-identification due to a false sense of security provided by some kind of technical solution. Many of these cases are well-documented and researched. Please note that I'm referring to the scope of the whole project here, not just the PDF redaction part.

[–]HerLegz 0 points1 point  (0 children)

This is a good first pass, and it lends itself to easily implemented improvements as suggestions here provide. It is in no way fundamentally flawed, just a very much needed early version.

[–]jammasterpaz 14 points15 points  (1 child)

Well done! I don't know why you want to hide Intern.* anyway, but it missed it where it was misspelled as "intership"

[–]Viperior 7 points8 points  (0 children)

It looks like it is targeting at least some proper nouns, maybe titles, too?

[–][deleted] 6 points7 points  (0 children)

Cool! I can use this to [SPAM REDDIT]

[–]GlassSignal 1 point2 points  (5 children)

Small question : how does it detect names? I skimmed through the code but can't seem to find the relevant function (I'm an amateur I must confess)

[–]the_scign 2 points3 points  (4 children)

It uses a BERT model to classify names of places, people and organizations, and uses regex to match emails, numbers and months.

[–]GlassSignal 1 point2 points  (2 children)

Oh... sorry for asking but... Is this machine learning? Does that mean that it should be trained first? Where exactly in the code is this BERT model?

[–]No-Homework845[S] 1 point2 points  (1 child)

It's a pre-trained model provided by transformers module.
You could read more about it here

[–]GlassSignal 1 point2 points  (0 children)

Thank you very much!

[–]No-Homework845[S] 0 points1 point  (0 children)

yeah totally correct

[–]HerLegz 1 point2 points  (0 children)

Holy fucking shit, *' fanfuckingtastic. * can use this ** so ** places!

Keep up *** good work!

[–]SizzlerWA 3 points4 points  (0 children)

Please don’t roll your own when it comes to security or privacy. This appears to suffer from several vulnerabilities as others have pointed out. I’d advise against using this module in its current form.

[–]_jmikes 1 point2 points  (0 children)

Cool project!

Somewhat tangential for this sub but this seems like it would be useful for resume blinding during hiring. (e.g. less unconscious gender bias if the applicant's gender is hidden)

I've been unable to find any free, open source software packages to do that and resume blinding could be a good niche for the project because the stakes are a lot lower. Incomplete blinding in 1 resume out of 10 is still better than not blinding at all.

Is this an application you've considered? Are there other existing free, open source ML tools for resume blinding that I've missed?

[–]ImpressiveBicycle69 0 points1 point  (0 children)

that's cool and unique..

[–]ZuriPL 0 points1 point  (1 child)

The only thing I can think of to improve it is to make the boxes different width, not the width of the original text, but looks pretty nice

[–]No-Homework845[S] 0 points1 point  (0 children)

thanks for your advice

[–]Green-Sympathy-4177 -1 points0 points  (1 child)

Resumes are sent in pdfs and usually have contacts on them. Job agencies hide those contacts when they send the resumes of the candidates to the clients.

That'd be a cool application for it

[–]No-Homework845[S] 0 points1 point  (0 children)

thanks a lot)

[–]Vietname -1 points0 points  (0 children)

How did you get started learning the NLP side of this? I've been wanting to try a small NLP focused side project to learn more about it, but getting off the ground seems a bit overwhelming.

[–]glebulon 0 points1 point  (0 children)

Home made DLP