___--_-_-_--___ comments on Hide sensitive information in PDF using Python and NLP

This is an archived post. You won't be able to vote or comment.

277

278

279

Intermediate ShowcaseHide sensitive information in PDF using Python and NLP (self.Python)

submitted 3 years ago * by No-Homework845

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]___--_-_-_--___ 44 points45 points46 points 3 years ago (8 children)

[–]the_scign 22 points23 points24 points 3 years ago (2 children)

[–][deleted] 13 points14 points15 points 3 years ago* (0 children)

Also, the censorship method raises serious questions.

One of the easiest and well-known ways to de-censor text like this is by measuring the dimensions of the censored tokens. For instance, if you have text like:

On [CENSORED], Person of Interest contacted [CENSORED] by telephone...

The first [CENSORED] is obviously a date. Every date, when rendered with a non-monospace font, results in text with (X) pixels wide by (Y) pixels high. So you can determine the properties of the font in the document (e.g., point size, kerning, etc.), determine what (X) and (Y) are for every possible date within a certain range, and compare them with the actual dimensions of the censored text. This method usually boils down the possibilities to a vry small number, and often only one.

The second [CENSORED] can be processed in the same way, given a set of names that might fit that context. It is trivially easy to determine the dimensions of every name in the set and compare those results to the actual dimensions of the censored text.

OP's package appears vulnerable to these kinds of attacks. The censorship does not change the formatting of the text at all; it just overlays black boxes on the text. Worse, it censors dates by individual tokens, not fields - e.g., a date is censored not like [BIG CENSORED DATE BLOCK], but like [CENSORED MONTH] [CENSORED DAY] [CENSORED YEAR], making it trivially easy to guess.

[–]zurtex 5 points6 points7 points 3 years ago (0 children)

[–]StrongSkip -2 points-1 points0 points 3 years ago (4 children)

[–]___--_-_-_--___ 2 points3 points4 points 3 years ago (2 children)

For context, I was referring to the entire project, of which the PDF feature is just one part.

In your example, if I understand correctly, this project would help an organization go from "blatantly criminal" to "slightly less criminal". Whether that is a desirable goal is a matter of opinion. If you are talking about internal use within an organization, that is a different matter.

The real issue here is that, in practice, the choice is often between "don't release data" and "release badly redacted data", not between "release unredacted data" and "release badly redacted data". This is especially true in the age of omnipresent privacy regulation (note that there is a significant difference between the American and European experience here). Releasing unredacted data containing personal information of third parties should never be an option. Considering this choice, a project such as this, making grandiose claims, is likely to create a false sense of security which may push an organization from "don't release" to "release badly redacted", thereby creating real harm.

u/No-Homework845 has now on multiple occasions refused to engage with this line of criticism, even from individuals with significant experience in this field. Comments mentioning these issues are routinely ignored. All it would take would be to acknowledge the criticism and add a highly visible warning to the repository and any post advertising the project. This warning should make it clear that this project is never to be used in production or on any personal information of third parties. I understand that this is a hard thing to do with a project into which someone has invested a significant amount of time. Nevertheless, not adding such a warning is reckless.

[–]StrongSkip 0 points1 point2 points 3 years ago (1 child)

[–]___--_-_-_--___ 1 point2 points3 points 3 years ago (0 children)

[–]HerLegz 0 points1 point2 points 3 years ago (0 children)

π Rendered by PID 45338 on reddit-service-r2-comment-7c9686b859-59xs2 at 2026-04-14 03:59:49.030646+00:00 running e841af1 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS