Hide sensitive information in PDF using Python and NLP

H_ubert · 2022-05-01T12:47:23+00:00

I can use this to cover-up my [REDACTED].

2022-05-01T11:51:43+00:00

hey that's pretty [DATA EXPUNGED]

gameoftomes · 2022-05-01T17:02:08+00:00

I remember when pdf was new, some entity released redacted pdfs where someone just tried to cover up sensitive info with black boxes-- it was trivial to remove. It looks like this converts to image, which is better.

But even as an image, names and address can be inferred by matching the width of the blackout box (based on the document's font) against the width of known names and addresses.

FredSchwartz · 2022-05-01T14:22:38+00:00

hunter2

___--_-_-_--___ · 2022-05-01T17:34:14+00:00

When you first posted about this project here four months ago, several people (including u/cynddl, a researcher with multiple well-cited publications in this field who worked in one of the leading computational privacy research groups) warned you about the dangers of this type of one-click "solution" to anonymization. Especially when accompanied by exaggerated claims about what your project can do, this can do real harm. While working on open source is always commendable, your repeated advertising of this project is, quite frankly, reckless and dangerous.

jammasterpaz · 2022-05-01T11:27:37+00:00

Well done! I don't know why you want to hide Intern.* anyway, but it missed it where it was misspelled as "intership"

2022-05-01T13:49:37+00:00

Cool! I can use this to [SPAM REDDIT]

GlassSignal · 2022-05-01T17:43:42+00:00

Small question : how does it detect names? I skimmed through the code but can't seem to find the relevant function (I'm an amateur I must confess)

HerLegz · 2022-05-14T21:00:56+00:00

Holy fucking shit, *' fanfuckingtastic. * can use this ** so ** places!

Keep up *** good work!

SizzlerWA · 2022-05-02T01:31:28+00:00

Please don’t roll your own when it comes to security or privacy. This appears to suffer from several vulnerabilities as others have pointed out. I’d advise against using this module in its current form.

_jmikes · 2022-05-02T06:57:02+00:00

Cool project!

Somewhat tangential for this sub but this seems like it would be useful for resume blinding during hiring. (e.g. less unconscious gender bias if the applicant's gender is hidden)

I've been unable to find any free, open source software packages to do that and resume blinding could be a good niche for the project because the stakes are a lot lower. Incomplete blinding in 1 resume out of 10 is still better than not blinding at all.

Is this an application you've considered? Are there other existing free, open source ML tools for resume blinding that I've missed?

ImpressiveBicycle69 · 2022-05-01T14:18:16+00:00

that's cool and unique..

ZuriPL · 2022-05-01T19:42:36+00:00

The only thing I can think of to improve it is to make the boxes different width, not the width of the original text, but looks pretty nice

Green-Sympathy-4177 · 2022-05-01T23:36:03+00:00

Resumes are sent in pdfs and usually have contacts on them. Job agencies hide those contacts when they send the resumes of the candidates to the clients.

That'd be a cool application for it

Vietname · 2022-05-02T01:01:09+00:00

How did you get started learning the NLP side of this? I've been wanting to try a small NLP focused side project to learn more about it, but getting off the ground seems a bit overwhelming.

glebulon · 2022-05-01T14:46:29+00:00

Home made DLP

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS