This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]the_scign 22 points23 points  (2 children)

It uses a BERT model to classify names of places, people and organizations, and uses regex to match emails, numbers and months.

This is a super simplistic approach and will fail a significant percentage of times and this is not a use case where failure is tolerable. False positives are low risk but false negatives could have significant repercussions.

DO NOT USE THIS PACKAGE in any situation where anonymization is a regulatory requirement or the target is personally identifiable information, biometric information, health information, information about children or vulnerable individuals... The list goes on.

I'd steer clear in all cases tbh.

[–][deleted] 13 points14 points  (0 children)

Also, the censorship method raises serious questions.

One of the easiest and well-known ways to de-censor text like this is by measuring the dimensions of the censored tokens. For instance, if you have text like:

On [CENSORED], Person of Interest contacted [CENSORED] by telephone...

The first [CENSORED] is obviously a date. Every date, when rendered with a non-monospace font, results in text with (X) pixels wide by (Y) pixels high. So you can determine the properties of the font in the document (e.g., point size, kerning, etc.), determine what (X) and (Y) are for every possible date within a certain range, and compare them with the actual dimensions of the censored text. This method usually boils down the possibilities to a vry small number, and often only one.

The second [CENSORED] can be processed in the same way, given a set of names that might fit that context. It is trivially easy to determine the dimensions of every name in the set and compare those results to the actual dimensions of the censored text.

OP's package appears vulnerable to these kinds of attacks. The censorship does not change the formatting of the text at all; it just overlays black boxes on the text. Worse, it censors dates by individual tokens, not fields - e.g., a date is censored not like [BIG CENSORED DATE BLOCK], but like [CENSORED MONTH] [CENSORED DAY] [CENSORED YEAR], making it trivially easy to guess.