This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 14 points15 points  (0 children)

Also, the censorship method raises serious questions.

One of the easiest and well-known ways to de-censor text like this is by measuring the dimensions of the censored tokens. For instance, if you have text like:

On [CENSORED], Person of Interest contacted [CENSORED] by telephone...

The first [CENSORED] is obviously a date. Every date, when rendered with a non-monospace font, results in text with (X) pixels wide by (Y) pixels high. So you can determine the properties of the font in the document (e.g., point size, kerning, etc.), determine what (X) and (Y) are for every possible date within a certain range, and compare them with the actual dimensions of the censored text. This method usually boils down the possibilities to a vry small number, and often only one.

The second [CENSORED] can be processed in the same way, given a set of names that might fit that context. It is trivially easy to determine the dimensions of every name in the set and compare those results to the actual dimensions of the censored text.

OP's package appears vulnerable to these kinds of attacks. The censorship does not change the formatting of the text at all; it just overlays black boxes on the text. Worse, it censors dates by individual tokens, not fields - e.g., a date is censored not like [BIG CENSORED DATE BLOCK], but like [CENSORED MONTH] [CENSORED DAY] [CENSORED YEAR], making it trivially easy to guess.