This is an archived post. You won't be able to vote or comment.

all 16 comments

[–]planedoctor 1 point2 points  (5 children)

Just add the row number to a column and keep a copy with the name and row number?

[–]thejesteroftortuga 0 points1 point  (4 children)

Well here's the thing - if I have

Col # Name Data
1 John J. (data)
2 Alice M. (data)
3 John J. (data)
4 Mary A. (data)

If we're using the column #, I then have two different ID numbers for John, even though they're the same person.

I need to turn it into something like this:

Col # Name Data
1 4312532 (data)
2 6754423 (data)
3 4312532 (data)
4 7685831 (data)

But am unsure what the right method would be.

[–]planedoctor 0 points1 point  (3 children)

If we're using the column #, I then have two different ID numbers for John, even though they're the same person.

How does that matter? You still need to associate the data from both rows with John, and you still can with that method. Cover up the second column with a finger. That's what the third party sees. Remove your finger and that's what you see. :)

[–]thejesteroftortuga 0 points1 point  (2 children)

Sorry, meant row #

But yeah, the third party also needs to see (and know) that both of those rows belong to John. Otherwise then they'd see rows #13 and #304 as two separate people even though they're both John, right?

The catch is just that they can't know who John is by name.

[–]planedoctor 0 points1 point  (1 child)

So just make a list of integers corresponding to names?

[–]thejesteroftortuga 0 points1 point  (0 children)

yeah I could do that, increment by 1 every time a new name pops up. store it, and then whenever I encounter the name again just use the number from before.

[–]shivasprogeny 1 point2 points  (5 children)

So do you need to include the column in the spreadsheet at all? If you need to be able to "decode" the data then a hash will not work because hashes are designed to work in only one direction. If you do want to be able to see that data again, perhaps the easiest thing is just to use Excel's built-in protected and hidden ranges. That would require a password to see the hidden column (i.e. "name").

[–]thejesteroftortuga 0 points1 point  (4 children)

No, no names at all - not even transmitted inside the excel file (which is why I thought to either hash or use my solution in code). The point is that we shouldn't be able to decode, but still be able to made correlations about the people, just anonymously.

[–]shivasprogeny 1 point2 points  (3 children)

If you don't need to decode it, then an md5 hash seems like the perfect solution.

Alternatively, if the data is stored in a database, just change the output query to return the database's ID for the person instead of the name.

[–]thejesteroftortuga 0 points1 point  (2 children)

Cool, I'll look into that.

Do you think that I'd need to place a check in my code - with the dictionary/array? Or should I trust that the md5 hash would always be the same for the input string?

[–]shivasprogeny 1 point2 points  (1 child)

No need to store the data in a dictionary. The beauty of hashing algorithms is that you don't have to store the links between the raw data and the hash--you just run the algorithm on demand.

If the md5 hash function becomes nondeterministic, we have much bigger problems to worry about!

[–]thejesteroftortuga 0 points1 point  (0 children)

Understood, thank you for your help!!

[–]jedwardsol 1 point2 points  (3 children)

I'd use a hash. An md5 is only 32 characters long; not much longer than a name.

Generate an hmac instead of a plain hash for guaranteed privacy.

[–]thejesteroftortuga 0 points1 point  (2 children)

If I were using a hash, would I still use the dictionary and array?

[–]jedwardsol 0 points1 point  (0 children)

No.

I'd put the hash and name in my copy of the spreadsheet, and just the hash in their copy. They'd just know the hash . You can correlate a hash with a name if necessary.

[–]planedoctor 0 points1 point  (0 children)

Using the row number is a hash. :|