Categories
Pages
-

HPC Cluster Usage Analysis

Anonymisation

October 6th, 2015 | by

One thing is already sure, a lot of unique identifiers are stored (called TIM ID). To be allowed to save the data, it must be made anonymous.

Here, 3 ways come to mind:

  1.  Hash each ID with a Hash-5-function to get an anonymised unique identifier. The problem here is that conventional hashs are usually not very friendly to the human eye which would make it more difficult to identify clusters and/or to verify  results in the later clustering process.
  2. Make a table where each User ID gets connected to an unique but anonymised identifier (for example 00001, 00002, 00003, 000004, …). The problem is see with this is that I have an additional table to store and to look up instead of a simple function. The function should be faster in use than the tables.
  3. I could make my own Hash function which has the advantages of using a not injective function while maintaining readable output. I have only a slight idea how to do this yet so I will look further into it.

Update:

By now, I decided for a way: I will make a lookup table that will lie in the same place where I get my data from so that it has the same security environment as the logs do. For readability I will assign a name consisting of a random colour, random name and a number, for example “blue_joe_0127” to each ID.

Leave a Reply

Your email address will not be published. Required fields are marked *