Kategorie: ‘Allgemein’
Data collection
After talking to many people in my institute, I found two sources of information. One lies within my own department, namely in the group that runs and manages our clusters. The other one is the group HPC in the department CSE that mainly conducts research on the clusters and also provides know-how for external users.
Colleagues of both groups sent me example files of the data that they collect along with additional questions they are interested in. I will now look into both and chose which information could be useful and which not.
Anonymisation
One thing is already sure, a lot of unique identifiers are stored (called TIM ID). To be allowed to save the data, it must be made anonymous.
Here, 3 ways come to mind:
- Hash each ID with a Hash-5-function to get an anonymised unique identifier. The problem here is that conventional hashs are usually not very friendly to the human eye which would make it more difficult to identify clusters and/or to verify results in the later clustering process.
- Make a table where each User ID gets connected to an unique but anonymised identifier (for example 00001, 00002, 00003, 000004, …). The problem is see with this is that I have an additional table to store and to look up instead of a simple function. The function should be faster in use than the tables.
- I could make my own Hash function which has the advantages of using a not injective function while maintaining readable output. I have only a slight idea how to do this yet so I will look further into it.
Update:
By now, I decided for a way: I will make a lookup table that will lie in the same place where I get my data from so that it has the same security environment as the logs do. For readability I will assign a name consisting of a random colour, random name and a number, for example „blue_joe_0127“ to each ID.
Hallo Welt!
Willkommen bei RWTH-Blogs. Dies ist der erste Artikel. Du kannst ihn bearbeiten oder löschen. Und jetzt nichts wie ran ans Bloggen!