One key feature of the Blackbox project is keeping the collected data anonymous. This avoids all sorts of ethical and legal (data protection) complications. But how do you anonymise source code? There is a trade-off between how far you go to anonymise the data, and how useful the data remains afterwards. In this post, I’ll explain various options, from the most anonymous/least useful to the least anonymous/most useful.
Here’s our original source code, which I’ll use as an example while discussing anonymisation:
import java.util.List;
/**
* A test class
* @author Neil Brown
*/
class Foo
{
private int x = 0;
private String s = "Neil rules!";
// An accessor for the string held by the class
public String getValue()
{
return s;
}
}
Anonymise Everything
The ultimate anonymisation that I can think of (besides just blanking the file!) is to anonymise everything. It is possible to replace all non-keyword identifiers in the source, so that the original source code becomes:
import package1;
/* Comment1 */
class Class1
{
private int field1 = 0;
private String field2 = "StringLiteral1";
// Comment2
public String method1()
{
return field2;
}
}
One problem is that you quickly run into various technical issues if you start altering code: if you rename a type, you need to make sure you rename that type everywhere in the code (including other classes), and the same for fields and methods (especially public ones).
Thinking about the later analysis, lots of information has been lost here: how are the users naming their types? Are they following a convention? What packages did they import? Are all their variable names 1 letter long? Do they name their accessors with a “get” prefix? How useful/informative are their comments? Although this is fairly cast-iron anonymous, it’s also much less useful than the original source for research purposes.
(Also, on a technical note, this destroys line and column information, so you would have to adjust for this when recording the position of compile errors and edits, and so on.)
Anonymise All Comments
If we back off from anonymising everything, another approach is to anonymise comments, but leave the rest of the code untouched. This approach can’t guarantee complete anonymity: users can insert identifying information in the names they use (e.g. calling their class “JoesClass”, or their variable “bill”) or in String literals (e.g. my “Neil rules!” literal above). However, short of total anonymity, this seems like a sensible best-effort attempt at anonymisation. With just the comments anonymised, our code above would look like this:
import java.util.List;
/* Comment1 */
class Foo
{
private int x = 0;
private String s = "Neil rules!";
// Comment2
public String getValue()
{
return s;
}
}
Anonymise The Header
The above version is more useful, and we’ve at least removed my name from the header comment. In fact, the comment before the class is the main location that people write their name. So rather than get rid of all comments and lose that interesting comment about the purpose of the method, we could keep comments that occur within the class, and just remove the header comment from before the class.
While we’re at it, there’s no need to completely replace the comment text with a placeholder (and as mentioned earlier, changing the number of lines is particularly irritating for other purposes), so we could just replace each letter/digit with a placeholder character, giving:
import java.util.List;
/**
* # #### #####
* @###### #### #####
*/
class Foo
{
private int x = 0;
private String s = "Neil rules!";
// An accessor for the string held by the class
public String getValue()
{
return s;
}
}
That doesn’t leave anything revealing in the header comment, and I believe still provides anonymisation that, for huge amounts of code, is second-only to the complete anonymisation mentioned at the start of the post. (In fact, we could ease off even further, and spot the special “@XXX” Javadoc tags and retain the tag name.) This way, our rule becomes: any comment before the start of the class is anonymous, anything after the class declaration is sent as-is.
Badly Formed Code
There is one more complication. My code example above is well-formed. But we are recording code during editing, when code will have all sorts of errors. For example:
- What if the user’s first comment is not terminated and accidentally includes the whole file in the comment?
- What if the user forgets the word “class” and starts the class improperly?
- What if the user’s first comment is not started properly, and thus the comment appears like a stream of unexpected identifiers?
- What if the user gets the package or import syntax wrong, and their package/import declaration is no longer recognisable as such?
At the moment, the answers would be:
- The entire file is then anonymised, as it is taken as being inside the comment before the class.
- Any code not recognisable as an import or package declaration, or a comment, is assumed to be the start of the “real code”. So a broken class declaration will not be anonymised (and we don’t want it to be — we want to see the errors students are making).
- The intended-comment would be sent non-anonymised. It’s hard to avoid this — if the code has syntax errors, how can we tell the difference between what was meant to be a comment, and what was meant to be a class declaration?
- In this case, the broken package or import line will be thought to be the beginning of the class. So any comment appearing after a broken import/package (even if it’s before a well-formed class declaration) is assumed to be the start of the “real code”, and the comment will get sent.
Conclusion
Anonymising code, while keeping it useful, is tricky. When things like the Data Protection Act talk about anonymous data, they probably have in mind “Don’t ask for identifying details”, but with source code people insert identifying details without tagging them as such (not everyone uses the @author tag, especially beginners). So we are trying to make our best judgement of what is fair and feasible. It’s worth remembering that our data collection is opt-in, and that the code we are sent will not be public, and will only be accessible by a small group of researchers — but we are still doing our best to anonymise it to rule out any possible complications from storing participants’ source code.
Very interesting! I’ve had to anonymize large collections of student code in the past, and I took a slightly different approach by using the class roster. For each student submission, I’d get their full name from the roster and replaced every instance of their first and last names in the whole file with first_ and last_.
This approach ran into a lot of problems very quickly- some students used nicknames, some included their middle name, some left comments referencing when a fellow student helped them. Some short names kept appearing in normal contexts throughout the program, and some names were misspelled! However, it did preserve the header comments, which might be helpful for those researchers studying commenting behavior. The whole process took a ridiculous amount of time, since I had to keep adding new values to the roster so that the anonymization program could check for them and replace them properly. Overall, I think your process sounds more reasonable. =)
By: Kelly R. on December 6, 2012
at 8:25 pm