hashhashmapcode-snippets

How can I make hashing algorithm for code snippets?


I would like to make an algorithm, which can detect code snippets when iterating through source codes, based on some hashing algorithms.

So for example, here is that short code snippet from a fragment shader:

vec3 Fresnel(vec3 F0, float cosTheta) {
    float foo1;
    float foo2;
    return F0 + (vec3(1, 1, 1) - F0) * pow(1-cosTheta, 5);
}

I would like to hash this few lines and after that based on the hash value, I would like to check if other code snippets are the same as the one above. Of course the name of the variables and the order of the arguments are not important, like the order of foo1 and foo2.

I know that a hashing algorithm like this is possible, but I don`t know where to start.


Solution

  • There are two aspects to this - the first one is to work out how to identify the code snippers you want to index. That might be quite complex, and you may or may not want overlapping/nested code snippets (e.g. for the return line, or the expression in it, to be one code snippet, while the overall Fresnel function is another. Anyway, I'll leave that aspect with you and address the hashing part.

    Once you have some way to iterate over your code snippets, you can simply invoke a hash function on the source code for that snippet: because the code snippet is a string of text, this is easily done in most languages - e.g. in C++ you could do std::hash<std::string_view>{}(my_code_snippet_string_view).

    "Of course the name of the variables and the order of the arguments are not important, like the order of foo1 and foo2." shows you'll want to put the code through some standardisation/normalisation process first (e.g. reformatting it, replacing arbitrary contiguous whitespace with a single space, removing comments, replacing identifiers with incrementing placeholders __1, __2, maybe sorting statements that don't have a necessary ordering etc.) so that functionally equivalent code will produce the same hash value. That will be highly language dependent, and to do it well you may need to use some manner of parser. It will be hard to do this perfectly, but you may get useful enough results for your purposes without too much work. (In C++, at the extreme you could use clang-tools to convert the code to an AST, do various transforms on it, but you'd need not just the code snippet but all the earlier code to make it a valid translation unit).