security hash cryptography message-digest

When is it safe to use a broken hash function?

It is trivial to use a secure hash function like SHA-256, and continuing to use MD5 for security is reckless behavior. However, there are some complexities to hash function vulnerabilities that I would like to better understand.

Collisions have been generated for MD4 and MD5. According to NIST, MD5 is not a secure hash function. It only takes 2³⁹ operations to generate a collision and should never be used for passwords. However SHA-1 is vulnerable to a similar collision attack in which a collision can be found in 2⁶⁹ operations, whereas brute force is 2⁸⁰. No one has generated a SHA-1 collision and NIST still lists SHA-1 as a secure message digest function.

So when is it safe to use a broken hash function? Even though a function is broken it can still be "big enough". According to Schneier a hash function vulnerable to a collision attack can still be used as an HMAC. I believe this is because the security of an HMAC is dependent on its secret key and a collision cannot be found until this key is obtained. Once you have the key used in an HMAC it's already broken, so it's a moot point. What hash function vulnerabilities would undermine the security of an HMAC?

Let's take this property a bit further. Does it then become safe to use a very weak message digest like MD4 for passwords if a salt is prepended to the password? Keep in mind the MD4 and MD5 attacks are prefixing attacks, and if a salt is prepended then an attacker cannot control the prefix of the message. If the salt is truly a secret, and isn't known to the attacker, then does it matter if it's appended to the password? Is it safe to assume that an attacker cannot generate a collision until the entire message has been obtained?

Do you know of other cases where a broken hash function can be used in a security context without introducing a vulnerability?

(Please post supporting evidence because it is awesome!)

Solution

Actually collisions are easier than what you list on both MD5 and SHA-1. MD5 collisions can be found in time equivalent to 2^26.5 operation (where one "operation" is the computation of MD5 over a short message). See this page for some details and an implementation of the attack (I wrote that code; it finds a collision within an average of 14 seconds on a 2.4 GHz Core2 x86 in 64-bit mode).

Similarly, the best known attack on SHA-1 is in about 2⁶¹ operations, not 2⁶⁹. It is still theoretical (no actual collision was produced yet) but it is within the realm of the feasible.

As for implications on security: hash functions are usually said to have three properties:

No preimage: given y, it should not be feasible to find x such that h(x) = y.
No second preimage: given x₁, it should not be feasible to find x₂ (distinct from x₁) such that h(x₁) = h(x₂).
No collision: it should not be feasible to find any x₁ and x₂ (distinct from each other) such that h(x₁) = h(x₂).

For a hash function with a n-bit output, there are generic attacks (which work regardless of the details of the hash function) in 2ⁿ operations for the two first properties, and 2^n/2 operations for the third. If, for a given hash function, an attack is found, which, by exploiting special details of how the hash function operates, finds a preimage, a second preimage or a collision faster than the corresponding generic attack, then the hash function is said to be "broken".

However, not all usages of hash functions rely on all three properties. For instance, digital signatures begin by hashing the data which is to be signed, and then the hash value is used in the rest of the algorithm. This relies on the resistance to preimages and second preimages, but digital signatures are not, per se, impacted by collisions. Collisions may be a problem in some specific signature scenarios, where the attacker gets to choose the data that is to be signed by the victim (basically, the attacker computes a collision, has one message signed by the victim, and the signature becomes valid for the other message as well). This can be counteracted by prepending some random bytes to the signed message before computing the signature (the attack and the solution where demonstrated in the context of X.509 certificates).

HMAC security relies on an other property that the hash function must fulfill; namely, that the "compression function" (the elementary brick on which the hash function is built) acts as a Pseudo-Random Function (PRF). Details on what a PRF is are quite technical, but, roughly speaking, a PRF should be indistinguishable from a Random Oracle. A random oracle is modeled as a black box which contains a gnome, some dice and a big book. On some input data, the gnome select a random output (with the dice) and writes down in the book the input message and the output which was randomly selected. The gnome uses the book to check whether he already saw the same input message: if so, then the gnome returns the same output than previously. By construction, you can know nothing about the output of a random oracle on a given message until you try it.

The random oracle model allows the HMAC security proof to be quantified in invocations of the PRF. Basically, the proof states that HMAC cannot be broken without invoking the PRF a huge number of times, and by "huge" I mean computationally infeasible.

Unfortunately, we do not have random oracles, so in practice we must use hash functions. There is no proof that hash functions really exist, with the PRF property; right now, we only have candidates, i.e. functions for which we cannot prove (yet) that their compression functions are not PRF.

If the compression function is a PRF then the hash function is automatically resistant to collisions. That's part of the magic of PRF. Therefore, if we can find collisions for a hash function, then we know that the internal compression function is not a PRF. This does not turn the collisions into an attack on HMAC. Being able to generate collisions at will does not help in breaking HMAC. However, those collisions demonstrate that the security proof associated with HMAC does not apply. The guarantee is void. That's just the same than a laptop computer: opening the case does not necessarily break the machine, but afterwards you are on your own.

In the Kim-Biryukov-Preneel-Hong article, some attacks on HMAC are presented, in particular a forgery attack on HMAC-MD4. The attack exploits the shortcomings of MD4 (its "weaknesses") which make it a non-PRF. Variants of the same weaknesses were used to generate collisions on MD4 (MD4 is thoroughly broken; some attacks generate collisions faster than the computation of the hash function itself !). So the collisions do not imply the HMAC attack, but both attacks feed on the same source. Note, though, that the forgery attack has cost 2⁵⁸, which is quite high (no actual forgery was produced, the result is still theoretical) but substantially lower than the resistance level expected from HMAC (with a robust hash function with an n-bit output, HMAC should resist up to 2ⁿ work factor; n = 128 for MD4).

So, while collisions do not per se imply weaknesses on HMAC, they are bad news. In practice, collisions are a problem for very few setups. But knowing whether collisions impact a given usage of hash functions is tricky enough, that it is quite unwise to keep on using a hash function for which collisions were demonstrated.

For SHA-1, the attack is still theoretical, and SHA-1 is widely deployed. The situation has been described like this: "The alarm is on, but there is no visible fire or smoke. It is time to walk towards the exits -- but not to run."

For more information on the subject, begin by reading the chapter 9 of the Handbook of Applied Cryptography, by Menezes, van Oorschot and Vanstone, a must-read for the apprentice cryptographer (not to be confused with "Applied Cryptography" by B. Schneier, which is a well-written introduction but nowhere as thorough as the "Handbook").