I'm genuinely curious how the decision to use MD5 gets made. Who says, "hey, maybe we should use MD5." And then who responds, "that sounds like a great idea Bob." Seriously. I've known for years that MD5 is insufficient for hashing passwords and I'm just some random guy. This kind of thing really baffles me.
Yahoo has been a company for a long time. I imagine your conversation happened round about 1999 when using MD5 wasn't insane. And then they were just slow to upgrade.
It's still bad, I'm just saying the conversation about what hash algo to use didn't happen yesterday.
I'd like to believe that. However, I was recently asked to test a new website for an organization I volunteer for, and discovered their "forgot password" flow emailed me my plaintext password. I wrote an explanation of why this was bad, and how it could be fixed, to a non-technical friend of mine who works there; he passed my email to the (Bay Area based!) consulting shop that did their website. The shop sent this response:
"We do not store passwords as a plain text in database. We have functionality which encrypts and decrypts passwords. We have only ecnrypted passwords in the database.
Almost all other servers use one-way encryption. In this case, passwords cannot be decrypted from hashing."
Again, this is a Bay Area based shop. For code written in 2016.
I was shocked to receive this, but it (among other things) leads me to suspect that there are lot of people out there, in positions of power, who aren't just ignorant, but who actively cling to password-storage anti-patterns.
But it's not like if we didn't have a pretty much continuous stream of major data leaks for the past 5 years. Surely yahoo engineers occasionally open a newspaper...
From everything I've read, the engineers did. The problem was that the security team had to go head-to-head with the budget team. And unfortunately, the budget team won - since the upper levels didn't feel that the IT security salaries were a necessary expenditure. And beyond that, there was concern that making people actually change their passwords regularly and requiring anything like security in said passwords was going to discourage users from using Yahoo and send them over to GMail.
> The problem was that the security team had to go head-to-head with the budget team. //
Wouldn't engineers at such a big corp whistle-blow such incompetent decision making?
Apparently [1] they had a $1.37B net income in 2013. Given using bcrypt with a Blowfish hash and salting was pretty much a de facto standard by that point (I think that's what Wordpress were doing, hardly revolutionary security work) it seems the relative cost for Yahoo was approximately zero.
All I can imagine is that those in control were asked to leave the system open for government snooping? Why else would engineers working there not [anonymously] bring this to press attention - "hey, Yahoo security amounts to a piece of sticky tape holding a bank-vault shut".
It's not that hard to implement something at the start. It's more work to retrofit it on top of an existing system in a way that doesn't reduce the total security.
But would it require users to change their password?
The way I would have implemented it, but would be keen to know how secure it is, is that you start with the md5 of the password (md5(password)). You then bcrypt or scrypt that md5 (bcrypt(md5(password))) and replace the md5 in your database with the bcrypt hash.
When a user logs in, all you need to do is to calculate the md5 first then check that md5 against the bcrypt hash you have stored.
I am not a crypto expert but intuitively it doesn't look like I would have weakened the security that way. You can't really attack bcrypt(md5(password)) much more than bcrypt(password). Can you?
The method I've used is to add the column for the new stronghash then you update the old column to stronghash(<oldhash>), where <oldhash> is dumbhash(password) check against that on login stronghash(dumbhash(password)) and generate just stronghash(<password>) while you have the plaintext password in memory and update the row to add the new hash (simple and interoperable, not dependent on dumbhash) and drop the stronghash(<oldhash>). After a <longtime> limit (to optimize both maintenance overhead of the additional column / behavior and limit exposure to only minority users that haven't logged in for <longtime>), you drop the stronghash(<oldhash>) from everyone and do a "we sent you a reset email" for anyone that's trying to log in but has no <stronghash> password hash.
I am not sure I agree. Your way will leave all the non active users exposed in the case of a leak. They may not be active on your website but are likely active on another website using the same password.
The problem is in collisions. Md5(password) can yield the same result for many different values of password so simply bcrypting that result means that you start with a restricted possibility space. So less secure. Punts the question to how much less secure. Seems to me it would still be worth it to do and then all new passwords going forward are done correctly.
Agree, but a collision even for md5 is a relatively rare event. When brute-forcing the bcrypt hash, this would reduce the attempts you would need to try against a given hash, but only by a very small factor. With a reasonable work factor, I would assume it would still make a brute force attack impractical at scale.
I didn't do the test, but I'd expect that there wouldn't be more than a handful of collisions for the md5 of the 100m most common passwords.
[edit] I actually I just did the test on this 10m password list and no collision
That being said md5 does generate collisions. I was playing with the IMDB movie database that you can download. They use a combination of the title and the year as a primary key. I tried using an md5 instead to save space (but giving a reproducible ID instead if an identity column), and got many collisions. No collision with SHA256.
Wait, what? No MD5 collisions at all were publicly known until Xiaoyun Wang disclosed one in 2004 using a new cryptographic technique she invented (explained in Wang and Yu's "How to Break MD5 and Other Hash Functions").
MD5 has a 128-bit output so collisions that occur by chance should require about 2⁶⁴ inputs (18 exa-inputs). Surely your database didn't contain over 2⁶⁴ different movie records.
Could you take a look at what you were doing again? Your description doesn't really make sense mathematically.
You likely goofed something up. No one has demonstrated two strings that are conceivably used as passwords that users type in -- and that includes the tuple {movie title:year} -- that have MD5 collisions.
Oh, of course md5 has collisions. It's relatively easy (not computationally easy, but there are known methods) to find two random strings that hash to the same value, it's just very difficult to find a string that hashes to the value of a specific other string.
Not "relatively easy" by chance: it should require 2⁶⁴ entries in your database to see a single collision happen at random! It's only "relatively easy" following cryptographic research in the early 2000s that exploits structure in MD5 to produce collisions deliberately.
Yes, collisions are easier than preimages, but they still shouldn't occur by chance in real applications!
Unfortunately, this isn't an accurate description of the nature of the collision problem with MD5, which involves carefully crafted inputs using a sophisticated cryptographic attack -- not arbitrary user inputs that don't intend to collide with each other. See my and danielweber's comments about this down-thread.
(Yes, susceptibility to collisions was recognized as a problem with MD5 leading to a reason not to use it, but the collisions in question were constructed, not encountered accidentally. There isn't any evidence to date that the probability of a collision given two randomly chosen inputs is higher than the expected 1/2¹²⁸. You could test this yourself by hashing 2⁴⁰ random strings under MD5: you won't see a collision among the outputs!)
>Md5(password) can yield the same result for many different values of password //
Not "many different" using the normal constraints of text/numbers/typographical-marks and with maximum password lengths of 32 or so (I'll bet Yahoo's was shorter than that in 2013).
Yes, because MD5 digests are much shorter than 32 characters, even if it's just ascii, so by the pidgeonhole principle there must be. If you're asking if there are _known_ collisions between two messages with less than 32 printable ascii characters -- the answer is likely yes, but there are not known to me and likely not publicly known at all yet.
And nobody ever seemed to say "hey, maybe we should be using something more secure". Yahoo's been around for how many decades, and the fact they were still using MD5 in 2013 is just shameful. Yeah if it was some legacy code from 1993 you can probably excuse it, but I just can't believe after 20 years nobody thought it was a problem.
I'm not really a software developer but I really can't imagine it being a huge change. Instead of md5(pass) you could probably just change that to secure_hash(md5(pass), salt), add another column in the database for the salt, and rehash all the passwords. Customers wouldn't notice. Rehashing the databases would take a while, but otherwise that's really not a huge amount of work.
Well, you can only rehash if you have the plaintext password. So you have to wait until they login again, or force a password reset for everyone. In the former case you're stuck with a bunch of md5 passwords hanging around for any account that's not very active, and for the latter you'll lose some percentage of active accounts whose reset process is for some reason no longer functional. You could mix-and-match the two methods (start with the former, force the latter on any stragglers after, say, a few weeks) to minimize the damage, but that's more work and a number that someone somewhere in the organization finds very important is still probably gonna go down.
(I've never had to do this myself, so these are just the most obvious options I came up with. Possibly there are others.)
You can only rehash if you have the plaintext password
There are techniques to rehash, even without the plain-text password, and without the user having to login to trigger a rehash.
Drupal 7 used such a technique for upgrades from Drupal 6, migrating from MD5 to a salted sha512 hash, but it's not an uncommon technique.
The old passwords are stored as MD5 hashes in the databases. The MD5 hash is processed through the same techniques as new passwords: a salt and the new sha512 hash. Provide a way to identify whether the origin was a password, or an MD5 hash.
Either way, you end up with a hash. You can identify whether the origin was a password, or an MD5 hash, but you can neither determine the origin MD5 hash, nor the origin password, as the new hash is secure. So even if the original MD5 hash was insecure, the new hash is secure.
When someone attempts to login, you still need to determine which password-validation to use: hash = sha512(salt + password), or hash = sha512(salt + MD5(password)), but the security level is the same.
> "Passing the password through MD5 reduces the complexity to 128 bits, you can't get that back."
Assuming that the new hash is secure (and sha512 is generally agreed to be secure), then, given a specific sha512 hash, the original MD5 hash can only be determined via rainbow tables, which is a Big-O operation. Even though entropy is reduced, it's still a significant work to determine the original MD5 hash (significant in this instance being longer than the heat-death of the Sun, given current extrapolations of computing performance).
Attacks against MD5 are based around knowing the original MD5 hash. In this instance, the original MD5 hash is unknown, so there is no mathematical shortcut to finding a collision.
In this case an attacker isn't looking for a collision (which would mean creating two passwords with the same hash, and what hash that is doesn't matter).
The attacker needs a password with a specific hash, and the best reported attack for that is around 2^128.
> Passing the password through MD5 reduces the complexity to 128 bits
No, this is not the problem with MD5. You are not going to find two user-memorizeable-and-typeable passwords with an MD5 collision.
If you are bringing a password with more than 128 bits of complexity to the party, any password storage scheme better than plaintext will have your password safe.
I have been in this situation, and you're correct.
Somewhere in the organization, a product team is going to throw a fit about usability and churn over the decision to reset user passwords en masse, or to force users to change them when they first log in. This isn't a slight against product managers, but one of the clearest indications of a company's overall security culture "health" is how the security, engineering and product teams choose to compromise and "pick their battles." Risk accepting vulnerabilities has a legitimate place when you have to balance product development and usability, but so does pushing back on egregious issues.
I don't have privileged insight into Yahoo's organization, but in this case it's pretty clear the security team should have either been more diligent in conveying the ramifications or less kneecapped by the surrounding org units, depending on the circumstance. More importantly, Yahoo should have "migrated" their passwords in the manner a parallel comment explains in this thread. This is what Facebook and other companies did after maturing their security programs (see "Facebook Onion" on how Facebook transitioned away from MD5).
Also good to note - there is evidence Yahoo's security culture improved over the years. The decision to go with MD5 almost certainly happened in the 90s, and when Tumblr suffered a breach all users were forced to reset their passwords. The capability and awareness was clearly there.
The password "foo" may encrypt to the hash "12345". If an attacker were to discover that the hash is "12345", they would look for a password that hashes to "12345", which could, hypothetically, be the password "bar". They don't know the original password "foo", they've simply discovered an alternative, which happens to match the algorithm enough to unlock access.
In general, rainbow tables are used for identifying and attacking common passwords, but that doesn't mean that the algorithm is insecure.
Insecure algorithms can be attacked through collisions, which don't necessarily give you the original password, they just provide an alternative password which is accepted by the algorithm. The distinction matters when it comes to password reuse, because if Site A uses MD5, but Site B uses sha512, finding a collision that grants access on Site A doesn't necessarily give you a password that will grant access on Site B.
Having worked with monolithic legacy codebases that they likely have, it has gone through hundreds of developers who dont work for the company anymore that created a bunch of spaghetti code means its a huge effort required to make sure that none of their other services break when they implement such changes. Also, management HATES when dev teams do this because it isn't "new stuff" thats immediately visible to their bosses nor the end user.
If anything goes wrong with the password update, users get angry, lose faith in the services, stress, a few people get fired maybe, etc etc. On the other hand, letting it stay old and crappy just everything stays just peachy, and nobody is the wiser that the entire system is a house of cards. Until the day someone hacks the database of course... which happened so its "now" a problem.
They're not going to begin to take security seriously even after this incident. They'll do what they need to right now but there's no auditing and their users don't normally care about this sort of thing, therefore the management won't care either.
There are likely to be a lot of identity systems using the password in the database, all of which have been coded to look for an MD5 hash, not a salted hash. This means code in a number of applications have to be updated at the same time.
The typical way around this is to create your new destination column (e.g. sha256 with salt), and progressively have applications reference this column rather than the MD5 unsalted column.
It's a huge amount of work, and if the applications were made in 1990's, the code is likely legacy. If Yahoo are doing regular code security reviews, this will likely have been put in the pile of "we need to fix, but it's too costly to do".
That's the right question to ask. The answer is no, because new security vulnerabilities are disclosed every hour.
A large organisation will implement layered security (otherwise known as layers of the onion) to prevent this type of attack. This means; more secure passwords to access the password database, fewer people with access, rotation of access passwords, auditing of backup storage and encryption, etc etc. Clearly Yahoo's layers of security were all broken to allow this type of theft.
Really? Moving from doing md5(password) to bcrypt(password,salt)? I see organisations make things hard and legacy code-base, yadda, yadda but surely if Yahoo couldn't do this then they couldn't manage scratching their own butt; it really seems like quite a small change in the scheme of things. Like one senior engineer, one afternoon of work (then testing, etc., OK, sure) ... ?
I'm going to go out on a limb and guess you've never worked as a software engineer in a large organisation.
Given MD5 hashes are currently stored, how do you propose user's password get converted to SHA256/512? Should Yahoo brute force the passwords, and then store them in the new algorithm? Or should they wait for the user to log on, verify their password, and store it in the new hash algorithm (given some users rarely log on, this could take over 12 months to complete 80% of users).
Hashing the hash isn't a good idea, you're reducing the domain of your secure_hash function to the range of md5. The way to do it is to have a "password hash algo version" column and when the user puts in their password, you verify against the hash[algo](password) and rehash with the later version, changing the algo column for that user.
You need to do both. If you only do the latter, then stale accounts which never log in again will never have their passwords upgraded to the more secure hash. Hashing the hash allows you to replace the md5 hashes immediately, and then you can perform the upgrade if/when the user logs in again.
You're assuming engineering is just sitting on their thumbs, reviewing their code once a week, thinking of ways to optimize it.
In reality, they're constantly under pressure to develop new features, fix reported bugs, move on to the next project, keep the site from falling over, etc etc.
And the ones who choose NOT to work hard aren't sitting around reviewing old code either.
For an IdP at the scale of Yahoo, the can adopt something as complicated as supporting versioned passwords and migrating credentials to the latest secure algorithm upon successful login. You have the clear text password at that point. You can store metadata such as the version (or algorithms) used to hash the credential.
It's easy as hell. Even PHP, so often flamed for "bad security" these days supports EASY functions for this (and polyfills are available, if you're running PHP < 5.5, which you should't do anyway):
- password_hash, which creates a salted hash (the returned value consists of a type/strength spec, the hash, and the salt)
- password_verify, which verifies a password with a hash in a timing-safe manner
- password_needs_rehash, which tells you if you should update the hash in the database
password_hash and password_needs_rehash take a parameter for the hash function (currently only bcrypt is supported, quite likely to keep people from using md5/sha1), and for the cost (the amount of hash function calls).
I believe any reasonable programming language these days has such functions.
What I am NOT so sure about is how the various LDAP server implementations, which many people use for SSO and "normal" account management (because it's easier to connect a new software to LDAP than to migrate existing user db's into LDAP), handle password storage. I mean, having an LDAP server for the credentials prevents any form of password leakage, but in case someone breaches both servers/the LDAP daemon is running on the same host as the webserver?
It gets/got made ~10-15 years ago. (I don't understand the "no salt" thing, though. That was common practice even ~20 years ago on Linux machines, so I'm mildly surprised that it wasn't implemented in this case.)
> I'm genuinely curious how the decision to use MD5 gets made.
You assume a formal decision was made? I think a manager just went "make them secure" and history was made. That's how it usually seems to happen if it's not a user-facing thing.
I think the organization as a whole is just indifferent. Does this breach really matter to Yahoo's bottom line? They were already sold to Verizon. Most of the active users probably won't read this news. It's sad to say, but I think Yahoo as a whole just doesn't care about their users.
No, sorry. They're borderline criminally negligent. When you have 1bn passwords stored in raw md5, a decade after the first rainbow tables were published, then you don't deserve anyone's business or your freedom.
But it's already a godsend compared to what many banks do, storing passwords in plaintext, sending reset passwords via plaintext email, requiring 4-8 character passwords that can only contain digits and a limited set of characters, etc.
I'd be more than happy if any bank would follow Yahoo!'s password standards.
As a data point: when I was a teenage code monkey in 2004 writing PHP I already understood that unsalted MD5 is unsafe.
According to Wikipedia:
* 2004 it became possible to find MD5 collisions at a rate of one per hour on a cluster
* 2005 it became possible to do this within "a few hours" on a consumer laptop
* 2006 it became possible to do this within one minute
* nowadays it's possible to do this "within seconds"
Plus, as others have mentioned, it's now possible to find collisions instantly by using widely available rainbow tables, e.g. https://md5db.net/decrypt
The MD5 collisions attack usually done by researchers: They want to generate 2 files with the same MD5 hash (they can put anything they want in these files).
This kind of attack doesn't affect passwords. The user picked one file (i.e. the password), you don't know it, you can't change it, you can't choose it.
The existence of crafted collisions -- being able to create a pair of M1 and M2 such that MD5(M1) = MD5(M2) -- is primarily relevant to situations where MD5 is being used as a signature algorithm, such as in certificate issuance. In these applications, being able to generate a pair of documents with the same hash is catastrophic.
Being able to generate a pair of passwords that are treated as equal, on the other hand, is useless from a security perspective. It's a neat party trick, but it's not dangerous.
Now, if there were a preimage attack -- being able to take MD5(M1) and come up with a M2 such that MD5(M2) = MD5(M1) -- that'd be a much bigger deal, and it'd break MD5 password hashing wide open. But nobody's done that yet.
I'm a total greenhorn when it comes to cryptography, but the difference between these two situations was totally lost on me until I read this comment. When I see, "It's easy to create MD5 collisions," my first thought is, "If you give me a hash, it's easy to find a string that results in an identical hash." If I'm understanding this right, that would be a "preimage attack," and would be bad for all the reasons being discussed in this thread.
However, it seems like "It's easy to create MD5 collisions," at least as it is true today, actually means something different: That, given a string, it's easy to find a second string that shares the same hash. If that's the case, I have two questions:
* I am totally lost as to how these are different scenarios. There's no difference I can see between "Here's string A" and "here's the hash of string A," if the goal is to find a "string B" that shares the hash. Are these "crafted collisions" generated by modifying string A and string B, until a collision pops out?
* If that's the case... what's everyone freaking out about? Why were people saying MD5 is unsafe 20 years ago, if even now, we can't achieve a preimage attack that can get you into an account based on the valid password's hash? Yahoo could have printed these hashes out and hung them up on posters in the mall and no one would have been able to get into accounts from it. There are dozens of comments lamenting how stupid this was, but... it seems like there's no actual problem?
> However, it seems like "It's easy to create MD5 collisions," at least as it is true today, actually means something different: That, given a string, it's easy to find a second string that shares the same hash.
Very early MD5 collision attacks were even weaker, actually: given nothing, it was possible to find a pair of arbitrary garbage strings which had the same hash as each other. It wasn't until later that it became possible to pick what the strings would "look like".
> Are these "crafted collisions" generated by modifying string A and string B, until a collision pops out?
Generally speaking, yes.
> If that's the case... what's everyone freaking out about?
The issue with using MD5 as a password hash function actually has nothing to do with collisions. That's a red herring. :) The real problem is that using any fast and/or unsalted hash function for passwords is unsafe!
A fast hash function is unsafe because it makes it easy to generate a bunch of potential passwords, calculate their hashes, and look for a match.
An unsalted hash function is unsafe because it makes it possible to build a "rainbow table" of all possible passwords and their hashes, and look up password hashes in that table.
As used in this situation, MD5 is both fast and unsalted.
Most people here don't seem to understand the difference between collision and preimage attacks. So they're overreacting to the fact Yahoo used MD5.
Storing unsalted passwords, however, would be a huge mistake, if Yahoo did so as someone here claimed.
There are precomputed lookup tables for the unsalted hashes of many, many passwords (both MD5 and more secure hashes) and cracking unsalted passwords is simply a database lookup.
Ah ha! There's the weakness I was missing, thank you so much for responding. I hadn't even thought of it that way---I knew salts shook up the resulting hashes, but an actual benefit of it is that it makes it pretty much impossible to do any "homework" (rainbow tables) ahead of time.
> Sure, SHA1, scrypt or bcrypt with salt were already common back then, but it's an entirely different story than if they had used it today.
Not an excuse, this is Yahoo, not a PHP shop in India doing some low budget contracting.They should have a top of the line security team enforcing the most recent secure practices. Furthermore I got no email from Yahoo telling me that my account may have been hacked. Both incompetent and irresponsible at the same time.
By the way I did some PHP dev back in 2011. bcrypt hashing was already common practice. How can you come up with that argument in good faith ?
> Furthermore I got no email from Yahoo telling me that my account may have been hacked
Then your account was most likely not on the list of accounts compromised.
> By the way I did some PHP dev back in 2011
Well Yahoo is a tad bit older then that, by about 17 years. This is not an excuse, but really comparing your 2011 coding to 1994.... Go ahead and boot up your old 486. I'll get back to you when this page loads up in an hour. :)
Yahoo's code base is old and huge, like billions of lines huge. Yahoo's engineers have modernized it at a massively rapid pace. I'm not sure of current state, but when I left Yahoo finance was written in something like 10 languages including serving pages in C, cause that's all they had back then.
Current tech is NodeJSish and others. They have their own hardened versions. But still migrating millions of lines of C to something other then C isn't a walk in the park.