The Strong Password Dilemma

Excerpt from Chapter 6 of

Authentication: From Passwords to Public Keys

ISBN 0-201-61599-1

Copyright © 2002 Addison-Wesley.

This paper was also published as an article in CSI’s Computer Security Journal, Summer 2002 (see Note 14).

Strong Password Policies

(I cheat and make all my computer accounts use the same password.)

- Donald A. Norman, The Design of Everyday Things

Since passwords were introduced in the 1960s, the notion of a “good” password has evolved in response to attacks against them. At first, there were no rules about passwords except that they should be remembered and kept secret. As attacks increased in sophistication, so did the rules for choosing good passwords. Each new rule had its justification and, when seen in context, each one made sense. People rarely had trouble with any particular rule: the problem was with their combined effect.

The opening quotation illustrates one well-known assumption about proper password usage: it’s “cheating” to use the same password for more than one thing. This is because passwords may be intercepted or guessed. If people routinely use a single password for everything, then attackers reap a huge benefit by intercepting a single password. So, our first rule for choosing passwords might be:

1. Each password you choose must be new and different.

An early and important source of password rules was the Department of Defense (DOD) Password Management Guideline (see Note 1) Published in 1985, the Guideline codified the state of the practice for passwords at that time. In addition to various technical recommendations for password implementation and management, the Guideline provided recommendations for how individuals should select and handle passwords. In particular, these recommendations yielded the following password rule:

2. Passwords must be memorized. If a password is written down, it must be locked up.

Password selection rules in the DOD Guideline were based on a simple rationale: attackers can find a password by trying all the possibilities. The DOD’s specific guidelines were formulated to prevent a successful attack based on systematic, trial-and-error guessing. The Guideline presented a simple model of a guessing attack that established parameters for password length and duration. This yielded two more password rules:

3. Passwords must be at least six characters long, and probably longer, depending on the size of the password’s character set.

4. Passwords must be replaced periodically.

The DOD Guideline included a worked example based on the goal of reducing the risk of a guessed password to one chance in a million over a one-year period. This produced the recommendation to change passwords at least once a year. Passwords must be nine characters long if they only consist of single-case letters, and may be only eight characters long if they also contain digits. Shorter passwords would decrease the risk of guessing to less than one in a million, but that still provided good security for most applications. The DOD Guideline didn’t actually mandate eight-character passwords or the one-in-a-million level of risk; these decisions were left to the individual sites and systems.

In fact, the chances of guessing were significantly greater than one in a million, even with eight- and nine-character passwords. This is because people tend to choose words for passwords-after all, they are told to choose a word, not a secret numeric code or some other arbitrary value. And there are indeed a finite number of words that people tend to choose. Dictionary attacks exploit this tendency. By the late 1980s, dictionary attacks caused so much worry that another password rule evolved:

5. Passwords must contain a mixture of letters (both upper- and lowercase), digits, and punctuation characters.

Now that we have these five rules in place, it is time to click on this link. The evolving rules, and the corresponding increases in password complexity, have now left the users behind. None but the most compulsive can comply with such rules week after week, month after month. Ultimately, we can summarize classical password selection rules as follows:

The password must be impossible to remember and never written down.

The point isn’t that these rules are wrong. Every one of these rules has its proper role, but the rules must be applied in the light of practical human behavior and peoples’ motivations. Most people use computers because they help perform practical business tasks or provide entertainment. There’s nothing productive or entertaining about memorizing obscure passwords.

Passwords and Usability

Traditional password systems contain many design features intended to make trial-and-error attacks as hard as possible. Unfortunately, these features also make password systems hard to use. In fact, they violate most of the accepted usability standards for computer systems. Of the eight “Golden Rules” suggested by Ben Shneiderman for user interface design, password interactions break six of them (see Table 1). People can’t take shortcuts: the system won’t match the first few letters typed and fill in the rest. Most systems only report success or failure: they don’t say how close the password guess was, or even distinguish between a mistyped user name and a mistyped password. Many systems keep track of incorrect guesses and take some irreversible action (like locking the person’s account) if too many bad guesses take place. To complete the challenge, people rarely have a chance to see the password they type: they can’t detect repeated letters or accidental misspellings.

Table 1: Password Systems Are Not User Friendly
 Golden Rules of User Interface Design
(See Note 2)
 True for Passwords?
 1. Strive for consistency  YES
 2. Frequent users can use shortcuts  NO
 3. Provide informative feedback  NO
 4. Dialogs should yield closure  YES
 5. Prevent errors and provide simple error handling  NO
 6. Easy reversal of any action  NO
 7. Put the user in charge  NO
 8. Reduce short-term memory load  NO

To appreciate another truly fundamental problem with passwords, consider what happens when changing a password. Imagine that a user named Tim needs to change his password, and he wishes to follow all of the rules. While it’s possible that he might have a particular password in mind to use the next time the occasion arises, many (perhaps most) people don’t think about passwords until they actually need to choose one. For example, Windows NT can force its users to immediately change a password during the logon process, usually because the existing password has become “too old.” If Tim hasn’t thought of another good password ahead of time, he must think of one, fix it permanently in his mind, and type it in twice without ever seeing it written.

This presents a significant mental challenge, especially if Tim tries to follow the classic password selection rules. He has to remember and apply the rules about length, reuse, and content. Then he must remember the password he chose. This is made especially hard since the system won’t display the password he chose: Tim must memorize it without the extra help of seeing its visual representation.

Human short-term memory can, on average, remember between five and nine things of a particular kind: letters, digits, words, or other well-recognized categories. The DOD Guideline spoke of eight- or nine-character passwords, which lie on the optimistic end of peoples’ ability to memorize. Moreover, Tim’s short-term memory will retain this new password for perhaps only a half minute, so he must immediately work at memorizing it. Studies show that if Tim is interrupted before he fully memorizes the password, then it will fall out of his working memory and be lost. If Tim was in a hurry when the system demanded a new password, he must sacrifice either the concentration he had on his critical task or the recollection of his new password. Or, he can violate a rule and write the password down on a piece of paper (see Note 3).

Passwords were originally words because it’s much easier for people to remember words than arbitrary strings of characters. Tim might not remember the password “rgbmrhuea,” but he can easily remember the same letters when they spell out “hamburger.” Tim more easily remembers a word as his password because it represents a single item in his memory. If Tim chooses an equally long sequence of arbitrary characters to be his password, he must mentally transform that sequence into a single item for him to remember. This is hard for people to do reliably. While there are techniques for improving one’s memory, they are difficult to learn and require constant practice to retain. Strong passwords simply aren’t practical if they require specialized training to use correctly. Later in this chapter we examine a few simple and practical memory techniques for producing memorable passwords. The techniques do not necessarily provide the strongest possible secrets, but they are within the reach of most peoples’ abilities (see Note 4).


Dictionary Attacks and Password Strength

Note to purists: This section doesn’t really appear in Chapter 6 of Authentication. It was added to explain the notion of the average attack space and provide enough context to fully appreciate the weakness of passwords. The material in this section came from Chapters 2 and 3.

In general, strong authentication techniques require a person to prove ownership of a hard-to-guess secret to the target computer. Traditionally, a user would transmit the password during the login operation, and the computer would verify that the password matched its internal records. More sophisticated systems require a cryptographic transformation that the user can only perform successfully if in possession of the appropriate secret data. Traditional challenge response authentication systems use symmetrically shared secrets for this, while systems based on public key cryptography will use the transform to verify that the user possesses the appropriate private key. In all cases, successful authentication depends on the user’s possession of a particular piece of secret information. In this discussion, that secret information is called the base secret.

A simple way to compare different authentication techniques is to look at the number of trial-and-error attempts they impose on an attacker. For example, an attacker faced with a four-digit combination lock has 10 times as hard of a job as one faced with a three-digit lock. In order to compare how well these locks resist trial-and-error attacks and to compare their strength against the strength of others, we can estimate the number of guesses, on average, the attacker must make to find the base secret. We call this metric the average attack space.

Many experts like to perform such comparisons by computing the length of time required, on average, to guess the base secret’s value. The problem with such estimates is that they are perishable. As time goes on, computers get faster, guessing rates increase, and the time to guess a base secret will decrease. The average attack space leaves out the time factor, allowing a comparison of the underlying mechanisms instead of comparing the computing hardware used in attacks.

Each item counted in an average attack space represents a single operation with a finite, somewhat predictable duration, like hashing a single password or performing a single attempt to log on. When we look for significant safety margins, like factors of thousands, millions, or more, we can ignore the time difference between two fixed operations like that.

If all possible values of a base secret are equally likely to occur, then a trial-and-error attack must, on average, try half of those possible values. Thus, an average attack space reflects the need to search half of the possible base secrets, not all of them.

In practice, people’s password choices are often biased in some way. If so, the average attack space should reflect the set of passwords people are likely to choose from. In the case of a four-digit luggage lock, we might want to represent the number of choices that reflect days of the year, since people find it easy to remember significant personal dates, and dates are easily encoded in four digits. This reduces the number of four-digit combinations an attacker must try from 10,000 to 366.

When we try to measure the number of likely combinations, we should also take into account the likelihood that people chose one of those combinations to use on their luggage. The average attack space, then, doesn’t estimate how many guesses it might take to guess a particular password or other secret. Instead, it estimates the likelihood that we can guess some base secret, if we pick it randomly from the user community.

Biases in password selection are the basis of dictionary attacks, and practical estimates of password strength must take dictionary attacks into account. In the classic dictionary attack, the attacker has intercepted some information that was derived cryptographically from the victim’s password. This may be a hashed version of the password that was stored in the host computer’s user database (i.e. /etc/passwd on classic Unix systems or the SAM database on Windows NT systems) or it may be a set of encrypted responses produced by a challenge response authentication protocol. The attacker reproduces the computation that should have produced the intercepted information, using successive words from the dictionary as candidates. If one of the candidates produces a matching result, the corresponding candidate matches the user’s password closely enough to be used to masquerade as that user. This whole process occurs off-line with respect to the user and computing system being targeted, so the potential victims can’t easily detect that the attack is taking place. Moreover, the speed of the search is limited primarily by the computing power being used and the size of the dictionary. In some cases, an attacker can precompile a dictionary of hashed passwords and use this dictionary to search user databases for passwords; while this approach is much more efficient, it can’t be applied in every situation.

We can compute an estimate of password strength by looking at the practical properties of off-line dictionary attacks. In particular, we look at dictionary sizes and at statistics regarding the success rates of dictionary attacks. In this case, the success rate would reflect the number of passwords subjected to the dictionary attack and the number that were actually cracked that way. The 1988 Internet Worm provides us with an early, well-documented password cracking incident.

The Internet Worm tried to crack passwords by working through a whole series of word lists. First, it built a customized dictionary of words containing the user name, the person’s name (both taken from the Unix password file), and five permutations of them. If those failed, it used an internal dictionary of 432 common, Internet-oriented jargon words. If those failed, it used the Unix on-line dictionary of 24,474 words. The worm also checked for the “null” password. Some sites reported as many as 50% of their passwords were successfully cracked using this strategy (see Note 5).

Adding these all up, the worm searched a password space of 24,914 passwords. To compute the average attack space, we use the password space as the divisor, and we use the likelihood of finding a password from the space as the dividend. We use the constant value two to reflect the goal of searching until we find a password with a 50-50 chance, and we scale that by the 50% likelihood that the password being attacked does in fact appear in the dictionary. This yields the following computation:

24,914 / (2 x 0.5) = 24,914, or 215 average attack space

Since the most significant off-line trial-and-error attacks today are directed against cryptographic systems, and such systems measure sizes in terms of powers of two (or bits), we will represent average attack spaces as powers of two. When assessing average attack spaces, keep in mind that today’s computing technology can easily perform an off-line trial-and-error attack involving 240 attempts. The successful attack on the Data Encryption Standard (DES) by Deep Crack (see Note 6) involved 254 attempts, on average, to attack its 56-bit key (we lose one bit when we take the property of complementation into account).

We can also use the average attack space to compute how long a successful attack might take, on average. If we know the guess rate (guesses per second) we simply divide the average attack space by the guess rate to find the average attack time. For example, if a Pentium P100 is able to perform 65,000 guesses per second, then the P100 can perform the Internet Worm’s dictionary attack in a half-second, on average.

The Worm’s 50% likelihood figure plays an important role in computing the average attack space: while users are not forced to choose passwords from dictionaries, they are statistically likely to do so. However, the 50% estimate is based solely on anecdotal evidence from the Internet Worm incident. We can develop a more convincing statistic by looking at other measurements of successful password cracking. The first truly comprehensive study of this was performed in 1990 by Daniel V. Klein (see Note 7).

To perform his study, Klein collected encrypted password files from numerous Unix systems, courtesy of friends and colleagues in the United States and the United Kingdom. This collection yielded approximately 15,000 different user account entries, each with its own password. Klein then constructed a set of password dictionaries and a set of mechanisms to systematically permute the dictionary into likely variations. To test his tool, Klein started by looking for “Joe accounts,” that is, accounts in which the user name was used as its password, and quickly cracked 368 passwords (2.7% of the collection).

Klein’s word selection strategies produced a basic dictionary of over 60,000 items. The list included names of people, places, fictional references, mythical references, specialized terms, biblical terms, words from Shakespeare, Yiddish, mnemonics, and so on. After applying strategies to permute the words in typical ways (capitalization, obvious substitutions, and transpositions) he produced a password space containing over 3.3 million possibilities (see Note 8). After systematically searching this space, Klein managed to crack 24.2% of all passwords in the collection of accounts. This yields the following average attack space:

3,300,000 / (2 x .242) = 223 average attack space

Klein’s results suggest that the reported Internet Worm experience underestimates the average attack space of Unix passwords by about 28. Still, a 223 attack space is not a serious impediment to a reasonably well-equipped attacker, especially when attacking an encrypted password file. The guess rate of a Pentium P100 can search that average attack space in less than two minutes.

The likelihood statistic tells us an important story because it shows how often people pick easy-to-crack passwords. Table 2 summarizes the results of several instances in which someone subjected a collection of passwords to a dictionary attack or other systematic search. Spafford’s study at Purdue took place from 1991 to 1992, and produced a variety statistics regarding people’s password choices. Of particular interest here, the study tested the passwords against a few dictionaries and simple word lists, and found 20% of the passwords in those lists. Spafford also detected “Joe accounts” 3.9% of the time, a higher rate than Klein found (see Note 9).

Table 2: Percentage of Passwords Found by Systematic Searches
 Report  When  Passwords Searched  Percentage Found
 Internet Worm (note 5)  1988  thousands  ~50%
 Study by Klein (note 7)  1990  15,000  24.2%
 Study by Spafford (note9)  1992  13,787  20%
 CERT Incident IN-98-03 (note10)  1998  186,126  25.6%
 Study by Yan et al. (note11)  2000  195  35%

The CERT statistic shown in Table 2 is based on a password cracking incident uncovered at an Internet site in 1998. The cracker had collected 186,126 user records, and had successfully guessed 47,642 of the passwords (see Note 10).

In 2000, a team of researchers at Cambridge University performed password usage experiments designed in accordance with the experimental standards of applied psychology. While the focus of the experiment was on techniques to strengthen passwords, it also examined 195 hashed passwords chosen by students in the experiment’s control group and in the general user population: 35% of their passwords were cracked (see Note 11). Although the statistics from the Internet Worm may be based on a lot of conjecture, the other statistics show that crackable passwords are indeed prevalent. If anything, the prevalence of weak passwords is increasing as more and more people use computers.

The average attack space lets us estimate the strength of a password system as affected by the threat of dictionary attacks and by people’s measured behavior at choosing passwords. As shown in Table 3, we can also use the average attack space compare password strength against other mechanisms such public keys. In fact, we can compute average attack spaces for any trial-and-error attack, although the specific attacks shown here are divided into two types: off-line and interactive.

Off-line attacks involve trial-and-error by a computation, as seen in the dictionary attacks. Interactive attacks involve direct trial-and-error with the device that will recognize a correct guess. Properly designed systems can defeat interactive attacks, or at least limit their effectiveness, by responding slowly to incorrect guesses, by sounding an alarm when numerous incorrect guesses are made, and by “locking out” the target of the attack if too many incorrect guesses are made.

Table 3: Average Attack Spaces
 Example  Style of Attack  Average Attack Space
 Trial-and-error attack on 1024-bit public keys  Off-line  286
 Trial-and-error attack on 56-bit DES encryption keys Off-line  254
 Dictionary attack on eight-character Unix passwords Off-line  223
 Trial-and-error attack on four-digit PINs  Interactive 213

For an example of an interactive attack, recall the four-digit luggage lock. Its average attack space was reduced when we considered the possibility that people choose combinations that are dates instead of choosing purely random combinations. Even though a trial-and-error attack on such a lock is obviously feasible, it obviously reflects a different type of vulnerability than that of a password attacked with off-line cryptographic computations. The principal benefit of considering the different average attack spaces together is that they all provide insight into the likelihood with which an individual attack might succeed.

Forcing Functions and Mouse Pads

If strong security depends on strong passwords, then one strategy to achieve good security is to implement mechanisms that enforce the use of strong passwords. The mechanisms either generate appropriate passwords automatically or they critique the passwords selected by users. For example, NIST published a standard for automatic password generators. Mechanisms to enforce restrictions on the size and composition of passwords are very common in state-of-the-art operating systems, including Microsoft Windows NT and 2000 as well as major versions of Unix. While these approaches can have some value, they also have limitations. In terms of the user interface, the mechanisms generally work as forcing functions that try to control user password choices (see Note 12).

Unfortunately, forcing functions do not necessarily solve the problem that motivated their implementation. The book Why Things Bite Back, by Edward Tenner, examines unintended consequences of various technological mechanisms. In particular, the book identifies several different patterns by which technology takes revenge on humanity when applied to a difficult problem. A common pattern, for example, is for the technological fix to simply “rearrange” things so that the original problem remains but in a different guise (see Note 13).

Forcing functions are prone to rearrangements. In the case of strong password enforcement, we set up intractable forces for collision. We can implement software that requires complicated, hard-to-remember passwords, but we can’t change individuals’ memorization skills. When people require computers to get work done, they will rearrange the problem themselves to reconcile the limits of their memory with the mandates of the password selection mechanism.

Coincidentally, mouse pads are shaped like miniature doormats. Just as some people hide house keys under doormats, some hide passwords under mouse pads (Figure 2). The author occasionally performs “mouse pad surveys” at companies using computer systems. The surveys look under mouse pads and superficially among other papers near workstations for written passwords. A significant number are found, at both high-tech and low-tech companies.

Figure 2: Passwords hidden under a mouse padThis handwritten list of passwords was found under a mouse pad in a professional office. Each time the system demanded a new password, the user chose a new one, wrote it on the slip of paper, and stuck it back underneath the mouse pad. As many as one out of three people do something like this.
 Password list hidden under a mouse padAuthentication © 2002, used by permission

People rarely include little notes with their passwords to explain why they chose to hide the password instead of memorize it. In some cases, several people might be sharing the password and the written copy is the simplest way to keep all users informed. Although many sites discourage such sharing, it often takes place, notably between senior managers and their administrative assistants. More often, people write down passwords because they have so much trouble remembering them. When asked about written passwords, poor memory is the typical excuse.

An interesting relationship noted in these surveys is that people hide written passwords near their workstations more often when the system requires users to periodically change them. In the author’s experience, the likelihood of finding written passwords near a workstation subjected to periodic password changes ranged from 16% to 39%, varying from site to site. At the same sites, however, the likelihood ranged from 4% to 9% for workstations connected to systems that did not enforce periodic password changes. In some cases, over a third of a system’s users rearranged the password problem to adapt to their inability to constantly memorize new passwords.

These surveys also suggest an obvious attack: the attacker can simply search around workstations in an office area for written passwords. This strategy appeared in the motion picture WarGames, in a scene in which a character found the password for the high school computer by looking in a desk. Interestingly, the password was clearly the latest entry in a list of words where the earlier entries were all crossed off. Most likely, the school was required to change its password periodically (for “security” reasons) and the users kept this list so they wouldn’t forget the latest password.

Using the statistics from mouse pad searches, we can estimate the average attack space for the corresponding attack. Table 4 compares the results with other average attack spaces. In the best case, the likelihood is 4%, or one in 25, so the attacker must, on average, search 12 or 13 desks to find a password. That yields an average attack space of 24. The worst case is 39%, which is less than one in three. Thus, the attacker must, on average, search one or two desks to find a written password.

Table 4: Mouse Pad Searches and Average Attack Spaces
 Example  Style of Attack  Average Attack Space
 Trial-and-error attack on 56-bit DES encryption keys Off-line  254
 Dictionary attack on eight-character Unix passwords Off-line  223
 Trial-and-error attack on four-digit PINs  Interactive 213
 Best-case result of a mouse pad search  Interactive  24
 Worst-case result of a mouse pad search  Interactive  21

The mouse pad problem shows that we can’t always increase the average attack space simply by making passwords more complicated. If we overwhelm people’s memories, we make certain attack risks worse, not better. The reason we want to discourage single-word passwords is that they’re vulnerable to off-line dictionary attacks. Table 4 shows that such attacks involve a 223 attack space. We don’t increase the average attack space if forgettable passwords move to the bottom of people’s mouse pads.

Reference Notes

If you are following the notes to see if they contain more technical details, don’t bother. The notes only provide sources for the information in the text. If you are interested in general in the sources, it’s best to postpone looking at the notes until you’ve read the entire paper. Then just read all of the notes.

1. See the DOD Password Management Guideline, produced by the NCSC (CSC-STD-002-85, Fort Meade, MD: National Computer Security Center, 12 April 1985).

2. See Chapter 2 of Designing the User Interface: Strategies for Effective Human­Computer Interaction by Ben Shneiderman (Reading, MA: Addison-Wesley, 1998). For a point of view more focused on usability and security, see the papers by Alma Whitten and J. D. Tygar: “Usability of Security: A Case Study” (CMU-CS-98-155, Pittsburgh, Pennsylvania: Carnegie Mellon University Computer Science Department, 18 December 1998), and “Why Johnny Can’t Encrypt,” (Proceedings of the 8th USENIX Security Symposium, USENIX Association, 1999).

3. See Chapter 10 of Shneiderman’s Designing the User Interface, and George Miller’s article “The Magical Number Seven­ Plus or Minus Two” (Psychological Science vol. 63, 1956).

4. In Chapter 3 of The Design of Everyday Things (New York: Doubleday Currency, 1988), Donald Norman talks about the problem of memory and the impracticality of techniques to improve memory.

5. The Internet Worm’s dictionary attack is described in Eugene Spafford’s article “Crisis and Aftermath” (Communications of the ACM vol. 32, no. 6, June 1989).

6. The best description of attacks on DES is in Cracking DES: Secrets of Encryption Research, Wiretap Politics, and Chip Design, by the Electronic Frontier Foundation (Sebastopol, CA: O’Reilly & Associates, 1998).

7. Daniel Klein’s paper “A Survey of, and Improvements to, Password Security” (Unix Security Workshop II, USENIX Association, 1990) describes his experiments.

8. The size given for Klein’s password space is an estimate based on the description in his paper. Klein does not report the actual password space size.

9. Spafford’s password study was described in his report “Observing Reusable Password Choices” (3rd USENIX Security Symposium, USENIX Association, 1992).

10. The 1998 incident was reported as CERT Incident IN-98-03, “Password Cracking Activity.”

11. The study by Yan, Blackwell, Anderson, and Grant is titled “The Memorability and Security of Passwords-Some Empirical Results” (research paper, Cambridge University Computer Laboratory, 2001).

12. Forcing functions are discussed in Chapter 5 of Norman’s Design of Everyday Things.

13. Edward Tenner was inspired to write Why Things Bite Back: Technology and the Revenge of Unintended Consequences (New York: Alfred A. Knopf, 1996) after noticing how much more paper gets used in a modern “paperless” office. Tenner summarized his taxonomy of revenge effects in Chapter 1.

14. The CSI Journal actually published the article twice. The first time, in Spring 2002, the printer eliminated all exponents, so that 2128 became 2128. The Summer 2002 version contains the correct text.

Follow

Get every new post delivered to your Inbox.

Join 46 other followers