How to Fend off Shoulder Surfing Volker Roth a Kai Richter b a OGM Laboratory LLC 6825 Pine St, Omaha, NE 68106, USA b Zentrum f¨ur Graphische Datenverarbeitung e.V. Fraunhoferstr. 5, 64283 Darmstadt, Germany Abstract Magnetic stripe cards are in common use for electronic payments and cash withdrawal. Reported incidents document that criminals easily pickpocket cards or skim them by swiping them through additional card readers. Personal identification numbers (PINs) are obtained by shoulder surfing, through the use of mirrors or concealed miniature cameras. Both elements, the PIN and the card, are generally sufficient to give the criminal full access to the victim’s account. In this paper, we present alternative PIN entry methods to which we refer as cognitive trapdoor games. These methods make it significantly harder for a criminal to obtain PINs even if he fully observes the entire input and output of a PIN entry procedure. We also introduce the idea of probabilistic cognitive trapdoor games, which offer resilience to shoulder surfing even if the criminal records a PIN entry procedure with a camera. We studied the security as well as the usability of our methods. The results support the hypothesis that our primary mechanism strikes a balance between security and usability that is of practical value. In this article, we give a detailed account of our mechanisms and their evaluation. Key words: Security engineering, usability and security, secure PIN entry, human computer interaction, shoulder surfing 1 Introduction Personal Identification Numbers (“PIN”) are used as a means of authenticating oneself when withdrawing money from Automatic Teller Machines (“ATM”), authorizing Point of Sales (POS) transactions, unlocking our cell phones and Portable Digital Assistants (“PDA”), gaining access to secure areas, or disarming anti-burglar alarms, to name a few examples. Typically, a user proves himself to a machine by Email address: volker.roth@acm.org (Volker Roth). Preprint submitted to Elsevier Science May 21, 2006 entering a four digit PIN number using a PIN pad with three by four keys, and an automatic process verifies whether the entered PIN is correct. However, anyone who has the PIN pad in his field of view may observe the PIN number that a prover enters and use that information to impersonate the legitimate prover. This particular attack is widely known as shoulder surfing. 1 As an added security mechanism against such involuntary PIN disclosure, many authentication systems require not only something that the legitimate prover knows but also something that he has, such as a magnetic stripe card with certain information stored on it. However, fraudsters steal or skim valid cards with increasing sophistication (Weinstock, 1987; Brader, 1998; Wood, 2003; Summers and Toyne, 2003; Colville, 2003) causing significant damage to customers and the banking industry. The means by which fraudsters obtain the corresponding PIN numbers also increased. In several recent cases, miniature camera devices were planted at ATMs in a concealed fashion which radioed video images of PIN entry sequences to nearby receivers (Wood, 2003; Summers and Toyne, 2003; Colville, 2003). We investigated whether the method by which PINs are entered can be designed in a way that is resilient to human shoulder surfers even if all input and output is in plain sight, perhaps even if all input and output is recorded by a concealed camera. At the same time, the method should be efficient and easily usable. Our contribution is a novel design which consciously leverages the fact that certain cognitive capabilities of humans are very limited, particularly humans’ ability to store and retain information in their short term memory (Miller, 1956; Anderson, 2000; Vogel and Machizawa, 2004). We refer to our principal design as an interactive cognitive trapdoor game between a verifier and a prover where all input and output is in plain sight of an observer, and authenticating oneself amounts to winning the game. The game is designed so that winning the game is well within the bounds of human’s cognitive capacity if the correct PIN is know. If, however, the PIN is not known then winning the game requires cognitive capacity beyond what is typically found in humans. A simple example may serve as an illustration of the general idea. Assume that the prover wishes to enter a single digit using a PIN pad with the typical fixed layout of keys and digits. Further assume that the verifier has the ability to set the background color of each individual key to either black or white. The verifier randomly partitions the set {0, 1, · · · , 9} of possible PIN digits into two equally sized sets A and B. The digits in set A are displayed on white background and the digits in set B are displayed on black background. If the prover’s digit is in set A then she enters white, and black otherwise. After playing the game for a few rounds, the verifier can uniquely determine the digit by intersecting the sets indicated by the prover. The observer, on the other hand, does not know the digit and in order to calculate the set intersection she has to quickly memorize or record at least one 1 see e.g., “Word Watch column—Shoulder Surfing.” The Atlantic Monthly. February, 1992. set, its color, and the prover’s response in each round. The game is repeated until all digits are entered. In the remainder of the article, we elaborate on the design and its security. We describe multiple variations of it some of which are especially suitable for people with certain handicaps such as blindness. We also present and discuss the results of several user studies we conducted with the goal to assess the security and the usability of our most prominent variants. The outcome of the studies support the hypothesis that our primary method offers resilience against shoulder surfing while still being reasonably usable—and thus have considerable practical value where shoulder surfing is a concern. Certain modifications of our design that we describe—the introduction of ambiguity into provers’ answers—provide limited resilience even against a single recording by a concealed camera. However, that has been noted before by Baker (1995) and thus does not constitute a novel result. 2 Background In this section we summarize background material that we assume as known in the remainder of our article, namely: a description of our threat model, the psychological foundation of our mechanism design, as well as mathematical tools that we applied in our usability evaluation. Readers who are familiar with these fields may safely skip the corresponding sections and continue reading the description of our mechanism design in §3. 2.1 Threat Model and Terminology We model entry of a PIN as an interactive game between three parties: a machine verifier, a human prover, and a human observer. Each game consists of a number of rounds in which the verifier, in an abstract sense, poses questions and the verifier inputs her answer. The objective of the prover is to authenticate himself to the verifier by his PIN; the objective of the verifier is to decide whether the prover knows the correct PIN. The observer observes all interactions between the prover and the verifier; his objective is to impersonate the prover in subsequent games with the same verifier. The game also involves a setup phase with a trusted dealer who, using a confidential and authentic channel, distributes a token to the prover, and a master secret to the verifier (alternatively, the verifier may query the dealer online during each game over an authentic and confidential channel). The token, typically a magnetic stripe (Count Zero, 1992) or chip card, contains information that uniquely identifies the prover. It also contains information by which the verifier can verify whether the prover’s input is correct and matches his identity. We assume that the observer cannot verify the correctness of a given PIN unless he also knows the master secret (and we assume he does not). Additionally, we expect that the verifier keeps a record of how often an prover successively inputs a false PIN. If the count reaches three, the verifier voids the prover’s authorization until the prover receives a new PIN from the trusted dealer. Let the observer possess (a perfect copy of) the prover’s token (obtained by theft or skimming). Most important, we assume that the resources of the observer are computationally and memory bounded by the cognitive capacity of a human, particularly the short term memory (“STM”). Compared to an actual implementation, we make idealized and simplified assumptions. For instance, we assume that the PINs are generated in a uniformly distributed fashion. Actually, the PIN distribution e.g., of the Eurocheque Card system was shown to be skewed considerably (M¨oller, 1997; Kuhn, 1997), to a large degree because recommendations in the applicable standards (ISO, 2002) were not fully adhered to. Otherwise, our model corresponds closely to what is typically found in the ATM world. 2.2 Background on Cognitive Psychology The cognitive capabilities of a human have interesting limitations. Recently, Vogel and Machizawa (2004) discovered a relationship between neural activity and memory capacity and found neurophysiological evidence that the human visual short term memory (“STM”) is limited to three to four symbols. In their measurements, they used a delay of one second between exposure and recall. Few subjects they tested had the capacity to hold five symbols in their STM. This is even less than the findings of Miller (1956) who suggested that the capacity of the STM is limited to 7 ± 2 symbols. However, retention of items in the STM, and transfer to long term memory (“LTM”), appears to be critically dependent on the ability to rehearse the information in the STM (Perterson and Peterson, 1959). A constant stream of new information in short succession, as in the case of our mechanism design, is known to impede rehearsal and thus later recall. An effect that increases the capacity of the STM is referred to as chunking (Murdock, 1961). Multiple items are grouped together and represented as a single item that occupies one “slot” in the STM. For instance, an American can probably remember the sequence 149217761941 easily because it can be grouped into three chunks of four digits. Each of the chunks represents a year in which a historic event occurred that is significant to Americans, and which can be represented as a single item. It is not entirely clear to us at this point whether chunking may have an impact on our mechanisms. All 30240 black and white five digit numeric patterns are equally likely, it appears that the opportunities for frequent chunking are marginal. In summary, the limitations of humans’ STM are a promising starting point for devising cognitive trapdoor games although there may be alternative approaches. 2.3 Background on Usability Testing The concern of usability testing is to determine the probability that a change in user performance is caused by a particular change of condition (e.g., the improvement of a user interface) as opposed to random variance. Generally, usability testing is based on empirical observations that are analyzed by means of statistical methods (see e.g., Sachs (2002); Box et al. (1978) for an overview). A basic tool is the Mann-Whitney U-test (Mann and Whitney, 1947), which tests whether two independent samples are from the same population by comparing their relative ranks. The individual observations must be comparable in the sense that one can determine which observation is “greater.” For samples from different populations one would expect that their ranks (i.e., the overlap between samples) are determined by random chance. The Mann-Whitney U-test tests that on the null hypothesis that the two samples are drawn from the same population. A table of its critical values for different significance levels is given by Milton (1964). The test is suitable to analyze ordinally scaled data such as data collected by Likert (1932) scales. Likert scales require subjects to rate a given statement based on an odd number of ordered alternatives (e.g., agreement or disagreement with the statement on a five to seven point rating scale). Likert scales are an ingredient of the Software Usability Scale (“SUS”) described by Brooke (1996). While the U-test is restricted to paired comparisons, difference hypotheses for more than one independent sample can be tested with the Kruskal-Wallis H-test (Kruskal and Wallis, 1952) which is an extension of the method of Mann and Whitney. If ordinally scaled data is collected in repeated measurements such as “before” and “after” comparisons then the test of Wilcoxon (1945) for matched pairs must be used. Similar to other rank sum tests, Wilcoxon’s test is reasonably robust even if the data is not equally or normally distributed. The Student’s t-test and the analysis of variance (“ANOVA”) are methods which analyze the summed squared deviation from the mean of a normal distribution. Put simply, the Student’s t-test answers the question whether differences in the means of two samples are due to chance. ANOVA serves as a basis for several well-known statistical methods such as regression analysis and multivariate methods. Which test is applied when depends on the design of the study and the type of data that is acquired during the study. Typically, a combination of tools is used to analyze measurements for statistically significant properties. 3 Cognitive Trapdoor Games The general principle we apply is to consecutively display the set of PIN digits to the verifier as two partitions. The verifier indicates the partition in which the current PIN digit is. After a few rounds, the prover determines the correct PIN digit by intersecting the indicated partitions. The algorithm may be repeated for as many digits as the prover wishes to enter. The input and output methods determine the difficulty of the cognitive task that must be accomplished by the prover and the observer. In §3.1, we describe two designs of such a task: the immediate choice variant and the delayed choice variant. Our hypothesis is that in both designs the task is of limited cognitive complexity if the PIN is known, and of significant cognitive complexity otherwise. Hence, the term cognitive trapdoor game. We discuss and compare the properties of our designs in §3.2. Both variants achieve significantly better resilience against shoulder surfers without automatic recording devices than contemporary PIN entry methods (see §5.1 for experimental evidence). In §3.4, we describe a modification which additionally provides limited resilience even if shoulder surfers record all inputs and outputs with a camera, and we analyze its security. The designs we present in §3.1 are based on visual perception and tactile input. However, the principles of cognitive trapdoor games easily extend to other input and output modalities. In §3.5, we describe alternative designs which are particularly suited for handicapped people. 3.1 PIN Entry Using Key Pads The immediate response design and the delayed response design for visual output and haptic input that we present in this section are conceptually similar. Both designs require a display on which a key pad can be displayed, or a key pad for which some perceptible aspect of each key is controllable by the verifier e.g., the color of the keys can be changed from black to white and vice versa. Virtually all ATMs provide a display suitable for our purpose. Additionally, the designs require two input keys one of which denotes black and the other white. The pound (’#’) and asterisk (’∗’) keys typically found at the lower left and right edges of an ATM’s key pad are suitable. Principally, only a software upgrade would be necessary to implement our designs on such devices. Additionally, our designs can coexist with the contemporary method of entering one’s PIN by typing its digits into the key pad. In the immediate response design, each PIN is entered as follows: (1) The verifier produces the display of a key pad with the familiar fixed layout of keys where half of the keys are displayed with white digits on black background and the other half with black digits on white background. The distribution of black and white colors must have certain properties, we present an algorithm to compute suitable distributions in §3.3. (2) The verifier prompts the prover for input. The prover responds by pressing the key denoting white (e.g., the pound key) if his PIN digit is shown on white background, and presses the key that denotes black (e.g., the asterisk key) otherwise. Assume that S is the set of five digits with the same color than the one that the prover selected. input w input w input b input b 1 2 3 1 2 3 1 2 3 1 2 3 4 5 6 4 5 6 4 5 6 4 5 6 7 8 9 7 8 9 7 8 9 7 8 9 0 0 0 0 next digit or clear Figure 1. This figure illustrates the immediate response design. Assume that the prover wishes to enter digit ’3’. The verifier begins by presenting the leftmost color pattern. Digit ’3’ is displayed on white background, therefore the prover enters white. The verifier changes the color patterns, this time digit ’3’ is displayed on black background. Hence, the prover enters black. The procedure continues for two more rounds after which the verifier clears the display and calculates digit ’3’ by intersecting the white digits in the first color pattern with the black digits in the second pattern and so forth. The algorithm for calculating the color patterns is given in §3.3. (3) The verifier repeats steps 1 and 2 four times. (4) The verifier intersects the sets S1 , · · · , S4 selected by the prover. The set intersection contains the candidates for the PIN digit that the prover entered. Assume that D is the set intersection. (5) The verifier repeats steps 1 to 4 four times, one time for each of the four digits D1 , · · · , D4 that constitute the prover’s PIN. Overall, 16 input/output rounds have to be completed, four rounds per digit and four repetitions for the four digits of the prover’s PIN. If any of the set intersections contains either no digit or more than one digit then an error occurred during input. In that case, the verifier notifies the prover of the error, increases the overall count of false attempts for the alleged prover, and offers to repeat the entire procedure unless three false attempts were counted. Otherwise, the verifier verifies that the digits D1 , · · · , D4 constitute the correct PIN. Figure 1 illustrates steps 1 to 3. Steps 1 and 2 must be repeated four times because four is the smallest number of repetitions which guarantees that the set intersection always yields a unique solution to finding one digit out of ten possible digits. More generally, if the verifier must identify any one of N digits or characters then the prover must respond log2 N times with a binary answer. The principal observation here is that with each binary decision, the set of candidates can be halved. The immediate response design owes its name to the fact that subsequent to each output of the verifier (step 1), the prover has to input his response (step 2). In the delayed response design, the verifier repeats step 1, the output, four times with a delay of 0.5 seconds between consecutive outputs. Subsequent to the fourth output, which is shown for 0.5 seconds as well, the verifier clears the display and prompts the prover to enter the colors that his or her PIN digit had in the previous four outputs. The prover then enters the colors consecutively. Hence, the prover’s responses are delayed until after all output that is required to determine a single PIN digit has completed, which gives the delayed response design its name. The design rationale 0.5 sec 0.5 sec 1 2 3 1 2 3 1 2 3 4 5 6 4 5 6 4 5 6 7 8 9 7 8 9 7 8 9 0 0 0 0.5 sec 0.5 sec clear display prover enters 4 5 6 w,b,w,b 7 8 9 1 2 3 0 Figure 2. This figure illustrates the delayed response design. The input is the same as in Figure 1. This time, however, the verifier changes the color patterns every 0.5 seconds, rather than after each response of the prover. Only after all four patterns were displayed the prover is prompted to enter the color sequence of the digit he or she wishes to enter. The algorithm for calculating the color patterns is given in §3.3. for the delayed response design is to limit the time for which the output is exposed to the prover and also to any observer. For comparison, in the immediate response design the output is displayed until the prover inputs his response. Hence, a slow prover is more vulnerable to shoulder surfing than a fast prover. Figure 2 illustrates four input rounds (entering one digit) of the delayed response design. Assuming that Oy denotes the verifier’s output in round y, and Iyx denotes the prover’s input of the color digit x had in round y, and further assuming that the prover’s PIN is 1234, the input/output sequences of the two designs can be summarized as given below: first digit second digit immediate response: O1 , I11 , O2 , I21 , O3 , I31 , O4 , I41 O5 , I52 , O6 , I62 , O7 , I72 , O8 , I82 · · · delayed response: O1 , O2 , O3 , O4 , I11 , I21 , I31 , I41 O5 , O6 , O7 , O8 , I52 , I62 , I72 , I82 · · · Obviously, more options exist to arrange inputs and output. The prover may respond immediately to each output of the verifier as in the immediate response design, but the verifier may run the first round for all PIN digits 1 to 4, followed by the second round for all digits and so forth until all four rounds were completed for all four PIN digits. We refer to that design as interleaved response, its round structure can be illustrated in the same fashion as above: first round, all digits second round, all digits interleaved response: O1 , I11 , O2 , I22 , O3 , I33 , O4 , I44 O5 , I51 , O6 , I62 , O7 , I73 , O8 , I84 · · · We analyze and discuss the properties, the psychologic rationale, as well as the advantages and disadvantages of all these designs in §3.2 below. Furthermore, we were interested how our designs would perform in practice and implemented several versions of them for the purpose of conducting security and usability studies. The results of these studies are described in §5.1 and §5.2. 3.2 Comparison and Analysis For all designs of cognitive trapdoor games we have presented above, it holds that, if the observer can perfectly record or memorize all input and output then he or she will be able to deduce the prover’s PIN in the same fashion as the verifier does it. We describe modifications ouf our designs that provide limited resilience against automatic recordings in §3.4. In this section, we assume that the observer has no automatic recording devices such as concealed cameras, although he may use e.g., manual tools such as pencil and paper. This means that the observer’s resources are constraint by humans’ cognitive capabilities as we summarized them in §2.2. In the immediate response design, the prover must retrieve his or her PIN from LTM and must decide which color his or her current PIN digit has before responding. In the delayed response design, the prover must remember a sequence of four colors in its STM for a few seconds. In both cases, the prover can focus his or her gaze on the fixed position of the current PIN digit on the PIN pad, which eliminates the need to maintain awareness of the digit itself. The immediate response design is well within the cognitive capacity of a healthy human, the delayed response design is within practical bounds. In the immediate response design, the observer must memorize at least six symbols in each round: five symbols of equal color (the new information presented in each round) and the response of the prover. Alternatively, the observer must perceive and manually record that information at the same speed at which it is presented. If, on the other hand, the observer does not memorize or record information but attempts to derive the PIN digits directly then he or she must additionally remember his or her current hypothesis what the set of probable PIN digits is, and must mentally intersect the hypothesis with the set indicated by the prover. In the delayed response design, the observer has no means to prune the set of possible PIN digits before the prover inputs his or her answers. This amounts to memorizing information worth at least 24 symbols which exceeds the capacity of humans’ STM (Vogel and Machizawa, 2004; Miller, 1956) by a safe margin. Also note that the estimate above is calculated for one PIN digit, the observer has to accomplish his attack four times in rapid succession, once for each PIN digit. Additionally, the six symbols that must be memorized or otherwise processed per round are not available for rehearsal, the process by which information in the STM is encoded in LTM (Perterson and Peterson, 1959). The symbols are rapidly replaced by new information that must be memorized or processed as well. A continuous stream of new information that must be processed, as generated by our designs, is known to impede rehearsal (Anderson, 2000). Therefore, we are reasonably confident that the greater resources of the LTM cannot be brought to bear on the observer’s task, particularly not in the delays response design where the exposure time for each round is limited to 0.5 seconds. In the interleaved response design, the observer would have to memorize information generated in five rounds before any pruning can take place, which amounts to memorizing 30 symbols. Before the first PIN digit can be derived unambiguously, information from 13 rounds or 91 symbols must be memorized. On the other hand, the prover cannot focus his gaze on one PIN digit for four consecutive rounds, but has to cycle through his or her PIN digits four times. In order to verify our hypothesis that the asymmetry of the cognitive overhead of the prover’s and the observer’s task fulfills our requirements for a cognitive trapdoor game, we conducted two studies. First, we measured subjects’ ability to record and guess PIN digits in recorded PIN entry procedures. Second, we studied subject’s ability to enter PINs using our designs. Additionally, we measured how well subjects accepted our designs. The results of our studies are presented in §5.1 and §5.2 respectively. 3.3 Randomizing the Color Patterns For our PIN entry methods to be secure, the color patterns must be random or at least pseudo random in a fashion that cannot be predicted by observers. Additionally, in each round the number of white digits should be equal to the number of black digits. This can be justified as follows: let p0 be the probability that the digit is white and let p1 = 1 − p0 the probability that it is black. The average amount of information that can be gathered in each round (in other words, the observer’s un certainty about the entered PIN digit) equals the entropy H(p0 , p1 ) = − pi log2 pi per round. It is well known that the entropy is maximal for an equal distribution (Shannon, 1948), which is the case if the number of colors is two and both colors are assigned the same number of digits. The question then is how the display of color patterns can be computed so that the aforementioned two criteria are met. The answer that is perhaps the simplest to give and prove correct is based on symmetrical balanced trees as shown in Figure 3. The height of a balanced tree (the maximum number of edges from its root to a leaf) is log2 n for a tree with n leaf nodes. Each leaf node is randomly associated with a unique digit. For instance, a tree that represents the digits [0, 1, · · · , 9] has height log2 10 = 4. All nodes, internal as well as leaf nodes, are assigned color labels such that a left sibling is black and a right sibling is white. Our algorithm for coloring digits can now be formulated simply as follows: in round r, each digit is assigned the color of its parent node at level r of the tree. Figure 3 illustrates the algorithm for rounds one and two. The colors of digits which do not have a leaf node at the last level are chosen randomly such that the number of black and white digits is equal. The symmetry of the tree ensures that the number of white and black digits is equal even if the number of digits is not a power of two. We omit a proof for its simplicity. The basic idea of the proof is to show that if a subtree with root v has not an equal number of black and white leaf nodes then its symmetrical node v has as many black leaf nodes as v has white leaf nodes and vice versa. The security follows from 9 5 4 3 0 1 9 8 2 5 4 7 7 1 3 6 0 8 2 6 Figure 3. This figure illustrates how the colors of digits are determined in each round. The algorithm is based on a balanced tree which has a depth that is logarithmic in the number of digits e.g., log2 10 = 4 for ten digits. As rounds progress from round one to round four, the digits (the leaves of the tree) inherit the color of their parent nodes at the corresponding tree level. The upper tree shows the color distribution at level one, and the lower tree shows the color distribution at level two. Nodes at the respective levels are circled for clarity. the condition that leaf nodes are randomly associated with a digit. Hence, each digit can be represented by a unique path in the tree yet the association between paths and digits (the sequence of colors that must be entered by the prover) is random. A less abstract algorithm, which we present without proof, can be described as follows: each digit is randomly associated with a card in a deck of cards. In each round, the verifier randomly colors the digits represented by cards in the upper half of the deck black and otherwise white, or vice versa. After each round, the verifier performs a perfect riffle in-shuffle. 2 The resulting patterns also fulfill the criteria we required above. 3.4 Resilience Against Camera Recording Criminals increasingly employ concealed miniature cameras to observe and record the PINs entered by victims (Wood, 2003; Summers and Toyne, 2003; Brader, 1998). The designs we have presented above are effective against human shoulder surfers, as we found in our evaluation (see 5.1 for empirical evidence). However, if an observer records all input and output then he can compute the prover’s PIN number in the same fashion the verifier computes it—by intersecting the sets of digits that the prover indicated. 2 Eric W. Weisstein. “Riffle Shuffle.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/RiffleShuffle.html 9 4 1 5 7 3 0 2 6 8 Figure 4. This figure shows a tree with a reduced number of levels which yields a reduced number of rounds for entering PIN digits. The reduction results in some degree of uncertainty about the verifier’s PIN even if the sequence of colors that the verifier enters is perfectly recorded by an observer. In the given example, ten digits are identified by four possible color sequences which means that each input sequence leaves 2.5 candidate digits on average. It turns out, though, that a simple modification of our design can provide limited resilience even against automatic recordings of all input and output (Baker (1995) describes an earlier approach which benefits from the same effect). The key is to input less information than is necessary to uniquely identify the entered PIN. In other words, subsequent to the prover’s input the intersection of sets yields not only one candidate PIN but multiple ones which are all equally likely. We refer to the set of candidate PINs as the shadow set. The verifier can efficiently verify candidates in the shadow set and authorize transactions if one of the candidates is the correct PIN. The observer, on the other hands, lacks knowledge of the master secret that is part of the verification process and thus cannot do better than to try PIN numbers from the shadow set at random. Given a limited number of allowed false entries, typically three, this yields a certain success probability which depends on the size of the shadow set. At the same time, the probability of success when guessing PIN numbers blindly (without knowing a shadow set) increases. Therefore, the size of the shadow set becomes a tradeoff between false acceptance rate, efficiency of verification, and the security of the design against camera recording. Before we analyze and quantify that tradeoff in detail below, we illustrate the modified design by an example. Consider the balanced tree in Figure 4. We reduced the depth of the tree from four to two levels. Consequently, each leaf node is associated with multiple digits. This reduces the number of rounds necessary to enter a PIN digit and, as a positive side effect, improves the overall usability of our design. Although, only four different color input sequences are available to identify ten different digits. This means that on average 2.5 candidate digits are identified per two rounds of input, with a minimum of two and a maximum of three candidates depending on the digit that the verifier enters. Assuming that the PIN consists of four digits, the size of the shadow set is therefore between 24 = 16 and 34 = 81 with an average of 2.54 ≈ 39. In more mathematical terms, given a shadow set size of s shadows the prover has to perform t = log2 (N/s) = log2 (N ) − log2 (s) rounds of input where N is the overall number of PIN numbers, which is 10, 000 in a typical setting with four decimal digits per PIN. Assume that the observer steals the prover’s magnetic stripe card and attempts to authenticate himself or herself to the verifier by random input. success A0 ∩ B0c A1 ∩ B1c Ac0 Ac1 ∩ B0c Ac0 ∩ B0 ∩ ··· ··· ··· B1c Ac1 ∩ B1 Figure 5. The decision tree to compute the probability of successful impersonation of an oracle by an adversary. Given n attempts, his or her probability D of success would be: D= n 1 ( k=1 2t · (1 − 1 k−1 ) ) 2t (1) Formula (1) therefore provides a lower bound of the probability with which the observer succeeds to impersonate the prover. We derive additional bounds below. Assume the observer has one camera recording and derives the shadow set from it. He or she then attempts to authenticate him-/herself by entering PINs randomly chosen from the shadow set. The observer succeeds if he or she guesses: A: the correct PIN from the shadow set B: a wrong PIN but the correct PIN is a shadow of the wrong PIN Event B is an unfortunate side effect of the shadow sets. Note that the observer cannot guess a correct and a wrong PIN simultaneously and therefore in all attempts A ∩ B is the empty event φ the probability of which is zero. Let Ac be the complementary event of A and therefore Pr[A] = 1 − Pr[Ac ]. The observer’s probability AB of success in the k’th attempt can now be computed as the probability of the decision tree shown in Figure 5. The events at each node of the tree are mutually independent conditional to their parent node and thus the observer’s probability to succeed in n or fewer attempts (counted from k = 0, · · · , n − 1) is the sum of the probabilities of the leaves in the tree, or more precisely: AB = n−1 k−1 k=0 i=0 ((Pr[Ak ∩ Bkc ] + Pr[Ack ∩ Bk ]) · Pr[Aci ∩ Bic ]) (2) Formula (2) provides another lower bound on the probability that the observer succeeds to impersonate the prover. The probabilities of individual events can be calculated based on the observation that, conditional to choosing a PIN xk from the shadow set, Ak and Bk are independent experiments. Therefore it holds that: Pr[Ak ∩ Bkc ] = Pr[Ak ] · Pr[Bkc ] 1 Pr[Ak ] = s−k Pr[Ack ∩ Bk ] = Pr[Ack ] · Pr[Bk ] 1 Pr[Bk ] = N −s+1 Unfortunately, the introduction of shadows also increases the probability that the observer succeeds to impersonate the prover by randomly guessing a PIN without 1 0.14 0.25 0.12 AB(x) CB(x) D(x) 0.1 0.08 AB(x) CB(x) D(x) 0.0625 0.015625 0.06 0.00390625 0.04 0.000976562 0.02 0.000244141 0 20 40 60 80 Shadows 100 120 4 16 64 256 1024 4096 Shadows Figure 6. Both displays show the same graphs. The left display is plotted in linear scale, the right display is plotted in logarithmic scale. Function AB(x) is the probability to guess a PIN with a shadow set of size x with 10,000 PIN numbers and three attempts (see Formula (2)). Function CB(x) is the same for the probability to guess correctly without knowing the shadow set (see Formula (3)). Function D(x) gives the probability to guess correctly by entering random sequences in our design with reduced rounds (see Formula (1)). knowing a shadow set. Let Ck be the event that the adversary guesses the correct PIN from the entire set of PINs in the k’th attempt. The probability of event Ck is: Pr[Ck ] = 1 N −k Even if the observer guesses a wrong PIN, the correct PIN may still be a shadow of the wrong PIN. By similar considerations as summarized above, we can derive the success probability CB by substituting Ck for Ak in formula (2). This yields our third and final lower bound: CB = n−1 k−1 k=0 i=0 ((Pr[Ck ∩ Bkc ] + Pr[Ckc ∩ Bk ]) · Pr[Cic ∩ Bic ]) (3) In Figure 6 we show graphs of the lower bounds we derived (formulas (1), (2) and (3)) for different numbers of shadows and a PIN space of size 10, 000. Note that the right half of the figure displays plots in logarithmic scale whereas the left part shows an excerpt plotted in linear scale. The size of the shadow set is plotted on the abscissa, probabilities are plotted on the ordinate. Where the size of the shadow set is approximately 100, the graph of (1) breaks even with the graph of (2) (approaching from above) at a probability of approximately 0.03. We conclude that, unless better attacks become known, a shadow set of size 100 is the maximal size that is reasonable and yields approximately a 3% chance that an observer impersonates a prover with or without knowledge of a shadow set. The values of (3) remain significantly smaller than the values of (1) and (2) until the size of the shadow set approaches the size of the PIN space, at which point the graphs of (2) and (3) merge and approach 1. At the same time, the overall number of rounds for which the prover has to play the cognitive trapdoor game is theoretically about halved (log2 (10, 000) = log2 (100) + log2 (100)), which considerably improves the usability of our PIN entry method. In practice, our choice of the number of rounds is somewhat limited by the fact that the number of decimal digits is not a power of two. Due to the probabilistic nature of this recording resilience modification we refer to such designs as probabilistic cognitive trapdoor games. One caveat remains, though. In a typical scenario, the verifier resets its counter of false attempts once a PIN is entered correctly. Hence, the observer may probe one or two PINs taken from the shadow set. If these attempts fail then he waits until the prover again (correctly) entered his genuine PIN. At that point, the verifier resets his false attempts counter and the observer can probe one or two more PINs from the shadow set. This strategy may be continued until the observer identified the genuine PIN. In order to avoid the attack, the verifier must display the recorded number of false attempts before the game, so that the prover is alerted. A consequential denial of service condition due to intentional entry of false PINs can be avoided by amending the identifying information stored on the prover’s token with a salt. Hence, the token cannot be forged from obvious identifying information (such as account numbers printed also on balance sheets) but must be stolen first (in which case invalidation of access is in the best interest of the prover). 3.5 Alternative I/O Modalities In previous sections, we laid out in detail how the PIN pad metaphor can be applied to design the PIN entry procedure in a fashion that fends off shoulder surfing. The concepts we applied are not limited to that particular metaphor nor are they limited e.g., to output that must be perceived visually. Consider an output device which consists of a board with eight palatable pins arranged in two arcs so that the small, ring, middle, and index finger of each hand can be conveniently placed on top of the pins, and the thumbs come to rest on two keys which are used for input. Assume that the verifier can raise or lower the pins in a palatable fashion. The device can loosely be compared to a simplified Braille display. 3 The prover’s PIN consists of a sequence of five fingers. For ease of description, we number the fingers of both hands excluding the thumbs from zero to seven. Assume the prover’s PIN sequence consists of the following fingers: left middle, left ring, right index, right middle, left index. Then we can represent the PIN as the five digit base eight number 214538 . Each digit is entered as follows: (1) The verifier raises four pins and lowers all others. (2) If the pin that corresponds to the current PIN digit is raised then the prover presses the key under his or her left thumb, and the key under his or her right 3 Braille—-a tactile reading and writing system for the blind based on dots raised above the surface, named after its inventor Louis Braille, 1809–†1852 from raised to lowered t = 1: left from lowered to raised t = 2: right t = 3: left Figure 7. This figure illustrates the input and output mechanisms based on palatable pins which can be raised (black) or lowered (white) under the control of the verifier. Only the thumbs are exceptions; thumbs denote keys which must be pressed to indicate whether a particular pin is raised (left thumb, black) or lowered (right thumb, white). The prover’s PIN number is represented by a sequence of fingers. In each round, the prover presses the thumb key which denotes the state (raised or lowered) of the current PIN finger e.g., the middle finger. In the example above, the pin under the middle finger is first raised, then lowered, then raised again, hence the prover would would press the keys under his or her thumbs in the following sequence: left, right, left. thumb otherwise. Assume that S is the set of digits represented by the pins in the same state as indicated by the prover. (3) Steps 1 and 2 are repeated three times. (4) The verifier calculates the entered digit by intersecting the sets S1 , S2 , S3 . It is easy to see that we can devise variants of that algorithm analogous to the variants of the visual designs we described in §3.1. However, the tactile design has an advantage over the visual design: output is implicitly hidden by the fingertips of the prover which means that a human observer learns no information about the PIN that is entered. We must be aware, though, that attack and defense is an arms race: it is not unconceivable that sophisticated observers eventually develop other concealed measurement devices which allow them to capture the state of the pins e.g., based on microphones. The tactile design is particularly suited for blind provers who are unable to notice active shoulder surfing. Although other handicapped provers may profit from such a scheme as well e.g., people in wheelchairs who have difficulty to effectively shield their input from the view of observers. 4 Related Work The problem of how PIN numbers can be entered in the face of shoulder surfing has inspired numerous related work. A common approach of which several variants were proposed is based on a key pad with randomized layout of keys (Hirsch, 1982, 1984; Cairns, 1990; Thrower, 1989; Rehm, 1985; Hoover, 2001; Collins, 1990; McIntyre et al., 2003; Baker, 1995). The prover must locate and press the keys on the key pad that are labeled with his or her PIN digits. Of course, that provides added security only if the observer cannot observe the labels on the keys that the prover presses. It appears that the cognitive task of the prover is even more difficult than that of the observer since the prover has to find the appropriate keys whereas the observer may focus his or her attention simply on those keys that the prover finally presses. These mechanisms bear little if any resemblance to our designs. Although, Baker (1995) describes a password entry method whereby the verifier randomly arranges alphanumerical characters in a grid. Provers enter the characters of their password by selecting the row or column in which the password character is. After each selection, the grid is randomized again. In his description of the mechanism, Baker already noted that uncertainty about the entered character provides some resilience against camera recording. A second prominent category of mechanisms requires that the prover mentally calculates and enters the results of an arithmetic function which takes the secret PIN and a verifier supplied challenge as its input. For instance, the function could be a per digit multiplication modulo ten (Wilfong, 1998, 1999) or the calculation of vector products between two vectors one of which is a challenge of the verifier and the other one resembles the secret PIN of the prover (Hopper and Blum, 2000, 2001; Li and Teng, 1999; Matsumoto, 1996). Since there is a certain probability that an observer guesses the result of the vector product, multiple rounds are executed to diminish the chances that an observer successfully impersonates the verifier. These mechanisms are particularly interesting from a theoretical point of view since the security assertions that can be made about them (e.g., that it is hard for the observer to calculate the secret of the prover even if multiple sessions are recorded) are well founded in mathematics and theoretical computer science. In the case of Hopper and Blum (2001) the authors concluded that the mechanisms are prohibitively onerous in practice, as would be the case of the mechanisms described by Li and Teng (1999) which require that provers memorize and operate on three keys with different functions each of which has 20–40 bits worth of information. However, Matsumoto (1996) developed ground breaking, and in our view excellent, designs to cope with the difficulties of provers to perform the necessary calculations. He devised precomputation of the results of the vector products for several challenges, arranged in columns which are indexed by representatives of the prover’s secret vectors (which yields a tabular display of numbers). The prover’s task is thereby reduced to a lookup of the answer in the cross section of the line that resembles the current challenge and the columns that represent the secrets. Additional designs of his are based on a map of train stations or charts describing Janken games (better known as scissors, paper, stone). Although the underlying mechanisms in our and Matsumoto’s designs are different their graphical presentations exploit similar traits. An advantage that our designs have is that the input and output is somewhat simpler and more strongly exploits humans’ cognitive limitations to the prover’s advantage. As part of future work, we would like to study the usability of Matsumoto’s designs and to compare the findings with the usability of our designs. Another common approach is to apply alphanumerical character association and substitution problems (Johnson and Weber, 1997; A. James Smith, 2001; Patarin and Ugon, 1998; Anvekar, 2003; Matsumoto and Imai, 1991; Swi, 2004). For instance, the prover must substitute characters in a challenge (that the verifier provides) with associated characters in an answer alphabet (Matsumoto and Imai, 1991). The difficulty for the observer is reconstructing the mapping between the challenge alphabet and the answer alphabet. In other cases e.g., described by Anvekar (2003), the association between PIN digits and a unique code is displayed and the prover must enter the code in place of the PIN digit. The method described by Collins (1990) appears to be a multi-dimensional variant and application of the Playfair cipher invented by Sir Charles Wheatstone in the 19th century. The basis of the method is a multidimensional matrix of random elements which must be secretly shared by the prover and the verifier. The verifier challenges the prover with two elements not found in the same row or column of the matrix, and the prover answers by entering those elements that complete the rectangle or parallel-piped whose opposite corners where defined by the challenge. In the mechanism described by Johnson and Weber (1997), the prover must substitute values of an environment variable into variable expressions in his or her secret password, and the prover’s time to enter his or her secret may be limited. A more diverse category of related designs is based on interactions with random matrices of alphanumeric characters or geometric arrangements of elements. For instance, in the case of A. James Smith (2003) the secret consists of a sequence of characters associated with a pattern of positions in a master matrix. The remainder of the matrix is filled with random characters. The verifier presents various matrices to the prover one of which is the correct matrix and the others being decoys. The prover selects the correct matrix and directly recreates the pattern in it to authenticate himself or herself. In the case of Cottrell (1995), the secret consists of a geometric arrangement of elements which may consist of e.g., colored alphanumeric characters. In order to authenticate himself or herself, the prover must recreate the correct combination of geometric arrangement of elements. An earlier design which includes elements from Cottrell (1995) is described by Martino et al. (1994). Said mechanism resembles a jigsaw puzzle. By operating two or more controls, the prover manipulates several elements of the puzzle at once until a secret subset of elements is in predefined positions. The mechanism described by Narayanaswami (2004) uses a combination of images and positions as the secret. The verifier flashes images at distinct positions of a touch-sensitive display. The prover authenticates himself or herself by tapping on the locations where images flashed of which his or her password is comprised. Finally, Romanoff disclosed a mechanism by which the prover is challenged with a randomized grid of digits where each digit appears multiple times. The prover’s secret consists of a sequence of positions in the grid, he or she authenticates him- or herself by entering the digits which are located at these positions. Multiple occurrence of the same digit provides some resilience against observers and perhaps camera recordings, as in the case of Baker (1995). A detailed comparison of our designs with the aforementioned related work is be- yond the scope of this paper. Although, it is probably fair to conclude that despite occasional superficial similarities, our approach has significant unique traits to it. It is difficult to judge the tradeoff between security and usability that said related work can achieve since few of the authors provided a detailed study of their mechanisms. As future work, we consider filling that gap by conducting comparative usability studies of our designs and those proposed in related work. 5 Security and Usability Study We conducted three studies with the objective to assess the security and usability of our immediate response design (“IOC”) versus the delayed response design (“DOC”) versus the regular PIN entry method (“REG”). The first study put subjects into the role of the shoulder surfer, the second study put them into the role of the prover. We presented these studies in (Roth et al., 2004). For reference, we give a brief summary of the results below. Based on the outcome of these studies, we refined the user interface design of our implementations and conducted a third study that focused on the usability and user acceptance of our PIN entry methods. We describe the materials and methods we used (see §5.2) and the outcome of the third study (see §5.3). In §5.4, we interpret the outcomes of our studies. 5.1 Summary of Earlier Studies We implemented all three methods of PIN entry in software and deployed it on a touch screen kiosk system. The software required four rounds of input per PIN digit, and it logged all user input with a time stamp for subsequent analysis. For each method, we filmed ten entry procedures of randomly chosen PINs with a digital camera. The field of view was chosen so that the entire PIN pad visible as well as the fingers of the first author who entered the PINs. Care was taken not to unnecessarily obstruct the display during PIN entry. With this approach, we intended to provide optimal conditions for an attack. Additionally, we produced three separate example films for the purpose of explaining all input methods to the subjects with whom we conducted the study. We recruited 8 students of the local university as subjects for our first study. We first briefed the subjects with the example films and a written explanation of the principles of the methods and informed them that their task would be to determine the PINs being entered. Subsequent to the briefings, the three films were projected to a screen in front of the group. Breaks were offered after each PIN entry sequence in order to allow the subjects to write down their guesses and for reflection and discussion of their strategies. In order to prevent fatigue, we also offered breaks between the films which totaled a length of approximately 10 minutes. Correct digits 4 3 2 1 REG 100 0 0 0 IOC 0 0 5 8.75 DOC 0 0 5 7.5 Figure 8. The guessing rate in percent of total number of digits. All eight participants were able to complete the study and there was no visual or understanding problem in following the contents of the films. No participant was able to guess even one of the PIN numbers entered with one of our methods while all participants correctly guessed all PIN numbers in the REG condition. However, some participants succeeded in guessing one or two digits of some of the PIN numbers in the IOC and DOC condition (see Table 8 for a summary of results). The isolated successes appear to be a result of the strategies employed by the participants. In four cases, the subject focused on a randomly chosen digit and compared the input to the pattern of that digit. Another strategy of subjects was to capture the distribution of black and white buttons as a pattern that they sketched on paper. Some participants used prepared stencils as an aide to mark black and white buttons. However, no strategy was particularly successful. For our second (usability) study, we recruited 34 participants with academic education aged between 20 and 30 years. We chose a demographically homogenous group of participants in order to limit the required number of subjects, and to maximize the impact of the condition factor on variance. Each participant was randomly assigned one input method of which he had to complete 10 input cycles. All participants had to perform their input on the same kiosk system that we used in our first study. As dependent variables, we measured user condition (pre- and post test), user acceptance, time used for entry, and error rate. User condition was assessed using the short scale of the BMS (Plath and Richter, 1984) which indicates the work load in the sub-scales physiological fatigue, concentration, motivation, and emotional state. The usability was measured with a subset of questions taken from the SUS questionnaire (Brooke, 1996). In summary, we found that subjects learned the new methods in three to four trials. In the last three trials we found no significant difference in the error rate of all three conditions. The new methods were rated significantly less usable (which in itself is not a surprise) but were perceived to be significantly more secure than the regular method. The user acceptance was high but failed to reach significance. 5.2 Usability Study The results of our first two studies encouraged us to conduct a third study, for which we improved the user interface of all methods based on user feedback that Figure 9. The improved user interface. we had received (figure 9 shows the improved system). Most notably, we reduced the number of rounds per PIN digit from four rounds to three rounds. We added a progress display that indicates the current PIN digit position (the green dot in figure 9) and the number of rounds that are completed for the current digit (in figure 9, two out of three rounds are completed). By pressing the “delete” button, users can step back and correct input. We also introduced an intermediate state in between changing the color patterns by briefly coloring the silhouette of the key pad layout grey. The intermediate state facilitates detection of changing color patterns for digits that retain their color in subsequent rounds. Lastly, we changed the input device from touch screen to an external key pad which more closely resembles the predominant input device used in ATMs. More precisely, our new test environment consisted of a 12” Apple G4 iBook equipped with an external Trustmaster USB key pad, the layout of which we modified to that of a typical local ATM. We recruited nine female and 13 male subjects aged 22 years to 61 years for our third study. All subjects were briefed about the purpose of the study and the functioning of all three input methods (REG, IOC, and DOC). We asked each subject to choose a four digit PIN number he or she could remember easily as the PIN to be used in subsequent trials (most subjects chose month and year of their birth date). For each method, the subject had to enter his or her PIN six consecutive times. Our software randomized the order in which the REG, IOC, and DOC methods were tested. Subsequent to completion of each method, subjects had to fill out an electronic version of the SUS questionnaire (Brooke, 1996). The SUS questionnaire generally consists of ten questions that must be rated on a five point Likert (1932) scale. We adapted (and translated into German) eight relevant questions (see table 1), and increased the Likert scale to seven points in order to produce greater Question (and translation) REG∗IOC REG∗DOC IOC∗DOC t = 1.4 t = 3.1 t = 1.8 p = .17 p < .01 p = .08 t = 3.5 t = 5.2 t = 1.7 p < .01 p < .01 p = .09 t = 2.1 t = 5.4 t = 3.4 p = .04 p < .01 p < .01 Ich hatte das Gef¨uhl, bei der Bedienung die Kontrolle u¨ ber das System zu haben. (I had control over the system at all times.) t = 1.1 t = 4.3 t = 3.4 p = .3 p < .01 p < .01 Ich finde das System war umst¨andlich zu bedienen. (I think, the system was complicated to use.) t = 2.5 t = 6.3 t=4 p = .02 p < .01 p < .01 Mir hat die Darstellung des Systems sehr gut gefallen. (I like the design of the system.) t = 0.3 t = 2.7 t = 2.5 p = .8 p < .01 p = .01 Ich hatte das Gef¨uhl, dass das System viel zu schnell ablief. (The system was too fast for me.) t = 0.4 t = 5.5 t = 5.3 p = .7 p < .01 p < .01 Die Benutzung hat mir Spass gemacht. (I had fun using the system.) t = −1.5 t = −0.6 t = 0.9 p = .1 p = .5 p = .4 Ich w¨urde das System gerne h¨aufiger verwenden. (I would like to use this system more frequently.) Ich finde das System unn¨otig komplex. (I think, the system is unnecessarily complex.) Ich finde das System war leicht zu bedienen. (I think the system was easy to use.) Table 1 Results of pairwise comparison of SUS ratings using a two-sided t-test (DF=59). Significant differences between conditions are typeset in boldface. variance. That yields a summed SUS score from 0 to 8 · 7 = 56 with 56 being the best result. In order to measure user attitude, we asked three additional questions in conjunction with the SUS questionnaire (see table 2). 5.3 Results Error rate False input (e.g., pressing the “white” button when the current digit is not white) and pressing the “delete” button one or multiple times in direct succession was counted as one error. All conditions showed a slight learning effect. The error probability varied significantly depending on condition (Kruskal-Wallis: χ2 (2) = 338.48, p < .01). Pairwise comparison revealed a significantly higher error rate for the DOC condition (¯ e = 0.2) while there was no significant difference between the REG (¯ e = 0.025) and IOC condition (¯ e = 0.023; Wilcoxon: REG∗IOC: Z = 0.22, p = .83; REG∗DOC: Z = −9.52, p < 0.01; DOC∗IOC: Question Mean Mean std err Um meine Sicherheit zu erh¨ohen, nehme ich auch Mehraufwand in Kauf. In order to increase my security I am willing to accept additional effort.) 4.66 0.25 Aktuelle PIN-Eingabe-Verfahren sind ausreichend sicher. (Current PIN entry methods are sufficiently secure.) 1.40 0.24 An manchen Orten finde ich mich beobachtet, wenn ich meine 4.47 PIN eingebe. (At some places I feel observed while entering my PIN.) Table 2 Mean ratings for three acceptance questions (1: disagree, 7: agree). 0.29 0,4 0,4 0,3 DOC REG IOC IOC REG DOC 0,3 0,2 0,2 0,1 0,1 0 1 2 3 4 5 6 0 1 -0,1 Repetitions 2 3 4 Value Number Figure 10. Error rates depending on repetition (left) and value (right). Z = −16.42, p < 0.01). The error rate in the IOC and DOC condition was also influenced by fatigue, as can be seen in the left graph of figure 10. The error probability was also influenced by the digit position. Errors were particularly frequent for the second digit in the REG and IOC condition (see figure 10, right graph). Duration We found that subjects entered their PINS in the REG condition about 15 times faster than in the IOC condition, and about 22 times faster than in the DOC condition. Below, we summarize the duration by condition in milliseconds: Condition Mean Mean std err t-test REG 1,130 106 tREG∗IOC = −19.58, p < 0.01 IOC 17,626 640 tIOC∗DOC = −8.48, p < 0.01 DOC 24,734 788 tDOC∗REG = −28.13, p < 0.01 We found a learning effect that completed after three repetitions, figure 11 shows the average duration per PIN entry for each repetition. In other words, subjects quickly acquired the skills necessary to operate the new methods. We also found 40000 REG 35000 IOC DOC 30000 25000 20000 15000 10000 5000 0 1 2 3 4 5 6 Repetitions Figure 11. Duration of PIN entry by repetition. significant effects for age (REG: r = 0.15, p < 0.01; IOC: r = 0.12, p < 0.01; DOC: r = 0.05, p = 0.03) and for gender (REG: r = 0.11, p = 0.04; IOC: r = 0.08, p = 0.02; DOC: r = 0.06, p = 0.01) based on pairwise correlation analysis in all conditions. Usability A Kruskal-Wallis rank-sum test of the SUS ranks revealed a significant effect of condition on usability rating (χ2 (2) = 21.29, p < .01). By pairwise comparison of the conditions we found that the DOC method was rated significantly less usable than the other conditions (¯ a = 25.9) while there was no significant difference in ratings between REG (¯ a = 42.7) and IOC (¯ a = 37.7; Wilcoxon rank sums for REG∗IOC: Z = 1.47, p = 0.14; for IOC∗DOC: Z = −3.25, p < 0.01; for DOC∗REG: Z = 4.35, p < 0.01). We found neither age nor gender effects in the SUS ratings. Table 1 gives the results by condition and SUS question. Attitude Subjects did not consider current PIN entry methods as secure. They also concurred with the statements “in order to increase my security I am willing to accept additional effort” and “at some places I feel observed while entering my PIN” (see table 2). We found no effect of condition or demographics on user attitude. 5.4 Interpretation Our initial study of shoulder surfing attempts indicated a clear security advantage of our PIN entry methods when compared to the regular method. Subjects with no particular training in shoulder surfing observed all PINs in the REG condition without errors whereas in the IOC and DOC condition subjects guessed only one or two digits correctly in a few cases. Of course, one cannot generalize that result— determined adversaries would perhaps invest a certain amount of training to improve their shoulder surfing skills when faced with our methods. It remains to be investigated to what degree training may improve guessing probability. However, it is probably fair to say that our mechanisms raise the bar for shoulder surfers substantially. Although, the security benefits come at the price of longer duration for PIN entry paired with a higher level of required attention, particularly in the DOC condition. This was to be expected—the question was to what magnitude the usability of the IOC and DOC methods differ from that of the REG method. We were content to find that in our current study, subjects’ usability rating of the IOC method was comparable to the rating of the regular method. That is an improvement over our earlier study with the previous version of the implementation. The REG and IOC methods also exhibited similar characteristics with regard to age, gender, and error. Unfortunately, the DOC method did not profit to the same degree from the revisions we made to the test environment. All conditions showed a learning effect. Subjects acquired the skills necessary to operate our mechanisms within three repetitions. In summary, we conclude that the IOC method may indeed be of high practical value, whereas the DOC method appears to be too demanding for an actual application. 6 Conclusions Towards a PIN entry method that is robust against shoulder surfing, we proposed two variants of an interactive challenge-response protocol (the immediate and delayed choice variants) to which we refer as cognitive trapdoor games. The essential feature of such a game is that it is easily won if the PIN is known, and hard to win otherwise. The cognitive capabilities of a human are generally not sufficient to derive the genuine PIN through observation of the entire game’s input and output. As a defense against automatic recording for instance by miniature cameras, we proposed a modification which maintains a certain level of uncertainty about the genuine PIN even if automatic recording devices are deployed. Due to its probabilistic nature, we refer to this variant as a probabilistic cognitive trapdoor game. Additionally, we presented a tactile variant based on Braille-type displays which can be operated for instance by blind people with perfect secrecy against shoulder surfers. In order to assess the security and usability of our visual PIN entry methods, we conducted three user studies. We reported on the results of the first two studies already in (Roth et al., 2004). In this article, we report results of our third study which focused on the usability of a revised version of our software and its user interface. The results of these studies support the hypothesis that our immediate choice method provides resilience against shoulder surfing while still being reasonably usable, which is of significant value when entering PINs in a public environment. Among the variants, the immediate choice method has shown considerable advantages over the delayed choice method with regard to usability, acceptance, entry times, and error rates. Although the time required to enter a PIN with the immediate choice method is longer than the time required to enter a PIN with the regular method, the usability rating of the immediate choice method was not significantly different from the rating of the regular method. It appears that the additional effort, when compared to the regular PIN entry method, is offset by users’ subjective and objective security advantages gained by that method, which supports Sasse’s notion of users’ cost versus benefit calculation (Sasse, 2003). We conclude that the immediate choice method is of practical value where shoulder surfing is a concern. Our next objective is to conduct usability studies of our methods on a larger scale, ideally within the scope of a field test. Any guidance on that subject is greatly appreciated. Acknowledgments This article is a significantly revised and extended version of (Roth et al., 2004) which we presented at the 11th ACM Conference on Computer and Communications Security. The described methods are Patent pending. We would like to thank Abraham Bernstein and other (anonymous) reviewers very much for their detailed and supportive comments which helped and guided us in improving our original manuscript. We would also like to thank everyone who participated in our usability studies for their time and support. References http://www.swiveltechnologies.com, July 2004. Jr. A. James Smith. Method and apparatus for securing passwords and personal identification numbers. US Patent # 6,253,328, United States Patent and Trademark Office, 4901Gulf Shore Boulevard Dr. North, Apt. 1903, Naples, FL 34103, June 2001. Jr. A. James Smith. Method and apparatus for securing a list of passwords and personal identification numbers. US Patent #6,571,336, United States Patent and Trademark Office, 4901Gulf Shore Boulevard Dr. North, Apt. 1903, Naples, FL 34103, May 2003. John R. Anderson. Cognitive Psychology and its Implications. Worth Publishers, 5th edition, 2000. ISBN 0-7167-3678-0. Dinesh Kashinath Anvekar. Method for non-disclosing password entry. US Patent #6,658,574, United States Patent and Trademark Office, December 2003. Assignee: International Business Machines Corporation. Daniel G. Baker. Nondisclosing password entry system. US Patent #5,428,349, United States Patent and Trademark Office, 6982 SW 184th, Aloha, OR 97007, June 1995. George E. P. Box, William G. Hunter, and J. Stuart Hunter. Statistics for experimenters. Wiley-Interscience, 1st edition edition, 1978. Mark Brader. Shoulder-surfing automated. Risks Digest 19.70, April 1998. J. Brooke. SUS: A quick and dirty usability scale. In P. Jordan, B. Thomas, B. Weerdmaster, and I. McClelland, editors, Usability evaluation in industry, pages 1189–194. Taylor and Francis, London, UK, 1996. John P. Cairns. System for cryptographing and identification. US Patent #4,962,530, United States Patent and Trademark Office, Wilmington, DE, October 1990. Earl R. Collins. Computer access security code system. US Patent #4,926,481, United States Patent and Trademark Office, La Canada, CA, May 1990. John Colville. Atm scam netted $620,000 australian. Risks Digest 22.85, August 2003. Stephen R. Cottrell. Method to provide security for a computer and a device therefor. US Patent #5,465,084, United States Patent and Trademark Office, November 1995. Count Zero. Card-o-rama: Magnetic stripe technology and beyond. Phrack, (37), 1992. Steven B. Hirsch. Secure keyboard input terminal. US Patent #4,333,090, United States Patent and Trademark Office, 305 Peck Dr., Beverly Hills, CA 90212, June 1982. Steven B. Hirsch. Secure input system. US Patent #4,479,112, United States Patent and Trademark Office, 305 Peck Dr., Beverly Hills, CA 90212, October 1984. Douglas Hoover. Method and apparatus for secure entry of access codes in a computer environment. US Patent #6,209,102, United States Patent and Trademark Office, March 2001. Assignee: Arcot Systems, Inc. Nicholas J. Hopper and Manuel Blum. A secure human-computer authentication scheme. Technical Report CMU-CS-00-139, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, May 2000. Nicholas J. Hopper and Manuel Blum. Secure human identification protocols. In C. Boyd, editor, ASIACRYPT, volume 2249 of Lecture Notes in Computer Science, pages 52–66. Springer Verlag, 2001. ISO. Banking – Personal Identification Number (PIN) management and security – Part 1: Basic principles and requirements for online PIN handling in ATM and POS systems. International Organization for Standardization), May 2002. TC 68/SC 6. William J. Johnson and Owen W. Weber. Method and system for variable password access. US Patent #5,682,475, United States Patent and Trademark Office, October 1997. Assignee: International Business Machines Corporation. W. H. Kruskal and W. A. Wallis. Use of ranks in one-criterion variance analysis. J. Amer. Statist. Ass., (48):907–911, 1952. Markus Kuhn. Probability theory for pickpockets – ec-PIN guessing. Available at http://www.cl.cam.ac.uk/∼mgk25/, 1997. Xiang-Yang Li and Shang-Hua Teng. Practical human-machine identification over insecure channels. Journal of Combinatorial Optimization, 3(4), 1999. Rensis Likert. A technique for the measurement of attitudes. McGraw-Hill, New York, USA, 1932. H. B. Mann and D. R. Whitney. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Statist., (18):50–60, 1947. Michael J. Martino, Geoffrey L. Meissner, and Robert C. Paulsen Jr. Identity verification system resistant to compromise by observation of its use. US Patent #5,276,314, United States Patent and Trademark Office, January 1994. Assignee: International Business Machines Corporation. T. Matsumoto and H. Imai. Human identification through insecure channel. In D. W. Davies, editor, EUROCRYPT, volume 547 of Lecture Notes in Computer Science, pages 409–421. Springer Verlag, 1991. Tsutomu Matsumoto. Human-computer cryptography: an attempt. In Proceedings of the 3rd ACM conference on Computer and communications security, pages 68–75. ACM Press, 1996. ISBN 0-89791-829-0. doi: http://doi.acm.org/10. 1145/238168.238190. Keith Eric McIntyre, John Foxe Sheets, Dominique Andre Jean Gougeon, Curtis W. Watson, Keven Paul Morlang, and Dave Faoro. Method for secure pin entry on touch screen display. US Patent #6,549,194, United States Patent and Trademark Office, April 2003. G. A. Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63:81–97, 1956. R. C. Milton. An extended table of critical values for the Mann-Whitney (Wilcoxon) two-sample statistic. J. Amer. Statist. Ass., pages 925–934, 1964. Bodo M¨oller. Schw¨achen des ec-PIN-Verfahrens. Available at http://www. informatik.tu-darmstadt.de/TI/Mitarbeiter/moeller, February 1997. Manuscript. B. B. Murdock. The retention of individual items. Journal of of Experimental Psychology, 62:618–625, 1961. Chandrasekhar Narayanaswami. Password protection using spatial and temporal variation in a high-resolution touch sensitive display. US Patent #6,720,860, United States Patent and Trademark Office, April 2004. Assignee: International Business Machines Corporation (Armonk, NY). Jacques Patarin and Michel Ugon. Process for entry of a confidential piece of information and associated terminal. US Patent #5,815,083, United States Patent and Trademark Office, September 1998. L. R. Perterson and M. J. Peterson. Short-term retention of individual verbal items. Journal of of Experimental Psychology, (58):193–198, 1959. Hans-Eberhard Plath and Peter Richter. Erm¨udungs-Monotonie-S¨attigung-Stress (BMS). Technical report, Psychodiagnostisches Zentrum, Dresden, Germany, 1984. Werner J. Rehm. Security means. US Patent #4,502,048, United States Patent and Trademark Office, 22 Lomatta St., The Gap, Queensland, 4061, AU, February 1985. Volker Roth, Kai Richter, and Rene Freidinger. A PIN entry method robust against shoulder surfing. In Proc. 11th ACM Conference on Computer and Communica- tions Security, Washington, DC, USA, October 2004. Lothar Sachs. Angewandte Statistik. Springer-Verlag, Berlin, Germany, 10. edition edition, 2002. M. A. Sasse. Computer security: Anatomy of a usability, and a plan for recovery. Ft. Lauderdale, USA, April 2003. Claude E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379–423 and 623–656, 1948. Chris Summers and Sarah Toyne. Gangs preying on cash machines. BBC News Online, October 2003. Keith R. Thrower. Access control apparatus. US Patent #4,857,914, United States Patent and Trademark Office, Old Cedar, 12 Wychcotes, Caversham, Reading, RG4 7DA, GB2, August 1989. Edward K. Vogel and Maro G. Machizawa. Neural activity predicts individual differences in visual working memory capacity. Nature, 428:748–751, April 2004. Chuck Weinstock. Atm fraud. Risks Digest 4.86, May 1987. F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, (1):80–83, 1945. Gordon Thomas Wilfong. Method and apparatus for secure PIN entry. US Patent #5,754,652, United States Patent and Trademark Office, May 1998. Assignee: Lucent Technologies, Inc. (Murray Hill, NJ). Gordon Thomas Wilfong. Method and apparatus for secure PIN entry. US Patent #5,940,511, United States Patent and Trademark Office, May 1999. Assignee: Lucent Technologies, Inc. (Murray Hill, NJ). Danny Wood. Spain uncovers hi-tech cashpoint fraud. BBC News Online, January 2003.
© Copyright 2024