1. Introduction These are my lecture notes for the class. They also include suggested problems from the textbook (and possibly from other sources). The notes are far from complete, and should not be taken as an exhaustive list of what you are expected to know. Instead, they highlight the pieces that seemed most important to me. I mention that this document doesn’t contain very much on applied subjects, real examples and how to use this material. This isn’t because that part of the course is unimportant; it is just not the core technical material that exams tend to focus on. Finally, this document is not carefully edited and has not been read by anybody else. When my notes disagree with the textbook, trust the textbook! More importantly, Do not rely solely on these notes! Of course, I (and your classmates) also appreciate emails that point out errors or typos. 2. Lecture 1: Chapter 1 (Jan. 12) 2.1. Summary. • Welcome and Administrative Details. • Review: Axioms of Probability and Elementary Calculations 2.2. Lecture. 2.2.1. Administrative Details. • If you don’t know anybody in this class, please sit in the front-right and say hello to your neighbours! • In this course, we review basic probability and move on to some applications. By default, this will be focused on Markov chain models in engineering and economics. However, there is some flexibility. Please let me know (preferably by email) if there are particular topics or applications that you would like to cover. • Textbook: Probability, Markov Chains, Queues and Simulations by William J. Stewart. • Website: go to aix1.uottawa.ca/∼asmi28. The syllabus is there, and this is my primary means of communicating with the class. • Office hours: 585 KED, office 201G. Monday and Wednesday from 1 PM to 2:30 PM. • Evaluation will be based on 5-6 homework sets (25%), a midterm on March 2 (25%) and a final exam (50%). • The first homework set is posted on the website, and is due in 2.5 weeks. All homework is due by the start of class on its due date. Like the first few weeks of class, the material in this set is meant to be a review. • I strongly suggest that you start on this homework set quite soon! The homework, as well as the first week of class, will be devoted to review. If you are having trouble with the homework, and would like to have a longer review period, let me know what you are finding difficult by this Sunday. If enough students are interested in this, we will have additional review sessions. • You are encouraged to ask for worked solutions to relevant questions that are not on the homework; I am happy to give them and add them to this document. 1 • This is not an introductory math class. We expect answers that are readable by human beings, including those not in this course! For example, you should: – Use sentences. – Not use the equality sign to mean ‘the next step in my argument is...’ – Define your notation if there might be some confusion. • An advertisement: the probability group is running a reading course this term on Fridays (details will be on my website). If you, or a friend, are interested in diving a little deeper into probability, please attend! 2.2.2. Sets and Sample Spaces. Recall the basic definitions: • Sample space: a set. Often the set of all possible outcomes of an experiment. Ex. When rolling a die, the associated sample space might be Ω = {1, 2, 3, 4, 5, 6}. • Event: a subset of a sample space. Ex. The event that a die roll is even is S = {2, 4, 6} ⊂ Ω. • When discussing events, we need to understand set operations and the algebra of set operations. • Recall the basic set operations, and try them all with sets A = {1, 2, 3}, B = {2, 4, 6}: – Set union (denoted A ∪ B). – Set intersection (denoted A ∩ B). – Set difference (denoted A − B). – Set complement (denoted Ac or sometimes A0 ). • There are many, many ‘rules’ for the algebra of set operations. If you forget some, you can normally recover them from Venn diagrams. We recall a few important rules: – A ∩ Ac = ∅. – A ∪ Ac = Ω. – A ∩ (B ∩ C) = (A ∩ B) ∩ C. – A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C). – A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C). – (A ∩ B)c = Ac ∪ B c . – (A ∪ B)c = Ac ∩ B c . • Recall that events A, B are mutually exclusive if A ∩ B = ∅, while A1 , . . . , An are exhaustive if ∪ni=1 Ai = Ω. 2.2.3. Axioms of Probability. A probability P is a map from some subsets of a sample space Ω to [0, 1] that satisfies: • P[Ω] = 1. • P For any countable sequence {Ai }i∈N of pairwise mutually exclusive events, P[∪i∈N Ai ] = i∈N P[Ai ]. These have many consequences, which you should recall. Some of the most important are: • P[Ac ] = 1 − P[A]. • For any A, B, we have P[A ∪ B] = P[A] + P[B] − P[A ∩ B]. 2.2.4. Conditional Probability and Independence. We define the conditional probability of A given B by P[A ∩ B] P[A|B] = . P[B] 2 This is the definition, but it basically does what we expect: Example 1. Let Ω = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, let P[S] = |S| for any S ⊂ Ω, let A = 10 {1, 2, 3, 5, 7, 9, 10} and B = {2, 4, 6, 8, 10}. Then P[{2, 10}] P[A|B] = P[{2, 4, 6, 8, 10}] 2 = . 5 Recall that two events A, B are independent if P[A ∩ B] = P[A]P[B]. Example 2 (Dice and Sums). We roll two fair dice. Let E1 be the event that the sum is 7, and E2 be the event that the sum is 10. Let A be the event that the first die rolled comes up 4. We note: P[A ∩ E1 ] = P[E1 |A]P[A] 11 = 66 = P[A]P[E1 ]. P[A ∩ E2 ] = P[E2 |A]P[A] 11 = 66 > P[A]P[E2 ] 1 1 = . 6 12 Thus, A, E1 are independent but A, E2 are not. 2.2.5. Law of Total Probability. Recall that a collection of sets {Bi }ni=1 is called a partition of Ω if: • Bi , Bj are mutually exclusive. • P[∪ni=1 Bi ] = 1. Recall that, if {Bi }ni=1 are a partition and A is any set, n X P[A] = P[A|Bi ]P[Bi ]. i=1 Who cares? We have: Example 3. 100 people at a company have the opportunity to get a flu shot on Oct. 10. Say that 80 do. It is known that the probability of getting the flu within a year given that you have a flu shot is 3%; the probability of getting a flu without a flu shot is 6%. A worker is chosen at random on Oct. 9 of the following year; what is the probability that the worker had the flu in the intervening time? Let F denote the event that the selected worker had the flu and S denote the event that the worker had a flu shot. Then P[F ] = P[F |S]P[S] + P[F |S c ]P[S c ] = (0.03)(0.8) + (0.06)(0.2) 3 = 0.036. Note, the answer is between 3% and 6% - it is nice to remember this sort of ‘sanity check’ when doing calculations. 2.2.6. Bayes’ Rule. Recall that, if {Bi }ni=1 are a partition and A is any set, P[A|B1 ]P[B1 ] . P[B1 |A] = Pn i=1 P[A|Bi ]P[Bi ] Who cares? We have: Example 4. Consider the same situation as the above subsection. A worker with the flu is chosen at random; what is the probability that the worker had a flu shot? We calculate: P[F |S]P[S] P[F |S]P[S] + P[F |S c ]P[S c ] 0.024 = 0.036 2 = . 3 This is surprising the first time you see it: even though flu shots make you less likely to get the flu, most people who have the flu also had a flu shot. I do hope that this isn’t surprising to anyone in this class! P[S|F ] = 2.2.7. Harder Example. Lets go over some interesting examples that every student of probability should see at some point! How many people have seen these? Example 5 (Hat-Check Problem). n people go to dinner and leave there hats at the front of the restaurant. Upon leaving, each person grabs a hat at random. What is the probability that anybody gets the right hat? For 1 ≤ i ≤ n, let Ai denote the event that the i’th person has received their own hat. We then note that 1 P[A1 ] = . n Furthermore, P[A1 ∩ A2 ] = P[A1 ]P[A2 |A1 ] 1 1 = , nn−1 and more generally, P[A1 ∩ A2 ∩ . . . ∩ Aj ] = j Y i=1 = 4 1 n−i+1 (n − j)! n! Thus, by the principle of inclusion-exclusion, n X X n P[∪i=1 Ai ] = P[Ai ] − P[Ai ∩ Aj ] + i=1 = 1≤i<j≤n X X P[Ai ∩ Aj ∩ Ak ] − . . . 1≤i<j<k≤n (−1)|I|+1 P[∩i∈I Ai ] I⊂{1,2,...,n} = = n X i=1 n X (−1)i+1 (n − i)! n! i!(n − i)! n! 1 (−1)i+1 . i! i=1 This is the answer, but it is traditional to go a little further. Recall from calculus that ∞ X xi x e = . i! i=0 Thus, for n big, 1 − P[∪ni=1 Ai ] ≈ e−1 ≈ 0.368. 5 3. Lecture 2: Chapter 2; Start of Chapter 3 (Jan. 15) 3.1. Summary. • A few administrative details. • Review of combinatorics. • Definitions related to random variables. . That 3.2. Lecture. For a finite sample space Ω = {1, 2, . . . , n}, we often have P[A] = |A| n is, all outcomes are equally likely. In this case, calculating the probability of an event is equivalent to counting the number of ways the event can occur. Ex. card games, dice games, drawing balls from urns. 3.2.1. Permutations. We call an ordering of items in a set a permutation. The number of permutations of a set S is |S|! = |S|(|S| − 1)(|S| − 2) . . . (2)(1). Example 6. Let S = {1, 2, 3, 4}. Then some permutations of S are 1234, 1324, 4321, etc. There are |S|! = (4)(3)(2)(1) = 24 permutations of S. We sometimes consider permutations for a set with repeated elements. Say that a set S has ni copies of the i’th item, for 1 ≤ i ≤ k. Then the number of permutations of S is (n1 +...+nk )! . n1 !n2 !...nk ! Remark 3.1. Our way of talking about ‘sets with repeated elements’ is very far from what you might find in other math textbooks. This is not relevant for this course, but you should be aware of this if you do outside reading. Example 7. Let S = {1, 1, 2, 2, 2}. Then some distinct permutations of S are 11222, 12122. 5! There are 2!3! = 10 permutations of S. There are many, many objects that ‘look like’ permutations, and almost all of them show up in probability. I will briefly survey the textbook’s notation for them, but please be aware that: • Other sources may give the same objects slightly different names, and the names may be inconsistent. You have to understand what is going on. • The list in this textbook is not a complete list of permutation-like objects that show up! It is helpful, though not essential, to learn some principle. 3.2.2. Permutations with Replacement. Say that we have k boxes, and we can choose one of ni objects from each box. The total number of ways to make the k choices is n1 n2 . . . nk . Example 8. Say I have 12 pairs of pants and 25 shirts. Then I have (25)(12) = 300 possible outfits. 3.2.3. Permutations without Replacement. I have a box with n distinguishable objects, choose k objects from the box, and put them in order. The number of ways to do this is n! P (n, k) = . (n − k)! Example 9. Consider length-8 passwords built from characters a, b, . . . , y, z, 0, 1, . . . , 8, 9 without any repeated characters. There are 36! ≈ 1.2 × 1012 such passwords. 28! 6 3.2.4. Combinations without Replacement. I have n distinguishable objects and wish to put 0 ≤ k ≤ n of these objects into a box. The number of ways to do this is n n! P (n, k) = C(n, k) = = . k k!(n − k)! k! This number is called a binomial coefficient. Hopefully we all remember this from the binomial distribution! Example 10. I need to bring 3 textbooks with me on a trip, from the 12 that I should be 12! reading. The number of ways to do this is 3!9! = 220. Assume that I have m boxes and wish to place ki objects into each box, for 1 ≤ i ≤ m. Pm Assume that i=1 ki = n. The number of ways to do this is n n! = . k1 . . . km k1 ! . . . km ! These are called the multinomial coefficients. 3.2.5. Combinations with Replacement. I have a box with n distinguishable objects, and choose k of them with replacement, and ignore the order in which the objects were chosen. The number of ways to do this is (n+k−1)! . (n−1)!k! Example 11. I roll a die 3 times, recording the results. The number of possible sets of 8! results (again, ignoring order) is 5!3! = 56. 3.2.6. Remembering Which is Which. It is often hard to remember what is going on with permutations and combinations. For some people, putting the following into a table helps. Permutations: • With replacements: nk . n! • Without replacements: (n−k)! . Combinations: • With replacements: (n+k−1)! . (n−1)!k! n! • Without replacements: k!(n−k)! . 3.2.7. Principles. When looking at permutations and combinations without replacement, there is a nice unifying way to find all of the relevant formulas. Fix a set S of size n, possibly with duplicates. First, we add an extra label to duplicate objects, so that all of the objects are distinguishable. Then, put all objects in order; there are n! ways to do this. Finally, fix a particular permutation, and divide by the number of ways to ‘reorder’ this permutation that we don’t care about. Example 12. Let S = {A, A, B, B, B}. We wish to look at the number of ways to order these 5 items. In our steps: (1) We consider instead the set S 0 = {A1 , A2 , B1 , B2 , B3 }. (2) Note that there are 5! = 120 permutations of S 0 . (3) One permutation of S 0 is A1 A2 B1 B2 B3 . This is indistinguishable from A2 A1 B1 B2 B3 , A1 A2 B3 B2 B1 , etc. In total, there are 2! ways to rearrange the two copies of A and 5! 3! ways to rearrange the three copies of B. Thus, we end up with 2!3! permutations of S. 7 This principle is useful for two reasons: (1) It is eventually easier to use this than to memorize lots of formulas. (2) This can be used to do lots of calculations that can’t be done with the formulas in the textbook! 3.2.8. Harder Example. Lets go over some interesting examples that every student of probability should see at some point! How many people have seen these? Example 13 (Birthday Problem). Assume that people are born uniformly at random on the 365 days of the year. I have 32 people in a group. What is the chance that no two have the same birthday? If the year had a large number of days n, approximately how many people would need to be in a group for there to be a 90 percent chance of having two people with the same birthday? We tackle the general problem first. Assume that there are n days in a year, and for 1 ≤ j ≤ n let Aj,n be the event that the first j people in our group all have distinct birthdays. We note that j P[Aj+1,n ] = P[Aj,n ] 1 − , n and so j−1 Y i P[Aj,n ] = 1− . n i=1 Thus, for the first problem, P[A32,365 ] = 31 Y i=1 i 1− 365 = 0.2358. For the second problem, fix 0 < c < ∞. Then √ c n P[Ac√n,n ] = Y i 1− n i=1 Pc√n − i=1 ≈e c ≈ e− 2 . i n Since we want to get a probability of 0.1, we choose c 0.1 = e− 2 c = − log(0.1) 2 c = 4.61. √ IMPORTANT QUESTION: Why did we choose j = c n? How could we find out that this was the right thing to do? We now begin our review of random variables 8 3.2.9. Types of Random Variables. We begin by fixing a sample space Ω and a probability function P. At its most general, Definition 3.2 (Random Variable). A random variable X is a function from Ω to some other set. Remark 3.3. Those of you who have seen more mathematical theory will know that this isn’t quite right - random variables can’t quite be any function. However, this is close enough for our purposes. In this course, we generally have: Definition 3.4 (Random Variable). A random variable X is a function from Ω to the set R of real numbers. All of the random variables in this course will come in one of two flavours. For a random variable X : Ω → R, denote the range of X by X(Ω) = {X(ω) : ω ∈ Ω}. We then have: Definition 3.5 (Discrete Random Variable). If X(Ω) is either finite or countable, then X is called a discrete random variable. Definition 3.6 (Continuous Random Variable). If X(Ω) is an interval in R, then X is called a continuous random variable. Example 14 (Discrete, Continuous, and Other Random Variables). Let Ω = [0, 10] and define functions X1 (ω) = ω, X2 (ω) = bωc. Define X3 (ω) to be X1 (ω) for ω < 5 and X2 (ω) otherwise. Then X1 is continuous, X2 is discrete and X3 is neither. To relate random variables back to probability, we make the following important definition: P[X = x] = P[{ω ∈ Ω : X(ω) = x}]. Note that the RHS is already defined, and the LHS didn’t have any clear meaning! When actually doing probability calculations, we almost always deal with random variables and ignore the underlying sample space Ω. You certainly did this, possibly without thinking about it, in your last probability course. Example 15. Let Ω be the collection of possible 5-card hands from a deck of 52 cards. Define X1 to be 1 if the majority of cards in the hand are red, and 0 otherwise. Define X2 to be the sum of the 5 cards, using blackjack conventions. Define X3 to be the numerical ranking, from 1 to 52, of the highest card in the hand, using poker conventions. Define X4 to be the number of 5-card hands that this hand would beat in poker. 52! = We note that Ω is complicated and very large (we know from chapter 2 that it has 5!47! 2598960 elements), though we can write it down in a fairly compact way. X1 is much simpler than Ω; it takes only 2 values, and it is easy to check that P[X1 = 1] = P[X1 = 0] = 12 . This is main reason to use random variables: they are easier to deal with than the entire sample space. X2 and X3 are slightly more complicated than X1 ; both are easy to calculate by looking at a hand and have much smaller ranges (6 to 50 or 1 to 52) than the size of Ω. Finally, the range of X4 has the same size as Ω; we don’t gain very much in this case. 9 Remark 3.7. We often define random variables without defining Ω. For example, I might say ‘let the random variable X be the result of a fair dice roll.’ This is all perfectly OK, but a bit disconcerting - random variables are functions, but we seem to be omitting the domain of the function. I will be doing this throughout the rest of this course. 10 4. Lecture 3: End of Chapter 3, Start of Chapter 4 (Jan. 19) 4.1. Summary. • A few administrative details. • Basics of random variables and distributions. We do not cover this in a great deal of detail; please read over the rest of chapter 3 and make sure that it is familiar to you. • Introduction to joint distributions. This marks the beginning of material that will be new to many of you. 4.2. Lecture. Today, we continue our review of random variables. 4.2.1. Probability Distribution Functions. Recall: Definition 4.1 (Probability Mass Function). The probability mass function (or PMF) of a discrete random variable X is defined by pX (x) = P[X = x]. This has the important formula P[X ∈ A] = X fX (x). x∈A Also recall the related: Definition 4.2 (Cumulative Distribution Function). The cumulative distribution function (or CDF) of any random variable X is defined by FX (x) = P[X ≤ x]. This has the important related formula P[a < X ≤ b] = FX (b) − FX (a). Remark 4.3. Note that we use P[X ≤ x], not P[X < x]. Finally, Definition 4.4 (Probability Density Function). The probability density function (or PDF) of a continuous random variable X is defined by d pX (x) = FX (x), dx when it exists. Example 16. Question: X has PDF pX (x) = αx−2 for 2 ≤ x ≤ 10. What is α? What is the CDF? Answer: We note Z 10 1 1 1= αx−2 dx = α( − ), 2 10 2 so α = 25 . We then have 5 FX (x) = 2 Z 2 x 5 1 1 y −2 dy = ( − ) 2 2 x for 2 ≤ x ≤ 10. 11 4.2.2. Conditioned Random Variables. We can condition random variables in the same way that we condition events. Recall, for events A and B, P[A|B] = P[A ∩ B] . P[B] Similarly, if X is a discrete random variable, we can write P[X = x|B] = P[{X = x} ∩ B] . P[B] This leads to the definition of a conditioned PMF for discrete random variables: P[{X = x} ∩ B] pX|B (x) = . P[B] Example 17. Consider rolling two fair dice, let X be their sum, and let B be the event that the first die rolls a 4. We then have pX|B (x) = P[{X = x} ∩ B] = 6P[{X = x} ∩ B]. P[B] If 5 ≤ x ≤ 10, P[{X = x} ∩ B] = 1 ; 36 otherwise,P[{X = x} ∩ B] = 0. Thus, pX|B (x) = 1 6 for 5 ≤ x ≤ 10. Example 18. Consider rolling two fair dice. Let Y, Z be their results, let X = Y + Z be their sum, and let B be the event that Y is even. Then 1 P[B] = . 2 We note that, if X ∈ {3, 4} and B holds, we must have Y = 2. Thus, for x ∈ {3, 4}, pX|B (x) = P[{X = x} ∩ B] P[X = x|Y = 2]P[Y = 2|B]P[B] 11 1 = = = . P[B] P[B] 63 18 Similarly, if X ∈ {5, 6} and B holds, we must have Y ∈ {2, 4}. Thus, for x ∈ {5, 6}, pX|B (x) = P[{X = x} ∩ B] P[X = x|Y ∈ {2, 4}]P[Y ∈ {2, 4}|B]P[B] 12 1 = = = . P[B] P[B] 63 9 Similar calculations can be done for 7 ≤ x ≤ 12. If the event B is of the form B = ∪x∈I {X = x} for some set I, the condition PMF has a particularly nice formula: pX|B (x) = pX (x) P[B] for all x ∈ I, and pX|B (x) = 0 for x ∈ / I. This formula even works for continuous random variables. 12 Example 19. Let X be a continuous random variable with pdf given by fX (x) = e−x for x ≥ 0, and let B = {X > 1}. We wish to calculate fX|B . We first calculate Z ∞ e−x dx = e−1 . P[B] = x=1 We then note that B is of the nice form required by the expression above, and so pX (x) fX|B (x) = = e1−x . P[B] As an aside, this distribution is pretty special: it is basically the same as the one we started, but shifted. This is a special case of the memoryless property, which you probably saw in a previous probability course, and which is going to be very important in this course. 4.2.3. Distribution functions, events and formulas. Since distribution functions are written as the probabilities of events, all of the formulas we know for events (Venn diagrams, Bayes’ rule, etc) have related formulas for random variables. For example, let {Bi }ni=1 be a partition of a sample space Ω, and let X be a random variable on Ω. Then n X pX (x) = pX|Bi (x)P[Bi ]. i=1 This is similar to our rule P[A] = n X P[A|Bi ]P[Bi ] i=1 for events, and indeed you can prove one from the other. 4.2.4. Joint Distributions. We often want to think about two (or more) random variables defined on the same sample space. Since PDFs are easier to think about than the functions themselves, we introduce joint PDFs to deal with this. Remark 4.5. We have already been dealing with joint distributions, without naming them. Consider rolling two dice, and let X1 , X2 be their results. We certainly know how to deal with these together! Even more, we are pretty comfortable with random variables Y = X1 + X2 and Z = max(X1 , X2 ). We are just formalizing this. 4.2.5. Joint CDF. Definition 4.6 (Joint CDF). The joint CDF of two random variables X, Y is FX,Y (x, y) = P[X ≤ x, Y ≤ y]. In general, the joint CDF of n random variables X1 , X2 , . . . , Xn is FX1 ,...,Xn (x1 , . . . , xn ) = P[X1 ≤ x1 , . . . , Xn ≤ xn ]. For X, Y discrete, this is computed just like any event: Example 20. Let X1 , X2 be chosen uniformly from {1, 2, 3}, let Y = X1 + X2 and let Z = max(X1 , X2 ). We note that 2 ≤ Y ≤ 6, 1 ≤ Z ≤ 3 and look at FY,Z (y, z). 1 pY,Z (2, 1) = 9 13 pY,Z (2, 2) = 0 pY,Z (2, 3) = 0 pY,Z (3, 1) = 0 2 pY,Z (3, 2) = 9 pY,Z (3, 3) = 0 pY,Z (4, 1) = 0 1 pY,Z (4, 2) = 9 2 pY,Z (4, 3) = 9 pY,Z (5, 1) = 0 pY,Z (5, 2) = 0 2 pY,Z (5, 3) = 9 pY,Z (6, 1) = 0 pY,Z (6, 2) = 0 1 pY,Z (6, 3) = 9 This is tedious but easy! If X, Y are continuous, we need to remember our higher-dimensional calculus to do general calculations; more on that soon. If FX,Y is given to us, we have some formulas that are similar to the one dimensional case. Recall: P[a < X ≤ b] = FX (b) − FX (a). In two dimensions, P[a < X ≤ b, c < Y ≤ d] = FX,Y (b, d) − FX,Y (a, d) − FX,Y (b, c) + FX,Y (a, c). This equation, and similar equations, can be found by looking at pictures. We can also find the CDF for individual random variables, called the marginal CDF : lim FX,Y (x, y) = FX (x) y→∞ lim FX,Y (x, y) = FY (y). x→∞ We next have a critical definition: Definition 4.7 (Independent Random Variables). A pair of random variables X, Y are independent if FX,Y (x, y) = FX (x)FY (y) 14 for all x, y,. In general, many random variables X1 , X2 , . . . , Xn are independent if n Y FX1 ,...,Xn (x1 , . . . , xn ) = FXi (xi ) i=1 for all x1 , . . . , xn . Note that, if X1 , . . . , Xn are independent, than Xi , Xj are independent... but the opposite direction is not true! Example 21. Fix n ≥ 2. Let X1 , . . . , Xn−1 be independently chosen from {0, 1} and let Xn = 1 if an odd number of X1 , . . . , Xn−1 are odd, and Xn = 0 otherwise. We then have that {Xi }i∈I are independent for any I ⊂ {1, 2, . . . , n} with |I| ≤ n − 1, but {X1 , . . . , Xn } are not independent! NOTE: More complicated variations on this example show up in computer Pnscience and information theory. We have effectively split 1 ‘bit’ of information (whether i=1 Xi is odd or not) across n people, in such a way that no n − 1 of them working together can recover anything about it. 4.2.6. Joint PMF. Definition 4.8 (Discrete PMF). For a pair of random variables X, Y , we define pX,Y (x, y) = P[X = x, Y = y]. Similarly, for many random variables X1 , . . . , Xn , we have pX1 ,...,Xn (x1 , . . . , xn ) = P[X1 = x1 , . . . , Xn = xn ]. Just like P[X ∈ A] = X pX (x), x∈A we have for joint PMFs the formula X P[(X, Y ) ∈ A] = pX,Y (x, y). (x,y)∈A Just like we can relate the joint CDF to the marginal CDF, we can relate the joint PMF to the marginal PMF: X pX (x) = pX,Y (x, y) y pY (y) = X x 15 pX,Y (x, y). 5. Lecture 4: End of Chapter 4, Start of Chapter 5 (Jan. 22) 5.1. Summary. • A few administrative details. • Remainder of discussion of joint distributions. • Review of expectations. • Properties and formulas associated with expectations. 5.2. Lecture. Today, we continue our review of random variables and start a discussion of expectations. 5.2.1. Joint PDF. Recall the relationship between the CDF and PDF for a single continuous random variables: Z x fX (t)dt. FX (x) = −∞ We have a similar definition for the joint PDF: Definition 5.1 (Joint PDF). Let X, Y be two continuous random variables with joint CDF FX,Y . Then the joint pdf fX,Y is a function that satisfies Z y Z x FX,Y (x, y) = fX,Y (s, t)dsdt. −∞ −∞ Remark 5.2. There is some fuzziness here - it isn’t at all clear that there is exactly one such function. In fact, there generally isn’t! In this class, we generally give you one PDF, and you will do calculations using this formula. For a joint PDF fX,Y , we have the useful formula: Z Z fX,Y (s, t)dtds. P[X ∈ A, Y ∈ B] = s∈A t∈B Note that using this in general requires vector calculus! As a quick review: Example 22. Let X, Y have joint PDF fX,Y (x, y) = c(x + y) for 0 ≤ x ≤ y ≤ 5. We will calculate P[X + Y > 1]. First, Z 5Z y Z 3c 5 2 125c 1= c(x + y)dxdy = y dy = . 2 0 2 0 0 Thus, c = 2 . 125 We then calculate 1 2 Z P[X + Y > 1] = 1 − 0 1 =1− 125 374 = . 375 16 1−x Z x Z 1 2 x=0 2 (x + y)dydx 125 (1 − 4x2 )dx Definition 5.3 (Uniform Random Variable). A continuous (respectively discrete) random variable is uniform if its pdf (respectively pmf ) takes on exactly one nonzero value. Remark 5.4. As this isn’t a vector calculus course, I’m going to avoid doing too many ‘hard’ multivariable integral questions. However, you need to know some for this course, and to use the material in this course you will need to know more. I suggest that you read section 4.4, including the section on uniform random variables (which I am largely skipping). If any of it looks intimidating, do a practice problem or two. Just as we could calculate the marginal PMF from the joint PMF in the discrete case, we can calculate the marginal PDF from the joint PDF in the continuous case. Recall, for PMFs, we had: pX (x) = X pX,Y (x, y) y pY (y) = X pX,Y (x, y). x For PDFs, we have: Z fX,Y (x, y)dy fX (x) = y Z fY (y) = fX,Y (x, y)dx. x Just as we defined two discrete random variables X, Y to be independent if pX,Y (x, y) = pX (x)pY (y), we define two continuous random variables X, Y to be independent if fX,Y (x, y) = fX (x)fY (y). 5.2.2. Conditional Distribution Functions for Discrete Random Variables. Last class, we defined conditional distribution functions for single random variables. Recall, for a discrete random variable X and an event B, we had P[{X = x} ∩ B] pX|B (x) = . P[B] Similarly, for a pair of discrete random variables X, Y , we write P[X = x, Y = y] pX,Y (x, y) pX|Y (x|y) = . P[Y = y] pY (y) Remark 5.5. This is a definition, not a calculation! However, it lines up with many calculations for conditional events that we have already seen. For example, we can check that: pX|Y (x, y)pY (y) = pX,Y (x, y), just like P[A|B]P[B] = P[A ∩ B]. 17 Similarly, X and Y are independent if pX|Y (x, y) = pX (x) for all y, just like A and B are independent if P[A|B] = P[A]. Lets do some calculations: Example 23. We consider X, Y with joint PMF: pX,Y (x, y) = c(x + y) for x, y ∈ {1, 2, 3}. Lets calculate fX|Y (x, y). The natural thing to do is: 1= 3 X 3 X c(x + y) x=1 y=1 =c 3 X (3x + 6) x=1 = c(18 + 18), so c = 1 . 36 Next, we have pY (y) = 3 X pX,Y (x, y) x=1 3 = 1 X (x + y) 36 x=1 = 1 (6 + 3y). 36 Thus, pX,Y (x, y) pY (y) x+y . = 6 + 3y NOTE: I wasted time there. We didn’t need to calculate c - it goes away in the end. It is generally a good thing to be lazy in math class! pX|Y (x, y) = 5.2.3. Conditional Distribution Functions for Continuous Random Variables. We try to do the same thing for continuous random variables. There are some immediate issues; we originally had P[X = x, Y = y] pX|Y (x, y) = , P[Y = y] but the denominator is 0 for continuous random variables. However, we also had a formula: pX,Y (x, y) pX|Y (x, y) = , pY (y) 18 which suggests defining: fX,Y (x, y) . fY (y) fX|Y (x|y) = That definition at least works. It also turns out to make sense. As before, we get many analogues to existing formulas. We list some important formulas. First, we certainly have: Z ∞ Z ∞ fX,Y (x, y)dy = fY (y)fX|Y (x|y)dy. fX (x) = −∞ −∞ From this, we get a continuous version of Bayes’ rule: fY (y)fX|Y (x|y) . f (y)fX|Y (x|y)dy −∞ Y fY |X = R ∞ Example 24. Let X, Y have joint pdf fX,Y (x, y) = 6x2 y, on 0 ≤ x, y ≤ 1. We wish to calculate P[X ≤ 0.2|Y = 0.5]. To calculate the conditional distribution of X, we need the marginal distribution of Y : Z 1 6x2 ydx fY (y) = 0 = 2y. Thus, fX,Y (x, y) fY (y) 6x2 y = 2y = 3x2 . fX|Y (x|y) = Thus, Z 0.2 P[X ≤ 0.2|Y = 0.5] = fX|Y (x, 0.5)dx 0 Z 0.2 = 3x2 dx 0 −3 =5 . NOTE: We could have observed that fX,Y (x, y) = f1 (x)f2 (y), which automatically implies that X, Y are independent. Thus, we didn’t need to actually do much of this calculation at all! 5.2.4. Convolutions and Sums. Let X, Y be two discrete random variables with PMFs pX (x), pY (y). Let Z = X + Y . What is pZ (z)? Well, pZ (z) = P[Z = z] X = P[X = x, Y = z − x] x 19 = X pX (x)pY (z − x). x This is called the convolution formula. Basically the same thing holds for continuous random variables: Z ∞ fX (x)fY (z − x)dx. fZ (z) = −∞ We will be using this quite a bit later in the course; I mention it here mostly to follow with the textbook. 5.2.5. Expectations. You certainly know the expectation from previous probability classes, where it was the most frequently-used among many notions of ‘centrality,’ such as the median or the mode. In this class, it will really be the only such measure that we use. Recall the definition: Definition 5.6 (Expectation). For a discrete random variable X, the expectation is: X xpX (x). E[X] = x For a continuous random variable Y , the expectation is: Z ∞ E[Y ] = yfY (y)dy. −∞ We know how to do integrals, but there are many nice formulas that you might not know. For example: Lemma 5.7 (Integration by Parts Formula). Let X ≥ 0 be a discrete random variable. Then ∞ X P[X ≥ x]. E[X] = x=1 Remark 5.8. Yes, this is called ‘integration by parts’ in reference to the formula Z Z udv = uv − vdu from calculus. Those of you who like calculus can play around with this and figure out why, but it is beyond the scope of this course. Since functions of random variables are still random variables (yes, I really did mean that sentence!), we know how to take expectations of functions of random variables. For example, if X is a discrete random variable, and h is a function, then we can write: X E[h(X)] = h(x)fX (x). x Similarly for continuous random variables. However, there are a bunch of ‘special’ functions of random variables whose expectations have names. Definition 5.9 (Moments). For k ∈ N, the expected values E[X k ] are called the moments of X. When k = 1, the moment is called the mean or average. We also have 20 Definition 5.10 (Central Moments). For k ∈ N, the expected values E[(X − E[X])k ] are called the central moments of X. When k = 2, ths is called the variance. Remark 5.11. The mean and variance are the really important ones here! Remark 5.12. You’ve all seen expectations, so no examples here! 5.2.6. Expectations and Joint Distributions. We can’t really take expectations of a pair of random variables X, Y . When we talk about expectations in the context of joint distributions, we are really talking about the expectation of some function Z = h(X, Y ). But this Z is just a random variable, so we take the expectation the same way we always do: XX E[Z] = h(x, y)pX (x)pY (y) x Z y ∞ E[Z] = h(x, y)fX (x)fY (y). −∞ Example 25. Let X, Y have joint PDF fX,Y (x, y) = 3 2 (x + xy + y 2 ) 44 for 0 ≤ x, y ≤ 2. Let Z = XY . Then Z 2Z 2 3 (xy)(x2 + xy + y 2 )dxdy E[Z] = 44 0 0 Z 2Z 2 3 = (x3 y + x2 y 2 + xy 3 )dxdy 44 0 0 Z 2 3 8 = (4y + y 2 + 2y 3 )dy 44 0 3 3 64 = (8 + + 8) 44 9 52 = . 33 NOTE: This is plausible - the answer is between 0 and 4. More important than this definition, expectations of joint distributions have many important properties. The most important, often referred to as linearity of expectation, is that E[X1 + . . . + Xn ] = E[X1 ] + . . . + E[Xn ]. This holds regardless of independence. If X, Y are independent, we also have E[XY ] = E[X]E[Y ]. This implies an important formula for variance: V ar[X1 + . . . + Xn ] = V ar[X1 ] + . . . + V ar[Xn ]. Generally, if X, Y are not independent, we define 21 Definition 5.13 (Covariance and Correlation). The covariance of two random variables X, Y is given by Cov[X, Y ] = E[(X − E[X])(Y − E[Y ])] and their correlation is given by Cov[X, Y ] Corr[X, Y ] = p . V ar[X]V ar[Y ] There are many important formulas here as well: V ar[X + Y ] = V ar[X] + 2Cov[X, Y ] + V ar[Y ] Cov[cX, Y ] = cCov[X, Y ] Cov[X + c, Y ] = Cov[X, Y ] All of these are possible to derive in one or two lines from the definitions you know. However, they show up often enough that they are probably worth memorizing. 22 6. Lecture 5: End of Chapter 5 (Jan. 26) 6.1. Summary. • A few administrative details. • A few last properties of expectations. • Introduction to generating functions. 6.2. Lecture. 6.2.1. Homework Note. Several people have asked me what I mean by a ‘probabilistic’ proof of an identity. This is very poorly defined, but (roughly speaking) I want a proof that doesn’t involve manipulating formulas. Here is a simple proof in that line, for the formula: X n−3 n i−1 n−i−1 = . 5 2 2 i=3 To prove this, I give a procedure for choosing 5 objects out of n. First, I choose the middle object. Then, if the middle object is in position i, I choose 2 objects from the first i − 1 and 2 objects from the last n − i − 1. The number of ways to do this procedure is clearly n−3 X i−1 n−i−1 . 2 2 i=3 However, this procedure also gives me every way to choose 5 objects out of n exactly once, so the number of ways to do this procedure must also equal n . 5 Thus, the two must be equal. 6.2.2. Introduction. Today, we finish discussing expectations for collections of random variables and get started on generating functions. We then begin to review various special distributions. 6.2.3. Conditioning and Expectations. Let X, Y be two random variables with joint PMF (respectively PDF) pX,Y (respectively fX,Y ). We define the conditional expectation of X given Y in the obvious way: X E[X|Y = y] = xpX|Y (x|y) Zx ∞ E[X|Y = y] = xfX|Y (x|y)dx. −∞ Note that we have seen this before! After all, P[·|Y = y] is an honest-to-goodness probability distribution. We are just taking expectations with respect to it. So, again, there is in fact no new content here! However, it does lead to two really important formulas. Both of them work for both discrete and continuous random variables. The first is: E[X] = E[E[X|Y ]]. For now, we just give a silly application: 23 Example 26. The average height of a Canadian male is 175.1 cm and the average height of a Canadian female is 162.3 cm. 49.7% of Canadians are men. What is the average height of Canadians? Choose a Canadian at random. Let X be their height and let Y be their gender. Then E[X] = E[E[X|Y ]] = (175.1)(0.497) + (162.3)(0.503) = 168.7. Another important formula is: V ar[X] = E[V ar[X|Y ]] + V ar[E[X|Y ]]. We give a simple application: Example 27 (Pooled Samples). I am interested in insuring a group of students. I pick a student at random, and let X be the amount that I have to pay out and let Y = 1 if the student is on the varsity football team and 0 otherwise. Due to earlier research, we know: E[X|Y = 0] = 248 E[X|Y = 1] = 880 V ar[X|Y = 0] = 110 V ar[X|Y = 1] = 110 P[Y = 0] = 0.92. We then calculate: E[V ar[X|Y ]] = (0.92)(110) + (0.08)(110) = 110. Also, V ar[E[X|Y ]] = E[E[X|Y ]2 ] − E[E[X|Y ]]2 = (0.92)(248)2 + (0.08)(880)2 − ((0.92)(248) + (0.08)(880))2 = 29398. We conclude: V ar[X] = E[V ar[X|Y ]] + V ar[E[X|Y ]] = 29508. NOTE: We needed quite a lot of information to be able to do this sort of calculation. ANOTHER NOTE: This formula gives us a beautiful, simple, inequality: V ar[X] ≥ E[V ar[X|Y ]] ≥ min V ar[X|Y = y]. y Before continuing: I strongly encourage you to read the list of formulas and inequalities on pp.100 of the textbook. All of them should be ‘obvious,’ and are a nice source of ‘tricky’ exam questions. 24 6.2.4. Generating Functions. Generating functions are some particularly important expectations. First: Definition 6.1 (Probability Generating Function). Let X ≥ 0 be a discrete random variable with PMF pX . The probability generating function is given by ∞ X X GX (z) = E[z ] = pi z i . i=0 This function has many important properties. The most important, which will certainly be used, is: Theorem 28. The distribution of X is uniquely determined by its generating function. That is, if GX (z) = GX (y), then we necessarily have pX = pY . It is also very useful for computations: Theorem 29. We have G0X (1) = E[X]. More generally, for any 1 ≤ k, (k) GX (1) = E[X(X − 1) . . . (X − k + 1)]. These expectations are known as factorial moments, though I won’t use that word much. There is also a continuous analogue to all of this: Definition 6.2 (Moment Generating Function). Let X ≥ 0 be a random variable. For θ ∈ R, the moment generating function is given by MX (θ) = E[eθX ]. Remark 6.3. The probability and generating functions are very close... but are parameterized slightly differently. This is annoying and makes them a little harder to memorize. I’m sorry, but there’s not much that I can do about it. Remark 6.4. The moment generating function uniquely determines the PMF of a discrete random variable. The moment generating function ‘almost’ uniquely determines the PDF of a continuous random variable as well, when it exists. We won’t really discuss what this ‘almost’ entails. Theorem 30. We have, for k ≥ 1, (k) MX (θ) = E[X k ]. Although the formulas are nice, we mostly care about the MGF because it has some very nice formulas. The most interesting is probably: Theorem 31. Let X1 , . . . , Xn be a collection of independent random variables. LEt Y = X1 + . . . + Xn . Then n Y MY (θ) = MXi (θ). i=1 We also have, McY (θ) = MY (cθ). 25 Lets use this! Example 32. Let X1 , . . . , Xn be i.i.d. random variables with MGF 2 MXi (θ) = eθ . Ok,Pthese are normal random variables - but we’re not talking about that yet. Let X = n √1 i=1 Xi . Then n MX (θ) = M √1 Pn i=1 n Xi (θ) θ = MPni=1 Xi ( √ ) n n Y θ = MXi ( √ ) n i=1 = n Y θ2 en i=1 2 = eθ = MX1 (θ). D Thus, by our earlier theorem, we have S = X1 . NOTE: Even though we are just checking equality... this actually seems quite difficult to do without using a generating function! We have two more generating funtions! The most general is: Definition 6.5 (Laplace Transform). Let X be a random variable (nonnegative or not!). For any s ∈ C, the Laplace transform is defined by LX (s) = E[e−sX ]. Finally, we have Definition 6.6 (Characteristic Function). Let X be a random variable (nonnegative or not!). For any s ∈ R, the characteristic function is defined by where i = √ CX (θ) = E[eiθX ], −1. Remark 6.7. Although it isn’t the easiest to understand, the characteristic function is probably the most important of the generating functions. It has two particularly important properties: • It generally exists (we will see soon that many distributions don’t have MGFs). • It has a nice inversion formula, useful in many areas. Those of you who have seen Fourier analysis will know this. Example 33 (Calculations with Generating Functions). Let X have moment-generating function 1 MX (z) = . 2−z 26 What is E[X n ]? We calculate 1 (2 − z)2 2 MX00 (z) = (2 − z)3 ...... k! (k) MX (z) = , (2 − z)k+1 MX0 (z) = and so (k) MX (0) = and so E[X n ] = k! 2k+1 , n! . 2n+1 6.2.5. Maximums and Minimums. Remember: If X1 , . . . , Xn are independent, and Y = max(X1 , . . . , Xn ), then n Y FY (y) = FXi (y). i=1 Section 5.5 of the textbook has more on this. 27 7. Lecture 6: R Programming, Start of Chapter 6 (Jan. 29) 7.1. Summary. • A few administrative details. • Homework 1 is due by the start of class today! Homework 2 is on the website, and is due on February 12 by the start of class. • Introduction to R programming. 7.2. Lecture. Today, we get a lightning introduction to the R programming language and to simulation, and then start Chapter 6. 7.2.1. Introduction to R. R is a programming language that is popular for doing work in probability and statistics. Its main advantages are: • It has a large and active user base, who can (and will!) answer questions. • It has large libraries for doing almost any calculations you will see in any probability and statistics course. • It is similar enough to ‘modern’ languages like C to be familiar to most users, and code from those languages can be integrated into it. • You can do command-line programming. • The big one: it is free. The R environment can be obtained at http://www.r-project.org/ under the downloads page. Please do that immediately, so that you have some time to practice! There are many great R resources for general programming. Some include: • The official R tutorial (available at http://www.r-project.org/ ). • There are three things that you can type into the console to get information about a function such as ‘sum’: ?sum, ??sum, and sum • StackOverflow and CrossValidated (otherwise known as StatsOverflow) have an enormous number of basic questions about R. Everything that comes up in this course will likely have a question on that website. I will very briefly go over some R syntax in class, including: • Creating variables (e.g. x = 5), vectors (e.g. y = c(1,2,3)) and matrices (e.g. z = mat(1:10, nrow=5)). • Operations (+,-,*,/, &,|). • Flow control (if, for). • Functions (foo←function(x){bar}). • Simulating random variables (rbinom(nSamples,n,p), rnorm(nSamples, mean,variance), in general rBLAH). • A small number of important special functions (mean, sum, sample(nItems,nSamples)). • A small number of ways to plot data (the most common is plot(x) or plot(x,y) to make scatter plots, and hist(x) to create histograms of univariate data). We will use R mostly to do ‘simulations.’ That is, we will give you a story about a bunch of distributions, and you will use R to generate made-up data to try to understand the story. If you haven’t done this before, it isn’t clear what this is for, or why you can do this on a computer. We’ll go through a worked example that might clear up some of this. 28 Example 34 (Birthday Problem, Redux). We know that, in fact, births aren’t distributed evenly throughout the year. For example, in a typical week in Canada, roughly 20 percent more people will be born on a given weekday than on the weekend. Say that we have a large number (more than 365) people lined up in a hallway, and we go through the line asking people their birthdays. How many people do we need to go through before we find a repeated birthday? We try to understand this problem by simulating the process. We make a function that simulates the result of a single line, with the weekday to weekend birth ratio given by a free parameter. For simplicity, assume that a year has exactly 52 weeks. birthday <- function(p) { Weeks = sample(52,366, replace=TRUE, rep(1,52)) Days = sample(7,366,replace=TRUE, c(1, 1,1+p,1+p,1+p,1+p,1+p)) BirthDate = 7*(Weeks-1) + Days i = 2 for(i in 2:365) { for(j in 1:(i-1)) { if( BirthDate[i] == BirthDate[j]) { return(i) } } } } To simulate this process 100 times, we could write: SimulatedData = rep(0,100) for(i in 1:100) { SimulatedData[i] = birthday(0.2) } And then we might plot using hist(SimulatedData). So, this lets us simulate data. What do we actually use it for? In practice, we often use it to do Monte Carlo integration. This is a big and very popular subject, but it is easy to say the basics: Example 35 (Monte Carlo Integration). Fix some random variable X with PMF pX . We want to calculate some expectation E[f (X)] with respect to pX ; for example, we might be interested in P[X < 1] = E[1X<1 ] or E[X 2 ]. Sometimes, these integrals are hard to calculate. Instead, we might simulate X1 , . . . , Xn and estimate n 1X f (Xi ). E[f (X)] ≈ n i=1 29 Since we all know the central limit theorem, we know that n C 1X f (Xi ) + ± √ . E[f (X)] ≈ n i=1 n Example 36 (Finishing the Birthday Problem). How does this relate to the birthday problem? We let X be the number of people called in before we found a repeated birthday, and want to find E[X], P[X < 20] and some number T so that P[X < T ] ≈ 0.5. I don’t know much about pX in this case, so these seem hard to find! Lets instead use our simulated data from above. We calculate these three terms in R: mean(SimulatedData) mean(SimulatedData < 20) sort(SimulatedData)[floor(length(SimulatedData))] Easy! Ok, that is just about it for a brief introduction to R. R will show up mostly in assignments, where there might be a little bit of extra information that could be useful. You will generally be responsible for figuring out the rest by yourself, but please feel free to write to me, come to office hours, etc. 30 8. Lecture 7: Chapter 6 (Feb. 2) 8.1. Summary. • A few administrative details. • Some more discrete random variables 8.2. Lecture. We begin chapter 6 of the textbook: discrete distribution functions. Most textbook questions about chapters 6 and 7 follow essentially the same format: • We give you the parameters for some (named) distribution function, • You plug these values into the PMF or PDF. You should be very comfortable with this type of question, we’ll do some in class. Another (slightly less common) class of questions have the format: • We tell you how to construct a random variable, • You prove that this random variable is really one our named random variables in disguise. We already did this for the normal distribution when we started using generating functions, and there are more examples in homework 2. Another (still less common) class of questions involves doing ‘composite’ calculations. These look like the first type of calculation, but they involve several different types of random variables. Again, an example of this is in the homework, and I’ll highlight an example next lecture. If you are comfortable with these three types of questions, you will be largely set for exams. The homework also has some ‘tricky’ questions, but these would only show up on exams if they came with many hints. 8.2.1. Uniform Distribution. One of the more popular distributions is the uniform distribution. A discrete (respectively continuous) uniform distribution is one with PMF (respectively PDF) that is uniform on its set. There isn’t much to say about them - calculations here are fairly easy! 8.2.2. Discrete Distribution Functions. Today, we’ll focus on discrete distributions that come from independent trials: • Bernoulli, Binomial and Multinomial. • Geometric and Negative Binomial. There are a lot of names here. I suspect that you’ve heard all of them, but have trouble remembering which is which. As such, we start with the underlying process, then recall which name refers to which questions. Note that all of our calculations are ‘really’ about sampling with replacement, and the formulas reflect that. The hypergeometric distribution deals with a related process, but this time without replacement. Definition 8.1 (Bernoulli Process). A Bernoulli Process is a sequence of i.i.d. random variables {Xi }i∈N so that P[Xi = 1] = 1 − P[Xi = 0] = p ∈ [0, 1]. Example 37. The canonical example, when p = 12 , is an infinite sequence of coin flips, with Xi = 1 if the i’th flip comes up heads. Note that this coin-flipping is really sampling with replacement from a set of two objects! 31 Dice rolls give examples for other values of p - e.g. p = if the i’th die comes up 1, 2, 3 or 4. 2 3 can be obtained by setting Xi = 1 Remark 8.2. We’ve snuck a definition in here. We know what it means for any finite collection of random variables to be independent, but not what it means for an infinite sequence to be independent. We define an infinite sequence like this to be independent if any finite subset of them is independent. 8.2.3. Binomial and Multinomial Distribution. Let {Xi }i∈N be a Bernoulli process. A binomial random variable with parameters (p, n) has the distribution of: X= n X Xi . i=1 We list some formulas for this. You should derive some of them yourself - none are terribly difficult, all are fair game for exams. n x pX (x) = p (1 − p)n−x x E[X] = np V ar[X] = np(1 − p). For practice, we calculate the probability generating function: Example 38. Recall that GX (z) = E[z X ] n X n x = p (1 − p)n−x z x . x x=0 Ok, this seems to be a problem. What should we do instead? One answer: remember the formula GX (z) = GX1 +...Xn (z) n Y = GXi (z) i=1 = (1 − p)z 0 + pz 1 )n = (1 − p) + pz)n . So, pretty easy in the end. We don’t discuss the multinomial distribution in class, as it is similar to the binomial distribution and should be review. Please refer to section 6.7 of the textbook for review. 32 8.2.4. Geometric and Negative Binomial Distributions. The binomial distribution counts the number of successes up to some specific time n. The negative binomial distribution counts the number of trials needed to get a specific number of successes k. That is, the binomial distribution involves fixing the amount of time and counting successes; the negative binomial distribution involves fixing the number of successes and counting time. More formally, let {Xi }i∈N and define n X T = min{n : Xi = k}. i=1 Then T has negative binomial distribution with parameters (p, k). The geometric random variable is a special name for a negative binomial distribution with k = 1. In this class, we will be talking about geometric random variables (and their continuous cousins, the exponential random variables) more than the negative binomial distribution itself. Some formulas for the negative binomial distribution include: n−1 k pX (x) = p (1 − p)n−k k−1 k E[X] = p k(1 − p) V ar[X] = . p2 Remark 8.3. Again, k = 1 is the important case here. One reason is that the exponential distribution has the memoryless property, which we will get back to in the near future. Example 39. At my local bus stop, the number of minutes that I must wait for a bus follows a geometric distribution with mean 12. I have early meetings every day this week, and want to make sure that with 95% probability, I am on the bus by 7 AM. What time should I show up to the bus station? 1 Since the mean is 12, we have p = 12 . Let X1 , . . . , X5 be the amounts of time that I spend waiting for the bus, and assume that I get to the bus station m minutes before 7 AM. Then I want: 0.95 ≈ P[max(X1 , . . . , X5 ) ≤ m] = P[X1 ≤ m]5 1 = (1 − (1 − )m )5 12 Thus, 0.99 = (1 − (1 − 1 m ) 12 log(0.01) m= log( 11 ) 12 ≈ 53. 0.01 = (1 − 33 1 m ) ) 12 Is this plausible? For the model: it is clear that buses don’t arrive according to a discrete distribution... but counting by the minute is surely ‘plausible.’ For the end result: the average waiting time is 12 minutes, so you need to be about 36 minutes early to be 95-percent sure on even a single day. Going from 36 to 53 minutes early doesn’t seem completely implausible. 8.2.5. Poisson Distribution. The Poisson distribution is very important in applied probability, and in particular shows up quite often in queuing theory. Roughly speaking, a Poisson distribution shows up when you are counting the number of successes of a very large number of experiments, each of which has a very low success rate. For example, it might be used to model the number of callers to a Microsoft help desk within a given 1-minute period. In this case, the number of experiments is the number of people using Microsoft products (a very large number!), but the probability of success is the probability that any given person will call given any particular minute (this is small). There is a natural question here: why use a Poisson random variable, rather than a binomial? There are many reasonable answers, but for now I’ll give the simplest: there are many random variables besides the binomial that ‘look like’ the Poisson distribution, and the Poisson is easiest to work with. More mathematically, the Poisson is ‘universal’ - if you are running many trials, and all of them are likely to fail, and they aren’t too dependent, the number of successes will be (approximately) Poisson. Definition 8.4 (Poisson Distribution). A Poisson random variable X associated with rate r and time interval T has PMF λx pX (x) = e−λ , x! where λ = rT . Example 40. Assume that a help desk receives, on average, 7 calls per hour. Let X be the number of calls that it receives between 9 AM and 9:10 AM, and assume that X has Poisson distribution. Then its parameter is λ = (7)( 16 ) ≈ 1.17. The probability that the help desk receives no calls at all during this period is 7 P[X = 0] = e− 6 ≈ 0.311. We mentioned that the Poisson distribution is often used as an approximation to the binomial. More formally, Definition 8.5. Let X be a Binomial random variable with parameters n, p. Then the Poisson approximation to X is a Poisson random variable with parameter λ = np. Example 41. In a certain widget factor, one percent of widgets are defective. I look at 200 widgets, chosen at random. Using the Poisson approximation, what is the probability that at least 4 of them are defective? Let X be Poisson with mean λ = (200)(0.01) = 2. Then P[X ≥ 4] = 1 − P[X = 0] − P[X = 1] − P[X = 2] − P[X = 3] ≈ 0.133. 34 Also, E[X] = 2 V ar[X] = 2. There is more on the Poisson approximation in the homework. Lets do two more complicated calculations to prepare - what I called ‘composite’ questions last class. Example 42. I send somebody to look for defective widgets at the same widget factory. While I want the person to sample 200 widgets, they are unreliable; instead, they sample Θ widgets, where Θ is a geometric random variable with mean 200. Let X be the number of defective widgets that they find, and calculate both E[X] and V ar[X] under the Poisson approximation. First, E[X] = E[E[X|Θ]] Θ = E[ ] 100 = 2. NOTE: This is the same expectation as we would have if exactly 200 had been sampled! However, V ar[X] = V ar[E[X|Θ]] + E[V ar[X|Θ]] Θ Θ = V ar[ ] + E[ ] 100 100 1 = 4(1 − )+2 200 ≈ 6. NOTE: So, the variance is quite a lot larger in this situation! Another ‘composite’ example. Example 43. I buy 12 lightbulbs and plug them in. Assume each lightbulb’s lifetime, in days, follows a geometric distribution with mean 50. Also assume that the lifetimes are independent. I will buy new lightbulbs once 7 have failed. What is the probability that I have bought new lightbulbs after 75 days? P Let X1 , . . . , Xn be the lifetimes of the lightbulbs, let Yi = 1Xi >75 , let Y = 12 i=1 Yi and let Z = 1 if I have bought new lightbulbs and 0 otherwise. We note that 1 E[Yi ] = P[Xi > 180] = (1 − )75 ≈ 0.220. 50 Since Y is Binomial(12, 0.133), we have P[Z = 1] = P[Y ≤ 5] ≈ 0.97. 35 9. Lecture 8: Start of Chapter 7 (Feb. 5) 9.1. Summary. • A few administrative details. • Some more continuous random variables. 9.2. Lecture. We begin chapter 7, a collection of continuous distribution functions. As with the discrete distribution functions, there are really three types of popular questions: • We give parameters for a continuous PDF; you plug them into a formula. • We construct several random variables, and ask you to prove some relationship between them (e.g. that a sum of i.i.d. normal random variables is normal). This often involves generating functions; it sometimes involves conditional distributions. • We give parameters for several continuous PDFs; you solve several type-1 questions in a row. As with the discrete random variables, we’ll do all three types in class, and there are examples of all three types in the homework. You should, of course, do more examples yourself! 9.2.1. Exponential Distribution. The exponential distribution is a close cousin to the geometric distribution. Definition 9.1 (Exponential Distribution). The PDF of an exponential distribution with parameter λ is fX (x) = λe−λx for x ≥ 0. We give some related formulas: FX (x) = 1 − e−λx 1 E[X] = λ 1 V ar[X] = 2 . λ The exponential distribution is very important to this course for two closely-related reasons. The first is the memoryless property: Theorem 44 (Memoryless Property). Let X be a random variable with exponential distribution and let s, t be two constants. Then P[X > s + t|X > t] = P[X > s]. Proof. P[X > s + t] P[X > t] −λ(s+t) e = e−λt −λs =e = P[X > s]. P[X > s + t|X > t] = 36 Remark 9.2. The exponential distribution is the only continuous distribution with this property. The geometric distribution, which we have already seen, is the only discrete distribution with this property. Remark 9.3. This is very surprising! Lets look at another random variable, with CDF FX (x) = 1 − e−x 2 for x ≥ 0. Then 2 e−(s+t) P[X > s + t|X > t] = −(t)2 e −s2 −2ts =e = e−2ts P[X > s]. Example 45. I am at a bus stop with 5 types of buses. The wait for each bus is exponentially distributed, with mean waiting times of 12,14,19,22 and 40 minutes. What is the distribution of the waiting time for the first bus? Let Xi be the time until a bus of type i arrives and let X = min(X1 , . . . , X5 ). Note that we have many options for calculating the distribution of X. However, the CDF seems to be by far the easiest, so we choose that. NOTE: Choosing what to calculate is by far the hardest part of this question - it shouldn’t be swept under the rug! Then P[X < s] = P[min(X1 , . . . , X5 ) ≤ s] = 1 − P[max(X1 , . . . , X5 ) > s] t Y =1− P[Xi > s] i=1 s s s s s − 12 − 14 − 19 − 22 − 40 =1−e e e = 1 − e−0.278s . e e NOTE: This is the CDF of an exponential distribution. Please look at pp. 140 of the textbook for more discussion of this example, and more practice on manipulating the exponential distribution. The other important property of the exponential distribution is its relationship to the Poisson distribution. The relationship is: Theorem 46. Let {Xi }i∈N be an infinite sequence of i.i.d. exponential random variables with mean λ and fix T > 0. Then Y = max{n : n X i=1 has Poisson distribution with mean T . λ That is a lot to process! Lets make it concrete. 37 Xi ≤ T } Example 47. Say we know that the time between calls at a help desk follow an exponential distribution, with mean λ. Then this theorem says that the number of calls to the help desk over a time period of length T has Poisson distribution with mean Tλ . Note that this does more-or-less go in the opposite direction, though the above theorem doesn’t imply this (and indeed we would have to be more careful to make this rigorous): Example 48. Say we know that the number of calls to a help desk follow a Poisson distribution, with rate λ. Then the time between calls has an exponential distribution with mean 1 . λ We do some calculations: Example 49. Assume that the waiting time for a bus is exponentially distributed with mean 12 minutes. Let A be the event that no buses show up for the first 20 minutes and let B the event that more than 2 buses show up in the first 20 minutes. Calculate P[A], P[B]. The first is straightforward: 20 P[A] = e− 12 ≈ 0.189. For the second, we need our theorem. Let X be a Poisson random variable with mean Then 20 . 12 P[B] = 1 − P[X ≤ 2] = 1 − P[X = 0] − P[X = 1] − P[X = 2] ≈ 0.234. NOTE: The ppois function in R is very helpful for calculating these very quickly! In this case, the above was evaluated with 1-ppois(2, 20/12) 9.2.2. Normal Distribution. This is probably the most important distribution in probability and statistics, and was likely the emphasis of your first course on the subject. It will be less critical in this course, but is still interesting. The main interest in the normal distribution is that it has a very important ‘universal’ property: if you add up a lot of little random variables, and none of them are large, and they aren’t too dependent, the sum ‘looks like’ a normal distribution. This is the content of the famous central limit theorem. Recall: Definition 9.4 (Normal Distribution). The normal distribution with mean µ and variance σ 2 has PDF (x−µ)2 1 fX (x) = √ e− 2σ2 σ 2π for −∞ < x < ∞. NOTE: It isn’t so obvious that this is actually a distribution function. Checking the normalizing constant is a difficult calculus exercise. Since this function is so difficult to integrate, most probability calculations involving the normal distribution are done via table look-ups. In order to use these tables, you need: Definition 9.5 (Standard Normal Distribution). The normal distribution with µ = 0, σ 2 = 1 is known as the standard normal distribution. Note that: 38 Theorem 50. If X has a normal distribution with mean µ and variance σ 2 , then X −µ Z= σ has standard normal distribution. Proof. This is a short exercise, either directly or with generating functions. So, check! Almost all lookup tables are written in terms of the standard normal distribution, so we will do an example of that here. Note that, if you use R, it will do the transformation for you, and so this is a little outdated. Example 51. Assume that the test scores in this class are normally distributed, with mean µ = 85 and variance σ 2 = 16. Let X be the score of a student chosen at random. What is P[X > 80]? X − 85 80 − 85 P[X > 80] = P[ > ] 4 4 = P[Z > −1.25] ≈ 0.894. We could look this up in a table. Alternatively, we could use R: the command is pnorm(-1.25) Note that we needed a negative sign! Alternatively, if we are using R, we can do the entire question at once: pnorm(-80, -85, 4) Just like we had a Poisson approximation to the binomial distribution, there is a normal approximation to the binomial distribution: Definition 9.6 (Normal Approximation). Let X1 , . . . , Xn be n i.i.d. Bernoulli random Pn variables with E[Xi ] = p, let S = i=1 Xi , let S − np S∗ = p , np(1 − p) and let Z be a standard normal random variable. The normal approximation to S is P[a ≤ S ∗ ≤ b] ≈ P[a ≤ Z ≤ b]. Remark 9.7. When do we use a normal approximation, and when do we use a Poisson approximation? Roughly, we use the normal approximation when n is very large and np is also very large. We use the Poisson approximation when n is very large but p is not large. This isn’t the focus of the current course, so I won’t say much more about it here. 9.2.3. Gamma Distribution. We introduce the Gamma distribution now as a place-holder, and will see how it shows up ‘in the real world’ next lecture. To give a small preview, we often use a Poisson distribution to model the number of ‘successes’ of a process that occurs over time - for example, the number of buses that arrive at a bus stop. We saw two classes ago that the time between successes of a Poisson process had exponential distribution. We will see that the time between successes of a more general type of process has Gamma distribution. 39 Before defining the Gamma distribution, we need: Definition 9.8 (Gamma Function). The gamma function is: Z ∞ Γ(α) = y α−1 e−y dy. 0 This has a number of nice properties: Γ(1) = 1 √ Γ(0.5) = π Γ(α) = (α − 1)Γ(α − 1), α>1 Γ(n) = (n − 1)!, n ∈ N. The first is easy and the third and fourth come from integration by parts. The second is a little tricky! Definition 9.9 (Gamma Distribution). The distribution with parameters (α, β) is: fX (x) = x 1 α−1 − β x e β α Γ(α) for x > 0. This is quite a confusing density! A few comments: • β is called the ‘scale’ parameter. Changing it doesn’t change the shape of the distribution, it just scales it vertically and horizontally. • α is called the ‘shape’ parameter. It radically changes the shape of the distribution, but is hard to understand otherwise. • α = 1 gives the exponential distribution, which we’ve seen. • α = n2 , β = 2 gives the χ2 distribution with n degrees of freedom, which is important in statistics. • The distribution seems impossible to integrate in general. However, when α is an integer, there is a magical formula for the CDF: FX (x) = 1 − α X ( βx )i i=0 i! x e− β . This is exactly a Poisson distribution. • The distribution has: E[X] = αβ V ar[X] = αβ 2 . Remark 9.10. I don’t expect you to memorize these identities, besides the expectation and variance. However, it is helpful to know that they exist. 40 9.2.4. Weibull Distribution. The Weibull distribution is another generalization of the exponential distribution. Like the Gamma distribution, the Weibull distribution is quite flexible. Unlike the Gamma distribution, it is fairly easy to do calculations involving the Weibull distribution. The main interest of the Weibull distribution is that, like the Poisson distribution and the normal distribution, the Weibull distribution is ‘universal.’ Recall that the number of successes of many independent, low-probability trials is (approximately) Poisson, while the sum of many independent, small random variables is (approximately) normal. There is a similar result for the Weibull: the maximum of many small, independent random variables is approximately Weibull. Definition 9.11 (Weibull Distribution). The Weibull distribution with parameters β, η has PDF: β−1 x β β x fX (x) = e−( η ) , η η for x ≥ 0 and β, η > 0. Remark 9.12. You will sometimes see the Weibull distribution written with a different number of parameters. I follow the textbook’s notation. The Weibull has some nice properties: • When β = 1, this is the exponential distribution. • For once, we can write down the CDF: x β FX (x) = 1 − e−( η ) . This is nice enough that it is worth remembering! It is just like the exponential random variable! • The moments are: E[X] = ηΓ(1 + β −1 ) V ar[X] = η 2 Γ(1 + 2β −1 ) − Γ(1 + β −1 )2 . A typical question might be: Example 52. X has Weibull distribution with parameters η = 2 and β = 4. What is P[X > 1]? We calculate: P[X > 1] = 1 − P[X ≤ 1] 1 4 = e−( 2 ) ≈ 0.939. As I mentioned, the Weibull distribution has some ‘universality’ property. This universality is a little more complicated than the central limit theorem, and I won’t give a precise statement - it is the subject of extreme value theory. However, we can easily check that Weibull distributions are stable: 41 Example 53. Assume X1 , . . . , Xn are independent Weibull random variables with parameters η, β. Then P[min(X1 , . . . , Xn ) ≥ s] = P[X1 ≥ s]n s β = e−n( η ) =e − s −1 ηn−β β Note that this is, itself, the CDF of a Weibull random variable. 42 . 10. Lecture 9: More of Chapter 7 (Feb. 9) 10.1. Summary. • A few administrative details. • Remember: Homework 2 is due at the start of next class! • Today, we discuss reliability modeling and introduce phase-type distributions. 10.2. Lecture. 10.2.1. Reliability Modeling. The Weibull distribution is often used in reliability modeling, and so the textbook introduces the two together. However, in principle, they are different topics: the Weibull distribution is just some special distribution, while the ideas related to realiability modeling can be used with any distribution. For now, we introduce reliability modeling, then relate it back to the Weibull distribution. There is no new mathematics in the reliability modeling that we’re discussing; the new content is in subject matter rather than mathematics. The point of reliability modeling is to study random variables that represent the lifetime of some object; frequently, this is either the lifetime of a human or the lifetime of a complicated manufactured system. The mathematical content consists of some formulas and definitions that provide some insight into these applications. The first are quite simple: Definition 10.1 (Reliability). The reliability of a random variable is RX (t) = 1 − FX (t). The survivor function is exactly the same function: SX (t) = 1 − FX (t). In the context of survival times, these are often more convenient to discuss than the CDF. The prototypical example is: Example 54. We consider a truck with 8 wheels. Let Xi be the time that the i’th wheel lasts, and let X = min(X1 , . . . , X8 ). Assume that Xi is Weibull with parameters µ, β. Then P[X > s] = RX (s) 8 Y = RXi (s) i=1 s β = e−8( η ) − =e s −1 η8−β β . Remark 10.2. Note that we can make some heuristic comments here! Note that: • If we take the minimum of n Weibull random variables with the same parameters −1 β, η, the new random variable has parameters β, ηn−β . • Recall that a Weibull distribution has expectation E[X] = ηΓ(1 + β −1 ). 43 • Combining these, we see that the expected value of the minimum of n Weibull variables looks like −1 E[X] = n−β ηΓ(1 + β −1 ). Thus, if β 1 is small, the expectation drops very quickly; if β 1 is large, the expectation is insensitive to n. This gives a bit of interpretability to β, which looks quite mysterious: the larger the value of β, the less the lifetime ‘fluctuates.’ It even gives an exact tradeoff between the number of systems n and the ‘reliability’ β, if you happen to have control over both. Another typical example is: Example 55. I have 8 batteries with me, and my flashlight will fail once all 8 have broken. Let Xi be the failure time of the i’th battery, and let X = max(X1 , . . . , X8 ). Assume Xi is exponentially distributed with mean µ. Then FX (s) = 8 Y FXi (s) i=1 s = (1 − e− µ )8 . Remark 10.3. It turns out that this maximum is harder to deal with than the minimum, and there is only a ‘nice’ formula for the exponential distribution. When the number of components is n, the maximum satisfies: 1 1 E[max(X1 , . . . , Xn )] = E[X1 ](1 + + . . . + ) 2 n ≈ E[Xi ] log(n). Although this formula was computed for the exponential distribution, it turns out that the general principle that the maximum ‘looks like log(n)’ holds for many distributions. The next definition is less familiar: Definition 10.4 (Hazard Rate). For a continuous random variable X, the hazard rate is given by: fX (t) hX (t) = . RX (t) As the textbook points out, this is a measure of the chance that a device is ‘about to fail’ if it has survived until time t. More formally, the following calculation holds for ‘nice’ PDF’s: P[t < X ≤ t + h] lim P[X ≤ t + h|X > t] = lim h→0 h→0 P[t < X] R t+h fX (s)ds fX (t) = lim t = . h→0 RX (t) RX (t) Example 56. If X has exponential distribution with mean λ1 , hX (t) = λe−λt = λ. e−λt 44 The fact that the hazard rate is constant is essentially a restatement of the memoryless property - the probability that a device is ‘about to fail’ given that it has survived until time t doesn’t depend on t. The book has an interesting calculation that shows how hazard rates apply to minimums of random variables: Example 57 (7.16 from textbook). We consider a collection of i.i.d. continuous random variables X1 , . . . , Xn and let X = max(X1 , . . . , Xn ). In the jargon of reliability modeling, we are considering n objects ‘in parallel,’ just as with the battery example above. We recall that FX (t) = FX1 (t)n . Since FX1 (t) FX1 (t)n →n 1, this tells us something rather obvious: the more objects you put in parallel, the longer the circuit will last. Let’s see what the hazard function looks like: fX (t) hX (t) = 1 − FX (t) nFX1 (t)n−1 fX1 (t) = 1 − FX1 (t)n ! fX1 (t) nFX1 (t)n−1 = Pn−1 i 1 − FX1 (t) i=0 FX1 (t) ! nFX (t)n−1 = hX1 (t) Pn−11 . i F (t) X 1 i=0 As t goes to infinity, all of the terms in the parentheses go to 1. Thus, for t large, hX (t) lim = 1. t→∞ hX1 (t) This tells us: conditional on having survived until a very large time t, the hazard rate for n objects in parallel is very close to the hazard rate for a single object, regardless of the details of the hazard function (though, if you look at the calculation carefully, it is always at least a little smaller). Why is this obvious? We can consider: what is the probability that two or more objects have survived for a large time t, conditional on at least one having survived? The answer is that this probability certainly goes to 0. Thus, for very large times t, conditioning on at least one object having survived is very close to conditioning on exactly one object having survived. Finally, we give one last definition and a few easy formulae: Definition 10.5 (Cumulative Hazard). The cumulative hazard of a random variable is given by Z t HX (t) = hx (s)ds. 0 We relate some of the new objects we have: d hX (t) = − log(RX (t)) dt 45 RX (t) = e− Rt 0 hX (s)ds = e−HX (t) . This last formula is very important for the standard textbook problems, many of which give you a hazard function and ask you to calculate probabilities. For example, Example 58 (7.5.8 from textbook). The hazard rate of a random variable X is given by 1 t h(t) = e 4 , 5 t > 0. What are the cumulative hazard function and reliability function for X? What is P[X > 2]? We directly calculate Z t Z 1 t s 4 t H(t) = h(s)ds = e 4 ds = e 4 . 5 0 5 0 Thus, 4 t R(t) = e−H(t) = e− 5 e 4 , and so P[X > 2] = R(2) ≈ 0.267. 10.2.2. Phase-Type Distributions: The Goal. Today, we will be talking about phase-type distributions. These are important for a few reasons. The first is that they give our first introduction to Markov chains and Queues, which are central to this course. However, that is not the emphasis in the textbook. Instead, phase type distributions are motivated as a computational tool. That is, one reason to study the phase type distribution is as a way to approximate complicated integrals. This is quite different from anything that we’ve talked about in this course, which has mostly focused on distributions as models. Since the textbook doesn’t spend much time discussing this viewpoint in this chapter, it is worth taking a detour to talk about computation and probability. We’ll talk about this in the context of an important problem in applied probability that is (largely) out of the scope of this course. Let’s say you are trying to price an option - for example, you want to charge a client a fee F0 , in exchange for which the client has a right to buy a stock for a certain amount eS0 at a certain time T in the future. In this case, your client would gain money if eST > eS0 + F ; otherwise, you would gain money. The ‘classic’ approach to this problem is, roughly, to assume St+1 − St are i.i.d. normal random variables. If you make this careful and do appropriate transformation, this leads to the Black-Scholes formula. In the real world, however, St+1 − St is not normal. It may even depend a great deal on t - for example, there might be a lawsuit pending, and there might be uncertainty as to when the lawsuit is to be resolved. One approach is to try to build a very complicated model, like St+1 − St = A max(B, C, eD log(E)) + 46 1 , 1+Q where each of the random variables A, B, . . . might themselves be complicated - some might correspond to court cases, or patents, or weather, or anything else. While this model might be realistic, it is very hard to calculate things like P[St+1 −St < 1] - that is some huge integral! Phase-type distributions are built to get around this issue. The central observation is that it is very easy to do complicated calculations with the exponential distribution. Thus, if we could replace each ‘part’ of the above expression with a ‘bunch’ of exponential random variables, we would be able to combine them, and thus actually be able to do the calculation. It is generally not possible to replace a complicated distribution with a bunch of exponential random variables. However, it turns out to be possible to get an arbitrarily good approximation using them. Thus, we replace each ‘part’ of a complicated distribution with a phase-type distribution that approximates it, and then we do calculations with those phase-type distributions. The study of phase-type distributions started before we had modern computers - one of the main theorems in the subject dates back to 1955. This leads to a natural question: is this approach still useful, now that we have fast computers? The answer turns out to be yes. Even though we can do more complicated integrals now, it turns out that doing high-dimensional integrals is very, very hard, even on a computer. The study of how to approximate complicated distributions with simpler and more tractable distributions is still very much alive, and there are good reasons to believe that it will remain an important subject. 10.2.3. Erlang Distribution. So, we’re going to build a big collection of random variables out of exponential distribution, with the goal of being able to approximate complicated distributions. To do this, we need two things that we didn’t need in the rest of chapters 6 and 7: • Some way(s) to check that the random variables we’re constructing are ‘really’ different, and • Some way to keep track of all of our indices. We will measure the first by looking at the relationships between the moments of our distributions, and at their Laplace transforms. We will eventually do the second by building a bunch of matrices and vectors, but initially we will do this using a bunch of pictures. The pictures aren’t necessary, but they are quite useful for most of us - and they become even more helpful as we continue our study of Markov chains in the rest of the course. The first picture is: This picture represents the following process: somebody enters a queue on the left, waits in the circle for a random amount of time that is exponential with mean λ, and then leaves 47 on the right. The associated phase-type random variable is the amount of time that they wait, which is of course exponential with mean λ. So far, we have one parameter’s worth of random variables, and we can get: E[X] = λ V ar[X] = λ2 p V ar[X] cv (X) ≡ =1 E[X] 1 , L(s) = 1 + λs Wehre cv (X) is callted the coefficient of variation. Once we’ve drawn this, the obvious next picture is: Here, we have somebody comes in from the left, waits an amount of time X1 ∼ exp(λ), then waits an amount of time X2 ∼ exp(λ), then exits to the right. So, the phase-type distribution is given by: X = X 1 + X2 . This is called a Erlang-2 distribution. We can calculate for this: E[X] = 2λ V ar[X] = 2λ2 1 cv (X) = √ 2 L(s) = 1 . (1 + λs)2 So, we get something new! In particular, we get a second value for the coefficient of variation. Once we have two things in a line, the obvious thing to do is put many things in a line: Here, we fix r, let X1 , . . .P , Xr be a sequence of r i.i.d. exponential random variables, each with mean λ. We set X = ri=1 Xi . The distribution of X is called Erlang-r. We can calculate for this: E[X] = rλ V ar[X] = rλ2 48 1 cv (X) = √ r 1 . (1 + λs)r So, we get many new things... but we certainly don’t get everything. Our coefficients of variation all look like √1r for r an integer. Similarly, all of the Laplace transforms look like powers of monic polynomials - we certainly don’t get all Laplace transforms, or anything like all of them. So, we continue to define more general phase-type distributions. L(s) = Remark 10.6. The textbook has explicit CDF’s for the Erlang-r distributions. You should be able to use these formulas, but we won’t discuss how to derive them. 49 11. Lecture 10: Remainder of Chapter 7 (Feb. 12) 11.1. Summary. • A few administrative details. • Remember: Homework 2 is due at the start of class! • Today, we finish our discussion of phase-type distributions. 11.1.1. Hypoexponential Distribution. So far, our diagrams are sequences of circles, each with the same number in them. We’re going to make the diagrams more complicated soon, but first we just change the number inside each circle: We do the obvious thing: let Xi ∼ exp(λi ), and let X = hypoexponential distribution. We can calculate for this: r X E[X] = λi Pr i=1 Xi . This is called a i=1 V ar[X] = r X λ2i i=1 Pr λ2 cv (X) = Pri=1 i 2 ( i=1 λi ) 1 . L(s) = Qr i=1 (1 + λi s) So, we get a little more than before. In particular, we could always get any expectation, but now we seem to be able to get just about any possible coefficient of variation that is less than 1... or at least there aren’t any such values that are obviously impossible to obtain. 2 11.1.2. Hyperexponential Distributions. For the hyperexponential distribution, we use graphs that aren’t lines! It actually isn’t so clear what that should mean, so first we describe something a bit easier: Definition (Mixture). Consider a collection of PDF’s f1 , f2 , . . . , fn and fix 0 ≤ α1 , . . . , αn P11.1 n satisfying i=1 αi = 1. Then n X f (s) = αi fi (s) i=1 is also a PDF. It is referred to as a mixture of the densities f1 , . . . , fn . It is sometimes just called a mixture distribution as shorthand. Remark 11.2. Mixture distributions are generally easiest to understand algorithmically. To sample X from a mixture distribution (e.g. on a computer), do the following: • Sample a number 1 ≤ i ≤ n according to the PMF fY (i) = αi . 50 • Sample X from fi . The hyperexponential distributions are just mixtures of exponential distributions. The associated diagram is: To read this diagram: we go in on the left. With probability α1 , we take the top branch and wait X1 ∼ exp(λ1 ) before exiting to the right; with probability α2 = 1 − α1 , we take the top branch and wait X2 ∼ exp(λ2 ) before exiting to the right. In general, the diagram has r branches (draw on board); we take branch i with probability αi and then wait an amount of time Xi ∼ exp(λi ). We make similar calculations to before: E[X] = r X α i λi i=1 r X E[X 2 ] = 2 αi λ2i i=1 P 2 ri=1 αi λ2i cv (X) = Pr −1 ( i=1 αi λi )2 r X αi L(s) = . 1 + λi s i=1 2 It isn’t obvious on sight, but Cv (X)2 ≥ 1. This is new ground! It also isn’t obvious - see pp. 166 of the textbook for proof. The obvious next step looks like: 51 That’s basically true, though there is another named distribution that is a little less general than this, called the Coxian or extended Erlang distribution: This can be written in the form of the previous figure (see figure 7.14 of the textbook). Continuing our standard calculations, it is easy to check: E[X] = r X i=1 λi i−1 Y αj . j=1 Unfortunately, the others are more annoying to write down. If you are interested, none of them are ‘hard’ to write down, and many of these calculations are on pp. 167 of the textbook. However, they are all fairly tedious, and they are all complicated enough to be difficult to interpret. In a sense, that is a good thing: remember that we’re trying to approximate general random variables using phase-type random variables, so these calculations had better get messy! 11.1.3. General Phase-Type Distributions. Finally, we write down general phase-type distributions. The pictures associated with these general distributions have: • Some collection of arrows entering from the left. • Some collection of circles with parameters inside, and with arrows between them. • Some collection of arrows exiting to the right. 52 Note that, unlike the pictures we’ve seen so far, we do not insist that all of the arrows between circles move to the right. Here’s a typical example: To interpret this sort of diagram, we run the following process: • At time 0, pick one of the arrows entering from the left, according to some distribution on those arrows. • Every time you enter a circle with the number λ in it, wait time exp(λ). Then choose one of the outgoing arrows, according to the numbers on those arrows. • When you finally exit on the right-hand side, record the final time. At this point, we’ll move from the ‘picture’ representation of phase-type distributions to a more algebraic representation. The parameters of a phase-type distribution are: • The number n of ‘circles,’ or states, in the diagram. • The inverse-means µ1i of the holding time within each state. • A distribution {σi }ni=1 of entrance probabilities. The initial entry is given by sampling from this distribution. This is denoted by the length-n P vector σ. • The routing probabilities {rij }1≤i,j≤n , which satisfy j rij ≤ 1 for all 1 ≤ i ≤ n. The number rij gives the probability of going next to state j after waiting in state i. This is denoted by the n by n matrix R. P • The terminal probabilities {ηi }ni=1 , which satisfy ηi + j rij = 1 and maxi ηi > 0. The number ηi gives the probability of immediately exiting to the right after leaving state i. This is denoted by the length-n vector η. There is an equivalent description, which basically merges {rij }1≤i,j≤n and {ηi }ni=1 into one object. Heuristically, we just add a new state, called the graveyard state, all the way at the right. Rather than counting time until we ‘exit’ the diagram, we count the time until we get to the graveyard state. This doesn’t change any of the actual distributions, it just slightly changes the notation and diagrams! To compare our ‘original’ diagrams to a diagram with a graveyard: 53 Writing the graveyard as a ∆ is a common convention, which I will follow. We also often merge some information, defining ri∆ = ηi , rδi = 0, and qij = µi rij , X qii = − qij . j 6= i, j6=i We write the associated matrix as Q. This changes the algebraic representation a little bit: • The number n of states is as before. • The distribution {σi }ni=1 of entrance probabilities is as before. • The routing and terminal probabilities {rij }1≤i,j≤n and {ηi }ni=1 , as well as the inverse mean holding times {µi }ni=1 , get smooshed into one object, the transition matrix Q. Example 59. We draw the diagram for n = 5, σi = i, rij = 21 1|i−j|=2 for i 6= 5 and r5j = 14 1|i−j|=2 , ηi = 21 1i=5 . This looks like a 5-sided star, with one arrow to the right: 54 We follow the textbook’s example by defining the n by n matrix S and the n by 1 matrix S 0 via: S S0 Q= 0 0 Define e to the be the column vector of all 1’s. There turn out to be reasonably nice formulas for these general diagrams! Let X be the amount of time it takes to get to the graveyard state: FX (x) = 1 − σeSx e E[X j ] = (−1)j j!σS −j e. So, doing these calculations involves inverting matrices. This can be tedious, and there are more places to make mistakes... but it isn’t so difficult. Here is a standard (simple) question: Example 60. Consider a network with σ = (0.2, 0.2, 0.3, 0.3) and Q= −2 1 0 0 1 1 −3 1 0 1 0 1 −2.5 1 0.5 1 0 0 −2 1 0 0 0 0 0 . We then have S −1 −0.65 −0.25 = −0.1 −0.05 −0.3 −0.25 −0.325 −0.5 −0.25 −0.125 −0.2 −0.5 −0.05 −0.1 −0.25 −0.525 . 55 NOTE: You need to be able to invert a small matrix, since a small example of this is likely to be on an exam. However, when doing problems at home, feel free to use the R command Solve(S) We then have E[X] = −σS −1 e = 0.95. NOTE: Again, matrix multiplication can be done in R, though you will need to be able to do this on exams. 11.1.4. Fitting Phase-Type Distributions. Section 7.6.7 of the textbook discusses matching phase-type distributions to data. This isn’t the emphasis of this course, so I just mention that it exists! The coverage in the textbook is short; just enough to say that fitting is possible, but without any discussion as to when it is a good idea. 56 12. Lecture 11: Chapter 8 (Feb. 23) 12.1. Summary. • A few administrative details. • Today, we discuss theorems. 12.2. Lecture. This isn’t a very theoretical course, but today will be devoted to theory. That means approximating things, taking limits, and so on. 12.2.1. Markov’s Inequality. We know that, if a ≤ X ≤ b, a ≤ E[X] ≤ b as well. This says that, if we know about the values X can take, we know something about its expectation. Markov’s inequality provides a more interesting bound in the other direction: if we know something about the expected value of X, we learn something about the values that X is likely to take: Theorem 61 (Markov’s Inequality). Let X be a random variable and h a nondecreasing, nonnegative function. Then P[X ≥ t] ≤ E[h(X)] . h(t) How do we use this? Here is a silly application: Example 62. Let X be the height, in inches, of a randomly chosen Canadian. This is obviously nonnegative, and E[X] ≈ 69. Then 69 ≈ 0.48. 144 This is obviously right, but obviously not very useful. This is because we aren’t using very much information; the expectation doesn’t tell us a vast amount about possible values far from the expectation. P[X > 144] ≤ We’ve seen that Markov’s inequality can be quite lose. Is it always this bad? The answer is no: the expectation simply doesn’t say very much. Example 63. Fix r ≥ 1, and let X be 1 with probability 1 r and 0 otherwise. Then 1 1 E[X] = r + 0(1 − ) = 1. r r We then note that 1 E[X] = . r r Thus, Markov’s inequality is actually an equality in this case. In math jargon, we say that the inequality is tight. This means that you can’t improve the inequality without adding some assumptions. P[X ≥ r] = There are many special cases of Markov’s inequality (that is, choices of the function h) that have their own names. The first is 57 Theorem 64 (Chebyshev’s Inequality). Let X be a random variable with mean µ and variance σ 2 . Then σ2 P[|X − µ| > s] ≤ 2 . s This is just a special case of Markov’s inequality. Heuristically, this is much better because 2 X grows much more quickly than X when X is large. Thus, if E[X 2 ] is not much larger than E[X]2 , X must be small most of the time. Let’s look at heights: Example 65. Human height X has mean 69 inches and variance 16. Thus, we have 16 ≈ 0.00284. P[X > 144] ≤ P[|X − 69| > 75] ≤ 5625 This is still an overestimate - it is certainly not true that 2 in a thousand people are over 12 feet tall. But it is much more reasonable than the previous estimate! Again, Chebyshev’s inequality is sharp by itself. The last ‘famous’ application of Markov’s inequality involves the choice h(x) = eθx for some θ > 0. Theorem 66 (Chernoff Bound). Let X be a random variable with moment generating function MX (θ) = E[eθX ]. Then P[X > s] ≤ e−θs MX (θ). This one is a little harder to understand, since most of us don’t have a good intuitive understanding of how big a moment generating function should be. The basic idea, though, is similar to the idea behind using Chebyshev’s inequality over the standard Markov’s inequality: just as X 2 grows more quickly than X, eθX grows more quickly than X 2 . Thus, we can expect Chernoff’s inequality to give us better bounds for s ‘moderately large.’ Example 67. Assume that human height X has E[eX ] ≤ 8 × 1046 (NOTE: we should be a little bit sceptical about using this estimate. It is based on the tallest recorded human, but that is probably not really good enough to write down this sort of bound. We’ll continue for now, as doing inference about the tails of distributions is far outside the scope of this course). Then P[X > 144] ≤ e−144 × 8 × 1046 ≈ 2 × 10−16 . This is much, much smaller than what we had in the previous examples, even though 1046 is a huge number. (NOTE: It is hard to evaluate this number. There have been about 1012 people in the world, and nobody has ever been confirmed at a height of close to 12 feet. Again, this is meant just as probability, not statistics). 12.2.2. Law of Large Numbers. We know that, if you flip a coin many, many, times, the percentage of heads will eventually be close to 21 . The law of large numbers is a way to turn this into math. Here is a first way to do so: Theorem 68 (Weak Law of Large Numbers). P Let {Xi }i∈N be a sequence of i.i.d. random variables with E[X 2 ] < ∞ and define Sn = n1 ni=1 Xi . Then for any > 0, lim P[|Sn − E[X]| > ] = 0. n→∞ 58 Proof. Fix > 0. By Chebyshev’s inequality, V ar[Sn ] 2 V ar[X1 ] = → 0. 2 n P[|Sn − E[X]| > ] ≤ Remark 12.1. We don’t quite need E[X 2 ] < ∞ for the conclusion to be true. Ok, so this is called the weak law of large numbers. To say where this word comes from, we talk a little bit about what we want the law of large numbers to say, and what the weak law of large numbers actually says. We’ll turn this into math soon, but the problem is easy to understand. When considering coin-flipping, the WLLN says something like: fix = 0.1; then for n very large, the probability that the proportion of heads is more than 0.6 or less than 0.4 is very small. We want to be able to say that the sequence Sn actually converges to 12 , but the WLLN doesn’t tell us this. For example, it can’t exclude the possibility that Sn careens back and forth between 0 and 1, just spending an increasing percentage of its time near 0.5. Here’s an illustration of what we would like Sn to look like as n increases, next to a picture that the WLLN doesn’t exclude: Ok, the second picture is obviously crazy - it doesn’t happen. So, there should be a better LLN. Let’s make this careful: Definition 12.2 (Weak Convergence of Random Variables). Let {Xi }ni=1 be a sequence of random variables and let c ∈ R. We say that {Xi }ni=1 converges to c weakly if, for all > 0, lim P[|Xn − c| > ] = 0. n→∞ Thus, the WLLN can be restated as: Theorem 69 (WLLN, Redux). PnLet {Xi }i∈N be a sequence of i.i.d. random variables with 1 2 E[X ] < ∞ and define Sn = n i=1 Xi . Then {Sn }n∈N converges weakly to E[X]. We contrast weak convergence with another form of convergence: Definition 12.3 (Almost Sure Convergence). Let {Xi }ni=1 be a sequence of random variables and let c ∈ R. We say that {Xi }ni=1 converges to c almost surely if P[ lim Xn = c] = 1. n→∞ Remark 12.4. The phrase ‘almost surely’ is some math jargon that we don’t investigate in this course. The definition above will be good enough for us, even if the phrase is a little mysterious. 59 Remark 12.5. Note: this really does make sense, based on our naive notion of a limit! The Xn ’s are a sequence of functions from a state space Ω, and so we can check if limn→∞ Xn (ω) = c for ‘almost all’ ω ∈ Ω. How are these definitions related? Unsurprisingly, weak convergence is less desirable than almost sure convergence. In particular, the latter implies the former: Theorem 70. If {Xn }n∈N converges to c almost surely, it also converges to c weakly. But the former does not imply the latter! Example 71 (Weak but not A.S. Convergence). Define the sequence of independent random variables {Xn }n∈N by having Xn = 21 with probability 1 − n2 , equal to 0 with probability n1 , and equal to 1 with probability n1 . It is easy to check that, for any > 0, 2 1 lim P[|Xn − | > ] = lim = 0, n→∞ n→∞ 2 n so the sequence converges weakly to 0. On the other hand, if it converged a.s. to 12 , there would have to be some step τ so that Xt = 12 for all t > τ . It turns out that this happens with probability 0! There are some related constructions. Define Pn {τn }n∈N to be a sequence of geometric random n variables, with means 2 , define bn = i=1 τi , and define B = {bi }i∈N . We then set Yn = 1n∈B . It is easy to check that Yn converges to 0 weakly, but not almost surely. These examples are exactly about the ‘bad picture’ I was trying to avoid. So it should be no surprise that: Theorem 72 (Strong Law of Large Numbers).P Let {Xi }i∈N be a sequence of i.i.d. random variables with E[X 2 ] < ∞ and define Sn = n1 ni=1 Xi . Then {Sn }n∈N converges to E[X] almost surely. 12.2.3. Central Limit Theorem. This is probably the most famous theorem in statistics. The idea is: the law of large numbers tell us that sample averages converge to ‘true’ means; the central limit theorem tells us how quickly this happens. We’ll see that taking that claim too seriously can be a little misleading, but this is the goal. Theorem 73 (Central Limit Theorem). Let {Xi }i∈N be a sequence of iid random variables with E[X12 ] < ∞ and denote by Φ(x) the CDF of the standard Normal random variable. Then, for all x, Pn (Xi − E[Xi ]) p lim P[ i=1 ≤ x] = Φ(x). n→∞ nV ar[X1 ] Lots of comments on this! Remark 12.6. The formulation in the textbook is a little bit dangerous. It mentions that you don’t need the Xi to be i.i.d. for something that looks like the CLT to hold, but doesn’t really say what replaces this requirement. It is easy to check that you need some condition i - for example, the conclusion can’t hold for Xi ∼ exp(22 ). The short answer is that, for n large, the sum can’t be dominated by a small number of terms. To see a slightly more specific answer, see the Lindeberg CLT. A huge amount of probability theory is devoted to proving things that look like the CLT, so there are very, very many more specific answers. 60 Remark 12.7. The LLN said that |Sn − E[X1 ]| → 0. The CLT refines this, suggesting that |Sn − E[X1 ]| ≈ √σn → 0. Note that the CLT implies the weak LLN, but does not imply the strong LLN. There are stronger theorems that incorporate both. Remark 12.8. The CLT is a limit theorem - it doesn’t say anything for finite values of n, and it doesn’t guarantee P convergence for all x simultaneously. For example, you cannot use the CLT to calculate P[ ni=1 Xi > 10]; you can’t even get a rigorous estimate! Nonetheless, it is popular to pretend that the CLT is really an equality for n ‘reasonably large.’ This is often fairly safe. If you are interested in justifying this sort of thing, a first step is the Berry-Esseen theorem. Remark 12.9. We won’t quite prove the CLT in this class, but we’ll give an outline of the ‘standard’Pproof. Let MX (θ) be the MGF for X1 , and let MSn (θ) be the MGF for Sn = n √ 1 i=1 (Xi − E[Xi ]). By our rules for MGF’s, we have: V ar[X]n θ MSn (θ) = (MX ( p ))n V ar[X]n 3 θ 1 θ2 = (MX (0) + p + O(n− 2 ))n MX0 (0) + 2 V ar[X]n V ar[X]n 3 1 θ2 + O(n− 2 ))n = (1 + 2n θ2 →e2. But this last object is exactly the MGF of the normal distribution! So, our MGF’s converge to the right thing. The above calculation really does work; so why isn’t it a proof of the CLT? We know that, if two random variables have the same MGF, they must have the same distribution. Here, we have only shown that the MGF’s of a sequence of random variables converge to an MGF. To prove the CLT, we need to prove that convergence of the sequence of MGFs implies convergence of the sequence of random variables. As we have seen already with the LLN, the convergence of random variables can be a bit subtle, but everything turns out to work in this case. Remark 12.10. As you have probably seen, not every random variable has E[X 2 ] < ∞, and so the CLT doesn’t apply to every random variable. It turns out that, for all 21 ≤ α ≤ 1, there are stable distributions F so that i.i.d. sequences {Xi }i∈N ∼ F satisfy n X −α n Xi ∼ X1 . i=1 1 , 2 For α = this stable distribution is the normal distribution. For α = 1, the stable distribution is called the Cauchy distribution; it has PDF 1 fX (x) = . 2 π(x + 1) The standard application of the CLT in statistics classes is as follows: Example 74. I flip a fair coin 500 times, and record the number of heads X. What is (approximately) P[X > 270]? 61 Denote by Z a standard normal random variable. Then X − 250 20 P[X > 270] = P[ q >q ] 500 4 500 4 X − 250 = P[ q > 1.79] 500 4 ≈ P[Z > 1.79] ≈ 0.037. 62 13. Lecture 12: Midterm Exam Review (Feb. 26) 13.1. Summary. • A few administrative details. • Midterm is next class! Today is devoted to review. 13.2. Lecture. Today is devoted to review for the midterm next class, on March 2. We don’t have time to go over every question that might show up. Instead, I will describe the different types of questions that might show up. Midterm overview: • The midterm covers chapters 1 to 8 of the textbook (i.e. everything we have covered up to and including last class). There are no programming questions. It is closed book, with a ‘cheat sheet.’ The cheat sheet is on my website. • The midterm has 2 ‘long-answer’ questions and 8 multiple choice questions. The longanswer questions are computational, are involved but not at all tricky, and focus on the later parts of the course (e.g. chapters 6/7). The 8 multiple choice questions cover the 8 chapters evenly. All are fairly short; most are straightforward, but a few are ‘tricky.’ • My feeling is that (some of) the homework is more difficult than the textbook questions. The computational questions on the midterm are meant to be of similar difficulty to the textbook questions; the tricky questions require little work but might require a first step that isn’t obvious. • In terms of material covered: I suggest that you make sure that you are very comfortable with the solutions to all three homework sets and to the examples in this lecture. I also suggest that you do several practice problems from the textbook and check your answers. Doing these things should get you a decent mark. The last few marks (e.g. 1-3 parts of questions) might be a little trickier, though again some version of the tricks have been mentioned in class or in homework at some point. • Some general study advice: – Look over the cheat sheet, and make sure that it makes sense to you. There is one typo for the expected value of Weibull distribution. I may not have time to correct this before the exam, and will not be able to correct it if you are not writing the exam in class. – It is probably not worthwhile to memorize formulas related to distributions. It is probably worthwhile to memorize formulas related to fundamental objects (e.g. linearity of expectation, product formula for generating functions) and conditional distributions (e.g. Bayes’ rule, law of total variance, etc). The difference is that it is hard to think about conditional distributions unless you largely have the formulas memorized. – For the computational questions, doing (lots of) practice should be enough. The one ‘funny’ thing that pops up would be what I call ‘backwards’ versions of questions. For example, a ‘standard’ question is: X is Poisson with E[X] = 3; what is P[X = 0]? The answer is to note that a Poisson random variable with mean 3 has λ = 3, and so P[X = 0] = e−3 ≈ 0.05. A ‘backwards’ version of the question might be: X is Poisson and P[X = 0] = 0.05; what is E[X]? In this case, we write 0.05 = e−λ , and so λ ≈ 3, and so E[X] = λ ≈ 3. These questions 63 are both referring to the same random variable, and involve doing essentially the same calculations. You just need to be aware that these questions are fair game. – For the tricky questions, my main advice is to not be afraid! For most of these questions, there is a calculation you would like to do but are unable to do. The right choice is almost always to write down the calculation that you want to do, and then make a simple observation. Here is a prototypical example, from Ch. 1. Let {Bi }4i=1 be a partition of the sample space Ω, and assume that P[A|Bi ] = 10i . Is it possible that P[A] ≥ 12 ? P We want to write P[A] = 4i=1 P[A|Bi ]P[Bi ], but we don’t know P[Bi ]. Fortunately, we don’t need it to answer the question - we are averaging a bunch of numbers that are less than 12 , so the average must be less than 12 . This principle should get you through all of the tricky questions! – There may be 1-2 questions that are really checking to see if you know a definition (e.g. what a mixture of distributions is), an idea we’ve been using (e.g. the diagrams for phase-type distributions), or a technique (e.g. using a Gaussian or Poisson approximation). These are, of course, things that aren’t on the cheat sheet. Now we’ll do practice problems from each chapter. Since you all probably know the first few chapters well, I start at the end: 13.2.1. Chapter 8. There are really only three types of questions here: • Plug something into Markov’s inequality. E.g. E[X 2 ] = 12; is it possible that P[X > 12 200] > 0.5? No, since P[|X| > 200] = P[X 2 ≥ 40, 000] ≤ 40000 0.5. • Do you understand types of convergence? • Use the normal approximation suggested by the central limit theorem; then do a ‘chapter 7’-type question. 13.2.2. Chapter 7. There are a small number of significant question types: • We give you a parameter and ask you to calculate an expectation or probability. Eg. X has exponential distribution with mean 2; calculate E[X 3 ] and P[X < 1]. These questions basically involve plugging into a formula; you need to practice, but there isn’t much to say. • A backwards version of this question: we give you an expectation or a probability and ask you to calculate a parameter. This is only possible either for one-parameter families of distribution, or if we give some of the parameters. E.g. X has normal distribution with mean 0, and P[−9.1 ≤ X ≤ 9.1] = 0.95. What is V ar[X]? To answer this, look up in a table that p P[−1.96 ≤ Z ≤ 1.96] = 0.95 for the standard 9.1 normal distribution; thus, 1.96 = V ar[X]. For other distributions, you might have a formula for the probability in the setup of the question rather than using a table lookup. • Compound questions, for which one essentially must do two of these questions together. • Qualitative questions. For example, can a hypoexponential distribution ever have larger variance than an exponential distribution with the same mean? To answer this question, let X be hypoexponential with means λ1 , . . . , λr and let Y be exponential 64 with mean λ. Then V ar[Y ] = λ2 = E[Y ]2 , r r X X 2 V ar[X] = λi < ( λi )2 = E[X]2 . i=1 i=1 NOTE: I consider this to be a fairly tricky question! The first of these question types is the most common. For these questions, the hardest part is often recognizing what distribution to use. For this reason, I suggest reading over many questions, even those that you will not complete. Example 75. X has exponential distribution with mean 1. What is P[X > 4]? P[X > 4] = e−4 ≈ 0.018. It is worthwhile to pay special attention to calculations involving the phase-type distributions. They are new for most of you, the calculations are more difficult, and they are very closely related to what we will study later in the term. These three things also mean that they are likely to show up on exams! You are also expected to be familiar with the way we have been representing phase-type distributions, including both the diagrams and the various matrices and vectors. Let’s look at a ‘typical’ question: Example 76. When people call a helpline at a bank, they must wait an initial amount of time that is exponential with mean 10 minutes to get through the initial menu. Twenty percent of callers must then talk to a representative; this takes an amount of time that is exponential with mean 40 minutes. Ten percent of callers who talked to a representative must then talk to a manager; this takes an amount of time that is exponential with mean 80 minutes. Let S be the total amount of time taken. Draw the phase-type diagram associated with S and also calculate E[S]. Let X1 , X2 and X3 be exponential distributions with means 10, 40 and 80 respectively. Let Y2 and Y3 be Bernoulli random variables with means 0.2 and 0.1 respectively. To calculate E[S], we have E[S] = E[X1 + Y2 X2 + Y2 Y3 X3 ] = E[X1 ] + E[X2 ]E[Y2 ] + E[Y2 ]E[Y3 ]E[X3 ] = 10 + (0.2)(40) + (0.2)(0.1)(80) = 19.6. Diagram goes here in class. NOTE: Calculating variance is also fair game! I omit it only for time. 13.3. Chapter 6. Although there are many distributions here, the questions associated with chapter 6 are almost identical to those associated with chapter 7. My only tips are: • Try to remember which is which among binomial, negative binomial, hypergeometric, and Poisson. • The Poisson approximation is introduced in chapter 6. 65 Example 77. X has Poisson distribution, and P[X = 0] = 0.1. What is E[X]? Looking at the PMF, we have 0.1 = P[X = 0] = e−λ , so λ ≈ 2.3. Thus, E[X] ≈ 2.3. 13.3.1. Chapter 5. There are a surprisingly large number of reasonable questions for chapter 5. Some straightforward questions: • A distribution and a function are given, and you are asked to calculate the expectation or conditional expectation. This is ‘just’ calculus; it might be difficult, and you should practice, but there is no probability content. • As above, but with a generating function. Some harder computational questions involve conditioning. You should review question 3 in homework set 2 for a simple version of this. Here is another: Example 78. Say N is exponential with mean 5, and X is binomial with p = What is V ar[X]? Use the formula 1 2 and n = N . V ar[X] = E[V ar[X|N ]] + V ar[E[X|N ]] N N = E[ ] + V ar[ ] 4 2 5 25 = + . 4 2 This chapter also involves some questions that aren’t very computational: • Proving relationships using generating functions. See question 2 of homework 2. • Proving inequalities. Sometimes these are very simple (e.g. min(X) ≤ E[X] ≤ max(X), or V ar[X] ≥ E[V ar[X|Y ]]). Sometimes they require more thought (e.g. question 1 of homework 2). Any on an exam will be straightforward - so, if you find yourself doing a long calculation, know that there is a way to avoid it! • Reading coefficients off of generating functions (see homework set 3). 13.3.2. Chapter 4. It is possible to ask some very difficult questions based on material from chapter 4. However, the basic questions involve giving you a few pieces of information and asking you to use some of the very many formulas that show up in this chapter. The following is about as difficult as such a ‘standard’ question gets: Example 79. fX,Y (x, y) = (2x+5y) , 0 ≤ x ≤ 5, 0 ≤ y ≤ 2. Find fX|Y (x|y). 100 To do this question, the obvious first step is to write: fX|Y (x|y) = fX,Y (x, y) . fY (y) You don’t know fY (y), so you calculate it: Z 5 2x + 5y 1 1+y fY (y) = dx = (25 + 25y) = . 100 100 4 0 66 Plugging in, fX|Y (x|y) = 2x + 5y . 25(1 + y) There are really three things to make sure you know how to do for questions in this chapter: • Be familiar with the formulas (so that, e.g., the previous example is second nature to you). • Be familiar with calculus (I won’t ask anything mean here). • Be able to read ‘word problems’ and recognize distributions. These will often be either distributions from chapter 6/7 or counting problems from chapter 2. Any ‘tricky’ question in chapter 4 will be of the type discussed at the start of class - some piece of information needed to complete a calculation will be missing, but enough will be there that you can answer the question. 13.3.3. Chapter 3. There are very few questions related to chapter 3: • Standard definition questions (e.g. a CDF is given to you in chart form, and you are asked to find P[X ≤ 12]). Include such an example P in class. P • Tricky definition questions (e.g. recognizing that x pX (x) = 1, or that 0 ≤ x∈A pX (x) ≤ 1). For example, We might say: P[A] = 0.7, P[B] = 0.5. Is it possible that A, B are mutually exclusive? • Questions that are ‘really’ chapter-1 questions (e.g. using Bayes’ rule). 13.3.4. Chapter 2. The biggest difficulty in chapter 2 is recognizing what situation you are in! The following example is one that many people find hard: Example 80. A box contains 5 white balls and 3 black balls. Two different balls are chosen at random. What is the probability that they are the same colour? P[same colour] = P[W W ] + P[BB] 5 3 = 2 8 2 + 2 8 2 ≈ 0.464. 13.3.5. Chapter 1. The questions here are essentially the same as those in chapter 3. Here is a typical question: Example 81. Let X1 , X2 , X3 be independent Bernoulli random variables with success probabilities 0.9, 0.8, 0.5. Let X = X1 + X2 + X3 and Y = (X1 , X2 , X3 ). What is P[X = 1]? P[X = 1] = P[{Y = (1, 0, 0)} ∪ {Y = (0, 1, 0)} ∪ {Y = (0, 0, 1)}] = P[X1 = 1, X2 = 0, X3 = 0] + P[X1 = 0, X2 = 1, X3 = 0] + P[X1 = 0, X2 = 0, X3 = 1] = P[X1 = 1]P[X2 = 0]P[X3 = 0] + P[X1 = 0]P[X2 = 1]P[X3 = 0] + P[X1 = 0]P[X2 = 0]P[X3 = 1] = (0.9)(0.2)(0.5) + (0.1)(0.8)(0.5) + (0.1)(0.2)(0.5) = 0.14. Note that we are following a fairly standard machine here, one which you follow for essentially all ‘standard’ questions. The machine looks like: 67 (1) Write the event you’re interested in as a complicated intersection/union of other events. , and more complicated ones (2) Get rid of conditional probabilities via P[A|B] = P[A∩B] P[B] via Bayes’ rule. (3) Get rid of unions via P[A ∪ B] = P[A] + P[B] − P[A ∩ B]. (4) Get rid of intersections by independence or mutual exclusion assumptions. (5) If everything is elementary, you’re done! If you have intersections of events that aren’t independent or mutually exclusive, go back to step 1. Another typical question is: Example 82. Box 1 has a fair die in it, and box 2 has a die with 6’s printed on all faces. A die is picked at random from one of the boxes and rolled; it lands with a 6 face up. What is the probability that it was picked from the second box? P[B = 2|F = 6] = = P[F = 6|B = 2]P[B = 2] P[F = 6|B = 2]P[B = 2] + P[F = 6|B = 1]P[B = 1] 1 2 1 2 + 1 12 = 6 ≈ 0.86. 7 Tricky questions involve being unable to use this machine. I like the following example from the textbook: Example 83. We have 1 P[A|B] = P[B] = P[A ∪ B] = . 2 Are A, B independent? This looks scary - it isn’t clear what the first step should be. It is possible to answer this question by thinking ahead and deciding what you want to calculate, but that might be too much to expect on an exam. Let’s see if we can come up with an algorithm for this type of question. Like all tricky questions, we first write down what we want: P[A ∩ B]? =?P[A]P[B]. One of these terms, P[B] = 12 , we already know. Lets look at the others. First, P[A ∩ B] =??? We don’t have many formulas here. Some prospects are: P[A ∩ B] = P[A] + P[B] − P[A ∪ B] P[A ∩ B] = P[B|A]P[A] P[A ∩ B] = P[A|B]P[B]. The first two look hopeless, but the last looks good! We have 1 P[A ∩ B] = . 4 68 Finally, we try to calculate P[A]. I only know two formulas with P[A] by itself: P[A ∩ B] P[B|A] = P[A] P[A ∪ B] = P[A] + P[B] − P[A ∩ B] The first we can’t use, but the second involves only terms we’ve calculated. We find: 1 1 1 = P[A] + − 2 2 4 1 P[A] = . 4 Thus, we have: 1 P[A ∩ B] = 4 11 P[A]P[B] = 6= P[A ∩ B]. 42 69 14. Lecture 13: Midterm Exam (Mar. 2) Good luck! 70 15. Lecture 14: Start of Chapter 9: Markov Chains (Mar. 5) 15.1. Summary. • A few administrative details. • Brief midterm recap. • We start studying Markov chains. 15.2. Lecture. 15.2.1. Midterm Recap. The midterm was graded out of 36. The mean was 25.27, the median was 29. In terms of individual questions, the number of people getting the multiple-choice questions wrong were: 1:6 2 : 16 3 : 10 4:7 5 : 13 6:5 7:9 8 : 6. This struck me as reasonably uniform. Question 2 seemed to be the most difficult. This was expected (although a nearly-identical question was in Homework 3). Question 5 was the next-most difficult; it was intended to be less straightforward. All other questions were, basically, done well. The midterm basically worked as I had hoped: roughly 75 percent of the questions were straightforward and roughly 25 percent were somewhat tricky. The result was an exam that most students did fairly well at. Since it worked fairly well, I intend to have a similar structure for the final exam and to run review in a fairly similar way. If that worked for you: great! If not, I hope to hear from you on what we can do differently. 15.2.2. Introduction. Today, we begin studying Markov chains, one of the main new subjects in this course. The first 2-3 weeks of this will be fairly abstract, so we take a moment to say what they are about. Concretely, most Markov chains we encounter will look like the drawings of phase-type distributions we’ve been making. There will be a bunch of states with arrows between them, and we imagine following the arrows in some random way. Rather than merely studying the exit time, however, we’ll study many other properties of this process and consider variants and extensions of this idea. In general, Markov chains describe a system that evolves over time, with the important ‘memoryless’ property that the next state depends only on the current state, never the distant past. They are described mathematically by square matrices with nonnegative entries (or sequences of matrices, or matrix-like objects), and there is a lot of overlap between the theory of Markov chains, applied linear algebra, and partial differential equations. Quite 71 often, the ‘same’ object will be interpreted through all of these lenses. We give a few places that Markov chains pop up in the sciences and computation: (1) Markov chains as models. • Statistical physics and material science: we would like to model how molecules bounce around in a box, or otherwise interact. The states of the Markov chain include all information about all of the particles being tracked - for example, a list of their coordinates. The ‘randomness’ here corresponds to details of the system that aren’t included in the model - e.g. the exact orientation of molecules. I don’t think it is obvious a priori why these models should have the memoryless property, but it turns out to be a good idea. These models are extremely popular and can be used to make detailed calculations. • Queues: Consider the number of people in a queue - for example, waiting on the phone for technical support. In many situations, the evolution of this number turns out to be quite closely modeled by a Markov chain. • Language processing: the sequences of characters, or pairs of characters, or words, etc. can be modeled as a Markov chain. This idea, and sophisticated variants, have been very useful ways to understand speach and text. Unlike the previous examples, these models are ludicrous - it is easy to tell apart real text and a simple Markov chain model. Nonetheless, as you will see in homework, the simple models are very useful. They are also easier to work with than many otherwise-comparable models. (2) Markov chains as computational tools. • The most famous use of Markov chains in computation is as a way to combine rankings into one global ranking. This is the basis of the famous PageRank algorithm developed by/for Google. The idea is to build a Markov chain out of linked websites, and analyze the chain’s behaviour. The same approach has been used successfully in many other settings. • In many situations, it turns out to be much easier to design a Markov chain to sample from a distribution than it is to sample from the distribution directly. The prototypical example here is again statistical physics: the easiest way to simulate the ‘typical’ arrangement of a bunch of molecules is just to simulate the system for a very long time. The same idea turns out to be useful in many surprising places, and Markov chains are an important tool for doing simulations. We now get started with a sequence of definitions: 15.2.3. Definitions. Definition 15.1 (Stochastic Process). Fix some set T , known as an index set. A stochastic process is a collection of random variables {Xt }t∈T . If T ⊂ N, {Xt }t∈T is known as a discretetime stochastic process. If T = (a, b) ⊂ R, it is known as a continuous-time stochastic process. The values (or range) of the random variables are often called states. NOTE: Unlike previous random variables, we may not identify states with elements of R. That is fine! Definition 15.2 (Stationary Process). A stochastic process {Xt }t∈T is stationary if, for all n ∈ N, all x1 , . . . , xn R, all t1 , . . . , tn ∈ T , and all α ∈ R so that t1 + α, . . . , tn + α ∈ T , we 72 have P[Xt1 ≤ x1 , . . . , Xtn ≤ xn ] = P[Xt1 +α ≤ x1 , . . . , Xtn +α ≤ xn ]. Stationary processes ‘look the same’ at all time points. This doesn’t mean that they are iid: Example 84. Let X0 be Bernoulli with success probability 12 . Then set Xt+1 = 1 − Xt . We have that {Xt }t∈N is stationary, but there is only one ‘bit’ of randomness. Remark 15.3. Despite what the book claims here, we are certainly interested in nonstationary processes. We don’t yet have the notation to describe homogeneous and inhomogeneous stochastic processes. Roughly, the former correspond to processes that evolve according to a fixed diagram, the latter according to a different diagram at every time. Definition 15.4 (Discrete-Time Markov Chain (DTMC)). A DTMC is a stochastic process {Xt }t∈T with index set T = N that satisfies P[Xt+1 = xt+1 |Xt = xt , . . . , X1 = x1 ] = P[Xt+1 = xt+1 |Xt = xt ] for all t ∈ N and all x1 , . . . , xt+1 . We will generally assume that our Markov chains have domains that are subsets of N and write: Definition 15.5 (Transition Matrix). Define pij (n) = P[Xn+1 = j|Xn = i] to be the single-step transition probabilities of a Markov chain {Xt }t∈N . Denote by P (n) the matrix with entries pij (n), and call this the transition probability matrix. NOTE: This may not really be a ‘square matrix,’ as we allow it to have infinitely many rows and columns. We will spend quite a bit of time thinking about this sort of infinity. We notice that: • For all i, j, n, P0 ≤ pij (n) ≤ 1. • For all i, n, j pij (n) = 1. Any matrix with these properties is called a Markov matrix or transition matrix, and defines the transition probabilities of some Markov chain. Definition 15.6 (Time-Homogeneous Markov Chain). A Markov chain is time-homogeneous if its transition matrix P (n) does not depend on n. Otherwise, it is called time-homogeneous. Example 85 (Simple Markov Chain Calculations). Consider the Markov chain with transition matrix 0.7 0.2 0.1 P = 0.5 0.1 0.4 0 0.5 0.5 The associated picture is: 73 Let’s consider a Markov chain {Xt }t∈N started at X1 = 1, evolving according to the transition matrix P . We can read off immediately: P[X2 = 1|X1 = 1] = P [1, 1] = 0.7. We can also easily calculate the probabilities of any sample path as follows: P[(X1 , X2 , X3 , X4 ) = (1, 2, 3, 3)] = P[X4 = 3|(X1 , X2 , X3 ) = (1, 2, 3)]P[X3 = 3|(X1 , X2 ) = (1, 2)]P[X2 = 2|X1 = 1] = P[X4 = 3|X3 = 3]P[X3 = 3|X2 = 2]P[X2 = 2|X1 = 1] = P [3, 3]P [2, 3]P [1, 2] = (0.5)(0.4)(0.2) = 0.04. Lets go further. What about just calculating P[X3 = 1]? We make the following calculation, which will be ubiquitous: P[X3 = 1] = P[X3 = 1, X2 = 1|X1 = 1] + P[X3 = 1, X2 = 2|X1 = 1] + P[X3 = 1, X2 = 3|X1 = 1] = P[X3 = 1|X2 = 1, X1 = 1]P[X2 = 1|X1 = 1] + P[X3 = 1|X2 = 2, X1 = 1]P[X2 = 2|X1 = 1] + P[X3 = 1|X2 = 3, X1 = 1]P[X2 = 3|X1 = 1] = P[X3 = 1|X2 = 1]P[X2 = 1|X1 = 1] + P[X3 = 1|X2 = 2]P[X2 = 2|X1 = 1] + P[X3 = 1|X2 = 3]P[X2 = 3|X1 = 1] = P [1, 1]P [1, 1] + P [2, 1]P [1, 2] + P [3, 1]P [1, 3] = (0.7)(0.7) + (0.2)(0.5) + (0)(0.1) = 0.59. In general, we will often find ourselves writing for s < t < u: X P[Xu = i|Xs = j] = P[Xu = i|Xt = k]P[Xt = k|Xs = j]. k What about P[X4 = 1] or P[X5 = 1]? We could obviously continue as above, working out these probabilities one index at a time. However, we will see that we can be much more efficient. We won’t see nearly as many inhomogeneous Markov chains, but here is one: Example 86. Define P (n) = . 1− 1 n 74 1 n 1 n 1− 1 n Assume X1 = 1. Then P[(X1 , X2 , X3 , X4 , X5 ) = (1, 2, 1, 1, 1)] = p12 (1)p21 (2)p11 (3)p11 (4) 1 1 2 3 = (1)( )( )( ) = . 2 3 4 4 Example 87 (Finite-Memory Processes). Let {Xt }t∈N be a stochastic process with range Ω. Assume that, for some k ∈ N, it satisfies: P[Xt+1 = xt+1 |Xt = xt , . . . , X1 = x1 ] = P[Xt+1 = xt+1 |Xt = xt , . . . , Xt−k+1 = xt−k+1 ]. If k = 1, this would be a Markov chain. For k > 1, it is not quite a Markov chain. These processes are often called k-dependant processes, and they can easily be converted into Markov chains in the following way. Define: Yt = (Xt , Xt+1 , . . . , Xt+k−1 ). Then {Yt }t∈N really is a Markov chain. This is a very general trick, and it allows us to use our Markov chain techniques to study things that don’t (initially) look like Markov chains. Note: this is a simple but important trick, and it is easy to put on an exam. 15.2.4. Sojourn Times and Embedded Chains. So far, our Markov chains have been discretetime, even though the phase-type distributions involved waiting times that were continuous. We give some definitions that will be used later on to bridge some of that gap: Definition 15.7 (Sojourn Time). Consider a Markov chain with Xt = i, Xt−1 6= i. Then define the sojourn time or holding time of state i to be the random variable: R = R(i, t) ≡ max{s ∈ N : Xt+s−1 = i}. We note that, if Xt is time-homogenous, we can write P[R = s] = P[Xt = i, . . . , Xt+s−1 = i; Xt+s 6= i] = ps−1 ii (1 − pii ). This is the PDF of a geometric random variable with parameter p = pii . And then: Definition 15.8 (Embedded Chain). Let {Xt }t∈N be a time-homogeneous DTMC. Then define τ0 = 0, and τs+1 = min{t > τs : Xt 6= Xτs }. Finally, let Y s = X τs . This is the embedded chain for {Xt }t∈N . Some observations: • If pii < ∞ for all i, this definition makes sense. Otherwise, the embedded chain might not be defined for all indices. • If the embedded chain is defined, it is a Markov chain. Actually, we can do better. Let P be the transition matrix for Xt . Then the transition matrix for Yt is given by: pˆii = 0 pˆij = pij . 1 − pii 75 16. Lecture 15: More Chapter 9: Linear Algebra Review and Classification of States (Mar. 9) 16.1. Summary. • A few administrative details. • We continue building the theory of Markov chains. 16.2. Lecture. Recall that a time-homogeneous discrete-time Markov chain is a stochastic process {Xt }t≥0 , whose law is determined completely by the distribution of X0 and the transition ‘matrix’ P . Recall that the embedded chain has transition matrix Pˆ given by: pij pˆij = 1 − pii pˆii = 0. Remark 16.1. Implicitly, we will generally talk about Markov chains that are discrete-time and which have a finite number of states {1, 2, . . . , n}. The textbook refers to these as finite Markov chains. Example 88 (Simple Markov Chain Calculations, Redux). Recall our previous Markov chain: 0.7 0.2 0.1 P = 0.5 0.1 0.4 0 0.5 0.5 The transition matrix for the embedded chain is: 0 32 31 P = 59 0 94 0 1 0 We can do all of our favorite calculations here. 16.2.1. Stationary Distributions and the Chapman-Kolmogorov Equations. So far, we can calculate the probability that a Markov chain follows a given path for a short period of time, and also calculate the probability that it is in a specific state at a specific time. However, for most of our applications, we are interested in the long-run behaviour of the Markov chain. For example, we might want to know: • How much time, on average, does the chain spend in state i? • Does the chain return to state i infinitely often? • Does the chain ever get stuck in some state? These questions are hard to answer based on the types of calculations that we know how to do, and so we begin to develop a theory to answer them. Let P be the transition matrix for a time-homogeneous DTMC Xt . We recall that X P[X3 = i|X1 = j] = P[X3 = i|X2 = k, X1 = j]P[X2 = k|Xi = j] k = X P[X3 = i|X2 = k]P[X2 = k|Xi = j] k 76 = X pki pjk . k But this formula looks pretty familiar. It is the j, i entry of P 2 ! By induction on t, it is easy to check that P[Xt = i|X1 = j] = e†i P t−1 ej , (16.1) where ej is the vector with a ‘1’ in entry j and a ‘0’ everywhere else. We will often write pij = e†j P m ei (m) for the m-step transition probabilities of a Markov chain. We note that the associated matrix, P m , really is the transition matrix for a Markov chain. Indeed, it is the transition matrix for the Markov chain Yt ≡ Xmt . Equation (16.1) tells us that lim P[Xt = i] = lim e†i P t−1 ej . t→∞ t→∞ (16.2) What does this look like? We can calculate P t directly, but it is a bit of a nightmare for large t. Instead, we’ll review the way that you were taught to calculate powers of matrices in linear algebra class, and then we’ll notice that we don’t quite need to calculate everything. Example 89 (Review of Powers). We consider the matrix 0.7 0.2 0.1 P = 0.5 0.1 0.4 0 0.5 0.5 , and try to calculate limt→∞ P t if that exists. We have a machine for doing this calculation: (1) First, we find the eigenvectors and eigenvalues of P . The usual way to do this is: (a) Calculate the characteristic polynomial of P , given by f (λ) = det(P − λId). (b) Find the roots λ1 , λ2 , λ3 of f . (c) Find vi satisfying the equation P vi = λi vi for 1 ≤ i ≤ 3. This step changes a little if there is multiplicity in the roots of f , but we ignore that for now. For now, we just use R. Recall that our commands are: P = matrix(c(0.7,0.2,0.1,0.5,0.1,0.4,0,0.5,0.5), nrow = 3) eigen(P) R will spit out: $values [1] 1.0000000 0.5405125 -0.2405125 $vectors [,1] [,2] [,3] [1,] -0.7407611 -0.7992229 0.4337809 [2,] -0.4444566 0.2549321 -0.8159527 [3,] -0.5037175 0.5442907 0.3821718 This means that the eigenvalues are λ1 = 1,λ2 ≈ 0.5405 and λ3 ≈ −0.24. The eigenvectors are v1 = (0.741, 0.444, 0.504) and so on. 77 (2) Next, we write P = QDQ−1 , where Q = [v1 v2 v3 ] and λ1 0 0 D = 0 λ2 0 0 0 λ3 (3) Our formula for P now has a magical property, which was the whole reason for doing this: P t = (QDQ−1 )t = QDt Q−1 . Unlike P , which is complicated, it is easy to take powers of D. They are: t λ1 0 0 Dt = 0 λt2 0 0 0 λt3 . (4) Finally, we take limits: lim P t = Q lim Dt Q−1 . t→∞ t→∞ But these limits are either 0 or 1 if they exist! In our case, we have 1 0 0 0 D = 0 0.54 0 0 −0.24 , so 1 0 0 lim Dt = 0 0 0 t→∞ 0 0 0 . We will see that this isn’t an accident - any ‘reasonable’ Markov chain will have this as the limiting matrix, even in high dimensions. (5) We conclude: lim P[Xt = i] = lim e†i P t−1 ej t→∞ t→∞ v1 [i] =P . j v1 [j] In this case, lim P[Xt = 1] = 0.439 t→∞ lim P[Xt = 2] = 0.263 t→∞ lim P[Xt = 3] = 0.298. t→∞ 78 So, we have seen: • It is straightforward to calculate the probability P[(X1 = x1 , . . . , Xt = xt )] of a path. • It is annoying, but possible, to calculate the probability P[Xt = x] directly for t small. • For t large, we can use linear algebra to calculate the probability P[Xt = x]. 16.2.2. Chapman-Kolmogorov Equations. We recall: Definition 16.2 (m-step transition matrix). For a time-homogeneous DTMC, define (m) pij = P[Xn+m = j|Xn = i]. We denote the associated matrix by P (m) = P m . This is analogous to the usual single-step (1) transition matrix pij = pij . We then have the following expression, called the Chapman-Kolmogorov Equations: X (`) (m−`) (m) pij = pik pkj . k This holds for all 0 < ` < m. In matrix form, this looks much nicer: P m = P ` P m−` . We can use these formulae to calculate transition probabilities. Assume that our Markov chain starts out according to the distribution π = π (1) . That is, we write P[X1 = i] = π (1) (i). Then define π (m) (i) = P[Xm = i]. We then have, for time-homogeneous DTMC’s, that π (m) = π (1) P m−1 . For time-inhomogeneous DTMC’s, we have the slightly uglier formula: π (m) = π (1) P (1)P (2) . . . P (m − 1). We then have, as before, lim π (n) = π (1) lim P n . n→∞ n→∞ Remark 16.3. This limit may not exist! We’ve already seen a few hints as to when it exists - it depends on the eigenvalues and eigenvectors of P . We’ve also seen that, at least in one example, it actually doesn’t depend on π (1) . This should make some sense after it has been pointed out: you are jumping around a plot, and eventually you ‘forget’ where you started. We’ll make all of this more precise soon. Remark 16.4. The book stops keeping track of time-inhomogeneous Markov chains about here, restricting their attention to time-homogeneous Markov chains for the most part. I’ll do the same, and we’ll assume that chains are time-homogeneous unless otherwise stated. 79 16.2.3. Classification of States. The Markov chains we saw when studying phase-type distributions all had the following property: we started on the left-hand side of the graph, bounced around for a while, and then entered a final graveyard state and stayed there forever. We’ll introduce some vocabulary based on this, and discuss the different types of behaviours that Markov chains can have. First, a picture: We make some simple observations about a Markov chain starting in state 1: (1) The Markov chain will leave state 1 immediately. States that can only be occupied for a fixed, finite number of steps are called ephemeral. Note: this word is not used very often. (2) The Markov chain will bounce between states 2 and 3 for some period of time, and there is no deterministic upper bound on the number of steps that the chain can spend in states 2 and 3. However, it will eventually leave them, and will never come back. States that can be occupied for any number of steps, but which are occupied in total for only finitely many steps, are called transient. Note: any ephemeral state is also transient, but the reverse is not true. (3) The Markov chain may eventually reach states 4 and 5. If it does, it will bounce between them forever, reaching both infinitely many times. States that may be reached infinitely many times are called recurrent. Furthermore, the time in between visits to state 4 is always exactly 2. States for which these return times are always (multiples of ) some number k 6= 1 are called periodic with period k. More generally, recurrent states with return times that have finite mean are called positive recurrent; other recurrent states are called null recurrent. Note: since our stochastic process is a time-homogeneous Markov chain, the return time I’ve just described is an honest-togoodness random variable with a specific distribution. At some point, you should make sure that you understand why this is true, though we will also discuss it more soon. Note: in Markov chains with finite state spaces, all recurrent chains are positive recurrent. For infinite Markov chains, the two ideas may not coincide. (4) If the Markov chain ever reachest state 6, it stays there forever. Such states are called absorbing states. This is the easiest state to understand: a state i is absorbing if and only if pii = 1. All of these definitions are really ways to understand return times, so let us begin by making that notion more precise: 80 Definition 16.5 (Return Time). Assume X0 = i and define the return time τ by τ = min{t > 0 : X0 = i}. Following the textbook, we define (n) fii = P[τ = n]. Note: In general, we have (n) (n) P[τ = n] = fii ≤ pii = P[Xn = i]. The RHS is the probability that Xn = i; the LHS is the probability that Xn = i and Xs 6= i for all 0 < s < n. We now try to calculate these return time probabilities, using similar tricks to our compu(n) (k) tation of pij with the Chapman-Kolmogorov equations. We break up the probability pii in terms of the first return time to i. This looks like: (0) pii = 1 (1) (1) (1) (0) (2) (2) (1) (1) (0) (3) (3) (2) (1) (1) (2) pii = fii = fii pii pii = fii + fii pii + fii (0) pii = fii + fii pii + fii pii + fii , and in general (k) pii = k X (j) (k−j) fii pii . j=1 We can use all of these equations together to compute (k) fii (k) pii = − k−1 X (j) (k−j) fii pii . j=1 We then define fii = ∞ X (k) fii . k=1 (k) (k) Remark 16.6. Warning! pii and fii are very similar, but pii and fii are very different! Definition 16.7 (Recurrent State). A state i is recurrent if fii = 1. Remark 16.8. Recall that fii = ∞ X k=1 (k) fii = ∞ X P[τ = k] = P[τ < ∞]. k=1 Thus, i is recurrent if and only if the probability of return to that state is one. We now relate the condition fii = 1 to other functions we understand. We note that, if P[τ < ∞] = 1, we must have that in fact Xt returns to i infinitely often. 81 Remark 16.9. This is a special property of time-homogeneous Markov chains! The idea is that, whenever the chain returns to i, you imagine that you are starting the chain all over again. This ‘restarting,’ or ‘memoryless,’ property is exactly what makes Markov chains special. Thus, we have ∞ X E[ 1Xt =i ] = E[|{t : Xt = i}|] = ∞. t=0 But we can rewrite this! ∞ = E[ ∞ X 1Xt =i ] = t=0 ∞ X (t) pii . t=0 We have: Theorem 90. If a state i is recurrent, then P∞ t=0 (t) pii = ∞. We have the related definition: Definition 16.10 (Transient State). A state i is transient if it is not recurrent. We note that the pdf of the number of returns to i that a Markov chain ever has is (n−1) f (n) = (1 − fii )fjj . 1 . fii In particular, the mean number of This is exactly a geometric distribution with mean returns is finite! This lets us upgrade our previous theorem: P (t) Theorem 91. A state i is recurrent if and only if ∞ t=0 pii = ∞. Remark 16.11. We now have two different but equivalent ways of checking if a state is recurrent. 82 17. Lecture 16: More Chapter 9: Classification and Special Matrices. (Mar. 12) 17.1. Summary. • A few administrative details. • More on Markov chains 17.2. Lecture. 17.2.1. General note: I will not spend quite as much time repeating notation associated with Markov chains as I have been. Please ask if there are any questions as to whether a theorem applies, or what a piece of notation means. In general, however, there is quite a bit of notation associated with Markov chains and you will have to memorize most of it for the lectures (and questions) to make sense. 17.2.2. More classification. Last class, we discussed recurrent and transient states. We now introduce a slightly strange distinction between types of transient states. Define: Definition 17.1 (Mean Recurrence Time). For a recurrent state i, the mean recurrence time is ∞ X (n) nfii . Mii ≡ n=1 and Definition 17.2 (Null and Positive Recurrence). A recurrent state is: • Null recurrent if Mii = ∞. • Positive recurrent if Mii < ∞. All of the recurrent states we have seen so far were positive recurrent. Are there any null recurrent states? Before we check, I mention the related idea of passage times. For a chain Xt started at X0 = i, define (temporarily) τ = inf{t > 0 : Xt = j}, and let (n) fij = P[τ = n]. By the same sorts of calculations we did last class, we have n X (n) (`) (n−`) pij = fij pjj `=1 (n) fij = (n) pij − n−1 X (`) (n−`) fij pjj . `=1 We recall that P is the matrix with entries pij , and define F to be the matrix with entries P (n) fij . When fij ≡ ∞ n=1 fij = 1, define the mean first passage time to be Mij = ∞ X n=1 83 (n) nfij . The above tells us that Mij satisfies the equation X Mij = 1 + pik Mkj . k6=j This is quite similar to the Chapman-Kolmogorov equation in spirit: both say that we can decompose the first time that something happens in terms of intermediate events. Section 9.4 of the textbook gives a discussion on how to change this into linear algebra. I just mention that we can rewrite the equation as linear algebra as M = E + P (M − Diag(M )), † where E = ee and Diag(M )ij = Mij 1i=j (that is, the matrix that has the same diagonal entries as M and 0’s everywhere else). We now find an example of a chain with states that are null recurrent but not positive recurrent. Example 92. Define a transition matrix by pi,i+1 = pi,i = 1 2 for i ∈ Z. We want to show that 0 is null recurrent. First, recurrence. We have: 1 (2n) −2n 2n p0,0 = 2 ≈√ , n πn where the “≈” here is based on Stirling’s formula (which would certainly be provided in an exam). Thus, X (n) 1 X 1 √ = ∞. p0,0 ≈ √ π n n n By our theorem giving equivalent conditions for a state to be recurrent, this implies that 0 is recurrent for this Markov chain. On the other hand, I claim that M00 = ∞. To see this, let Mij be the expected time to get from state i to state j for the first time. By symmetry, we have all of the following relationships: 1 1 M0,0 = M0,1 + M−1,0 2 2 1 1 M0,1 = M−1,1 + 2 2 M−1,1 = 2M0,1 . Putting together the last two, we have M−1,1 = M−1,1 + 1. This can’t be true for any finite value of M−1,1 , so M−1,1 = ∞. Plugging this into the first expression, we have M0,0 = ∞ as well. The following theorem tells us that there are no ‘really simple’ null-recurrent states: Theorem 93. Let Xt be a finite time-homogeneous DTMC. Then • No state is null-recurrent. 84 • Not all states are transient. Proof. We sketch the arguments. For the first: if a state j is recurrent, then for all states i (n ) there exists ni < ∞, pi > 0 so that pij i > pi . Thus, the mean recurrence time is at most maxi ni maxi p1i . For the second: if all states were transient, the Markov chain would only spend a finite amount of time in each state. But then the Markov chain would have nowhere to go... We give one last classification of states: (n) Definition 17.3 (Periodic). Let I = {n : pii > 0} and let p be the greatest common divisor of I. Then i is called periodic with period p. If p = 1, it is also called aperiodic. Remark 17.4. Please read section 9.4. It is fairly important, and we have spent a lot of time on it, but we have skipped several details. Example 94 (Based on 9.4.3 of textbook). Consider the matrix 0.1 0.4 0.5 P = 0.5 0.2 0.3 0.4 0.4 0.2 . We calculate the mean first passage times to state 1. By our recurrence, we have: M21 = 1 + (0.2)M21 + (0.3)M31 M31 = 1 + (0.4)M21 + (0.2)M31 . Rewriting, we have the linear equations: 0.8M21 − 0.3M31 = 1 −0.4M21 + 0.8M31 = 1. I get M21 ≈ 2.12 M31 ≈ 2.31. We can then calculate M11 = 1 + 0.4M21 + 0.5M31 ≈ 3. 17.2.3. Irreducibility. So far, we have discussed classifications of individual states. Here, we discuss classifications of collections of states. (n) Definition 17.5 (Closed). A collection of states S is called closed if pij = 0 for all i ∈ S, j∈ / S and n ≥ 1. If |S| = 1, the single state is called absorbing. A state that is not closed is called open. We then have Definition 17.6 (Irreducibility). If a Markov chain has a proper closed subset of states, it is called reducible. Otherwise, it is called irreducible. 85 We have some equivalent definitions: (n) Definition 17.7 (Reachability). A state j is reachable or accessible from i if supn pij > 0. We denote this i 7→ j. A chain is reducible if and only if all states are reachable from all other states. Definition 17.8 (Communicating). We say i, j are communicating if i 7→ j and j 7→ i. This can be written i ↔ j. An important abstract relationship: Definition 17.9 (Equivalence Relation). Fix a set S. A subset A ⊂ S 2 is an equivalence relation on S if (x, y) ∈ A ⇒ (y, x) ∈ A (x, x) ∈ A (x, y) ∈ A, (y, z) ∈ A ⇒ (x, z) ∈ A. Remark 17.10. Let A be an equivalence relationship. Then there exists a partition P = P1 , . . . , Pk of S that satisfies x, y ∈ Pi ⇔ (x, y) ∈ A. Remark 17.11. Let S be the collection of states in a Markov chain. We define A ⊂ S 2 by (i, j) ∈ A ⇔ (i ↔ j). This is an equivalence relation. We briefly give some related, less important, definitions: Definition 17.12 (Return states, etc). A state that communicates with itself is a return state. The set of states that communicate with i is denote by C(i), the class of i. Remark 17.13. Section 9.5 has some further results on infinite Markov chains, and these are probably worth reading. 17.2.4. (1) (2) In this Special Matrices. We recall that we have seen two special matrices already: The transition matrix P with entries pij . The reachability matrix F with entries fij . section, we define some related matrices. The first is: Definition 17.14 (Potential Matrix). The potential matrix, with entries rij , is R= ∞ X P n. n=0 Remark 17.15. We have ∞ X rij = E[ 1Xn =j |X0 = i]. n=0 In particular, it is often the case that many entries of R are infinite, while other states can be 0. We will start out by studying all of the entries that ‘might’ not be 0 or infinity, then come back and classify the remaining entries. 86 The rows and columns corresponding to transient elements of the state space are called the fundamental matrix and are denoted by S. These are the only elements of R that may have entries that are not 0 or infinity. We often order the states of a Markov chain so that states i, j with rij < ∞ are first. That is, we will write our transition matrices P in the form: T U P = 0 V where T represents all transitions between transient states, V represents all transitions between recurrent states, and U represents all transitions from transient to recurrent states. Remark 17.16. This ordering doesn’t quite make sense if the Markov chain is infinite. In that case, we keep the definition of T, U, V despite the fact that we can’t really ‘order’ the states in a nice way and the fact that we can’t write T U P = 0 V In this notation, we have that R= ∞ X n=0 ∞ X T n U (n) P = 0 Vn n n=0 where U (n) is a complicated collection of matrices that we will ignore for now. Thus, we can write the fundamental matrix S, which is the upper left-hand part of R, as: S= X T n. n There is a nice formula for S: S = (Id − T )−1 . Remark 17.17. It isn’t so obvious that this formula makes sense. However, it is ‘really’ the same thing as the following familiar Taylor series from calculus: ∞ X xn = n=0 1 . 1−x The textbook has more details. Let’s use this idea: Example 95. We consider a matrix P of the form T U P = 0 V with 87 0.2 0.4 0.4 T = 0.3 0.3 0.3 0.1 0.4 0.4 Then, using Gaussian elimination or the R command solve 03.75 5 5 (Id − T )−1 = 4.5 5.5 4.5 2.375 4.5 5.5 We now look at the value of rij when we don’t have i, j transient: • i recurrent: we have rij = ∞ if j ∈ C(i) and rij = 0 otherwise. (n) • i transient, j recurrent: we have rij = ∞ if maxn>0 pij > 0 and rij = 0 otherwise. There are no real calculations to be done for these elements of the matrix. We note that the entry sij of S gives the mean number of times that a Markov chain enters state j when started from state i. Section 9.6 of the textbook gives similar formulas for several related quantitites, such as the variance of this number. We don’t go over this in class, but please look at this section. I expect you to be aware that the formulas are available. We note that the textbook also discusses the relationship between the first-passage matrix F and the potential matrix R. Let H denote the entries of F that correspond to transient pairs of vertices (just like S denotes the entries of R that correspond to transient pairs of vertices). The most important relationship is: H = (S − Id)(Diag(S))−1 , where for any matrix M the entries of Diag(M ) are Diag(M )ij = Mij 1i=j . Section 9.6 of the textbook has many more details. They are worth understanding if you want to understand the theory, but I will not ask about any of the details in the exam or expect you to remember them in class. 88 18. Lecture 17: More Chapter 9: Random Walks and Distributions (Mar. 16) 18.1. Summary. • A few administrative details. • More on Markov chains 18.2. Lecture. So far, we have focused on finite time-homogeneous DTMC’s, but almost all of our theorems have also applied to infinite time-homogeneous DTMC’s. Today, we begin to look at the infinite chains a little more seriously. 89 18.2.1. Random Walk Problems. We will look at ‘nice’ infinite Markov chains. The following two theorems allow us to distinguish between positive recurrent, null recurrent and transient chains. We give no proofs, or even arguments: Theorem 96 (9.7.1 of Textbook). Let P be the transition matrix of an irreducible Markov chain. Then all of the states of the chain are positive recurrent if and only if the system of linear equations z = zP P has a solution with j zj = 1. Furthermore, if such a solution exists, it is the unique solution to these equations and satisfies zj > 0 for all j. Theorem 97 (9.7.2 of Textbook). Let P be the transition matrix of an irreducible Markov chain and let P 0 be the matrix obtained by deleting the first row and column of P . Then all of the states of the chain are recurrent if and only if all solutions y of the system of linear equations y = P 0y satisfy at least one of (1) y ≡ 0, or (2) there exists i so that yi ∈ / [0, 1]. These theorems apply to all Markov chains, but we will begin by applying them to a special class of chains: Definition 18.1 (Birth and Death Chain). A DTMC is called a birth-and-death chain if the transition probabilities pij satisfy pij = 0 for all |i − j| > 1. We will study one particular birth-and-death chains in this section; the textbook has another. Example 98 (Reflecting Random Walk). We define a family of birth-and-death chains with parameter 0 < p < 1 as follows: 0 1 0 0 0 ... 0 p 0 0 ... 1−p 1−p 0 p 0 ... 0 P = 0 1−p 0 p ... 0 0 0 0 1−p 0 ... 0 0 0 0 1 − p ... That is, at any state besides i, we add 1 with probability p and subtract 1 with probability 1 − p. We would like to understand what this chain looks like. It is clear that there are basically three things that can happen: (1) p > 21 : In this case, we ‘normally’ add 1. The law of large numbers (and a bit of thought) is enough to tell us that limt→∞ Xt = ∞. 90 (2) p < 12 : In this case, we ‘normally’ subtract 1 and expect to ‘bounce off of the origin’ quite frequently. None of our old theorems tell us exactly what to expect the Markov chain to look like, but we might think that the chain spends a lot of time close to 0. (3) p = 21 : In this case, we wander back and forth, sometimes ‘bouncing off of the origin.’ The central limit theorem tells us what happens without this bouncing... but the longterm behaviour is quite mysterious! Of course, we saw some of the answers last class. We look for solutions to the linear equations suggested by our theorems. For the first, we have 1 z1 = z0 1−p zi = pzi−1 + (1 − p)zi+1 , i > 1. i p z0 for any z0 , but don’t prove it (see the We claim that this has the solution zi = 1−p textbook for details). Summing, we have ∞ X 1−p 1 zi = z0 ,p< 1 − 2p 2 i=0 ∞ X i=0 1 zi = ∞ p ≥ . 2 P for p < 12 , we get a solution with i zi = 1; for p ≥ 21 , our calculations Choosing z0 = 1−2p 1−p show that there is no such solution. Conclusion: All states are positive recurrent if and only if p < 12 . We still need to see if the states are null recurrent or transient when p ≥ 21 . We define 0 p 0 0 0 0 p 0 0 1−p 0 1 − p 0 p 0 P0 = 0 1−p 0 p 0 0 0 0 1−p 0 0 0 0 0 1−p 0 Then our solutions to y = P y look like ... ... ... ... ... ... y1 = py2 yi = (1 − p)yi−1 + pyi+1 , i > 1. A rather more complicated calculation shows that, when p = 21 , the only solution is yi = iy1 . Thus, by our earlier theorem, we have that all states are null recurrent in this case. When p > 21 , we have a solution j 1−p yi = 1 − . p Thus, by the same theorem, we that all states are transient in this case. 91 Conclusion: (1) p > 12 : All states are transient. (2) p < 12 : All states are positive recurrent. (3) p = 12 : All states are null recurrent. Section 9.7 of the textbook has another lovely example, called the Gambler’s ruin, that is worth reading. 18.2.2. Limits and Stationary Distributions. We saw already, two classes ago, how to calculate lim P[Xn = i], n→∞ at least for ‘nice’ finite Markov chains. In this section, we look at this problem in a more systematic way. First, some definitions: πi (n) ≡ P[Xn = i]. In matrix notation, we write: π(n) = π(0)P n . We then have Definition 18.2 (Stationary Distribution). Let P be the transition matrix of a time-homogeneous DTMC and let z be a solution to z = zP that satisfies P j zj = 1. The z is called a stationary distribution of the Markov chain. Remark 18.3. The theorem from the start of class said that there is at most one solution z if P is also irreducible. If P is not irreducible, there may be many different solutions. Regardless of whether or not P is irreducible, we have already seen examples showing that there may be no solutions at all. Remark 18.4. If z is a solution and we set π(0) = z, then π(1) = π(0)P = π(0), and in general π(n) = π(0). Thus, π(i) doesn’t move - it is ‘stationary.’ Remark 18.5. This means that, at least sometimes, the stationary distribution is the limiting distribution... and when this occurs, we don’t have to diagonalize the whole matrix. It is enough to solve a single linear equation. This is much easier! Depending on how you learned to solve linear equations, note that taking transposes might be helpful: z† = P †z†. 92 Example 99. Set 0.3 0.2 0.5 P = 0.5 0.1 0.4 0 0.5 0.5 Then we can use R to solve z = zP with the commands P = matrix(c(0.3,0.2,0.5,0.5,0.1,0.4,0,0.5,0.5),nrow=3) v = eigen(P)$vectors[,1] v/sum(v) The result is z ≈ (0.22, 0.31, 0.47). You should get some practice solving these equations by hand, using Gaussian elimination/row reduction. We now have a related definition: Definition 18.6 (Limiting Distribution). Let P be the transition matrix of a time-homogeneous DTMC. If lim P n n→∞ exists, then π ≡ lim π(n) n→∞ exists and is called a limiting distribution of the Markov chain. Remark 18.7. This is a little misleading. The limiting distribution is not necessarily a distribution at all! Consider the Markov chain Xt = t. This has no stationary distribution. However, the vector π(i) = 0 for all i is a ‘limiting distribution’ according to our definition, despite the fact that it isn’t a distribution function. Theorem 100. When all states of a Markov chain are positive recurrent and aperiodic, then there exists a unique limiting distribution which is also a stationary distribution. Example 101. We have already seen chains with unique limiting distributions and nonunique limiting distributions, even with state space {0, 1}. (1) Let {Xs }s∈N be an i.i.d. Bernoulli(0.5) sequence. This is also a Markov chain; it has unique limiting (and stationary) distribution π(0) = π(1) = 12 . (2) Let Xs+1 = 1 − Xs . This has no limiting distribution, since P n doesn’t have a limit. However, the distribution π(0) = π(1) = 12 is a stationary distribution. (3) Let Xs+1 = Xs . This has infinitely many limiting distributions, which are also stationary distributions. For any q ∈ [0, 1], you can check that πq (1) = q, πq (0) = 1 − q is a stationary measure for the chain. We skip ahead a little, to pp. 240 of the textbook, and give some examples showing the relationship between stationary distributions and limiting distributions, as well as uniqueness of each: Example 102 (Irreducible chains that are null recurrent or transient). Our previous theorem tells us that any irreducible chain for which all chains are null recurrent or transient cannot have a stationary distribution. If these chains have a limiting distribution, it must be identically 0. 93 Example 103 (Irreducible chains that are positive recurrent). These chains always have at least one stationary distribution, given by 1 . π(i) = Mii As we can see by looking at examples we already know, this may not be unique, and may not be a limiting distribution, if the chain is periodic. The last example gives us all ‘nice’ DTMCs; the other classes mostly indicate that you should really be analyzing several different chains: Example 104 (Irreducible chains that are aperiodic). These chains always have a unique limiting distribution. This limiting distribution π either satisfies: • π ≡ 0, in which case all states are transient or null recurrent. • π(i) > 0 for all i, in which case all states are positive recurrent and the limiting distribution is a stationary distribution. This class of chains is very important. Pages 242-243 of the textbook give some formulas associated with this stationary distribution. The formulas are consequences of things we have already written down, but are worth looking at. Example 105 (Irreducible chains that are periodic). These chains cannot have limiting distributions, since P n does not have a limit. If the states are all positive recurrent, they will have stationary distributions. However, they will not have a unique stationary distribution. The textbook discusses these chains in greater detail. I will not emphasize them, however, and mention only the observation that if all states are periodic with period k, then P k is an aperiodic irreducible chain. This is often the chain that you ‘really’ want to analyze. 18.2.3. Reversibility. All of our categorizations so far have been related to ‘big’ problems that stochastic processes might have: • Discrete-time, time-homogeneous Markov chains are a very nice class of Markov chains, where we can start to develop some nice formulas. • Recurrence/transience, periodicity, reducibility all detect macroscopic problems (e.g. states that you can ignore after some short initial period, chains that ‘look like’ disjoint unions of smaller chains, etc). Reversibility is rather more subtle. The idea is as follows. We consider a Markov chain {Xs }s∈Z , and then define the ‘time-reversed’ chain Ys = X−s . We would like to be able to say if {Ys }s∈Z is the ‘same’ as {Xs }s∈Z . This property of being the ‘same’ is known as reversibility. Some observations and pointers: • Our notion of ‘the same’ should be at the level of distribution. It should also ignore the ‘starting’ point X0 , since that starting point is sort of arbitrary. • This is relevant to modelling: we expect any system from physics to be reversible. • This turns out to be mathematically relevant as well - reversible Markov chains have some very nice properties. However, that is much less obvious. Assume that our chain has a stationary distribution π, and that X0 ∼ π. Then: P[Xs = j, Xs+1 = i] P[Xs = j|Xs+1 = i] = P[Xs+1 = i] 94 pji π(j) π(i) The textbook defines this number to be rij , and the associated matrix to be R. This is not the potential matrix, which was also written as R and which we studied in section 9.6. We have: = Definition 18.8 (Reversibility). A time-homogeneous DTMC is reversible if R = P. This is equivalent to π(i)pij = π(j)pji , sometimes called the detailed balance equation. We give some results related to reversibility: Theorem 106. All birth-and-death chains are reversible. Theorem 107. All irreducible, positive-recurrent chains with transition matrices P that satisfy P = P † are reversible, and have stationary distribution that is uniform. Example 108. We find the reversal of a Markov chain. Let 0.7 0.2 0.1 P = 0.5 0.1 0.4 0.2 0.5 0.3 We can calculate its stationary distribution by solving at the equation v = P t v, which gives us π ≈ (0.54, 0.24, 0.22). Recall that the time reversal is rij = 0.7 0.22 0.08 R = 0.45 0.1 0.45 0.25 0.25 0.5 pji π(j) . π(i) Thus, There are many reasons to be interested in reversibility. One of my favorites, from statistics and computing, is the Metropolis-Hastings algorithm: Example 109 (Metropolis-Hastings Algorithm). Let P be the transition matrix of a reversible Markov chain with stationary distribution π > 0. Let µ > 0 be a distribution of interest. Define π(j)pji aij = min 1, . π(i)pij Then define the adjust transition matrix Q by qij = pij aij , i 6= j X qii = 1 − qij , otherwise. j Then Q is a reversible Markov chain with stationary distribution µ. 95 Remark 18.9. This example turns out to be very useful. We will see why in the homework set. 96 19. Lecture 18: Closing Chapter 9: Continuous-Time and Special Processes (Mar. 19) 19.1. Summary. • A few administrative details. • Yet more on Markov chains 19.2. Lecture. Today we discuss continuous time Markov chains. These are slightly complicated objects in generality, but we will end up only looking at continuous-time Markov chains that are ‘really’ discrete-time Markov chains in a light disguise. We give first the following non-rigorous definition: Definition 19.1 (Almost-a-definition for continuous time Markov chains). The processes we studied when defining phase-type distributions are continuous-time, time-homogeneous, finite Markov chains. All of the continuous-time Markov chains we study will be of this form. In particular, the exponential distribution will be quite important. This isn’t obvious from the following, more precise, definition: Definition 19.2 (Continuous-Time Markov Chain). A continuous time, discrete space Markov chain is a stochastic process {Xt }t∈R that satisfies P[X(tn+1 ) = xn+1 |X(tn ) = xn , . . . , X(t1 ) = x1 ] = P[X(tn+1 ) = xn+1 |X(tn ) = xn ] for all t1 < t2 < . . . < tn+1 and all x1 , . . . , xn+1 . We then have Definition 19.3 (Time-Homogeneous Continuous-Time Markov Chain). A continuous time, discrete space Markov chain is time-homogeneous if P[X(s + t) = j|X(s) = i] ≡ pij (s, s + t) does not depend on s. When a chain is time-homogeneous, which will be the case for all chains of interest, we will write pij (τ ) = pij (s, s + τ ). We note that these transition probabilities still satisfy X pij (τ ) = 1 j for all i and all τ . Although we have transition probabilities, they are less fundamental to the study of continuoustime Markov chains. Instead, we consider: Definition 19.4 (Instantaneous Transition Rate). For states i 6= j, define pij (h) − pij (0) qij = lim . h→0 h For state i, then define X qii = − qij . j6=i 97 Definition 19.5 (Infinitesimal Generator). The infinitesimal generator of a Markov chain is the matrix Q with entries qij . Remark 19.6. It isn’t so clear that pij (t) should have a derivative at 0 (e.g. that qij actually exists). In fact, we already know that this derivative can fail to exist: i.i.d. sequences satisfy our definition for continuous-time Markov chains, and derivatives generally don’t exist for this type of Markov chain. We’ll ignore this difficulty and assume that qij exists and is finite unless we explicitly say otherwise. Note that the textbook glosses over this problem completely, but it seems important to bring up: we have essentially looked at all possible DTMCs, but we are not looking at all possible continuous-time Markov chains (CTMCs). We’ll now see why exponential distributions show up everywhere for Markov chains with infinitesimal generators. Fix i, assume X0 = i and define the holding time Ti = inf{t > 0 : Xt 6= i}. Theorem 110. Ti has exponential distribution. Proof. Fix a, b > 0. By the Markov property, P[Ti > a + b|Ti > a] = P[Ti > a + b|Xa = i] = P[Ti > b]. But this says exactly that Ti has the memoryless property. The exponential distribution is the only continuous distribution with the memoryless property, and so Ti must have exponential distribution. We can do better. Not only does Ti have exponential distribution; we can even decompose it in terms of other exponential distributions! To be explicit, for all j, define Tij to be an exponential distribution with mean q1ij . Then Ti has the same distribution as inf j (Tij ) and P[XTi = j] = P[Tij = inf k Tik ]. Thus, we have a correspondence between the continuous-time Markov chains we study here and discrete-time Markov chains: Definition 19.7 (Skeleton Chain). Let τ0 = 0. Then define τi+1 = inf{t > τi : Xt 6= Xτi }. The skeleton of Xt is the discrete-time Markov chain Yi = Xτi . We have: Definition 19.8 (Correspondence). Fix a state space {1, 2, . . . , n} and let Q be an infinitesimal transition matrix. Then the skeleton of the associated chain has transition matrix qij pij = P k6=i qik for j 6= i and pii = 0. Remark 19.9. This correspondence means that we can do almost every calculation we might be interested in by just looking at the associated discrete-time chain, and then making adjustments to take into account the holding times. 98 Warning: it is very tempting to just do all of the calculations for the discrete-time chain, and apply them without any adjustments to the continuous-time chain. Somebody will do this on an exam, but it doesn’t work. To see this, consider the family of infinitesimal generators −C C QC = 1 −1 , where 0 < C < ∞. We calculate the the skeleton of this chain has transition matrix 0 1 PC = 1 0 for any value of C. Thus, any answers you get based only on the skeleton of the chain will not depend on C. This can be deeply misleading. For example, we will calculate soon that for any starting point X0 , C , t→∞ 1+C which depends a great deal on C! However, the skeleton chain has unique stationary distribution π(1) = π(2) = 12 . Again, somebody will make this mistake on the exam. Don’t be that person! (REVISIT this example in the section on stationary distributions!) lim P[Xt = 2] = We now try to figure out how to use the correspondence properly. For qualitative features of the Markov chain, we really don’t need to adjust very much. For example, we say that a continuous-time chain is irreducible if and only if the skeleton is irreducible. The same carrying-over happens for all of our other classifications, with the exception of periodicity: we always say that a continuous-time chain is aperiodic. We do get continuous-time versions of all of our formulas. We won’t go over all of these in class, but we will illustrate some of them today and when doing problems next class. You should be comfortable using other analogous formulas. We give one example to start. The continuous-time version of the Chapman-Kolmogorov equation is: X pij (τ ) = pik (s)pkj (τ − s) k for all i, j and all 0 < s < τ . Again, in matrix form, this is P (τ ) = P (s)P (τ − s). However, since we have continuous-time chains, we can take derivatives of this expression: d P (t) = QP (t) = P (t)Q. dt This has a rather nice solution: Qt P (t) = e ≡ ∞ X Qn tn n=0 99 n! . Remark 19.10. This last formula is similar to the much more trivial solution P (t) = P (1)t we had for discrete-time Markov chains. 19.2.1. Stationary, limiting and reversible distributions. Stationary and limiting distributions could be defined as in the finite case. However, as we have seen, there is some subtlety here: we don’t want to define the stationary or limiting distribution as coming from the skeleton of the chain. The good news is that the theory of limiting and stationary distributions are very similar to the discrete-time setting. The textbook has all of the details, but I highlight the one important difference and two most important similarities: • Although the skeleton of a chain might be periodic, no continuous-time chains are periodic. • Other than the fact that all continuous-time chains are aperiodic, all of the existence/uniqueness theorems concerning discrete-time chains apply as written. • Even the formula for a stationary distribution in terms of mean recurrence time still holds! Although we can define stationary and limiting distributions exactly as we did for DTMC’s, it is convenient to use some closely related definitions: Definition 19.11 (Stationary Distribution). Consider a continuous-time Markov chain (CTMC) with generator Q. If the equation zQ = 0 has a solution with P i zi = 1, then we call z a stationary distribution of Q. Remark 19.12. This is just another way of saying our old definition. Recall that, in discrete time, we wanted z to solve z = zP. However, if z is a stationary distribution for a CTMC, we have for all t > 0: zP (t) = z ∞ n n X t Q n=0 n! ∞ X tn+1 Qn = zId + zQ (n + 1)! n=0 = z. Thus, our new definition agrees with our old one. Since there are no periodic continuous-time Markov chains, the distinction between limiting and stationary distributions is less interesting. The following theorem, which is identical to one that we already saw for discrete-time Markov chains, gives us conditions under which the limiting and stationary distributions are the same: Theorem 111. For any finite, irreducible, continuous-time Markov chain, there exists a unique limiting distribution and it is equal to the unique stationary distribution. We revisit our old example: 100 Example 112. Recall the CTMC with generator: −C C QC = 1 −1 We then note that any stationary distribution satisfies: −Cz1 + z2 = 0 Cz1 − z2 = 0 and z1 + z2 = 1. Putting these together, we have z1 + Cz1 = 1, so (z1 , z2 ) = ( C 1 , ). 1+C 1+C This is what we had before. 19.2.2. Reversibility. Reversibility is similar in continuous time: Definition 19.13 (Reversible CTMC). A CTMC is reversible if πi qij = πj qji . We have Theorem 113. A CTMC is reversible if and only if its skeleton is reversible. 19.2.3. Not-Quite-Markov Processes. We’ve finished our general introduction to Markov chains, bringing us up to section 9.11 of the textbook. The next two sections of Chapter 9 discuss things that aren’t quite Markov chains, despite the chapter heading. These sections are actually building towards our last major topic: queuing theory. Just like we began with stochastic processes and then came up with definitions of nicer and nicer stochastic processes before getting to reversible, irreducible, aperiodic Markov chains, we begin with some very general classes of stochastic processes and will eventually get to some very nice Markov-chain-like objects called queues. Our first definition is quite general, and seems a little silly at first. We generalize our notion of a skeleton: Definition 19.14 (Skeleton Process). Let {Xt }t∈R be a continuous-time stochastic process. Let τ0 = 0. Then define the transition times τi+1 = inf{t > τi : Xt 6= Xτi }. The skeleton of Xt is the discrete-time stochastic process Yi = Xτi . We then have: Definition 19.15 (Semi-Markov Process). A continuous-time stochastic process {Xt }t∈N is called a semi-Markov process if its skeleton is a DTMC. 101 Example 114. To construct a big class of semi-Markov process, fix any transition matrix P for a DTMC and any CDF for a nonnegative random variable F . Let {Yt }t∈N be a DTMC evolving according to P , and let {τi }i∈N be an i.i.d. sequence of random variables distributed according to F . Define Mt = max{i ≥ 0 : i X τ` < t}. `=1 Then Xt = YM t is a semi-Markov process. This doesn’t represent all semi-Markov processes, but it gives a way to build many of them. Definition 19.16. A stochastic process {Nt }t≥0 is a counting process if • N (0) ≥ 0. • N (s) ∈ Z. • N (s) ≥ N (t) whenever s ≥ t. We then look at some special semi-Markov processes: Definition 19.17. A counting process {Nt }t≥0 with skeleton {Mn }n∈N and transition times {τi }i∈N is a renewal process if the sequence of transition times are i.i.d. Remark 19.18. This is the definition from the textbook, and so I follow it in these notes. However, you will see slightly different definitions in other books. Under this definition, renewal processes may or may not be a semi-Markov process. However, essentially all of the renewal processes that we care about will be semi-Markov. Example 115 (Nice renewal processes). random variables and define N (t) = • Let X1 , X2 , . . . be a sequence of Bernoulli(p) X 1Xi =1 . 0≤i<t We note that the transition times have geometric distribution with mean p1 . Thus, N (t) is a renewal process and also a semi-Markov process. It is called the binomial process. • Let {Xi }i∈N be any sequence of i.i.d. nonnegative random variables (e.g. a sequence of Gamma random variables). Then Mt = max{i ≥ 0 : i X X` < t} (19.1) `=1 is a renewal process, and also a semi-Markov process. Note: the binomial process is a special case of this construction, with Xi being geometric random variables with mean p1 . Note: Although the textbook doesn’t say so explicitly, it is only interested in renewal processes of this form. Be aware of the fact that many of its calculations don’t work for all of the processes encompassed by its definition. 102 Example 116 (Bad renewal process). Consider a process {Xn }n∈N that satisfies: X0 = 0 1 P[X1 = 1] = P[X1 = 2] = 2 ∀n > 1, Xn = Xn−1 + X1 . Then let Nt = Xbtc . This is a counting process (it always increases) and a renewal process (the time between changes is always exactly 1, which is certainly an i.i.d. sequence). However, it is not a semi-Markov process: the value of Xn does not become independent of X1 . In many books, this would not be considered a renewal process at all (though it is still a counting process). The most important renewal process is: Definition 19.19 (Poisson Process). Let X1 , X2 , . . . be i.i.d. random variables with exponential distribution and mean λ1 . Then N (t) = max{i ≥ 0 : i X X` < t} `=1 is called a Poisson process. Remark 19.20. The Poisson process is very nice! For example, • It is actually a Markov chain, not just a semi-Markov process. P • We can do lots of special calculations with it! For example, let Sn = ni=1 Xi . Then Sn has an Erlang distribution, and (λt)n P[N (t) = n] = P[Sn ≤ t] − P[Sn+1 ≤ t] = e−λt . n! We could try to do the same calculations for other renewal processes, but we generally don’t have an explicit formula for the CDF of Sn . Note: that last step is really the only obstacle - we aren’t using anything fancy about the renewal process being a Markov chain. 103 20. Lecture 19: Review of Chapter 9 (Mar. 23) 20.1. Summary. • A few administrative details. • Last bit of chapter 9: renewal processes. • We review the types of questions that can be asked about this material. 20.2. Lecture. Recall: renewal processes look like: Example 117 (Nice Renewal Processes). Let {Xi }i∈N be any sequence of i.i.d. nonnegative random variables. Then i X Mt = max{i ≥ 0 : X` < t} `=1 is a renewal process. For general renewal functions, we often content ourselves with moments rather than full CDF’s: Definition 20.1. We call M (t) = E[N (t)] the renewal function of a renewal process {N (t)}t≥0 . For nice enough renewal processes, the renewal function actually defines the entire process, in the same way that the transition matrix defined a DTMC: Theorem 118. Let N (t) be a renewal process of the sort defined in equation (19.1). Define Z ∞ ˜ (s) = e−st M (t)dt M Z0 ∞ F˜X (s) = e−st FX (t)dt. 0 The fundamental renewal equation is: ˜ (s) M ˜ (s) 1+M ˜ ˜ (s) = FX (s) . M 1 + F˜X (s) F˜X (s) = Since the renewal processes of this form are characterized completely by FX , this implies that ˜ (s). they are also characterized completely by M Remark 20.2. What are some consequences of the fact that a renewal process is ‘characterized completely’ by its renewal function? One is that somebody can ask a question by saying something like “let N (t) be the renewal process with renewal function M (t) = F oo,” just like we often began questions with ”let X be the normal random variable with mean Foo and variance Bar.” That is: when we describe a renewal process to you, we can just tell you the renewal function. 104 Remark 20.3. At first glance, this should be very surprising! After all, the random variables X define the renewal process, and they certainly aren’t characterized completely by their means!! When it exists, we define: Definition 20.4. The derivative m(t) = density. d M (t) dt of the renewal function is called the renewal Taking derivatives of the fundamental renewal equation gives: Z t m(t − s)fX (s)ds. m(t) = fX (t) + 0 20.2.1. Chapter 9 Review. (Most important:) most exam questions will be similar to the computational questions in the back of chapter 9 of the textbook. As we covered essentially all of chapter 9, you should make sure that you are comfortable with essentially all of these questions. The end of this section of the lecture notes has one example of most ‘types’ of questions. Today, I’ll go over the main subjects in chapter 9, discuss the types of questions that we should be able to answer about this material, and give some examples. I will begin by emphasizing things that are not sample questions in the textbook. This isn’t because they are more important; it is because we won’t have time to go over everything, and I expect you to read the textbook problems as well. The main subjects we covered in chapter 9 were: • A fairly small number of objects that we use in computation: the transition matrix P , the expected transition times Mij , potential and fundamental matrices R and S, the limiting and stationary distributions z, and the infinitesimal generator Q of a continuous-time chain. A large percentage of the Markov chain questions will be about computing these objects. • A fairly large number of definitions. From my point of view, these basically fall into three categories: (1) The basics: things like what a stochastic process is, what a Markov chain is, etc. (2) Things to compute: these are the things above, like the expected transition times. (3) Classification: these are the words like transient, irreducible, closed state, aperiodic. Several people have said that they find these definitions hard to remember. I have one main tip here. All of these definitions are really about describing a problem with how a Markov chain is behaving. For example, when a Markov chain is irreducible, it means that a whole bunch of states that are on the graph associated with the Markov chain won’t actually get visited after some initial period. So, to memorize these definitions, I suggest memorizing alongside them an example of the problems that they are meant to avoid. For example, when memorizing what ‘periodic’ means, one might try to associate: (n) – The (rather opaque) definition: a state i is periodic if the set {n : pii > 0} has 1 as its greatest common divisor. 105 – The picture: a Markov chain with d > 1 states arranged in a circle, and all arrows pointing clockwise. – The obstacle: a Markov chain that is periodic has no limiting distribution (in the previous obstacle, this is because P kd+` = P ` but P , P 2 , . . . , P k−1 are all distinct) but often has a unique stationary distribution (in this case, the uniform distribution). This is viewed as surprising - it is more ‘usual’ to have a unique limiting distribution that is equal to the unique stationary distribution. If trying to remember what ‘reducible’ means, one might try to associate: – The (again rather opaque) definition: a Markov chain is irreducible if, for (n) all chains i, j, we have supn pij > 0. Otherwise, it is reducible. – The picture: Two ‘nice’ Markov chains next to each other, with no arrows between them. – The obstacle: a Markov chain that is reducible will not have one stationary distribution (‘nice’ reducible chains will have infinitely many, but some might have 0). Associating the definition, example and obstacle has a few advantages. The definition will make more sense, and might be easier to remember. It will also let you quickly answer various tricky questions quite easily. For example: Example 119. Give an example of a Markov chain with at least two different stationary distributions and no limiting distribution. Provide the two stationary distributions. To answer this, we recall that periodic chains often have no limiting distributions, while reducible chains often have many stationary distributions. So we write down the simplest chain that is both: 0 1 0 0 1 0 0 0 P = 0 0 0 1 0 0 1 0 . Two stationary distributions are z1 = (0.5, 0.5, 0, 0) and Z2 = (0, 0, 0.5, 0.5). Example 120. Let 0.6 0.4 0 0 0.3 0.7 0 0 P = 0 0 0.5 0.5 0 0 0.8 0.2 . Create a new matrix P 0 that differs from P in exactly two elements in the first and third rows, so that P 0 has a unique stationary distribution. This question is trickier. The first thing you must discover is: why doesn’t P already have a unique stationary distribution? If you draw the picture for this transition matrix, you will quickly see that it is reducible. Reducible chains often don’t have unique stationary distributions, so the obvious fix is to make the chain reducible: 106 0.5 0.4 0.1 0 0.3 0.7 0 0 P0 = 0 0.1 0.4 0.5 0 0 0.8 0.2 . Remembering this can also make multiple-choice questions much easier. For example, you can answer a question like ‘which of the following Markov chains does not have a unique stationary distribution’ without having to do all of the linear algebra to actually compute stationary distributions. • Several theorems. I will generally not ask for you to state a theorem, but will certainly ask some related questions. The examples above related to stationary and limiting distributions were very closely related to some theorems that were stated in class, but didn’t actually require the theorems in order to be solved. • Some computational strategies and formulae. The most important of these is the Chapman-Kolmogorov equation. • Some miscellaneous bits. Most of these are not important to the course (e.g. MetropolisHastings chains; the details of the calculations we did for reflecting random walk). A few, such as the renewal equation, are very important. 20.2.2. Important Calculations. All of these questions are very similar to questions in the textbook. I have included the number of the associated textbook question. Example 121 (Path Probability (9.2.3)). Let P = 0 0 1 0 0 0 0.2 0 0 0 0.6 0 0 1 0 0 0 1 0 0 0.8 0 0.4 0 0 . What is the probability of being in states 1, 2, . . . , 5 after starting in state 1? We draw: 107 Example 122 (Classification of States (9.4.2)). Classify the states and subsets of the Markov chain drawn here: 108 109 Example 123 (Potential Matrix (9.6.1)). Compute the potential matrix of 0 0.3 0.7 0 0 0.4 0 0 0.6 0 0 0.8 0 P = 0.2 0 0 0 0.5 0 0.5 0 0 0 0 1 and use this to compute the expected number of visits: • To state 2, started at state 1. • To state 1, started at state 1. • To state 4, started at state 3. We begin by classifying states. Drawing the diagram for this chain, it is easy to check that states 1,2,3,4 are transient while state 5 is recurrent. Thus, 0 0.3 0.7 0 0.4 0 0 0.6 T = 0.2 0 0 0.8 0 0 0.5 0 We then have S = (Id − T )−1 . Calculating this, we have 1.62 0.81 S≈ 0.54 0.27 0.49 1.24 0.16 0.08 2.14 1.57 2.38 1.19 2 2 2 2 Note: it is striking that the last column has all 2’s. Where does this come from? Well, the only way to exit the group of transient states is to go through state 4. Also, every time you go to state 4, you exit the group of transient states with probability 0.5. Thus, regardless of where you start, the number of times through state 4 before exiting the group of transient is actually a geometric random variable with parameter 0.5. In particular, this number has mean 2. Asking for the expected number of trips through a state like this in a much larger Markov chain would make a (slightly mean) tricky question. Of course, now that I’ve pointed it out, I would feel less guilty about asking it... Continuing, we can read off from the matrix S that the three answers are: • 0.49. • 1.62. • 2. NOTE: I might ask you to compute the expected number of visits from i to j without telling you to compute the potential matrix. This is harder for many people, as you have to remember that you should calculate the potential matrix! Example 124 (Stationary Distributions via Return Times (9.8.4)). Consider the transition matrix: 110 0.6 0.3 0.1 P = 0.1 0.7 0.2 0.1 0.1 0.8 (1) Does this matrix have a unique stationary distribution? What about a unique limiting distribution? (2) Assume that you started in state 1. What is limt→∞ P[Xt = 3]? (3) What is the mean recurrence time of state 2? This question is really asking two things: (1) Can you calculate the stationary distribution? (2) Do you remember some of the theorems we have about limiting distributions? Lets do the computational part first, as almost everybody finds it easier. We solve z = Pz z1 + z2 + z3 = 1 to find z = (0.2, 0.35, 0.45). So, we have found a stationary distribution. We now use this to actually answer the question: (1) Yes. P is the transition matrix of an irreducible aperiodic finite DTMC, and so there is a unique stationary distribution that is also the limiting distribution. (2) Since the limiting distribution is equal to the stationary distribution, this probability is z3 = 0.45. 1 (3) For an irreducible aperiodic finite DTMC, zi = M1ii . Thus, M22 = 0.35 ≈ 2.86. Note: This question might look like a question about computation. However, for almost everybody, solving z = P z will be the easiest part. Make sure that you understand the rest of the solution! Example 125 (Reversed Chain (9.9.1)). For each transition matrix below, find the transition matrix of the reversed chain: 0.6 0.3 0.1 P1 = 0.1 0.7 0.2 0.1 0.1 0.8 0 0 1 P2 = 1 0 0 0 1 0 We first calculate the stationary measures π (1) , π (2) of P1 , P2 . They are: π (1) = (0.2, 0.35, 0.45) 1 1 1 π (2) = ( , , ). 3 3 3 pji πj We then use the formula rij = πi for the reversible chain and find: 0.6 0.175 0.225 0.13 P1 = 0.17 0.7 0.04 0.16 0.8 111 0 1 0 P2 = 0 0 1 1 0 0 Note: The second is interesting to draw. It changes a Markov chain that goes around a circle counter-clockwise into a Markov chain that goes around the same circle in th opposite direction. Example 126 (Embedded Chain (9.10.3)). Construct the embedded chain of the continuoustime Markov chain with infinitesimal generator −4 2 2 Q = 1 −7 6 3 3 −6 Recall that the embedded chain has transition matrix: qij , i 6= j pij = P k6=i qik pii = 0. Thus, 0 P = 1 7 1 2 1 2 0 1 2 1 2 6 7 0 Example 127 (Stationary Distributions for Continuous-Time Chains (9.10.7)). Calculate the stationary distribution of the continuous-time Markov chain with infinitesimal generator −4 2 2 Q = 1 −7 6 3 3 −6 We recall that the stationary distribution satisfies zQ = 0. We calculate this directly, and find z ≈ (0.35, 0.26, 0.38). Note: the following example was not in the lectures, but might be helpful for people who haven’t seen much discussion of infinities, infinite sums, and so on. Example 128. We have already in this course a few times. For example, we P seen infinities j know that for all 0 < p < 1, ∞ p(1 − p) = 1. However, they will be more important for j=0 Markov chains, so we take a moment to look at when infinities are ‘ok’ and when they’re trouble. There are basically three types of sums: ∞ X j=0 ∞ X 2−j = 2 j −1 = ∞ j=0 112 ∞ X j=0 ∞ X (−2)−j = 2 3 (−j)−1 = 1 − log(2) j=0 ∞ X (−1)j =??? j=0 The first two are infinite sums of positive terms. These are pretty safe: either they converge to a number of they go to infinity. If you are dealing with only sums of positive terms that you know converge, you can treat them almost like numbers. The next term involves an infinite sum that has some negative signs. However, if you remove the negative signs, the remaining sum still converges. These are absolutely convergent sums, and they are also fairly safe. However, they don’t play well with diverging sums. The last two terms involve infinite sums with some negative signs, for which the corresponding absolute sum diverges. These sums are quite dangerous, and don’t interact well with each other or other sums. In this class, we are going to be fairly careful to only look at infinite sums of the first two types. 113 21. Lecture 20: Start of Chapter 11: Introduction to Queues (Mar. 26) 21.1. Summary. • A few administrative details. • Introduction to Queues. 21.2. Lecture. Queueing theory studies the way that queues evolve over time. In the next few classes, we will only have the time to say a little bit about queueing theory - the main examples, a few definitions, and a few calculations - and so we won’t be analyzing particularly complicated systems. Once beyond this class, queueing theory is often more concerned with the optimization of queueing systems than the modelling of specific systems. That is: although we will mostly be doing computations about the properties of specific systems, this is just a first step towards the more interesting task of optimizing your system. The definitions in chapter 9 of the textbook were a little bit fuzzy, and I tried to clarify them. The definitions in chapter 11 are even fuzzier, but this extra fuzziness seems useful in describing the subject. I will still try to make it clear when a definition is ‘precise’ and when it is only a rough description. First, some jargon that gets used to describe queues. We imagine an infinite sequence of potential customers. As they arrive at the office, they form one or more queues in front of one or more servers. Each customer eventually gets to one of the servers, and then some amount of time passes as they are served. The random process that we study is generally the total number of people in line at a given time. This may or may not be a Markov chain. Next, a rough definition of the simplest possible queue. This queue describes, roughly, what happens if a bunch of people are lining up at the sole checkout counter in a small grocery store: Definition 21.1 (M/M/1 Queue (rough description)). An M/M/1 queue N (t) describes the evolution of the number of people waiting in a single line. It is assumed that the time in-between arrivals of people has exponential distribution. It is also assumed that only the front person in the line is being served at any time, and that the amount of time it takes for a person to be served is also exponential. This is a nice queueing model, but it certainly isn’t the only one. In a large grocery store, there are often several lines, and people jump between them. This will influence the number of people who are waiting. More complicated queues are possible, and indeed are quite common. In the literature, there is a standard notation for describing queues as a sequence of up to 6 letters or numbers in a row, seperated by slashes. I’ll give a rough version of the notation here, and say why the first example is called M/M/1: Definition 21.2 (Kendall’s Notation). • A: The first element of the list describes the inter-arrival time. The letter M in M/M/1 queue stands for Markov. This indicates the fact that the number of customers who have arrived by time t is a Markov chain. We already know, at least roughly, that only the exponential distribution can have this property. • B: the second element of the list describes the service time once somebody is at the front of the queue. The second M in M/M/1 queue also stands for Markov, and again indicates an exponential distribution. • C: The number of servers. This is self-explanatory! 114 • X: The system capacity. This is the maximum possible size of the queue - any extra customers who arrive are assumed to wander off (possibly to buy groceries elsewhere). When this isn’t mentioned (as in M/M/1 queue) it is assumed to be infinity. • Y: The size of the customer population. Again, when it isn’t mentioned it is assumed to be infinity. • Z: The queue scheduling discipline. This refers to the way that the line works. Most grocery lines are first-come, first-served (FCFS). That is, you start at the back of the line, move forward, and get served when you get to the front. Other possibilities include last-come, first-served (so you get served from the back of the line) or roundrobin (like FCFS, but you get booted back into line part-way through being served if you are taking too long). Even this notation obviously doesn’t cover everything (for example, it doesn’t cover the fact that people don’t switch lines perfectly when going to a grocery store), but it is a good start. With this overview, we’ll go back and look at some pieces of the M/M/1 queue more carefully, with breaks to discuss other general principles as they show up. First, the arrival process. Recall: Definition 21.3 (Poisson Process). Let X1 , X2 , . . . be an i.i.d. sequence of exponential random variables, and let N (t) = max{i ≥ 0 : i X X` < t}. `=1 Then N (t) is called a Poisson process. We know that N (t) is a counting process, a renewal process and a Markov process. It has some other rather nice properties. First, two more definitions: Definition 21.4 (Independent Increments). A counting process M (t) is said to have independent increments if M (b) − M (a) is independent of M (d) − M (c) whenever a ≤ b ≤ c ≤ d. Definition 21.5 (Stationary Increments). A counting process M (t) is said to have stationary increments if the distribution of M (b) − M (a) depends only on b − a. We have: • The increments of a Poisson process have Poisson distribution. That is, P[N (a + b) − N (a) = n] = e−λb (λb)n , n! where λ−1 = E[N (1)]. • Any continuous-time counting process N (t) that has independent, stationary increments must in fact be a copy of the Poisson process. • There are two other equivalent definitions in the textbook that I skip. The Poisson process has a number of special properties that make them especially nice in the context of queues. The first is the superposition/decomposition property: 115 Theorem 129 (Superposition Property). Let N1 (t), N2 (t), . . . , Nk (t) be independent Poisson processes. Then N (t) = k X Ni (t) i=1 is also a Poisson process. Remark 21.6. Here is an interpretation. If you have k lines at a grocery store, each of which gets filled up like a Poisson process, you can also model the total number of people waiting in all lines as a Poisson process. We also have the reverse: Theorem 130 (Decomposition Property). Let N (t) be a Poisson process with E[N (1)] = λ1 . Then for any k ∈ N and any λ1 , . . . , λk satisfying k 1 X 1 = , λ λ i=1 i there exist independent Poisson processes N1 (t), . . . , Nk (t) with E[Ni (t)] = N (t) = k X 1 λi and Ni (t). i=1 Remark 21.7. We don’t prove these two results. However, it is a nice exercise to do so (and would make a fair exam question). The main idea in both cases is to do calculations with generating functions to check that N (t) is Poisson with the right parameters for all t. Example 131 (Similar to 11.1.1 of textbook). Let N (t) be a Poisson process with mean E[N (1)] = 2. Calculate: (1) P[N (5) = 0]. (2) E[inf{t : N (t) = 3}]. Both of these are easy if we remember our usual representation of the Poisson process: (1) N (5) has Poisson distribution with mean E[N (5)] = 5E[N (1)] = 10. Thus, 100 −10 e ≈ 0.00005. 0! (2) This looks funnier, but will turn out to be almost as easy. Let X1 , X2 , . . . be the interarrival times of the Poisson process. Then inf{t : N (t) = 3} = X1 + X2 + X3 . But X1 , X2 , X3 are independent exponential distributions with mean 21 , so X1 + X2 + X3 has Erlang distribution. Thus, P[N (5) = 0] = E[inf{t : N (t) = 3}] = E[X1 + X2 + X3 ] = 6. Remark 21.8. You should also be able to calculate the variance Var[inf{t : N (t) = 3}]! It isn’t any harder if you remember the formulas for Erlang random variables. The next property is quite a bit more complicated. At first glance it might seem ‘obvious,’ but in fact it is quite special. We need some notation: 116 • Let an (t) be the probability that a person arriving at time t sees n people in line. Note this is a distribution on the number of people n (e.g. N ∪ {0}), not on the time t (e.g. R). Unfortunately, we don’t quite have the notation to define an (t) carefully. It can be written as n k X X an (t) = P[ Xi = t | t ∈ { Xi }k∈N ], i=1 i=1 but the event we’re conditioning on has probability 0, and we haven’t really talked about how to condition on general events with probability 0. • Let pn (t) be the probability that there are n people in line at time t. This is, of course, a Poisson distribution. We then have: Theorem 132 (PASTA Property). an (t) = pn (t). Remark 21.9. The PASTA property also holds for the M/M/1 queue in full generality. That is, if an (t) is the distribution of the length of the queue at time t, and pn (t) is the stationary distribution of the length of the queue, then an (t) = pn (t). Why is this special? (1) The PASTA property doesn’t hold for any arrival process that isn’t Poisson, and doesn’t hold for any queue that doesn’t start M/M. For example, assume that people arrive in a line exactly once per minute and that the service time is exactly 30 seconds. Then p0 (t) = p1 (t) = 12 , but a0 (t) = 1. (2) Some very similar-sounding facts are false for the Poisson process. For example, assume that the arrival of buses follows a Poisson process with mean inter-arrival time of 20 minutes. The same calculation that gave us the PASTA property tells us that the expected time until the next bus is also 20 minutes, and also that the expected time since the last bus is 20 minutes. Thus, when you arrive at the bus station, you expect to arrive during an inter-arrival period of length 40 minutes. In particular, if people arrive at a rate of exactly 1 per minute, you expect to be boarding with 40 other people, even though the average number of people boarding a bus is only 20 people! This can be a little surprising at first glance. If 20 people board the bus on average, how can it be that the expected number of people boarding the bus with me is 40? The answer is that more people arrive during the longer inter-arrival times than the shorter ones. For the Poisson process, this means that you exactly double the expected inter-arrival time. For other arrival processes, something more complicated will happen in general. 117 22. Lecture 21: More Chapter 11: Properties and Birth-Death Chains (Mar. 30) 22.1. Summary. • A few administrative details. • Studying specific queues. 22.2. Lecture. Recall, last class we saw: • A lot of jargon related to queues (arrival process, service times, Kendall’s notation, etc.) • The simplest queue: M/M/1. • Some special properties of the Poisson process. • One special property of the M/M/1 queue. Today, we’ll continue to do calculations related to the M/M/1 queue. Remark 22.1. most of the calculations we do today are about the stationary distribution of the M/M/1 queue, or about the dynamics if the M/M/1 queue is started at its stationary measure. Let’s remember what this means. When you run a Markov chain {Xt }t≥0 , you have to choose its starting point X0 . In most of our discussion, we have been choosing this starting point deterministically - we often assume that the chain starts in ‘state 1,’ and we might assume that queues start with no people in line. However, we are also allowed to treat X0 as a random variable with some distribution. If the initial distribution of X0 is equal to a stationary distribution of the Markov chain, we say that we have started at stationarity, or started at the stationary distribution, or one of several similar phrases. It is important that you recognize this. Most of the calculations we do today assume that the queue is started at stationarity, and will not be accurate for the queue at some finite time t when it is started at length N (0) = 0. Before doing calculations, the textbook takes a break at this point to draw three types of pictures of queueing processes. I won’t draw them on the board, as they are quite intricate and it is worthwhile to look at them for more than a few seconds. Please look at the diagrams on pp. 397-398 of the textbook to understand what they are plotting. They aren’t necessary to do any of the problems, but I suspect that you will find them helpful, in much the same way that we used a diagram to understand example 131 last class. The textbook then takes another break to describe what sorts of questions are generally asked in queueing theory. This may be interesting, but we will skip it. The next important result holds for queues in a great deal of generality, not just for M/M/1 queues. 22.2.1. Little’s Law. For a queue process N (t), let a(t) denote the total number of arrivals by time t and let R t d(t) be the total number of departures by time t, so that N (t) = a(t) − d(t). Define g(t) = 0 N (s)ds. Next, define: λt = a(t) t 118 g(t) a(t) g(t) Lt = t Rt = and also λ = lim λt t→∞ R = lim Rt t→∞ L = lim Lt t→∞ when the limits exist. We then have: Theorem 133 (Little’s Law). L = λR. Remark 22.2. It is just a matter of algebra that Lt = λt Rt , so this shouldn’t be so surprising! Remark 22.3. This theorem is saying that the average number of people in the queue, L, is equal to the average arrival rate, λ, multiplied by the average amount of time it takes to get service, R. Remark 22.4. Again, Little’s law is about limits. Its conclusions may not be useful for finite time intervals. Example 134 (Similar to 11.1.5). At a given grocery store checkout, the average interarrival time is 4 minutes and the average amount of time spent in line is 8 minutes. What is the average number of people in line? L = λR 60 8 = ( )( ) 4 60 = 2. Remark 22.5. Questions about Little’s law are generally phrased as in this example. 22.2.2. M/M/1 queues as birth-and-death processes. Recall the definition of a (continuoustime) birth-and-death chain: Definition 22.6 (birth-and-death chain). A continuous-time Markov chain is called a birthand-death chain if: • Its state space is the nonnegative integers {0, 1, 2, . . .}. • The matrix qij satisfies qij = 0 whenever |i − j| > 1. Remark 22.7. The M/M/1 queue is a continuous-time Markov chain, with off-diagonal entries of the infinitesimal generator given by qi,i+1 = λ qi,i−1 = µ1i>0 qi,j = 0, |i − j| > 1 or i, j < 0. 119 Thus, it is a birth-and-death chain. Indeed, it is just about the simplest possible birth-anddeath chain. Remark 22.8. We have already (almost) seen this process before! Note that the transition matrix of the embedded chain is: λ pi,i+1 = λ+µ µ pi,i−1 = 1i>0 λ+µ pi,j = 0, |i − j| > 1 or i, j < 0 or i = j. λ Writing p = λ+µ , this is exactly the random walk with reflection that we studied in chapter 9. So, we will continue studying it now, but most of the answers we get should be no surprise. We have essentially already calculated the stationary distribution pn . Let ρ = µλ . We have: • When ρ < 1, pn = ρn (1 − ρ). • When ρ > 1, all states are transient and so there is no stationary distribution. This is obvious from our description of the process: on average, people are arriving faster than they are processed, so the size of the queue goes to infinity. • When ρ = 1, we know from earlier classes that all states of the embedded chain are null-recurrent. This means that, again, there is no stationary distribution. This is much less obvious than what happens when ρ > 1. Here, the average rate of arrival is equal to the average service rate, and so the queue has no reason to grow or shrink. However, the random fluctuations in the length of the queue tend to grow as time passes. If you are interested in the subject, this is worth simulating in R: I don’t think it is obvious from the math that there won’t be a stationary distribution, but it should be very easy to convince yourself of this fact by looking at the simulation. Remark 22.9. Notice that pn is the PMF of a geometric random variable. If you are studying for exams, this means that the stationary distribution of an M/M/1 queue is yet another way to state easy problems about geometric random variables. For example, the following. Example 135. Consider a line at a grocery store, which we model as an M/M/1 queue. Assume that people arrive at an average rate of 15 per hour, and that the average service time is 2 minutes. What is the probability of finding 2 or more people in the line, under the stationary distribution of this process? Note: we know that the stationary distribution is really just a geometric random variable. So, the only part of this question that is new is the work required to find the parameter! We have λ = 15 customer arrivals per hour and µ = 60 = 30 customers served per hour, 2 λ 1 and so ρ = µ = 2 . Let X be the number of people in line. Then 1 1 1 − = . 2 4 4 We now consider some ‘performance measures’ for these queues. The most obvious is the expected number of people in the queue. The most important is often the expected response time - that is, the expected amount of time that a customer will wait. I’ll only give a few of P[X ≥ 2] = 1 − P[X = 0] − P[X = 1] = 1 − 120 these for now; we will do a more comprehensive review when looking at the M/M/c queue next class. We first calculate the expected queue length L = E[N ]. When ρ ≥ 1, it is infinity. When ρ < 1, we use a little trick for calculating infinite sums: L= ∞ X npn n=0 = ∞ X n(1 − ρ)ρn n=0 = (1 − ρ)ρ ∞ X nρn−1 n=0 ∞ d X n ρ = (1 − ρ)ρ dρ n=0 = (1 − ρ)ρ = d 1 dρ 1 − ρ ρ . 1−ρ Writing this in terms of µ and λ, λ . µ−λ This trick of taking derivatives of infinite sums doesn’t always give the right answer, but is safe in this situation. You can learn about when this trick works in an introduction to analysis. The same trick gives the variance of N : ρ . V ar[N ] = (1 − ρ)2 Note that, for ρ very close to 1, these numbers are enormous! The average response time might seem harder to calculate. However Little’s law makes it easy. Recall that the average occupancy E[N ] is equal to the average arrival rate λ multiplied by the average response time E[R], so E[N ] = E[N ] 1 = . λ µ−λ Again, this goes to infinity as µ − λ goes to 0. The textbook calculates several other performance measures for queues, such as the average queue length conditional on the queue length being nonzero. We skip most of them, but highlight two definitions: E[R] = Definition 22.10. The throughput of a queue is the rate at which customers are processed. For the M/M/1 queue, this is min(λ, µ). The utilization of a queue is the percentage of the time that the server is being used. For the M/M/1 queue, this is min(1, µλ ). 121 Remark 22.11. We highlight two statistics that seem most important. The average response 1 time, µ−λ , measures how annoying it is to wait in a given line. The utilization, µλ , measures how efficiently the server in the queue is being used. There is obviously a big tradeoff here having low waiting times on average means having long idle times for the server. This isn’t surprising, but it is nice when the probability model picks up something that we all know from the real world. We can do better than using Little’s law to calculate the expected response time E[R] under the stationary measure. We can actually calculate the whole distribution. The calculation is a little involved, so we just give the conclusion: P[R > t] = e−(µ−λ)t . 1 Thus, R actually has exponential distribution with mean µ−λ . We do a standard problem on the M/M/1 queue before moving on to general birth-anddeath chains: Example 136 (Similar to 11.2.4 of textbook). Computer help desks are being set up on campus. It is known that every student being served by a help desk generates on average 1 query every 500 hours. It takes a worker at a help desk, on average, 12 minutes to respond to a query. If we want the average response time to be less than 20 minutes, how many students can each help desk serve? The first step is recognizing what equation we want to use. We are interested in the average response time, so we should use the formula: E[R] = 1 . µ−λ We want E[R] ≤ 20 and we know that µ is 60 = 5 queries per hour. What is λ, and how 12 is it related to the number of students being served? Let n be the number of students served. n Then the number of queries per hour is λ = 500 . Putting together these pieces of information, we need 1 1 ≥ E[R] = n . 3 5 − 500 Solving, 5− n ≥3 500 so n ≤ 500(5 − 3) = 1000. Remark 22.12. There are obviously many, many ways to ‘dress up’ this type of question. The calculation will always be easy in the end, but you must be able to figure out what to plug into the formula, and this is where the mistakes will be. Remark 22.13. Notice that n depends relatively little on our rather strict requirement that the service time should be less than 20 minutes. This is typical: waiting times are quite short until the efficiency ρ is very close to 1, and then they explode. 122 We now move on to general birth-and death chains. In keeping with our notation for the M/M/1 queue, the off-diagonal entries of the infinitesimal generators of these chains is written qi,i+1 = λi qi,i−1 = µi 1i>0 qi,j = 0, |i − j| > 1 or i, j < 0. So, the M/M/1 queue is a special case, with λi = λ, µi = µ for all i. The mathematics of these chains is very similar to the mathematics of the M/M/1 queue. The applications are a little less obvious: why would the arrival and service rate depend on the number of customers in the queue? At least one plausible reason is that people are less likely to join a queue that is extremely long, and so λi might decrease as i increases. We will see in the next section an example where µi changes as well. Back to the mathematics, we look again for the stationary distribution pn . From the definition of the stationary distribution, this must satisfy 0 = −(λn + µn )pn + λn−1 pn−1 + µn+1 pn+1 , 0 = −λ0 p0 + µ1 p1 . n≥1 Thus, λ0 p0 , µ1 λn + µn λn−1 = pn − pn−1 , µn+1 µn−1 p1 = pn+1 n ≥ 1. How do we solve this? A few observations: P • Since we are looking for a distribution, we know that n pn = 1. Thus, we have a ‘free parameter’ - we can find pn as a function of p0 , and then find p0 at the end. This makes life much easier! • We can get the first few values directly. We find: p0 = p0 λ0 p1 = p0 µ1 λ1 + µ1 λ0 λ1 + µ1 λ0 λ0 λ1 λ0 p1 − p 0 = p0 − p0 = p0 p2 = µ2 µ0 µ2 µ1 µ0 µ2 µ1 λ2 λ1 λ0 p3 = p0 µ3 µ2 µ1 λ3 λ2 λ1 λ0 p4 = p0 . µ4 µ3 µ2 µ1 It isn’t clear how to get a formula out of these calculations, but it certainly leads to a natural guess: 123 pn = p0 n Y λi−1 i=1 µi . The most natural tool for going from a guess to a proof is induction. Let’s do this carefully: Theorem 137. For all n ≥ 1, pn = p0 n Y λi−1 i=1 µi . (22.1) Proof. We know that equation (22.1) holds for n = 1, 2, 3, 4. To use the inductive framework, we assume that it holds for n ≤ m and show that it also holds for n = m + 1. To do this, we write: λm + µm λm−1 pm − pm−1 µm+1 µm−1 m m−1 λm + µm Y λi−1 λm−1 Y λi−1 = p0 − p0 µm+1 i=1 µi µm−1 i=1 µi m−1 Y λm + µm λm−1 λm−1 = p0 − µ µ µm−1 m+1 m i=1 pm+1 = = p0 m Y λi−1 i=1 µi . We used the assumption that equation (22.1) holds for n ≤ m in the second line of this calculation; the rest is just algebra. Q Following the definition of the textbook, let ρi = λµi−1 and let ζn = ni=1 ρi . We then have: i P • If Pn ζn = ∞, there is no stationary distribution. • If n ζn = ζ < ∞, there is a stationary distribution. It is given by ζn pn = . ζ Note that this agrees with our earlier result for the M/M/1 queue. It also gives us the following ‘reasonable’ way to define queues that remain finite: Theorem 138. Consider a birth-and-death chain that satisfies: • µi = µ for all i. • λi+1 ≤ λi for all i. Then the associated chain has a stationary distribution if and only if limi→∞ λi < µ. Remark 22.14. The limit in the above theorem always exists. The textbook gives a finer analysis of the limits of these birth-and-death chains. For example, it distinguishes between chains with transient states and chains with null-recurrent states. 124 23. Lecture 22: More Chapter 11: Multiserver Systems (Apr. 2) 23.1. Summary. • A few administrative details. • Studying more queues. • Note: this is Easter weekend. 23.2. Lecture. Today, we go from the M/M/1 queue to the M/M/c queue. Recall from Kendall’s notation that this means: • We have an infinite pool of potential customers, arriving in a single line. The interarrival times are a series of iid exponential random variables. • We have a collection of c servers. Once service has started, the service time is an exponential random variable. Remark 23.1. There is some asymmetry between the number of lines and the number of servers. Imagine that we were to make up a type of queue with c lines as well as c servers, and assume that the arrival process for each line is a Poisson process. By the superposition property, this is equivalent to a queueing process in which all customers arrive in a single line, as long as you assume that customers can move from a non-empty line to an empty line when they want to. This last assumption is generally true. On the other hand, an M/M/c queue is not equivalent to any M/M/1 queue. You can check this with the mathematics, but there is a very simple ‘intuitive’ reason for this that we understand from grocery stores: one customer can hop between lines before being served, but one customer cannot be served by more than one server. To be a little more concrete, assume that customers arrive very rarely (say once a month) and the service time is, on average, two minutes. This service time cannot be decreased by adding more servers - it will always be two minutes. An M/M/c queue is still a birth-and-death process. Assume that customers arrive at rate λ and that each server processes customers at rate µ. Using the notation for general birthand-death processes from last class, the M/M/c queue has parameters: λn = λ, µi = iµ, µi = cµ, for all n, 1 ≤ i ≤ c, i > c. Using the formulas for the stationary distribution of a general birth-and-death process, any stationary distribution for an M/M/c queue must satisfy: λn 1 pn = p0 n , 1 ≤ n ≤ c, µ n! λn 1 c−n pn = p0 n c , n > c. µ c! This is rather complicated. The formula for the solution to this stationary distribution is given in the textbook on pp. 420 and 421. I don’t write it here, as I don’t think it is more informative than the above. I do point out that: • If λ < cµ, a stationary distribution exists. • If λ ≥ cµ, no stationary distribution exists. 125 We now move on to performance measures. Unfortunately, the performance measures we calculated for the M/M/1 queue are quite difficult to calculate directly! To calculate this, we introduce several related quantities. Since we’ve seen so many performance measures, we’ll also review the old ones: • N : the number of customers in the system at steady state. Also define L = E[N ]. • Nq : the number of customers waiting in the queue at steady state. For M/M/c, we have |N − Nq | ≤ c. Also define Lq = E[Nq ]. ] . That is, the response time • R: the response time. From Little’s law, E[R] = E[N λ is the average number of people in the queue divided by the arrival rate. We write W = E[R]. • Wq : the average waiting time. Rewriting Little’s law, Lq = λWq . • L0q : the effective queue length, defined by L0q = E[Nq |Nq 6= 0]. Some of these are more interesting than others. Regardless of their practical use, it is very nice to know all of these definitions and relationships when doing computations. In particular, it will be easier to calculate Lq than L for the M/M/1 queue, so we start there: ∞ X (n − c)pn Lq = = n=c ∞ X n=c n cn−c c! n (cρ) p0 − ∞ X n=c c cn−c c! (cρ)n p0 . We can see why this is easier: pn has two formulas, one for n ≥ c and one for n ≤ c. Both formulas show up when calculating L, but only one is needed when calculating Lq . Carrying through this calculation, we find (ρc)c+1 p0 . Lq = c!(1 − ρ)2 c The textbook points out that we already have some simple formulas that allow us to find all of the other simple performance measures once we know Lq . It is worth emphasizing this: we only had to do the ‘hard work’ of actually computing an expectation once. The relations are: • Lq = λWq (this is essentially Little’s law), so we can find Wq . • W = Wq + µ1 . This observation doesn’t have a name, and is very simple but quite useful. The observation is that the mean waiting time (W ) is equal to the mean about of time that you wait before meeting a server (Wq ) plus the mean amount of time that you wait after meeting a server ( µ1 ). This relationship more-or-less holds for other queuing models as well, and is helpful to remember. This lets us find W . • L = λW (this is Little’s law), so we can find L. The formulas you get out of this are: Wq = W = λc µ µc (c − 1)!(cµ − λ)2 λc µ µc (c − 1)!(cµ − 126 p0 p0 λ)2 + 1 µ L= λc λµ µc (c − 1)!(cµ − p0 λ)2 + λ . µ We now do some useful calculations: Example 139 (Erlang-C formula). We model a supermarket as an M/M/c queue, with λ = 8, µ = 3 and c = 5. What is the probability that there is an empty check-out when you arrive? Assume that the queue is started at the stationary distribution. We could do this while plugging in the numbers, but it will be easier to do the algebra and then plug in at the end. Write A for the event that the check-out is empty when you arrive. We have P[A] = p0 + p1 + . . . + pc−1 = 1 − pc − pc+1 − . . . ∞ X cc ρ n = 1 − p0 c! n=c (cρ)c . (1 − ρ)c! This is called the Erlang-C formula. Plugging in, we have in this case = 1 − p0 P[A] ≈ 0.838. Next we try to understand: does being allowed to switch lines make a difference to the efficiency of a queue? Example 140 ((A tiny bit of) the Power of 2 Choices - Similar to example 11.11 of textbook). Fix k > 1. We consider the following two collections of queues: (1) There are k independent M/M/1 queues. Each has arrival rate λk and service rate µ. (2) There is 1 M/M/k queue with arrival rate λ and service rate µ. Note that, in both situations, customers arrive at total rate λ, and there are k servers with rate µ. The only difference is that, in the former situation, there are k separate lines, while in the latter, customers are allowed to switch lines at any time. Let N1 , N2 denote the number of customers in these two queues at stationarity. Similarly λ let R1 , R2 be the average response times. Let ρ = kµ . Then ρ E[N1 ] = k 1−ρ 1 k E[R1 ] = E[N1 ] = . λ kµ − λ (kρ)k λµ E[N2 ] = kρ + p0 (k − 1)!(k − kρ)2 1 E[R2 ] = E[N2 ]. λ When c = 2, we have 2ρ 1 E[N2 ] = 1−ρ1+ρ 127 E[R2 ] = 1 1 . µ 1 − ρ2 What difference does this make? We have 1 E[N1 ] 1+ρ 1 E[R2 ] = E[R2 ]. 1+ρ So, being able to swap lines always makes queues more efficient - sometimes almost twice as efficient. E[N2 ] = Example 141 (Similar to 11.4.3 of textbook). Consider a help desk that receives, on average, 12 calls per hour. It takes a help desk employee 10 minutes to respond to a call, on average. How many employees are needed if the average response time must be less than 2 hours? How many extra employees are needed if the average response time must be less than 12 minutes? We wish to use the formula λc µ µc 1 (c − 1)!(cµ − µ and solve the inequality W ≤ 2 for c. This is going to be fairly terrible to do by hand - this is a complicated formula. Instead, we plot in R, using: RespTime<-function(c,mu,lambda){ rho = lambda/(c*mu) pinv = ((c*rho)^c)/((1-rho)*factorial(c)) for(n in 0:(c-1)) { pinv = pinv + ((c*rho)^n)/factorial(n) } W = (1/pinv)*(mu*(lambda/mu)^c)/ ( factorial(c-1)*(c*mu - lambda)^2) + 1/mu return(W) } W = p0 λ)2 + res = 3:10 for(i in 3:10) { res[i-2] = RespTime(i,6, 12) } plot(res) which(res < 2) which(res < 1/5) We find that 3 servers are required to get a ‘finite’ expected service time, 5 are required to get a response time below 2 hours, and 6 are enough to get a service time below 12 minutes. NOTE: On an exam, we would not make you calculate W for so many values of c. We would ask a simpler question for example, we might ask if 25 servers would be enough for a desired response time. We might also pre-compute some of the values, such as p0 , since the calculations are fairly messy. 128 Having studied the M/M/1 queue and the M/M/c queue, the obvious next step is the M/M/∞ queue. That is, we consider a line as the number of servers goes to infinity. This queue can be understood quite well without using the Markov chain formalism that we have developed so far. For example: • Since there are infinitely many servers, nobody ever waits in line. • Since nobody ever waits in line, the mean time spent per customer must be µ1 . • From these two observations and Little’s law, the mean number of customers present in the queue must be µλ . However, the Markov chain formalism can still be somewhat helpful. An M/M/∞ queue is a birth-and-death chain with n∈N n ∈ N. λn = λ, µn = µn, When is there a stationary distribution? Recall that, for M/M/c queues, we had a stationary distribution whenever λ < cµ. By analogy, we expect a stationary distribution to exist whenever λ < ∞µ - that is, we expect there to always be a stationary distribution. This turns out to be correct. A stationary distribution must satisfy: pn = λn p0 . µn n! We have ∞ X λ λn = e− µ < ∞ n µ n! n=0 for any choice of µ, λ. Thus, a stationary distribution always exists, and we have pn = λn − µλ e . µn n! We note that this is the PMF of a Poisson distribution with parameter was probably not very obvious without using the Markov chain formalism. λ . µ This last fact Example 142 (Similar to 11.4.4 of textbook). At a certain restaurant, customers arrive at an average rate of 5 per hour. They stay, on average, for 30 minutes. The restaurant has 5 tables. Calculate the average number of customers present at any time using the M/M/5 model. Then calculate the same number, using the M/M/∞ approximation to the M/M/100 model. For the M/M/5 model, we use the equation: L = E[N ] = = ( µλ )c λµ (c − 1)!(cµ − ( 25 )100 10 99!(200 − ≈ 2.53. p0 5)2 129 + 5 2 p0 λ)2 + λ µ For the M/M/∞ model, we calculate directly that λ L = = 2.5. µ Remark 23.2. The M/M/∞ approximation is often a good approximation to the M/M/c model when c µλ . Often c = 3 µλ is good enough; as you can see in this example, even c = 2 µλ is acceptable. 130 24. Lecture 23: End of Chapter 11: More Special Systems (Apr. 9) 24.1. Summary. • A few administrative details. • Last bits of queueing theory. 24.2. Lecture. We finally add a fourth letter to our queueing notation: we look at the M/M/1/K queue. Recall that the fourth letter indicates the maximum possible length of a line, and that the M/M/1 queue could also be written as the M/M/1/∞ queue. In the notation of birth-and-death chains, the M/M/1/K queue has state space {0, 1, 2, . . . , K} rather than {0, 1, 2, . . .} and has transition rates λi λK µi µ0 i ∈ {0, 1, . . . , K − 1} = λ, =0 = µ, = 0. i ∈ {1, 2, . . . , K} We now look at the stationary distribution. We recall that the M/M/1 queue had a stationary distribution if and only if ρ = µλ < 1. We can tell immediately that the M/M/1/K queue will always have a stationary distribution, as follows. Recall from chapter 9 that a finite, irreducible continuous-time Markov chain always has a unique stationary distribution. The M/M/1/K is such a Markov chain, and so must have a unique stationary distribution. The only remaining task is to calculate it. We use the general formula for the stationary distribution of a birth-and-death chain to write pn = λn p0 , µn n ∈ {0, 1, . . . , K}. When ρ 6= 1, we have: p−1 0 = K X ρn = n=0 1 − ρK+1 . 1−ρ When ρ = 1, we have p−1 0 = K X ρn = K + 1. n=0 Thus, we find for n ∈ {0, 1, . . . , K} that 1 , K +1 1 − ρK+1 pn = ρn , 1−ρ pn = 131 ρ=1 ρ 6= 1. = ρ, regardless of the value of ρ. Let X be the number of Remark 24.1. we have pn+1 pn customers in a steady-state M/M/1/K queue with parameter ρ, and let Y be the number of customers in a steady-state M/M/1/K queue with parameter ρ−1 . This implies that P[X = n] = P[Y = K − n]. So, there is some symmetry between queues with ρ > 1 and queues with ρ < 1. Remark 24.2. We have already seen that the M/M/∞ queue can be a good approximation to the M/M/c queue when c is very large. This makes some sense: if it is rare for more than 3 check-out lines to be in use simultaneously, it hardly makes a difference if 50, or 500, or infinitely many are availble. Indeed, M/M/∞ queues generally have about µλ queues in use, and the approximation became good when c µλ . The same reasoning suggests that the M/M/1 = M/M/1/∞ queue could be a good approximation to the M/M/1/K queue when K is large: if there are rarely more than 10 people in line, it hardly makes a difference if the store has the capacity for a line of length 200, or 2000, or infinitely long. This idea suggests that the approximation should be good when K is much larger than the typical line length. This is basically correct. The one warning is that the M/M/1/K queue always has a stationary distribution for any value of ρ, but the M/M/1 queue requires ρ < 1. Thus, we can only expect the approximation to be good when ρ < 1. Most performance measures are similar for the M/M/1/K queue as they were for the M/M/1 queue. The expected system size is L= K X npn . n=0 When ρ 6= 1, this is L= ρ(1 − (K + 1)ρK + KρK+1 ) . (1 − ρ)(1 − ρK+1 ) When ρ = 1, this is L= K X n=0 n K = . K +1 2 To find the average system response time W , we would like to use Little’s law. However, Little’s law must be modified for M/M/1/K queues. Rather than the nominal rate of customer arrivals λ, we must use: Definition 24.3 (Effective rate of customer arrivals). The effective rate of customer arrivals for an M/M/1/K queue is λ0 = λ(1 − pK ). Remark 24.4. Recall that pK is the probability that the queue is full, under the stationary distribution. Thus, λ0 is the rate at which customers arrive and actually enter the queue. 132 Thus, Little’s law for the M/M/1/K queue says: L = λ0 W. Remark 24.5. For the M/M/1 queue, we effectively have “K = ∞,” and so λ0 = λ. Thus, this is a generalization of our previous version of Little’s law. You only need to memorize this one! Similarly, we must modify our definitions for throughput and utilization. Definition 24.6 (Throughput). Recall that the throughput of a system is the rate at which customers are served. We define the throughput to be X = λ(1 − pK ), which is exactly λ0 . Definition 24.7 (Utilization). The utilization of a server is the percentage of the time that the server is being used. This is given by λ U = 1 − p0 = (1 − pK ). µ Remark 24.8. For the M/M/1 queue, we defined X=λ λ U= . µ Thus, these definitions are again generalizations of the old ones. You don’t have to remember the old ones separately. Example 143 (Similar to 11.5.1). A barbershop has a single barber and 5 chairs. Assume it takes on average 15 minutes to give a haircut, that 3 people arrive per hour, and that people will leave the shop if they arrive when all chairs are full. • What percentage of arriving customers will eventually get a haircut? • What is the effective arrival rate? • What percentage of the time is the barber occupied? • What is the average response time? We begin this type of question by doing some initial calculations: λ = 3, µ = 4 and ρ = 0.75. Now for the specific questions: • Let pn be the stationary distribution of the queue. We are interested in (1 − ρ)ρ5 1 − p5 = 1 − 1 − ρ6 0.25(0.75)5 =1− 1 − (0.75)6 ≈ 0.928. • The effective arrival rate is given by λ0 = λ(1 − p5 ) ≈ 2.78. 133 • The occupation rate is 1−ρ 1 − ρ6 0.25 =1− 1 − (0.75)6 ≈ 0.696. 1 − p0 = 1 − • Using Little’s law, the average response time is 1 W = 0L λ 1 0.75(1 − 6(0.755 ) + 5(0.756 )) = 2.78 (0.25)(1 − 0.756 ) ≈ 0.612. 24.2.1. M/M/c/K Queues. We bring together our two generalizations of M/M/1 queues, and study M/M/c/K queues. We will assume that 1 ≤ c ≤ K. These queues are birth-and-death chains, with rates on {0, 1, . . . , K} given by λn = λ, λn = 0, 0 ≤ n ≤ K − 1, n ≥ K, and µ0 = 0, µn = nµ, µn = cµ, 1 ≤ n ≤ c, c ≤ n ≤ K. As with the M/M/1/K queue, the M/M/c/K queue always has a stationary distribution. Solving the equations as before, we have 1 λn p0 , 0 ≤ n ≤ c, n! µn 1 λn pn = n−c p0 , c ≤ n ≤ K, c c! µn where again p0 has a rather complicated formula that is in the textbook. As with the M/M/c queue, it is difficult to compute the average length of the queue L = E[N ] directly. Instead, one begins by computing Lq , then using the (modified!) version of Little’s formula to compute Wq , then recalling W = Wq + µ1 to calculate W , and finally using the (modified!) version of Little’s formula to compute L. pn = Example 144 (Similar to 11.6.2 from textbook). We consider the same barbershop. This time, we assume that there are 2 barbers. Assume it takes on average 15 minutes to give a haircut, that 3 people arrive per hour, and that people will leave the shop if they arrive when all chairs are full. 134 • What percentage of arriving customers will eventually get a haircut? • What is the effective arrival rate? • What percentage of the time is at least one barber occupied? We have again λ = 3, µ = 4. We also calculate p−1 0 c−1 K X 1 λn X 1 λn = + n! µn n=c cn−c c! µn n=0 ≈ 2.19. • This is λn p0 cn−c c! µn ≈ 0.993. 1 1 − p5 = 1 − So, as expected, more people get served if you have more servers. • The effective arrival rate is λ0 = λ(1 − p5 ) ≈ 2.98. • The occupation rate is 1 − p0 ≈ 0.54. So, much lower than when there was only one server. 24.2.2. M/M/c//M queues. We add a 5th letter to our notation, studying what the textbook calls a M/M/c//M queue. This is shorthand for the M/M/c/∞/M queue. It has the potential to be slightly confusing notation: the last M is a number, the total population of customers that can ever show up, but the first two M’s both stand for Markov. I’ll try to use M/M/c/∞/N instead. Note that this is equivalent to an M/M/c/N/N queue - if there are only N potential customers, there are certainly not more than N in the queue at any time. The prototypical example of an M/M/c/∞/N queue is a collection of N machines being used, which sometimes break, and a collection of c teams who repair the machines. Modelling this queue as a birth-and-death process, the rates are given by λn = λ(N − n), 0 ≤ n ≤ N, and µ0 = 0, µn = nµ, µn = cµ, 1 ≤ n ≤ c, c ≤ n ≤ N. Solving the steady-state equations, we find N! λn p0 , 0 ≤ n ≤ c, n!(N − n)! µn N! n! λn pn = p0 , c ≤ n ≤ N, n!(N − n)! cn−c c! µn pn = 135 where again the formula for p0 is fairly complicated and is in the textbook. The textbook gives some other, terrible, formulas for measures of efficiency. These are not worth memorizing. We do point out that the effective arrival rate has a nice formula. The arrival rate is λ(N − n) when n people are currently in the queue, and so 0 λ = N X λ(N − n)pn . n=0 But this is λ0 = N λ N X pn − λ n=0 N X npn n=0 = λN − λL. We can use this to calculate W by Little’s Law. Example 145. A certain office has 80 computers. It is known that, on average, 1 computer will break per day, and that each computer breaks down at a rate of once per 50 days. What is the average response time? 1 We have λ0 = 1, λ = 50 , N = 80. Thus, by the formula immediately before this example, L=N− λ0 = 80 − 50 = 30. λ By Little’s law, L 30 = = 30. 0 λ 1 1 Note: It is clear that λ = 50 , since λ measures the rate at which each individual computer 0 breaks. How do we find that λ = 1? We claim that the number of computers that will break per day is in fact the effective arrival rate! To see this, note that the effective arrival rate is always equal to the rate at which customers are added to the queue. Since all arriving customers are added to the queue in the M/M/c/∞/N model, the effective arrival rate is equal to the average arrival rate. In particular, even though the formula for effective arrival rate is different from the case of the M/M/1 queue, the effective arrival rate still describes the number of customers arriving per unit time. This is different from the case of the M/M/1/K queue, where the effective arrival rate was equal to the ‘nominal’ arrival rate λ multiplied by the probability that an arriving customer is actually added to the queue (i.e. the probability that the queue is not full). W = 136 25. Lecture 24: Final Exam Review (Apr. 13) As with the midterm review, we go backwards, from chapter 11 to chapter 1 (skipping chapter 10). This will have to be a little faster, unfortunately: we’ll only have times to sketch types of solutions and large classes of collections. You should also review the notes for the midterm review! I emphasize: As with the midterm review, this review will focus more on ‘tricky’ questions than the actual exam. This is because I expect you to do the ‘less-tricky’ practice problems from the textbook, and to check your answers. 25.1. Exam Basics. • The exam will be on April 20, 2015 from 7 PM to 10 PM. The exam will take place in DMS 1130. • Simple calculators will be allowed. The calculators allowed during final examinations of the Faculty of Science are: the Texas TI-30X, TI-30XA, TI-30SLR scientific and non programmable. Note that I don’t set this policy - I am just warning you to be careful! More information is available at http://www.uottawa.ca/cgi-bin/cgiwrap/regist/print.cgi?E/academic/info/regist/crs/ • The exam will be closed-book. There will be a cheat sheet attached to the exam, just like there was a cheat sheet attached to the midterm. This cheat sheet is available on my website. There may be lengthy formulas on the exam that are not on the cheat sheet, if necessary (for example, we might have the formula for the probability p0 that an M/M/c/K queue is empty). • The exam will cover what we have discussed in the classroom, as well as things that I have suggested you read from the textbook and the material in the homework. It will not cover any programming material. • In the textbook, this corresponds roughly to chapters 1-9 and 11. The noteable differences are: – Chapters 1,3,4,5: I will probably ask at least one question that is ‘trickier’ than those in the textbook. The midterm review has examples of ‘tricky’ questions. – Chapter 2: Not emphasized. – Chapter 7: There will be no questions about the Gamma distribution, I will not refer to Coxian distributions by that name, and there will be no questions about ‘fitting’ phase-type distributions to data. – Chapter 8: There may be questions about the normal or exponential approximation to the binomial, even though it isn’t in the textbook. There will not be questions on the difference between stronger and weaker forms of convergence. – Chapter 9: There will be no questions about reversibility. – Chapter 11: There will be no questions requiring you to use a graphical representation of a queue (though it may be helpful!). There will be no questions about Kendall’s notation itself, though you must recognize the types of queues that we discussed in class (i.e. you must know what an M/M/c/K queue is). There will be no questions about what the textbook calls ‘transient behaviour’ - e.g. section 11.2.3 for the M/M/1 queue, and analogous sections for the other queues. There will be no questions about state-dependent queues. – Overall: There will be no questions about programming. 137 • The exam will have 4 long-answer questions and 10 multiple-choice questions. The questions will be similar in spirit and difficulty to the midterm. The multiple-choice questions will be fairly uniform over the course. The long-answer questions will be mostly (though not exclusively) on material covered since the midterm. 25.2. Source of Review Material. You should study from: • The examples and questions in the textbook. • The questions from previous exams/midterms in this course. • The review sections of the lecture notes (there is a review from the midterm, a minireview of chapter 9, and this review for the final exam). • The homework problems and solutions. Now, types of problems: 25.3. Chapter 11. There are quite a few worked examples in the lecture notes for chapter 11, and I expect you to do well on the exam if you can do all of those questions and understand the remarks. That being said, here are the types of questions I see: • Detailed questions about the Poisson distribution. We spent a good deal of time discussing the Poisson distribution, and I can ask: – Straightforward computations (e.g. calculate P[N (4) − N (2) = 5]). – Use of special properties (e.g. any continuous-time counting process N (t) that has independent, stationary increments must in fact be a copy of the Poisson process; the superposition property). – Relationship between Poisson and Exponential distributions (e.g. inf{t : N (t) = 5} has distribution X1 + . . . + X5 , where Xi are iid exponential). All of these ‘special properties’ questions are really 2-part questions. The first part is recognizing that there is a special property to use. The second part is a ‘standard’ question from the first half of the course. For example, if you recognize that a continuous-time counting process with independent, stationary increments is a Poisson process in the first part of the question, you will then have to do some calculation with that Poisson process in the second half. • Many straightforward ‘plug-in’ questions are possible. This includes the renewal equation, Little’s law, calculations involving the stationary distribution of a queue, and performance statistics such as expected response time and utilization. Most of the work in these questions is understanding what parameters go where in which equations and remembering the definitions - the ‘actual math’ is generally pretty easy. • We know some facts about birth-and-death chains in general. You should be familiar with these facts and formulas. • There may be less-straightforward calculations involving Little’s law L = λW and other important performance statistics. See, for example, the convoluted calculation of L and W for the M/M/c queue. • There may be ‘tricky’ questions that revolve around the existence of a stationary distribution. Any such question can be answered using our general condition for the existence of a stationary distribution for birth-and-death chains, but most of the time your intuition should be a very good guide. 138 There won’t be questions directly about definitions (e.g. Kendall’s notation, what a counting process is, etc), though I expect you to more-or-less understand them and to be able to use them. See the examples in class notes and the textbook for typical non-trick questions. 25.4. Chapter 9. This chapter was very long! There are some worked examples in the lecture notes, but fewer than chapter 11. Here are some of the main question types, in order roughly corresponding to the lecture notes: • Initial explicit calculations. For example, I give you a transition matrix P and say X1 = 1; I then ask you for P[X3 = 3]. The ‘standard’ answer is to write something like: X P[X3 = 3|X1 = 1] = P[X3 = 3|X2 = k]P[X2 = k|X1 = 1] k and then to grind out the sum. You should also be able to calculate P[X4 = 3] by doing a longer, but very similar, calculation! • Special properties: we already discussed the fact that the Poisson distribution has some special properties, and that these properties let you do some calculations that look difficult at first glance. The same is true for Markov chains. For example, we saw that sojourn times have geometric or exponential distribution, we had some nice properties for the embedded chain of a Markov chain, and so on. As with the ‘special property’ questions for the Poisson distribution, these questions often effectively have two parts: recognizing the property, then doing the calculation. These questions can be a little tricky. • Explicit calculations of transition probabilities using special formulas. For example, we saw the Chapman-Kolmogorov equations, we used linear algebra to compute things like P[X200 = 2|X1 = 1], and so on. These questions are generally identical to the first type of question (where we were trying to calculate things like P[X3 = 2|X1 = 1]), but one needs to do a slightly more involved calculation. I promise that any question of this sort will have minimal computational burden (there will be no diagonalizing of 10 by 10 matrices). As discussed in earlier classes, there is some room for trickiness here. Here is a question that is trickier than any question on the exam, but might have a similar flavour to one of them. Define 0.1 0.4 0.3 0.2 0 0 0 0.5 0.3 0.0 0.2 0 0 0 0.2 0.2 0.4 0.2 0 0 0 0 0 0 1 0 0 P = 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 and let Xt be a Markov chain evolving according to the transition matrix P . What is P[X200 = 7|X1 = 1]? At first glance, this question looks terrible! Answering this the way we have answered questions in class involves finding all of the eigenvalues for a 7 by 7 matrix. So we should do something clever. Notice that the chain has two pieces: there is a 139 mess involving states {1, 2, 3}, and then a line going from state 4 to 5 to 6 to 7. Let τ = inf{t : Xt = 4}. Draw picture on board! By this observation, P[X200 = 7|X1 = 1] = P[τ ≤ 197]. • • • • But then notice that the transition probability from any of state {1, 2, 3} to 4 is always 1 0.2. Thus, τ is a geometric random variable with mean 0.2 = 5, and so we can calculate P[τ ≤ 197] explicitly! In this case, the number is basically 0. There was a very similar (though slightly easier) phenomenon in example 123 from the lecture notes. To prepare for this type of question, you should understand that example as well, and how it relates to this example. Classification of states: we have a large number of words to describe states of Markov chains, and the Markov chain as a whole. You must memorize them all! As I suggested in the short review of chapter 9, it is often easier to remember the classifications if you remember ‘prototypical examples.’ I am guaranteeing that there will be a classification question on the exam. First important statistic: mean recurrence time. Remember that we have many formulas for this. One is a nice set of matrix equations. Another, which we used more often in class, was to set up simple recurrences. A third was to recall that the stationary distribution often satisfies πi = M1ii . The first is quite general but hard to use. The second is quite general, though it requires a bit more setup. The third is by far the easiest to use when it applies - but it doesn’t always apply! You should at least be familiar with the second and third methods. P Second important statistic: mean occupation time (that is, ∞ t=0 P[Xt = i]). Remember that we calculate this by taking the ‘transient part’ T of the potential matrix, then looking at the entries of (Id − T )−1 . Random walk problems: the calculations here are quite difficult, and you won’t be expected to solve the complicated recurrences on an exam. From my point of view, the point of reviewing this section is that you should be able to make use of related formulas that are given to you (as you did in a homework assignment) to do calculations and to classify the Markov chain. Let’s make this concrete. I might give you a transition matrix P and say that the only solutions z to the equation zP = z satisfies: z0 zn+1 = n+1 for n ∈ {0, 1, . . .}. Does the associated matrix have a stationary distribution? To answer this, we note that the stationary distribution would have to satisfy 1= ∞ X n=0 zn = z0 ∞ X n=0 1 = ∞. n+1 This can’t be the case, so the chain has no stationary distribution. On an exam, the non-convergence would likely be even more obvious. • Stationary distributions: You should be able to calculate stationary and limiting distributions. We have three approaches: (1) Diagonalize the entire matrix and take powers. This is the first thing we did. It is never a good idea for calculating a stationary distribution, but can be a 140 good way to calculate a limiting distribution if there is not a unique stationary distribution. (2) Solve zP = z or zQ = 0. These are great ways to find the stationary distribution(s), and can be used to calculate the limiting distribution if there are equal. (3) Use zi = M1ii when it applies. In addition to calculating the stationary distribution, you should be very familiar with the theorems (and especially the standard examples we discussed in class) that explain when stationary and limiting distributions exist, when they are unique, and when they are equal. This was one of the main topics of the chapter 9 review lecture, so I won’t give more examples here. • Continuous-time Markov chains: almost everything here is very similar to discretetime Markov chains. I emphasize: – The only ‘new’ formula is for the embedded chain. – Many of the ‘old’ formulas look different. For example, the stationary distribution satisfies zQ = 0 rather than zP = z. – Some, but not all, calculations can be done using the embedded chain. Understand the calculations for which the embedded chain is enough (e.g. classifying states of a finite Markov chain) and those for which it is not (e.g. computing a stationary distribution). 25.5. Chapter 8. We saw: • Markov’s inequality and variants (most questions involve just plugging in). • Law of large numbers. • Central limit theorem and the idea of a ‘normal approximation’ to a sum of random variables. 25.6. Chapter 7. We saw: • Many explicit families of continuous random variables. Most questions look like ‘X is exponential with inverse-mean λ = 4, what is P[X < 2]?’ Some questions try to disguise themselves a little, e.g. by saying V ar[X] = 12 and asking for E[X]. • We saw phase-type distributions. This is very similar to the above, but with more complicated formulas. • We saw reliability theory. 25.7. Chapter 6. Like chapter 7, but with discrete random variables. 25.8. Chapter 5. Lots of computations that should be second-nature at this point: • Expectations and conditional expectations (this is ‘just calculus’). • Generating functions show up, both as items to calculate and as tools for proving that distributions are the same. • On the midterm, this chapter had the most emphasis on ‘tricky’ questions that were about inequalities rather than equalities. For example, noting that if X ∈ [a, b], then E[X] ≥ a. In general, as with the midterm, most early-course tricky questions are about trying to do a calculation with some information missing and then subbing in an obvious inequality at that stage. 141 25.9. Chapter 4. Computational questions are similar to chapter 5. No generating functions. 25.10. Chapter 3. Questions in this chapter are really about remembering definitions. 25.11. Chapter 2. Not emphasized on the exam, except insofar as it is used in other chapters. 25.12. Chapter 1. Lots of computation that should be straightforward at this point. As was the case in your first courses in probability, much of the trickiness comes from trying to apply Bayes rule - remember that it requires you to come up with a partition, and not all questions come with an ‘obvious’ partition. After chapter 5, this is the most likely place to see ‘tricky’ questions from the early material. 26. Final Exam: April 20, 2015 Good luck! 27. Miscellaneous Last updated on April 9, 2015. 142
© Copyright 2025