Lab Manual for Anth/Biol 5221 c Alan Rogers and Jon Seger September 22, 2014 Contents 4 Simulating Genetic Drift and Mutation 4.1 The easy way to sample from an urn . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Iterating over a list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Simulating drift and mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 4 4 5 Simulating Gene Genealogies 8 5.1 Exponential and Poisson random variates . . . . . . . . . . . . . . . . . . . . . . . . 8 5.2 The Coalescent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 6 Using Simulation to Test a Statistical Hypothesis 6.1 A statistic and its sampling distribution . . . . . . 6.2 Using sampling distributions to test hypotheses . . 6.3 Sampling distributions from computer simulations 6.4 Confidence intervals . . . . . . . . . . . . . . . . . 7 Simulating Selection and Drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 13 14 17 19 8 Using HapMap 22 8.1 Choosing a population and chromosome at random . . . . . . . . . . . . . . . . . . . 22 8.2 Downloading data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 8.3 Using the pgen module to work with HapMap data . . . . . . . . . . . . . . . . . . . 23 9 Heterozygosity of HapMap SNPs 27 9.1 Storing values in a list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 9.2 Making a scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 10 Simulating Selection and Drift at Two Loci 30 10.1 Sampling from the multinomial distribution . . . . . . . . . . . . . . . . . . . . . . . 30 10.2 An incomplete program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 11 The Site Frequency Spectrum of HapMap SNPs 35 11.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 11.2 The site frequency spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 1 CONTENTS 2 11.3 Theoretical spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 12 Linkage Disequilibrium in the Human Genome 12.1 Correlating diploid genotypes . . . . . . . . . . . 12.2 Calculating variances and covariances . . . . . . 12.3 Smoothing data . . . . . . . . . . . . . . . . . . . 12.4 An incomplete program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 40 41 43 44 13 Linkage Disequilibrium Near the Human Lactase Gene 47 13.1 An incomplete program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 A Downloading HapMap data A.1 Interacting with the HapMap server . A.2 Uncompress the data files . . . . . . . A.3 The dummy data file . . . . . . . . . . A.4 Put the data files into the folder where A.5 Downloading all the HapMap files . . . . . . . . . . . . . . . . . . . . . . . you plan to . . . . . . . . . . . . . use . . . . . . . . . . . . . . them . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 50 52 53 53 53 B More Python B.1 Basic Python . . . . . . . B.2 The random module . . . B.3 The pgen module . . . . . B.4 Helping Python find input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 54 54 55 57 . . . . . . . . . files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Project 4 Simulating Genetic Drift and Mutation In real populations, allele frequencies change because a finite number of gametes are sampled from the parental generation to create the “gene pool” of the offspring generation. On average, the allele frequency of the offspring generation is the same as that of the parental generation, but in any particular case it will be slightly higher or lower. These random changes in allele frequency are called genetic drift. The conventional way to model genetic drift involves the “urn” metaphor. We think of the parental gene pool as an “urn” from which gametes (gene copies) are drawn at random, and we think of the offspring gene pool as an initially empty urn to be populated by the gametes drawn from the parents. Probability theory tells us about the distributions of outcomes to expect, given the population size (N ) and the allele frequency (p) among the parents. But what if we want to observe a particular instance of this process, or more to the point, a sequence of such instances, representing the long-term evolution of the population? Before computers, this was very difficult. Now it’s easy and enlightening. 4.1 The easy way to sample from an urn Each time you draw a ball, it is red with a probability (p) that equals the relative frequency of red balls within the urn. This is just like tossing a coin, whose probability of “heads” is equal to p. You already know how to do that because we did it in a previous project. Thus, you have all the tools you need to write a computer simulation of genetic drift. In this project, however, we will show you an easier way. In that previous project, you tossed coins using Python’s random function. We could use that here again, but there’s an easier way. On the class website there is a file called pgen.py which contains several functions that we have written for you. One of these is called bnldev. As explained on on page 55, bnldev generates random values (deviates) from the binomial distribution. Before using it, you must import the bnldev function into your program. After that, the command bnldev(N,p) will generate a binomial random deviate with parameters N (the number of trials) 3 PROJECT 4. SIMULATING GENETIC DRIFT AND MUTATION 4 and p (the probability of “success” on each trial). Here is an example: from pgen import bnldev for i in range(5): print bnldev(10000, 0.5) These three lines of code do five repetitions of Kerrich’s coin experiment. In each repetition, bnldev(10000, 0.5) simulates the entire process of tossing a fair coin 10000 times. Try this on your computer. You will be amazed at how fast it runs. 4.2 Iterating over a list Before we get started in earnest, you need to learn a Python trick. Below, you will need to run a simulation repeatedly, each time with a different value of the mutation rate, u. We want to show you two ways of doing that. Instead of doing an entire simulation, each pass through the loop in my examples will simply print the current value of u. Here is the first approach: u = [0.0001, 0.001, 0.01] for i in range(len(u)): print u[i] Consider how this works. The range command generates the following list: [0,1,2]. Then the for command iterates through this list. It is not necessary, however, to go through this extra step. The for command will iterate through any list, whether it was produced by range or not. Here is a re-write that avoids the range and len commands: for u in [0.0001, 0.001, 0.01]: print u The second approach is simpler, and not only because it leaves out range and len. It also allows one to refer to the current mutation rate simply as u rather than using the more cumbersome u[i]. Our bias in favor of the second method is pretty obvious, but you can use whichever method you please. 4.3 Simulating drift and mutation On his own page 56, Gillespie presents a program that simulates genetic drift. Here is an improved version, which uses bnldev: PROJECT 4. SIMULATING GENETIC DRIFT AND MUTATION 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 5 from pgen import bnldev twoN = 100 # population size (number of gene copies) maxgen = 100000 # maximum generation count p = 0.5 # initial frequency of allele A H = 2*p*(1-p) # initial heterozygosity g = 0 # count generations # Loop continues until heterozygosity is exhausted or we reach # the maximum number of generations. print "%4s %8s %8s" % ("g", "p", "H") while H > 0 and g < maxgen: # Draw balls from urn. x is a binomial random # variate representing the number of copies of A # in the next generation. x = bnldev(twoN, p) p = float(x)/twoN H = 2*p*(1-p) g += 1 print "%4d %8.3f %8.3f" % (g, p, H) You will find this code on the lab page of the course web site in a file called drift.py. Download it (right-click and select “save as”) and run it a few times. It should run to completion in 100 generations or so. The middle column of output is p, which should wobble around and end up either at 0 or 1. Meanwhile, H declines fairly steadily from 0.5 to 0. Make sure you understand the code. In line 17, x is the number of copies of some allele (say A1 ) in the new generation. Using x, the program calculates new values of p and H. Finally, it increments the count of generations and prints a line of output. This program simulates drift in its purest form—no other evolutionary force is involved. In this exercise, you will add mutation into this simulation. In the end, you will need to loop over several values of u, the mutation rate. For the moment, however, start small by defining a single value. Put the line u = 0.01 just after line 5 in your program. (Refer to the listing above for the line numbering.) You have now defined a mutation rate. We will assume that the rate of mutation from A1 to A2 is the same as the rate from A2 to A1 . How should these assumptions be incorporated into the simulation? The mathematical details depend on the order of events during the life cycle. Let us suppose that the life cycle has the following steps: 1. Each individual produces many many gametes. Because there are so many gametes, the frequency of A1 among them is essentially identical to that among parents. In other words, there is no genetic drift at this stage. A fraction u of the copies of A1 become A2 , and a PROJECT 4. SIMULATING GENETIC DRIFT AND MUTATION 6 fraction u of the A2 become A1 . As a result, the frequency of A1 becomes p0 = p(1 − u) + (1 − p)u. Among the gametes, a fraction p were allele A1 . Of these, a fraction 1 − u failed to mutate and are therefore still A1 . Thus, p(1 − u) is the fraction of the gametes that carried A1 and did not change, whereas (1 − p)u is the fraction that were A2 before mutation but are A1 after. 2. From these gametes, a smaller number (2N ) of gene copies is chosen to become adults in the next generation. This is where drift happens. In the new generation, the number of copies of A1 among the adults is a binomial random variable with parameters 2N and p0 . Exercise We are finally in a position to do something interesting. When the mutation rate is very low, heterozygosity behaves as though there were no mutation at all—it declines to 0. Presumably, this is also true when the mutation rate is very nearly 0. This is not true, however, for larger values of u. The question is, under what circumstances is the effect of mutation large enough to matter? In other words, what is the smallest value of u for which H is appreciably greater than 0? The answer will certainly depend on u, because u measures the force of mutation. It may also depend on N , if the answer turns out to depend on the force of mutation relative to that of drift. To answer these questions, the first step is to put mutation into your drift program: 1. Modify your drift program so that mutation occurs in each generation. This will involve adding a line of code in between lines 13 and 14 in the listing above. Run your modified program a few times. It should still run to fixation, but it should take more generations. 2. Modify your program so that it runs the simulation repeatedly, once for each of the several mutation rates. Use at least four rates, which span the range from 0.0001 to 0.1, using the method discussed above. Think carefully about how you want to divide this interval up. Do you want to space the rates evenly on an arithmetic scale? On a logarithmic scale?1 Does it matter? You will need to add a for statement just after line 5. This new loop will enclose lines 6–22 of the original program, and those original lines will need to be indented. Remove the print statement (line 22). The new program should have a single print statement, which should execute at the end of each pass through the outer loop. It should print the current values of the following quantities: 2N u 4N u H g twice population size mutation rate per gene copy per generation twice the number of mutations per generation in the population as a whole heterozygosity count of generations 1 On an arithmetic scale, the difference between each pair of adjacent mutation rates is the same. For example: 1,2,3, and 4 are equally spaced on an arithmetic scale. On a logarithmic scale, the ratios are the same. For example: 1, 10, 100, and 1000. PROJECT 4. SIMULATING GENETIC DRIFT AND MUTATION 7 Format it so that the numbers line up in columns. In the output, pay close attention to the printed value of gen. It tells you how many generations elapsed before either fixation of one allele or the other. If neither allele was fixed, then H will be greater than 0, and gen will equal maxgen. 3. Use this program in an effort to answer the question posed above. Search for the smallest value of u (and 4N u) for which H is appreciably greater than 0. Let’s call these the critical values of u and 4N u. To make your answers more precise, play around with the range of u values in your simulation. But do not get carried away. We are not expecting high precision. Just try for an answer that is correct within roughly a factor of 5. 4. Having found the critical values under a specific assumption about 2N , do it again with a larger value of 2N —one that is 10 or 100 times as large. As population size increases, do the two critical values go up, go down, or stay about the same? Is the pattern simpler in terms of u or 2N or 4N u? 5. Write a sentence or two summarizing the pattern in your results. Attempt to answer the question posed above. Turn in your program and a sample of its output, along with a short prose summary. Project 5 Simulating Gene Genealogies This project introduces a procedure—a sort of recipe—for simulating genetic drift. In computing, recipes are called “algorithms.” The one below is called the “coalescent algorithm.” It is based on the theory about gene genealogies that you have been learning in lecture and in the text. We will give you a simple version of the algorithm and ask you to modify it. You used a different method to simulate genetic drift in Project 4. There, you kept track of each copy of every allele in the entire population. In the new approach described here, we keep track only of the copies within our sample. This is one of several reasons why the coalescent approach is so powerful. This approach also differs from the traditional one in going backwards through time. As we travel forward in time, genetic drift reduces variation within populations and increases that between them. Thus, we focus on these variables in forward-time simulations. In backwards time, drift implies that individuals share ancestors. The smaller the population, the more recent these ancestors are likely to be. Thus, backward-time simulations focus on shared ancestors. Yet in spite of this difference in focus, both kinds of simulation describe the same biology. 5.1 Exponential and Poisson random variates In this project you will generate random variates from two probability distributions, the exponential and the Poisson. (To refresh your memory about these, see Just Enough Probability.) The random module of the Python Standard Library contains a function called expovariate, which generates exponential random variates. Python has no built-in generator for Poisson random variates, however, so we provide one called poidev in our module pgen.py. You will need to download pgen.py from the class web site. To make these functions available to Python, start your program with from random import expovariate from pgen import poidev The expovariate function (described on page 55) returns a value drawn at random from the exponential distribution. The single parameter of this distribution is called the hazard or the rate. In this project, we will be interested the hazard of a coalescent event. In class we discussed this hazard in the context of the standard model for neutral genes and constant population size. For example, 8 PROJECT 5. SIMULATING GENE GENEALOGIES 9 in a sample of two gene copies, the hazard of a coalescent event is h = 1/2N per generation, where 2N is the number of gene copies in the population. We can use expovariate to simulate the time since the last common ancestor (LCA) of two random gene copies. For example, >>> from random import expovariate >>> twoN = 1000.0 >>> expovariate(1.0/twoN) 380.65658779723157 Here, we handed expovariate a hazard of 1/2N , and it returned a value of 380.66 generations. Think of this as the number of generations since the LCA of a pair of gene copies drawn at random from a population in which 2N = 1000. Do this a few times yourself in Python’s interactive window to get a feel for how such intervals vary in length. To sample random values from the Poisson distribution, we use the poidev function, which is described on page 56. We will use this distribution to model the number of mutations that occur within a given gene genealogy. In this context, the mean is uL, the product of the mutation rate u and the total branch length L. (The total branch length is the sum of the lengths of all branches within the genealogy.) The command poidev(u*L) generates a random number that represents the number of mutations within such a genealogy. Here are a couple of examples: >>> >>> >>> >>> 7 >>> 5 from pgen import poidev u = 1.0/1000 L = 4000 poidev(u*L) poidev(u*L) The two returned values—7 and 5—represent the numbers of mutations within two random genealogies within which L = 4000 and u = 1/1000. Do this a few times yourself. Combining the exponential and Poisson distributions We usually do not know the lengths of branches in a gene genealogy. What we do know are the values of genetic statistics such as S, the number of segregating sites. In class we discussed the model of infinite sites, which implies that S equals the number of mutations that occurred along the genealogy. How can we simulate that? Let us go back to the simplest case: the genealogy of a pair of neutral genes in a population consisting of 2N gene copies and with mutation rate u. We proceed in two steps. In the first step, we use expovariate(1.0/twoN) to get a random value of t, the depth of the genealogy. Since there are two lines of descent, the total branch length is L = 2t, and the expected number of mutations is 2ut. Next, we call poidev(2*u*t) to get the number of mutations along a random genealogy. Here’s an example: >>> from pgen import poidev >>> from random import expovariate >>> u = 1.0/1000 PROJECT 5. SIMULATING GENE GENEALOGIES 10 >>> twoN = 4000.0 >>> poidev(2*u*expovariate(1.0/twoN)) 15 The result—15—represents the number of mutations between a random pair of genes drawn from a population in which 2N = 4000 and u = 1/1000. Do this yourself a few times yourself. Usually, we work with much larger samples, and we cannot get t directly from expovariate. In this project, you will use the coalescent algorithm to find the total branch length (L) of a random gene genealogy. Then you will use poidev to add mutations, just as in the preceding paragraphs. 5.2 The Coalescent Algorithm The coalescent algorithm begins in the current generation with a sample of K gene copies. As we trace the ancestry of this sample backwards in time, it will occasionally happen that two of the ancestral genes are copies of a single gene in the previous generation. This is called a coalescent event. With each coalescent event, the number of ancestral gene copies is reduced by one until eventually only a single gene copy remains. This is the LCA of all the genes in the sample. The time interval between two adjacent coalescent events is called a coalescent interval. As you learned in class, the length of each interval is an exponential random variable. In the computer program, we can use this fact to step from interval to interval. We need not concern ourselves with individual generations. As you learned in class, the hazard of a coalescent event is h = x(x − 1)/4N per generation, where 2N is the number of gene copies in the population, and x is the number (during some previous generation) of gene copies with descendants in the modern sample. Below, we use this fact together with expovariate to obtain simulated lengths of coalescent intervals. The coalescent algorithm is very simple to code. Here’s a program called coal_depth.py, which uses it: 1 2 3 4 5 6 7 8 9 10 11 12 13 from random import expovariate K = 30 # sample size (number of gene copies) twoN = 1000 # population size (number of gene copies) tree_depth = 0.0 # age of last common ancestor in generations # Each pass through loop deals with one coalescent interval. while K > 1: h = K*(K-1)/(2.0*twoN) # hazard of a coalescent event t = expovariate(h) # time until next coalescent event tree_depth += t print "%2d %7.2f %7.2f REMOVE ME" % (K, t, tree_depth) K -= 1 PROJECT 5. SIMULATING GENE GENEALOGIES 14 15 11 print "Tree depth:", tree_depth, "generations" Download this program from the lab page of the class web site, and run it a few times. At the end of each run, it reports the depth of a random gene tree. In this code, K represents the number of gene copies. We set it initially to 30, the number of gene copies in the modern sample. This value is reduced by one with each coalescent interval. The heart of the algorithm is the while loop in lines 8–13. Each pass through that loop deals with one coalescent interval. Within the loop, the program calculates the coalescent hazard and hands it to expovariate, which provides the length (t) of the current coalescent interval. This version of the program does nothing but add these values together to obtain the time since the LCA of the sample. Exercise 1. The code above generates simulated values of the tree’s depth—the number of generations from the present back to the LCA. Modify the code so that instead of depth, it simulates the total branch length (L). To see how to do this, consider an interval during which there were 18 distinct gene copies in the tree. If that interval lasted 45 generations, then it would contribute 45 to the depth of the tree, as shown in line 11 of the code above. However, this interval would contribute 45 × 18 to L. Modify the code accordingly. 2. Once your program is producing random values of L, add a line near the top that sets u = 0.001. Then, at the very end, use poidev as explained above to get the number of mutations in the genealogy. (This number is a Poisson-distributed random variable with mean uL.) At this stage, the program should produce a single line of output showing the number of mutations on a random tree, given the population size and mutation rate. 3. A few years ago, Jorde et al published data in which the number of segregating sites was S = 82 for a mitochondrial sample from 77 Asians. This is a good estimate of the number of mutations in the mitochondrial genealogy of these subjects. In these data, the mutation rate is thought to be about u = 0.001 per sequence. Let us entertain the hypothesis that the Asian female population was historically very small: say 2N = 5000. Use these values to set u, twoN, and K within your program, and then run it 20 times, keeping track of the results. 4. Use these results together with the observed data (S = 82) to evaluate the idea that the simulation model describes reality. How many of the simulated values are smaller than 82? How many are larger? Does the observed value represent an unusual outcome, or is it pretty typical of the simulated values? In the last step above, we are not asking for any detailed quantitative analysis. Just compare the observed to the simulated values and make a subjective judgement. We’ll show you in the next project how to put such conclusions on a firmer basis. Your project report should include (1) your program, (2) the 20 simulated values of S, and (3) a few sentences summarizing the conclusion that you reached in the last step. Project 6 Using Simulation to Test a Statistical Hypothesis In Project 5 you used a coalescent simulation to evaluate a hypothesis about 2N and u, two parameters that play central roles within population genetics. Your evaluation was subjective because we did not ask for any formal statistical test. In this project, you’ll do it right. But first, you need to learn some basic principles. We begin with two definitions. 6.1 A statistic and its sampling distribution All data have noise. They are noisy because of measurement error, because of sampling error, and because of unpredictable factors involved in the phenomena we study. This is why we need statistical methods in the first place. Things that we calculate from data are called statistics. Familiar examples include the mean and variance of a sample. Less familiar examples include estimates of the site frequency spectrum and the mismatch distribution. When a gene genealogy is estimated from data, even that is a statistic. Because they are based on noisy data, statistics are noisy too. In this context, “noise” refers to random variation. Scientists deal with such randomness using probability theory, and you should be familiar with that before reading further. (See Just Enough Probability.) Scientific hypotheses are often tested by comparing a statistic to its sampling distribution. To get this idea across, we re-visit a simple problem: that of tossing a coin. As you learned in Just Enough Probability, John Kerrich tossed a coin 10,000 times and observed 5070 heads. This number is a statistic. How could we use it to test a hypothesis? Suppose we suspected that Kerrich’s coin was slightly unfair, that the probability (p) of heads was 0.45 on each toss. To test this idea, we would need to know whether the value that Kerrich observed is unusual in similar experiments for which p really does equal 0.45. Suppose that we somehow managed to manufacture such a coin and then used it in 100,000 repetitions of Kerrich’s experiment. Suppose in addition that in none of these was the number of heads as large as Kerrich’s value, 5070. Then Kerrich’s result would clearly rank as unusual, and this might lead us to doubt the original hypothesis that p = 0.45. This is the reasoning that usually underlies tests of statistical hypotheses. The difficulty is that usually (as here) these repetitions are impossible to carry out. 12 PROJECT 6. USING SIMULATION TO TEST A STATISTICAL HYPOTHESIS Probability Density f (x) . ................... .... .... ... ... . . ... .. ... . . ... . . ... . . . ... . . ... . . ... . ... .. . ... . . . ... . . ... .. . .. ... . . ... . ... . . .... . . . .... . . . . .... .. . . ...... . . . . . . ......... . . . .. .... .............. .. ...... ............ .................. .............................. .. ............ .............. . . ....... . . . . . ............................ x0 x1 x 13 Cumulative Distribution ............... ..................... ......... ....... . . . . . . ..... ... ... ... . . . . ... ... ... ... . . . ... ... ... ... . . ... ... .... ... . . . ..... ...... ....... ......... . . . . . . . . . . . . . . . . . ................... x0 0.9 F (x) 0.1 x1 x Figure 6.1: Two ways to graph a sampling distribution. Areas under the density function correspond to probabilities. For example, the left shaded region comprises a fraction 0.1 of the total area, because 0.1 is the probability that X < x0 . On the right, the vertical axis shows F (x), the probability that X < x. For example, F (x0 ) = 0.1 and F (x1 ) = 0.9. There is no way to manufacture a coin for which p is exactly 0.45. Even if that were possible, who would volunteer to toss that coin 100, 000 × 10, 000 = 1 billion times? Nonetheless, let’s pretend that we had really done 100,000 repetitions of Kerrich’s experiment under conditions that guarantee that p = 0.45. We could calculate the number of heads from each repetition and then summarize these values using a frequency distribution. Since the number of repetitions was so large, this frequency distribution would approximate a special sort of probability distribution called a sampling distribution. The sampling distribution of a statistic is its probability distribution, but only in a restricted sense: it is the probability distribution implied by a particular hypothesis. Different hypotheses imply different sampling distributions, even for the same statistic. In our example, we assumed that p = 0.45. A different hypothesis would imply a different sampling distribution. In short, a sampling distribution is the probability distribution of a statistic, as implied by a particular hypothesis. 6.2 Using sampling distributions to test hypotheses Let us postpone the question of how one gets a sampling distribution. For the moment, our focus is on using one to test a hypothesis. It is sometimes useful to distinguish between a random variable and the particular values it may take. In this section, we use upper case for the former and lower case for the latter. Thus, X represents a random variable and has a probability distribution, but x is just a number. Figure 6.1 shows two ways to graph the sampling distribution of a hypothetical statistic, X. In the left panel, the vertical axis shows the probability density function, f . Areas under the curve correspond to probabilities in the sampling distribution of X. The shape of this particular density function may look familiar, but that has no relevance here. Sampling distributions can have any PROJECT 6. USING SIMULATION TO TEST A STATISTICAL HYPOTHESIS 14 shape. In the density function in the left panel, the two “tails” of the distribution are shaded. These tails comprise only a small fraction of the total area under the curve. For this reason, the observed value of our statistic is unlikely to fall within either tail. If it does, then either (a) we have observed an unusual chance event, or (b) our hypothesis and its implied sampling distribution are incorrect. When this happens, we say that the hypothesis has been rejected. This does not mean it is wrong, for correct hypotheses may be rejected too. The probability of such an error is called α and is equal to the shaded area of the graph. To minimize these errors, we usually choose small values of α, such as 0.05 or 0.01. In figure 6.1, we have set α to a very large value (0.2) so that the shading is easy to see. The right panel of Figure 6.1 shows another way to graph the same sampling distribution. The vertical axis shows F (x), the probability of observing a value less than or equal to x. This is called the cumulative distribution function. For example, in the figure’s left panel, a fraction 0.1 of the area under the curve lies to the left of x0 and a fraction 0.9 lies to the left of x1 . Consequently, the right panel shows that F (x0 ) = 0.1 and that F (x1 ) = 0.9. We explained above how f can be used to test a statistical hypothesis: if the observed value falls within the shaded tails, then we reject. This idea can be re-expressed in terms of F because F and f contain the same information. In terms of F , we reject if F (xobs ) ≤ α/2, or F (xobs ) ≥ 1 − α/2, (6.1) where xobs is the observed value of the statistic, and α is the level of significance.1 This is just another way of saying that the observed value fell into one of the tails. In testing a statistical hypothesis, one can work either with the density function (f ) or with the cumulative distribution function (F ), but the latter approach is often easier. We don’t need to know the whole sampling distribution—just a single value, F (xobs ). As we shall see, this value is easy to estimate by computer simulation. 6.3 Sampling distributions from computer simulations Long ago there was only one way to figure out the sampling distribution implied by a hypothesis— it required a sharp pencil and a mathematical turn of mind. That approach doesn’t always work however, even for gifted mathematicians. Fortunately there is now another way: we can estimate sampling distributions from computer simulations. In an informal way, you have done this already. In Project 5, you generated several simulated values of a genetic statistic (S) under a particular hypothesis about u and 2N . You then compared these simulated values to the one (Sobs = 82) that Jorde obtained from real data. Near the end of that assignment, we asked: “How many of the simulated values are smaller than 82? How many are larger?” These are questions about F (Sobs ). We then asked: “Does the observed value represent an 1 Consider first the left shaded tail. This tail contains half the shaded area and thus has probability α/2. This means that F (x0 ) = α/2. If xobs falls within this tail, then it must be true that F (xobs ) ≤ α/2. We can reject any hypothesis for which this condition is true. Applying the same reasoning to the right shaded tail, we reject if F (xobs ) ≥ 1 − α/2. PROJECT 6. USING SIMULATION TO TEST A STATISTICAL HYPOTHESIS Simulation 1 2 3 4 5 x 4439 4474 4484 4489 4495 Simulation 6 7 8 9 10 15 x 4508 4521 4527 4558 4566 Table 6.1: The number (x) of heads in ten draws from the binomial distribution with N = 10, 000 and p = 0.45. unusual outcome, or is it pretty typical of the simulated values?” In answering that question, you used the same logic that we explained above. The approach is more formal this time, but the idea is the same. To see how this works in detail, let us return to our hypothesis about Kerrich’s coin. Our statistic (xobs ) is the number of heads in 10,000 tosses, which equalled 5070 in Kerrich’s data. We are interested in F (xobs ) under the hypothesis that p = 0.45. The conditions of Kerrich’s experiment imply that the number of heads is drawn from a binomial distribution. This distribution has two parameters: the number (N ) of trials and the probability (p) of “heads” on each trial. Our hypothesis assumes that p = 0.45, and we can take N = 10, 000, because that is how many times Kerrich tossed the coin. We need to calculate F (5070) from a binomial distribution with these parameter values. We could do this calculation by summing across the binomial distribution function, but we’ll do it here by computer simulation. The procedure is simple: over and over, we simulate Kerrich’s experiment under the conditions of our hypothesis. Each simulation generates a value of x, the number of heads. Table 6.1 shows the result of 10 such simulations. In every one, the simulated value of x is smaller than xobs = 5070, the value that Kerrich observed. Already we begin to suspect that our hypothesis is inconsistent with the data. Let us relate this to the probability, F (xobs ), discussed above. As you know, probabilities are estimated by relative frequencies. We can estimate F (xobs ) as the relative frequency of observations such that x ≤ xobs . Every one of the 10 observations in Table 6.1 satisfies this condition. Thus, we estimate F (xobs ) as 10/10 = 1. Our sample is small, so this may not be a very good estimate. But let us take it at face value for the moment. To test a hypothesis, we must first specify α, so let us set α = 0.05. With this significance level, equation 6.1 says that we can reject if F (xobs ) ≤ 0.025 or ≥ 0.975. Our estimate is F (xobs ) = 1, which is clearly greater than 0.975. Thus, we reject the hypothesis—or we would do so if we trusted our estimate of F (xobs ). A convincing answer will require many more simulations, a potentially tedious project. Here is a program called cointest.py, which removes the tedium: 1 2 3 4 5 from pgen import bnldev xobs = 5070 nreps = 100 # Observed value of statistic. # Number of repetitions to do. PROJECT 6. USING SIMULATION TO TEST A STATISTICAL HYPOTHESIS 6 7 8 9 10 11 12 13 14 15 16 p = 0.45 F = 0.0 16 # As assumed by hypothesis. # Will hold Pr[X <= xobs] for i in xrange(nreps): x = bnldev(10000, p) # Number of heads in one experiment if x <= xobs: # Count number of x’s that are <= xobs F += 1 F /= float(nreps) # Turn count into a fraction. print "F[%d] = %6.3f for hypothesis: p=%5.3f" % (xobs, F, p) In this code, each execution of line 10 repeats Kerrich’s experiment, using the function bnldev. (This function is described in Project 4 and documented more fully on p. 55 of appendix B.3.1.) Lines 11–12 count the number of such repetitions for which the outcome is less than or equal to the observed value 5070, and line 14 converts the resulting count into a fraction. Download this code from the class web site and run it. You should get something that looks like this: F[5070] = 1.000 for hypothesis: p=0.45 The critical piece here is the value 1.000. This is the fraction of simulations for which the simulated value was less than or equal to Kerrich’s value, 5070. It estimates F (xobs ). Even in this larger sample, every simulation resulted in fewer heads than Kerrich saw. If our hypothesis about p is correct, then Kerrich saw something remarkable. This is probably not what happened. It seems more likely that our hypothesis about p is way off the mark. This estimate of F (xobs ) is a lot more reliable than the previous one, because it is based on 100 simulations rather than 10. It would be more accurate still if we had done 10,000. We did not start with 10,000, because it’s wise to keep the number small until you get the code running. Try increasing nreps to 10,000. Does it makes a difference? Exercise 1. Revise your coalescent code from Project 5 so that it tests the same hypothesis (u = 0.001 and 2N = 5000) that we considered there, using the same data (Sobs = 82, K = 77). To do this, use cointest.py as your model. Simply replace line 10 with the simulation that you wrote in Project 5. Like cointest.py, your program should produce a single line of output. Show us the program and its output, and write a sentence or two saying whether and why the hypothesis can be rejected. The tricky parts of this exercise are getting the indentation right and making sure that each variable is initialized where it needs to be. Before proceeding, make sure you understand what this program does. In principle, it is exactly what you did in Project 5. The program does the same simulation many times and allows you to tell whether the observed value, Sobs = 82, is unusually large, unusually small, or typical, under a particular evolutionary hypothesis. PROJECT 6. USING SIMULATION TO TEST A STATISTICAL HYPOTHESIS 6.4 17 Confidence intervals In the preceding section, we tested a single hypothesis about the value of p. It would be more interesting to examine a range of p values in order to see which can be rejected and which cannot. Here is a modified program called coin_ci.py, which does this: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 from pgen import bnldev xobs = 5070 # Observed value of statistic. nreps = 1000 # Number of repetitions to do. for p in [0.49, 0.495, 0.5, 0.505, 0.51, 0.515, 0.52, 0.525]: F = 0.0 for i in xrange(nreps): x = bnldev(10000, p) if x <= xobs: F += 1 F /= float(nreps) print "F[%d] = %6.3f for hypothesis: p=%5.3f" % (xobs, F, p) Notice how little this differs from cointest.py. Line 6 has been re-written, and lines 7–16 have been shifted to the right. We also bumped nreps up to 1000. Apart from these minor changes, the two programs are identical. These changes create an additional loop, which does a separate test for each of the values of p listed on line 6. Make sure you understand why the values of xobs and nreps are set outside this loop, and why that of F is set inside it. Download this program and get it running. Your output should look similar to this: F[5070] F[5070] F[5070] F[5070] F[5070] F[5070] F[5070] F[5070] = = = = = = = = 1.000 0.995 0.921 0.675 0.259 0.067 0.004 0.001 for for for for for for for for hypothesis: hypothesis: hypothesis: hypothesis: hypothesis: hypothesis: hypothesis: hypothesis: p=0.490 p=0.495 p=0.500 p=0.505 p=0.510 p=0.515 p=0.520 p=0.525 Each line of this output tests a different hypothesis. We can reject the first two (at α = 0.05) because F (xobs ) ≥ 0.975. We can reject the last two because F (xobs ) ≤ 0.025. The hypotheses we cannot reject include all values within the range 0.5 ≤ p ≤ 0.515. These values—the ones we cannot reject—are called the confidence interval of p. There is a little ambiguity here. We can reject p = 0.495 but not p = 0.5, so the lower boundary of the confidence interval must lie somewhere between these values. Exactly where, we cannot PROJECT 6. USING SIMULATION TO TEST A STATISTICAL HYPOTHESIS 18 say. There is a similar ambiguity involving the upper boundary. It is safe however to say that the confidence interval is enclosed by the slightly larger interval 0.495 < p < 0.52. The limits (0.495 and 0.52) of this larger interval are clearly outside the confidence interval, for we were able to reject them. This is why we use the strict inequality symbol (<). To be conservative, it is good practice to use this larger interval when reporting a confidence interval. Exercise 2. Revise your coalescent code using coin_ci.py as a model, and estimate the confidence interval of 2N . Show us your code, the output, and a brief explanation. Your explanation does not have to be long. Just explain what you have learned about the parameter 2N . Which hypotheses about 2N can we reject, and which are still on the table? At this point, we hope you can see (a) that testing a statistical hypothesis just amounts to asking whether your observation is unusual, and (b) that a confidence interval is just a summary of hypothesis tests. The methods of statistics are complex, but the underlying ideas are simple. Project 7 Simulating Selection and Drift In Project 4, you incorporated the effect of mutation into a model of genetic drift. Here, you will modify that code once again to incorporate the effect of selection and then use the resulting code in an experiment. In his section 3.1, Gillespie lays out the algebra of natural selection. At the top of page 62, he presents a classical formula for the frequency (p0 ) of allele A1 among offspring, given the genotypic fitnesses and the allele frequency of the parents: p0 = p2 w11 + p(1 − p)w12 w ¯ (7.1) Here, w11 is the fitness of genotype A1 A1 , w12 is the fitness of A1 A2 , and w ¯ is the mean fitness. This formula gives the allele frequency expected among offspring in the absence of genetic drift. In a finite population, however, there will also be variation around this expected value. We can model this process using a modified version of the urn model of genetic drift. In the standard urn model, we are equally likely to choose any ball in the urn. If p is the fraction of balls that are black, then p is also the probability that the ball we choose will be black. Now, however, there is a bias: one allele has higher fitness than the other. In the urn metaphor, balls of one color (say black) are more likely to be drawn. A ball drawn from this biased urn is black with probability p0 , as given by equation 7.1. You will use this idea below in order to simulate the combined effects of selection and drift. As you will see, the procedure is almost identical to the one you used in Project 4, where you modeled the combined effects of mutation and drift. Exercise 1. Begin either with your code from Project 4, or else with a fresh copy of file drift.py from the class web site. Next, modify this code so that it performs the simulation many times. You did this with the coalescent simulation in Project 6, so you already know how. You need something along the following lines: 1 2 twoN = 40 nreps = 100 # population size # number of replicates 19 PROJECT 7. SIMULATING SELECTION AND DRIFT 3 4 5 6 20 for rep in xrange(nreps): gen = 0 p = 1.0/twoN <then the "while loop" from the drift simulation> Note that line 6 is not real code. You should replace it with code from your own program. With this set up, each run of the program will perform 100 replicate simulations. So far, the program does not simulate selection. It just does 100 runs of the drift simulation. Later on, you will want to make nreps much larger. Notice that p = 1/2N at the beginning of each replicate. With this setup, we can interpret each replicate as a model of a newly arisen mutant allele. Make sure you understand why some variables (twoN and nreps) are set before the for loop, and the others are set within the loop but before any other code. Beginners often get this sort of thing wrong, so make sure you understand what is happening here. What would happen if twoN were set inside the for loop?1 What would happen if nreps were set inside the for loop?2 What would happen if gen and p were set before the for loop rather than within it?3 Once you understand the code, take out all the old print statements. At the very end—after all replicates have run—print the fraction of replicates in which A1 was fixed (p = 1) rather than lost (p = 0). 2. Now modify the program to implement the biased urn model that we discussed above. At the top of your program, define the following constants: 2N = 40, s = 0.1, h = 0.5, w11 = 1, w12 = 1 − hs, and w22 = 1 − s. This says that A1 is favored and that there is no dominance. (Note the minor difference between this fitness scheme and the one in the text.) Next, modify the value of p just before the call to bnldev, and then use this modified p in the call to bnldev. You did this in Project 4 in order to model mutation. Now you will do it again (this time using Eqn 7.1) in order to model selection. (Incidentally, you may be tempted to define a new variable called p’. This will not work, because the “’” character is illegal within Python names. Use some other name instead, or simply redefine the variable p.) 3. So far, you’ve got a program that runs the same simulation nreps times. Now you want to repeat that entire process for each of the following values of 2N : 500, 1000, 2000 and 4000. This will involve enclosing the existing code in a for loop, as we illustrated in coin_ci.py (see p. 17 of Project 6). 4. Once you have the program working, it is time to collect some data. Set nreps to 10000, and run the simulation with each of the following values of s: 0, 0.001, and 0.02. You will 1 The code would still run correctly, but not quite as fast. This is because twoN is a constant whose value never needs to change. 2 You must set nreps before you first use it, in the xrange statement. Otherwise the code will break. Once xrange executes, it doesn’t matter whether you set nreps again, although doing so is a waste of time. 3 On the second pass through the loop, things would break because gen and p would begin the loop with the values they had at the end of the first pass. PROJECT 7. SIMULATING SELECTION AND DRIFT 21 discover that this takes awhile for the case in which s = 0. Use the results to make a graph with log10 2N on the horizontal axis and “fraction fixed” on the vertical. Draw three lines: one connecting the points with s = 0, one for s = 0.001, and one for s = 0.02. Summarize the graph in words. How does the probability of fixation respond to population size? Write a paragraph interpreting your results using the theory discussed in Gillespie and in class. Hint: To save yourself some work, you may want to ask Python to calculate the log10 values rather than using your calculator. To do this, put from math import log10 at the top of your program. Then log10(100) will give log10 100 (which equals 2). Project 8 Using HapMap Several large databases of human genetic variation have become available during the past decade, and they have revolutionized the field. This project will introduce you to one of them, the HapMap. The HapMap project has genotyped millions of single-nucleotide polymorphisms (SNPs) throughout the human genome within samples from several human populations. You will learn how to download and manipulate these data, and will then use them to explore the relationship between observed and expected heterozygosity in HapMap SNPs. 8.1 Choosing a population and chromosome at random The original HapMap provided data on the following populations: Label YRI CEU CHB JPT JPT+CHB Sample 90 Yorubans from Nigeria 90 Utahns of western and northern European ancestry 45 Han Chinese from Beijing 44 Japanese from Tokyo combined Chinese and Japanese samples Although recent releases of HapMap have added additional populations, we will focus on these. The data from these populations are made available as plain text files, one for each chromosome within each population. You will download one of these files onto your computer, but first you must figure out which one. For this project, you will choose a chromosome and population at random. To do so, cut and paste the following program into the Python interpreter, and run it once. from random import choice, randrange print "Your chromosome is number", randrange(1,23) print "Your population is", choice(["YRI", "JPT+CHB", "CEU"]) It will tell you the number of your chromosome, and the name of your population. This program uses two new functions—choice and randrange—from Python’s random module. We’ll be using 22 PROJECT 8. USING HAPMAP 23 these at several points in this project, so you should experiment with them until you understand how they work. They are described in section B.2, page 54. 8.2 Downloading data Now that you know your chromosome and population, you are ready to visit the HapMap web site. This is a task that you’ll need to do for several projects, so we’ve put the instructions into a self-contained appendix, which you’ll find on page 50. Please download two data files: the one for your own chromosome and population, and also the one for chromosome 22, population JPT. From here on, we’ll assume that you have already downloaded the appropriate data file for this project. We recommend that you put all relevant files into the Documents folder. Before proceeding, make sure that the data file and pgen.py are both in the same folder (or directory), and that Python can execute the command from pgen import * without generating an error. 8.3 Using the pgen module to work with HapMap data The pgen module provides several tools for use with HapMap data files. These are described in section B.3.2 (page 56). Here, we merely illustrate their use. At the top of your program, you will need the line: from pgen import * The portion of the program that uses HapMap data should begin like this: pop = "JPT" chromosome = 22 hds = hapmap_dataset(hapmap_fname(chromosome, pop)) Here, the call to hapmap fname tries to find the name of an appropriate HapMap data set—one for chromosome 22 and population JPT. If it succeeds, it hands that name to hapmap dataset, which reads the file and stores it in a useful form (more on this in a moment). In the last line, the assignment statement creates a variable called hds, which points to this data structure. We use the name hds to help us remember that it refers to an object of type “hapmap dataset,” but you can use any name you please. But what exactly is an object of type “hapmap dataset?” Well, it is a lot like a Python list. Like a list, it has a length: >>> len(hds) 11235 And like a list, it also has a sequence of data values that we can access like this: >>> snp = hds[3] PROJECT 8. USING HAPMAP 24 Now snp refers to the data value at position 3 within the data set. This data value is an object of another specially-created type, called “hapmap snp.” An object of type hapmap snp has a variety of types of data. Having defined snp, we can type >>> snp.chromosome ’22’ >>> snp.position 14603201 >>> snp.id ’rs1041770’ These commands report (1) the chromosome on which this SNP lies, (2) the position of the SNP in base pairs, measured from the end of the chromosome, and (3) the “rs-number,” a name that uniquely identifies this SNP. You might think that the rs-number was redundant, once we know the SNP’s position. However, the position values change a little with each new build of the data base, whereas the rs-numbers are invariant from build to build.1 Thus, the rs-number is essential if you want to find a SNP that is discussed in the literature. hapmap snp also contains all the genetic data for the SNP. For example, >>> snp.alleles [’G’, ’T’] returns a list containing the two alleles that are present at this locus. In the original data file, the genotype of each individual would have been represented as GG, GT, or TT. In hapmap snp, these genotypes are recoded as numbers: >>> snp.gtype [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0] Each integer counts the number of copies of alleles[0] in one individual. In these data, allele 0 is G, so the integers 0, 1, and 2 correspond to genotypes TT, TG, and GG. Since the genotypes are now integers, they have a mean and a variance, which you can access like this: >>> snp.mean 0.066666666666666666 >>> snp.variance 0.06222222222222222 Objects of type hapmap snp behave like lists. The command snp[3] returns the genotypic value at position 3, len(snp) returns the number of genotypes in the data, and sum(snp) returns the sum of the genotypic values. In these examples, snp is the name of a variable that points to an object of type hapmap snp. But we could have used any other name just as well. For example, 1 Occasionally, rs-numbers do change. This happens when it is discovered that two numbers actually refer to the same SNP. Then the higher number is retired, and lower number becomes the official designation. PROJECT 8. USING HAPMAP 25 >>> another_snp = hds[23] >>> another_snp.alleles [’A’, ’C’] 8.3.1 Some examples This section shows how pgen’s HapMap facilities can be used to solve problems. Read them carefully, and try them out in Python’s interactive interpreter. They will all be useful later on. Working with gtype In a miniature sample of 8 genotypes, gtype might look like this: >>> snp.gtype [1, 0, 0, 1, 1, 2, 0, 1] Each number counts the copies of allele 0 in an individual genotype. The sum of these values—in this case 6—is the number of copies of this allele in the data set. Since each genotype has two genes, the total number of genes is 2 × 8 = 16. The frequency of allele 0 is therefore p = 6/16. On the other hand, the average of the data values is x ¯ = 6/8. The allele frequency is exactly half the mean and is thus amazingly easy to calculate: >>> p = 0.5*snp.mean Just remember that this is the frequency of allele 0—the one that is listed first in snp.alleles. Counting genes and genotypes As explained above, the number of copies of allele 0 is just the sum of genotypic values: >>> n0 = sum(snp) What about the number of heterozygotes? The variable snp.gtype is a Python list. As such, it has a built-in method called count that will count the number of copies of any given value within the list. In our data, every heterozygote is represented by the integer “1.” We can count the number of heterozygotes with a single command: >>> snp.gtype.count(1) Here, count is a built-in method that would work on any Python list. Looping over SNPs There are several ways to loop over the SNPs in a hapmap dataset. If you want to include all the SNPs, the code is very simple: >>> for snp in hds: ... print snp.mean This works because objects of type hapmap snp behave like lists. We will use this approach in Project 11. In this current project however, it would be a disaster. As you will soon see, today’s project involves making a graph. If you tried to graph all the SNPs, the graph would be incomprehensible. Consequently, we will look only at a limited sample. The easy way to do this is as follows: PROJECT 8. USING HAPMAP 26 >>> for i in xrange(20): ... snp = choice(hds) ... print snp.mean This loop includes 20 SNPs chosen at random from among all those in the data. The code makes use of the choice function, which we introduced on p. 22 and discuss more fully on p. 55. Exercise Create a hapmap dataset for chromosome 22 of population JPT. From this dataset, select the SNP whose index is 83, i.e. hds[83]. Use this SNP to answer the following questions: (1) What is the nucleotide state (A, T, G, or C) of allele 0? (2) How many copies of this allele are present? (3) What is its relative frequency? (4) What is the expected frequency of heterozygotes at Hardy-Weinberg equilibrium? (5) What is the observed frequency of hetetozygotes? (6) What is the position (in base pairs) of this SNP on the chromosome? Project 9 Heterozygosity of HapMap SNPs Heterozygosity is perhaps the most fundamental measure of genetic variation. In this project, we’ll study the heterozygosity of SNPs within the HapMap data base. For each SNP, we’ll calculate not only the observed heterozygosity, but also the value expected at Hardy-Weinberg equilibrium. We’ll be interested to see how well the two numbers agree. The exercise will involve modifying the incomplete code below. You can download it from the lab website, where it is called haphetinc.py. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 from pgen import * from random import choice chromosome = 22 pop = ’JPT’ hds = hapmap_dataset(hapmap_fname(chromosome,pop)) for i in range(20): snp = choice(hds) ohet = 0.0 ehet = 0.0 # REPLACE ME # REPLACE ME print "%5.3f %5.3f" % (ohet, ehet) The guts of this little program is the loop in lines 8–14. Each time through the loop, we choose a SNP at random from the data set. Then we set the observed (ohet) and expected (ehet) heterozygosities, and print these values. Only, as you can see, the present code just sets these values to 0. Exercise 1. Change lines 11-12 so that they calculate the observed and expected heterozygosities of each SNP. 27 PROJECT 9. HETEROZYGOSITY OF HAPMAP SNPS 28 So far, your program merely lists the expected and observed heterozygosity values. It is hard to get a good sense of these values, just by staring at the numbers. You’ll appreciate the pattern better after making a scatter plot. Before you can do this, you must first get the data values into lists. 9.1 Storing values in a list As you calculate values of ohet and ehet, you will need to store them in a list. Before you can do so, the list must first exist. Thus, you will need to create an empty list before the main loop in your program. For example, you might use a command like obs_vals = 3*[None] to create a list with room for three observed values. Here, the keyword “None” is Python’s way of representing an unknown value. In subsequent commands, you replace these with real values. Here is how it works: >>> obs_val = 2*[None] >>> obs_val [None, None] >>> obs_val[0] = 3.0 >>> obs_val[1] = 111 >>> obs_val [3.0, 111] Below, you will create two lists (one for ohet and one for ehet) to store the values you calculate within the loop in the program above. Your lists should be defined before the loop and should be long enough to hold all the values you will calculate within the loop. It is also possible to start with an empty list: >>> obs_val = [] >>> obs_val [] Then you can add values using the append method, which works with any list: >>> obs_val.append(3.0) >>> obs_val.append(111) >>> obs_val [3.0, 111] Either way, you end up with the same answer. The first method is faster with large jobs, but the second is easier. Feel free to use either approach. 9.2 Making a scatter plot There are several Python packages for making high-quality graphics. If you are working on your own machine at home, you may want to download either matplotlib or gnuplot.py. Unfortunately, PROJECT 9. HETEROZYGOSITY OF HAPMAP SNPS 29 none of these packages is available in the Mac Lab at Marriott Library. As a partial solution to this problem, we have implemented a crude graphics function within pgen.py. It constructs a scatter plot out of ordinary text characters—a style of graphics that has not been common since the 1970s. The function charplot is described on page 57. Here is a listing that illustrates its use: from pgen import charplot x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] y = [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] charplot(x, y) When this program runs, it makes a scatter plot of the (xi , yi ) values, with x on the horizontal axis and y on the vertical. Try it yourself. The function accepts several optional arguments, which you can use if you wish to control the number of tick marks or the dimensions of the plot. See page 57 for details. Exercise 2. Instead of printing the heterozygosity values, put them into two lists: one for the expected values and the other for the observed values. At the end of the loop, make a scatter plot, with expected heterozygosity on the horizontal axis and the observed value on the vertical axis. 3. Use your program to produce a graph. Experiment to find the best number of SNPs. With a ruler and pencil, draw a line on the graph representing the function y(x) = x. (To help you draw this line, you may want to append a dummy pair of data values such as (0.7, 0.7) before calling charplot.) If observed values always equaled expected ones, all points would fall on this line. Write a short paragraph describing the pattern you see. Is there good agreement between observed and expected values, or do you see a discrepancy? When expected heterozygosity is a small number, you should find that observed heterozygosity hardly ever exceeds the expected value. Why should this be? What to hand in Your answers to step 1, the final version of your program, the graph, and a paragraph discussing the results. Project 10 Simulating Selection and Drift at Two Loci This project is motivated by the current widespread interest in using linkage disequilibrium (LD) to detect selective sweeps. The goal is to find out how linkage disequilibrium (as measured by r2 ) behaves when one locus is undergoing a selective sweep and the other is neutral. You will build a simulation model involving selection, recombination, and drift at two loci. We assume that selection acts at one locus, where allele A is favored over allele a. At a linked locus, alleles B and b are neutral. The simulation starts with pA = 1/2N and stops when pA ≥ 1/2. At that point, it prints out pA , pB , and r. By doing a number of replicate simulations, you will get a sense of how r behaves when pA is near 1/2. On the class website, you will find a program called twolocinc.py, which does most of this job. Your task is to fill in the pieces that we have left out, and then to run the program a few times to collect data. But first, we need to tell you about some nuts and bolts. 10.1 Sampling from the multinomial distribution Thus far, our simulations have involved a single locus with two alleles. To model drift in such a system, we used the metaphor of an urn containing balls of two colors. Now we need to keep track of four types of chromosome (or gamete): AB, Ab, aB, and ab. The urn metaphor still works, but now the urn contains balls of four types. In a slight modification of Gillespie’s notation, we write the relative frequencies of the four gamete types as: Gamete AB Ab aB ab Rel. freq. x0 x1 x2 x3 Our initial subscript is 0 (rather than 1) for consistency with subscripting in Python. Suppose the urn contains four types of balls in these frequencies. You draw 2N balls at random from the urn, replacing each one and then shaking the urn before selecting the next ball. Among 30 PROJECT 10. SIMULATING SELECTION AND DRIFT AT TWO LOCI 31 the balls that you selected, we represent the number of each type as follows: Gamete AB Ab aB ab Count n0 n1 n2 n3 In other words, n0 is the number of balls of type AB that were drawn from the urn, and so on. Each time you repeat this experiment, the values in the array [n0 , n1 , n2 , n3 ] will vary at random. The probability distribution that describes this variation is called the multinomial. In the special case when there are only two types of balls, there is no difference between the multinomial and binomial distributions. The multinomial distribution is more general because it applies regardless of the number of types of balls. We are not going to tell you much about the multinomial. All your program needs to do is draw samples from it, and we have provided a function called mnldev that simplifies this task. This function is described on page 55. To use it, you need to specify the number of balls to draw, and the relative frequency of each type within the urn. Here’s an illustration: 1 2 3 4 5 >>> >>> [2, >>> [1, from pgen import mnldev mnldev(5, [0.2, 0.3, 0.5]) 0, 3] mnldev(5, [0.2, 0.3, 0.5]) 1, 3] In these calls to mnldev, the first argument is the integer 5. This says that we want to draw 5 balls. The second argument is a list containing 3 relative frequencies. For each call, mnldev returns a list of 3 integers that sums to 5. These returned values are samples from the multinomial distribution. They represent the numbers of balls drawn, with one number for each of the 3 types. In your application, you will want to draw twoN balls from an urn whose relative frequencies are stored in a Python list called x. The returned counts should be stored in a list called n. Thus, the call to mnldev will look like n = mnldev(twoN, x). Ordinarily, a list of relative frequencies must be normalized so that it sums to 1.0. Making sure that this is true often involves an extra step in computer code. You can omit this step for the list that you hand to mnldev, because it does the normalizing automatically. The call mnldev(5, [2, 3, 5]) is exactly the same as mnldev(5, [0.2, 0.3, 0.5]). We’ll explain below how to use this trick to simplify your code. 10.2 An incomplete program On the course website you’ll find a program called twolocinc.py, which looks like this: 1 2 3 4 5 # twolocinc.py from pgen import mnldev from math import sqrt nreps = 1000 PROJECT 10. SIMULATING SELECTION AND DRIFT AT TWO LOCI 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 twoN = 500 s = 0.02 c = 0.001 w = [1.0+s, 1.0+s, 1.0, 1.0] 32 # selective advantage of allele A # recombination rate # fitnesses of AB, Ab, aB, & ab print "2N=%d s=%f c=%f" % (twoN, s, c), print "Fitnesses:", w print "%5s %6s %6s %6s" % ("rep", "pA", "pB", "rsq") for rep in xrange(0, nreps): x = [1.0/twoN, 0, 0.5-1.0/twoN, 0.5] # freqs of AB, Ab, aB, & ab while True: # Insert code here to adjust x for recombination # Insert code here to adjust x for gametic selection n = mnldev(twoN, x) # sample from multinomial x = [float(z)/twoN for z in n] pA = x[0]+x[1] pB = x[0]+x[2] if pA==0 or pA>=0.5 or pB==0 or pB==1: break if pA >= 0.5 and (0 < pB < 1): # Insert code here to calculate rsq. print "%5d %6.3f %6.3f %6s" % (rep, pA, pB, "***") This program will run, but it is incomplete. It does neither recombination nor selection and does not calculate the value of r2 . These are the parts you need to add. But first, get it running. In its overall structure, this program is a lot like the one you wrote in Project 7. The top section defines parameters and prints them. The rest consists of two nested loops. The inner loop models the entire history of a single mutant allele; each pass through that loop corresponds to a generation. This entire process is repeated many times—once for each mutant allele—under the control of the outer loop. This new program differs from that of Project 7 in that there are four gamete types rather than just two alleles. This becomes apparent on line 15, which initializes a list containing the relative frequencies of the four gamete types. In the one-locus model, you would have written p = 1.0/twoN at this point. The new code needs to make some assumption about the state of the population just before the initial mutation. The one made by line 15 is completely arbitrary: just before the mutation, allele A didn’t exist, and allele B had frequency 0.5, so the four gamete frequencies were x = [0, 0, 1/2, 1/2]. Then one of the aB gametes mutated to become an AB. The resulting list of gamete frequencies is x = [1/2N, 0, 1/2 − 1/2N, 1/2], as indicated on line 15. Each pass through the inner loop (lines 17–24) corresponds to a single generation. Based on current gamete frequencies, we sample from the multinomial distribution. This yields a list of counts, which represent the number of copies of each gamete type in the next generation. After converting these back into relative frequencies, we check to see if it is time to stop. That is all there PROJECT 10. SIMULATING SELECTION AND DRIFT AT TWO LOCI 33 is to it. If this seems familiar to you, that is probably because the program you wrote in Project 7 worked the same way. Note that lines 26–28 execute only after we drop out of the inner loop. The if statement there makes sure that nothing prints unless both loci are polymorphic. We are not interested in monomorphic cases. Make sure this program runs before you begin changing it. Here is the output of one run on my computer: 2N=500 rep 39 50 907 s=0.020000 c=0.001000 Fitnesses: [1.02, 1.02, 1.0, 1.0] pA pB rsq 0.510 0.826 *** 0.502 0.502 *** 0.506 0.988 *** Only three of these replicates generated any output. The program prints fitnesses and a recombination rate, but these are misleading. As it stands, there is no recombination and no selection. Your job is to rectify that. Exercise 1. Recombination. We’re going to model recombination using a trick that you have already used twice before. In Project 4, you incorporated mutation into a simulation of drift by adjusting allele frequencies before calling bnldev. Then in Project 7 you used the same trick in order to model selection. Now you will use that trick once again to model recombination. There is only one real difference. In those earlier projects, there were only two alleles so you only had to adjust the value of one allele frequency. Now you must adjust the values of all four gamete frequencies and then call mnldev instead of bnldev. In his Eqn. 4.1 (p. 102), Gillespie shows how recombination changes the frequency of one gamete type. There are analogous formulas for all four types: x00 = x0 (1 − c) + cpA pB x01 = x1 (1 − c) + cpA (1 − pB ) x02 = x2 (1 − c) + c(1 − pA )pB x03 = x3 (1 − c) + c(1 − pA )(1 − pB ) where c is the recombination rate, pA is the frequency of A, and pB is the frequency of B. By the time the program reaches line 17 in the code above, it knows the value of c and all the values in the list x. From these, it is easy to calculate pA = x0 + x1 and pB = x0 + x2 . This gives you everything you need to calculate the values on the left side of the equations above—the frequencies of the gamete types after recombination. Add code that does this just after line 17 above. However you decide to do this, make sure your results end up in the list named x. PROJECT 10. SIMULATING SELECTION AND DRIFT AT TWO LOCI 34 2. Selection. Before calling mnldev, we need to adjust the gamete frequencies once again—this time for the effect of selection. The idea is the same as before. We have a theoretical formula that converts pre-selection frequencies into post-selection frequencies. As input to this process, we use the values in the Python list x, which have already been adjusted for recombination. For simplicity, assume that selection acts at the gamete stage rather than the diploid stage. (This makes our formulas a little simpler than Gillespie’s.) After selection, the new gamete frequencies are x0i = xi wi /w ¯ P where w ¯ = xi wi is the mean fitness. You have all the xi and wi values, so it is easy to calculate w. ¯ Given w, ¯ it is easy to calculate x0i for all four gamete types. But before you start, take another look at the formula. In it, w ¯ plays the role of a normalizing constant. It is there so that the new gamete frequencies will sum to 1. As discussed above, mnldev will do this normalizing for us, so we don’t have to. It is sufficient (and a little faster) to set x0i = xi wi , without bothering to calculate w. ¯ With this simplification, the list that we hand to mnldev contains values that are proportional to gamete frequencies, and that is good enough. Add code that does this just after line 18. As before, make sure that your results end up in the list named x. 3. Linkage disequilibrium. You have one more change to make. One conventional measure of linkage disequilibrium is the correlation between loci, x0 x3 − x1 x2 r=p pA (1 − pA )pB (1 − pB ) Note that this r has nothing to do with the recombination rate. That is why we used a different symbol (c) for that. The numerator here equals D, a measure of LD that was discussed in lecture and in the text. Add code just after line 27 to calculate r, and modify line 28 so that it prints r2 , the square of this result. 4. Collect data. Set 2N = 5000 and c = 0.001. Run the program once with s = 0, once with s = 1/500, and once with s = 1/50. (These correspond to 2N s = 0, 2N s = 10, and 2N s = 100.) Adjust nreps so that you get about 50 lines of output for each run. This will require a lot of replicates—probably about 150,000—for s = 0. For each value of s, make a histogram of the frequency distribution of the values of r2 . (You can do this by hand, or however you like. If you end up with hundreds of lines of output for some value of s, just use the first 50.) How easy would it be to distinguish strong from weak selection, based on the value of r2 ? Project 11 The Site Frequency Spectrum of HapMap SNPs This project will use the HapMap data, which you first encountered in Project 8. (Please do that project before this one.) The current project concerns the site frequency spectrum, which was discussed in lecture and in chapters 1 and 6 of Lecture Notes on Gene Genealogies. As those chapters explained, the spectrum has a very simple shape—at least in expectation—under the conditions of the standard neutral model. In this project you will estimate the site frequency spectrum from an entire human chromosome and compare this empirical distribution to its theoretical counterpart. The goal is to see how well the HapMap SNPs conform to the assumptions of the standard neutral model. 11.1 Preliminaries Counting copies of the minor allele As you know, each object of type hapmap snp describes one HapMap SNP, including the genotypes of all individuals sampled. On page 25, we explained how to count the copies of allele 0. Here we extend that code to count the copies of the minor allele—the rarer of the two. Consider the following snippet: K = 2*len(snp) x = sum(snp) if x > K/2: x = K - x # gene copies in sample # copies of allele 0 Here, snp is an object of type hapmap snp. The sum statement (explained on p. 25) counts copies of allele 0, and the last two lines convert this into a count of the minor allele. The code works because the number of copies of the minor allele cannot exceed K/2. Exercise 1. Write a program that counts the copies of the minor allele in the SNP at position 94 of the HapMap dataset for chromosome 22, population JPT. 35 PROJECT 11. THE SITE FREQUENCY SPECTRUM OF HAPMAP SNPS 36 Counting occurrences of integers In a previous project, you saw how to use Python to simulate the roll of a die. Now that you know a little more Python, we can boil that code down to from random import randrange rolls = 7*[0] for i in range(1000): spots = randrange(1,7) rolls[spots] += 1 print rolls This code uses Python’s randrange function, which is described on page 55. The variable “spots” represents the random outcome of one roll of a die. We increment the value of rolls[spots] in order to keep track of the number of times we have seen each value. Below, you will use this same algorithm to calculate the site frequency spectrum of a human chromosome. But first try this warm-up exercise. Exercise 2. Write a program that tabulates the number of times each integer appears in the list below. Use the algorithm illustrated in the dice example above. data = [0, 4, 3, 1, 4, 4, 3, 0, 0, 0, 1, 0, 3, 4, 4, 5, 4] 11.2 The site frequency spectrum The rest of this exercise will involve modifying the incomplete code shown below. Download it from the lab website, where it is called hapspecinc.py. # hapmapspec.py # Calculate site frequency spectrum from pgen import * chromosome = 22 pop = ’JPT’ # CHANGE ME # CHANGE ME hds = hapmap_dataset(hapmap_fname(chromosome,pop)) K = 2*len(hds[0].gtype) # number of gene copies in sample # ADD CODE HERE TO CALCULATE AND PRINT FOLDED SPECTRUM As it stands, the code will run, but it does nothing useful. PROJECT 11. THE SITE FREQUENCY SPECTRUM OF HAPMAP SNPS 37 You will eventually need to change the chromosome and pop values to reflect your own chromosome and population. But you may want to do this last. Chromosomes are numbered in order of decreasing size, and chromosome 1 is much larger than chromosome 22. If you are working with one of the big chromosomes, it will take you much longer to debug your program. So download a copy of the data file for population JPT and chromosome 22, and use that until you get your program working. At the very end, you can switch to your own chromosome and population. For debugging, you may want to use the dummy data set described on page 53. Exercise 3. Add code that loops over all SNPs (not just a sample), calculating the number of copies of the minor allele. While you are debugging this code, it will help to print the values. There will be lots of them, so hit Ctrl-C to interrupt the process once you see it working. Then remove the print statement. 4. The only thing left to do is tabulate the values that you are already calculating. To do so, use the algorithm illustrated by the dice example, which you practiced just above. At the very end of your program, insert code to print the spectrum in a table like this: i spec[i] 1 753 2 493 3 459 4 456 The first row tells us that there were 753 SNPs whose minor allele was present only once in the data, 493 whose minor allele was present in two copies, and so on. I have printed only the first 4 rows of the table. 11.3 Theoretical spectrum So far in this exercise, you have calculated the observed site frequency spectrum. But what have you really learned about evolution? Probably not much, because you don’t know what to make of the counts in the spectrum. You need something to compare them to. In this final portion of the project you will calculate the spectrum that is expected under the standard neutral model: random mating, selective neutrality, and so on. In the spectrum you have just calculated, let us refer to the ith entry as si . It is is the number of sites at which the minor allele is present in i copies within the data. There is a well-known formula for the expected value of si under the standard neutral model. (See sections 6.2.4 and 6.2.5 of Lecture Notes on Gene Genealogies.) E[si ] = θ/i if i = K − i θ/i + θ/(K − i) otherwise PROJECT 11. THE SITE FREQUENCY SPECTRUM OF HAPMAP SNPS 38 where θ = 4N u, 2N is the number of gene copies in the population, and u is the mutation rate. To calculate this value, you need an estimate of θ. In the lecture notes, we suggested basing this estimate on S, the number of segregating sites. The appropriate estimate is θˆ = S/a where a= K−1 X 1/i i=1 and K is the number of gene copies in the sample. Let us implement these formulas in Python. There are two easy ways to calculate the number of segregating sites in your data. Suppose that your observed spectrum is a Python list named ospec. Then S = sum(ospec}, or alternatively, S = len(hds). These two approaches should give the same answer. (Check this, just to make sure that your code is correct.) There are also several ways to calculate a. Here is one: a = 0.0 for i in range(1,K): a += 1.0/float(i) Given S and a, it is a simple matter to calculate θˆ and E[si ], using the formulas above. Here is one way to do it: espec = (K/2+1)*[0] for i in range(1,len(espec)): if i == K-i: espec[i] = theta/i else: espec[i] = theta/i + theta/(K-i) This generates a list called espec, such that espec[i] = E[si ]. Finally, you will want to compare the observed and expected values contained in ospec and espec. pgen.py contains a function called oehist (for observed-expected histogram), which does this for you. To use oehist, add the following line to the end of your program: oehist(ospec, espec) By default, this prints a line for the 0th position of the two lists, which isn’t used. Just ignore this line in the output. Exercise 5. Add code to your program that calculates the E[si ] as explained above, placing these values in a list called espec. At this point you should have two lists, one for the observed spectrum and the other for the expected spectrum. Then print and graph these lists using oehist. 6. Write a short paragraph commenting on the observed and expected spectra. Do they seem approximately the same, or do they differ? If they differ, summarize how they differ. What can you conclude from this experiment? PROJECT 11. THE SITE FREQUENCY SPECTRUM OF HAPMAP SNPS 39 7. Write a paragraph or two describing the pattern in the data and if possible suggesting an explanation. Pay particular attention to the differences (if any) between the observed and expected spectra. What to turn in The programs from steps 1, 2, and 5 along with a representative output from each; the paragraph from step 6; and the (optional, ungraded) paragraph on hypotheses. Project 12 Linkage Disequilibrium in the Human Genome In Project 10 we modeled selection and drift at two linked loci. The focus there was on linkage disequilibrium (LD), which was measured using the statistic r, a form of correlation coefficient. It is easy to estimate r, provided that we can tell which nucleotides occur together on individual chromosomes. Unfortunately, we are often ignorant about this. We are ignorant, in other words, about “gametic phase.” This makes it hard to measure LD with any data set that—like the HapMap— consists of unphased genotypes. There are several solutions. The simplest is to estimate LD using a second correlation coefficient (discussed below), which is easy to estimate from diploid genotypes at two loci. This is useful, because it turns out that this second correlation is a good estimate of the first1 . To distinguish between the two correlation coefficients, let us refer to the first as rH (since it correlates Haploid gametes), and to the second as rD (since it correlates Diploid genotypes). In this project, we will estimate rD , and interpret its value as an estimate of rH . In this project we use these estimates to study how LD varies across large regions of chromosome. Each of you will study a different chromosome. You’ll compare the pattern of LD in three populations. This project will require working with correlations, so that is where we begin. 12.1 Correlating diploid genotypes In raw HapMap data, each genotype is a character string. The pgen module, however, recodes these as integers, as explained on page 24. For example, genotypes TT, CT, and CC might be recoded as 0, 1, and 2. The table below shows genotypes at two neighboring SNP loci on chromosome 22 in the Utah HapMap population (CEU). It was made by the program sketched below. 1 Rogers, A.R. & C. Huff. 2009. Genetics 182(3):839–844. 40 PROJECT 12. LINKAGE DISEQUILIBRIUM IN THE HUMAN GENOME 41 locus A [rows] (C/T) at 22181701 locus B [cols] (A/G) + 793 0 1 2 0 [ 90 41 2] 1 [ 0 21 5] 2 [ 0 0 3] rAB = 0.612, r^2 = 0.375, n = 162 Locus A is at position 22181701, and locus B is 793 base pairs to the right of it. Locus A has two nucleotides, C and T, whereas locus B has two others, A and G. The 3 × 3 table shows the distribution of genotypes at the two loci, with labels 0, 1 and 2 to indicate the genotypic values at the two loci. If you study the table, you’ll see that the two loci are not independent. Individuals in row 0 tend to fall in column 0; those in row 1 tend to fall in column 1; and all those in row 2 fall in column 2. There is thus a positive relationship between the genotypic values of the two loci. This reflects a similar association in the underlying genotypes: individuals with genotype TT at locus A tend to have genotype GG at locus B, and so on. The last line of the report above summarizes the strength of this association. The diploid correlation coefficient is rD = 0.612, and its square is 2 = 0.375. Now this is not the same as the statistic r rD H that we used in Project 10 to measure LD. Yet as discussed above, it is a good estimate of rH . This justifies our use of it in what follows as a measure of LD. The correlation between two variables is based on their covariance, and the concept of covariance is related to that of variance. So let us begin there. As explained in section 2.2 of JEPr, the variance VX of X is its average squared deviation from the mean: ¯ 2 ]. VX = E[(X − X) Similarly, the covariance (JEPr section 2.3) is the average product of the deviations of X and Y from their respective means: ¯ Cov(X, Y ) = E[(X − X)(Y − Y¯ )]. Cov(X, Y ) is positive when large values of X often pair with large Y , negative when large X often pairs with small Y , and zero when √ neither variable predicts the other. The maximum possible (absolute) value of Cov(X, Y ) is VX VY , so the natural way to define a dimensionless correlation coefficient is as p rD = Cov(X, Y )/ VX VY , which can range from −1 to +1. 12.2 Calculating variances and covariances The convenient ways to compute variances and covariances are also closely related. You may recall from JEPr that VX = E(X 2 ) − E(X)2 and that Cov(X, Y ) = E(XY ) − E(X)E(Y ). Thus, we can calculate the variance from the averages of X and of X 2 . The covariance is only slightly more complicated: we need the averages of X, of Y , and of the “cross-product” XY . In JEPr, PROJECT 12. LINKAGE DISEQUILIBRIUM IN THE HUMAN GENOME 42 these averages were across probability distributions, because we were discussing the variances and covariances of random variables. We are now concerned with data, so we average across the values in our data. Apart from that, the procedure is the same. You will not need to calculate variances in today’s main assignment, for they are calculated automatically when you create an object of type hapmap dataset. You will however need to calculate covariances. The first step in today’s assignment is to figure out how. It will help to think first about variances, because that calculation is so similar. Consider the listing below. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 x = [5, 25, 36, 37, 41, 50, 60, 73, 75, 99] y = [15, 16, 21, 44, 49, 62, 71, 73, 78, 94] # Return the variance of the values in xvec def var(xvec): mx = mxx = 0.0 for x in xvec: mx += x mxx += x*x n = float(len(xvec)) mx /= n mxx /= n return mxx - mx*mx print "Var(x):", var(x) print "Var(y):", var(y) This code is available on the class web site, where it is called var.py. Download it, run it, and see what it does. You should get two lines of output, which report the variances of the two data sets. How would this code need to change if we wanted to calculate a covariance rather than a variance? For one thing, we would want to step through the two data lists simultaneously in order to accumulate the sums of X, Y , and XY . There are several ways to do this, the simplest of which is involves a facility called zip, which you have not yet seen. Let us pause for a moment to discuss this new facility. zip(xv, yv) is a Python facility that steps through the elements of several sequences (lists, tuples, strings, or whatever). For example, >>> xv = [’x1’, ’x2’] >>> yv = [’y1’, ’y2’] >>> for x, y in zip(xv, yv): ... print x, y ... x1 y1 x2 y2 PROJECT 12. LINKAGE DISEQUILIBRIUM IN THE HUMAN GENOME 43 Each time through the loop, x and y are automatically set equal to corresponding elements from the lists xv and yv. Here we have “zipped” a pair of lists, but one can also zip three or more. To use this facility in calculating a covariance, you would begin with a framework like this: def cov(xvec, yvec): mx = my = mxy = 0.0 for x, y in zip(xvec, yvec): Exercise 1. Write a new function called get cov, which calculates the covariance between two lists, and use it to print the covariance as the last line in the program. 12.3 Smoothing data In this project, you will be comparing LD between tens of thousands of pairs of loci. Without some way of simplifying the output, you would drown in data. In the end, you will plot the results rather than just staring at numbers. This will help, but it is not enough—the noise in the LD estimates would still obscure the pattern. We can get rid of much of this noise by smoothing the data. This is the purpose of the scatsmooth function, which is available within pgen.py and is described on page 57. There are many ways to smooth data, and scatsmooth implements perhaps the simplest. It divides the X axis into bins of equal width, and calculates the mean of Y within each bin. For example, suppose that we have data in two Python lists called x and y. To smooth them, we could type >>> bin_x, bin_y, bin_n = scatsmooth(x, y, 5, 0, 40) As you can see, scatsmooth takes five arguments: (1) x, a list of horizontal coordinate values; (2) y, a list of vertical coordinate values; (3) the number of bins (5 in this case); (4) the low end of the first bin; and (5) the high end of the last bin. In this example, we have asked scatsmooth to divide the interval from 0 to 40 into 5 bins. If you leave off the last two arguments, scatsmooth will calculate them from the data. scatsmooth returns three values, each of which is a list. The first (bin_x in this example) contains the midpoints of the X-axis values of the bins. The second (bin_y) contains the mean Y -axis values. The third returned list (bin_n) contains the numbers of observations within the bins. In your own code, you can name the returned values whatever you like; you don’t need to call them bin_x, bin_y, and bin_n. We can now manipulate the three returned lists any way we please. For example, here is a listing that prints their values. >>> for xx, yy, nn in zip(bin_x, bin_y, bin_n): ... print "%8.3f %8.3f %4d" % (xx, yy, nn) ... 4.000 4.000 5 PROJECT 12. LINKAGE DISEQUILIBRIUM IN THE HUMAN GENOME 12.000 20.000 28.000 36.000 19.000 39.000 59.000 74.000 44 10 10 10 5 Experiment with scatsmooth until you understand how it works. 12.4 An incomplete program On the website, you will find a program called rscaninc.py, which is an incomplete version of the program you will need for this project. Here is the listing: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 from pgen import * from random import randrange nreps = 500 chromosome = 22 pop = ’JPT’ window = 200 # DEFINE GET_COV AND GET_RSQ FUNCTIONS HERE hds = hapmap_dataset(hapmap_fname(chromosome,pop)) distvec = [] rsqvec = [] for rep in xrange(nreps): i = randrange(len(hds)) # index of random snp # scan right for j in xrange(i+1, len(hds)): kilobases = abs(hds[j].position - hds[i].position)*0.001 if kilobases > window: break distvec.append(kilobases) rsqvec.append( get_rsq(hds[i], hds[j]) ) print "Chromosome %d pop %s; %d focal SNPs, %d values of rsq" % \ (chromosome, pop, nreps, len(rsqvec)) # YOUR CODE GOES HERE You will need to insert code of your own just after lines 9 and 28. You do not need to modify anything else. Nonetheless, let us step through the existing code to see what it does. PROJECT 12. LINKAGE DISEQUILIBRIUM IN THE HUMAN GENOME 45 Lines 1–7 import modules and define variables for later use. The main loop (lines 15–23) looks at a large number (nreps) of randomly-selected SNPs, which we will call “focal SNPs.” The inner loop compares this focal SNP with each neighboring SNP within an adjoining region whose size (in kb) is given by the variable window. You need not worry about how lines 19–21 work. All you need to know is this: by the time we reach line 22, i is the index of a random SNP, and j is the index of a neighboring SNP. The goal is to find out (a) the distance between these two SNP loci 2 . The first of these data values is calculated for you: it is in the in kb, and (b) their LD value, rD variable kilobases. The second is obtained from a call to get rsq, which you can see on line 23. At 2 ) are appended to their the bottom of the loop (lines 22–23), both data values (kilobases and rD respective lists. In line 25 we have dropped out of the main loop, and both data lists are complete. The program prints a line of output and stops. Exercise In this exercise, the goal is to examine the relationship between LD and the distance that separates loci on a chromosome. You will study the LD-distance relationship in three different human populations. 2. We assume that you already know which chromosome you are working with. If not, refer to section 8.1 of Project 11. 3. This week you will need HapMap data files for three populations, CEU, YRI, and JPT. Please download them as explained in appendix A. For debugging, you may want to use the dummy data set described on page 53. 4. Download rscaninc.py from the class web site and save it as rscan.py. Modify it so that it specifies the right chromosome. 5. Paste your get cov function definition into the program just after line 9. Immediately below, 2. define a function called get_rsq, which returns rD 6. At the end of the program, add code that uses scatsmooth to smooth the data over the interval from 0 to window, using 20 bins. Treat distvec as your X-axis variable and rsqvec as your Y -axis variable. Then print the smoothed data in a table. The table should contain a 2 , and one for n (the numbers of observations row for each bin, one column for dist, one for rD within bins). Make sure your program runs before proceeding. 7. Set nreps to 500 and window to 200. Run the program once with pop set equal to each of the following values: CEU, YRI, and JPT. These values refer to the European, African, and Japanese populations. If you have been using the dummy data set, you will also need to reset chromosome. 2 on the vertical. You 8. Use the data to make a graph with dist on the horizontal axis and rD should end up with three curves—one for each population—on a single graph. You may do PROJECT 12. LINKAGE DISEQUILIBRIUM IN THE HUMAN GENOME 46 this by hand, with Excel, or however you please. Label it so that we can tell which curve refers to which population. 9. Write a paragraph or two describing the pattern in the data and (if possible) suggesting an explanation. Pay particular attention to the differences (if any) between populations. What to turn in (1) your code, (2) the output from the three runs, (3) the graph, and (4) your prose describing and interpreting the results. Project 13 Linkage Disequilibrium Near the Human Lactase Gene Most mammals are able to digest milk only during infancy. Later in life, their bodies stop producing the enzyme lactase and are thus unable to digest milk sugar (lactose). Some humans however continue to make lactase throughout life, a condition called lactase persistance. (It’s opposite is lactose intolerance.) This condition is common in European (and some African) populations and seems to result from a single point-mutation in the promoter region of the lactase gene. Todd Bersaglieri and his colleagues (Am. J. Hum. Genet., 74:1111–1120, 2004) argue that the European mutant arose a few thousand years ago and has been strongly favored by natural selection. Their evidence for this comes from the amount of linkage disequilibrium (LD) in this genomic region. In this project, you’ll see for yourself. You will estimate LD in the region around the putative advantageous mutation and then compare this estimate to the LD on chromosome 2 as a whole. 13.1 An incomplete program As usual, we provide code that does much of the work. You will find this code on the class web site in a file called lactaseinc.py. Here is what it looks like: 1 2 3 4 5 6 7 8 9 10 11 12 from pgen import * from random import randrange reach = 5000 chromosome = 99 pop = "CEU" focal_position = 136325116 # position of SNP rs4988235 # REPLACE THIS WITH YOUR OWN get_rsq FUNCTION def get_rsq(snp_x, snp_y): return 0.0 47 PROJECT 13. LINKAGE DISEQUILIBRIUM NEAR THE HUMAN LACTASE GENE 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 48 # Find average rsq between one SNP (hds[mid]) and all other SNPs that # are nearby on the chromosome. Return None if no such SNPs are # found. Average will include all SNPs between x-reach and x+reach, # where x is the position of the SNP at hds[mid] def window_rsq(hds, mid, reach): midsnp = hds[mid] lo = hds.find_position(midsnp.position - reach) hi = hds.find_position(midsnp.position + reach) if lo == hi: # make sure hi > lo return None rsqsum = n = 0 for i in xrange(lo, hi+1): if i == mid: continue rsq = get_rsq(hds[i], midsnp) if rsq != None: n += 1 rsqsum += rsq return rsqsum / float(n) hds = hapmap_dataset(hapmap_fname(chromosome, pop)) focndx = hds.find_position(focal_position) print "looking for SNP at pos %d; nearest is at %d" \ % (focal_position, hds[focndx].position) if focndx == len(hds)-1: print "Error: couldn’t find focal_position", focal_position exit(1) rsq_mean_obs = window_rsq(hds, focndx, reach) print "Mean rsq w/i %d kb of %s: %f" \ % (round(reach/1000.0), hds[focndx].id, rsq_mean_obs) # Insert code here to estimate the probability that a random region # in this chromosome would have this much LD. Lines 4–7 define several parameters. The parameter reach determines determines the size of the window within which you will estimate LD. In this initial version of the program, the window reaches from 5000 bases to the left of the focal SNP to 5000 bases to the right. (That is why it is called “reach.”) The chromosome number is set to 99, so that we can use the dummy data file during debugging. (See page 53.) In the end, you’ll want to set chromosome = 2 in order to analyze the real data. Line 7 identifies the SNP that is thought to be under selection. This SNP PROJECT 13. LINKAGE DISEQUILIBRIUM NEAR THE HUMAN LACTASE GENE 49 was identified in the Bersaglieri paper mentioned above. It’s label is “rs4988235,” and in our data it lies at position specified on line 7 of the code above. (This number may change in future versions of HapMap, but the label should remain the same.) Lines 9–11 define the stub of a function that is supposed to calculate r2 —except that it doesn’t in this incomplete code. Replace this function definition with your own version from Project 12. The function window_rsq is new. It takes three arguments, hds (the data set), mid (an index into that data set), and reach (discussed above). The function compares a central SNP (hds[mid]) to each SNP in the region around it. If no other SNPs are found in this region, the function returns None. Otherwise it returns the average value of r2 between the SNPs in this region and the central SNP. The calls to find_position will seem mysterious, since you have not yet seen this function used. If you are curious, see page 56. In line 35, we are done defining things and ready to do real work. In rapid succession, we define the dataset (hds), find the index (focndx) of the “focal SNP” (the one that is supposedly under selection), and calculate the mean value (rsq_mean_obs) of r2 within the region around this focal SNP. Exercise 1. Download the relevant files for the current project. You will need: (a) pgen.py, (b) the incomplete Python program lactaseinc.py, (c) the HapMap data file for chromosome 2 in the European population (CEU), and (d) the dummy data file for “chromosome 99.” For instructions on downloading HapMap data files, see appendix A. For instructions on getting the dummy data file, see page 53. 2. Replace lines 10–11 with your own working copy of the get rsq function. You’ll need the cov function too. At this point, your program should calculate the mean value of r2 in the region surrounding the putatitive selected SNP. 3. Add code to the end of the program that calculates (a) the mean r2 value within 100 randomly selected regions on chromosome 2, (b) the fraction of these regions whose mean r2 is greater than or equal to the “observed value,” i.e. the value for the region surrounding the putative selected SNP. Hint: To calculate r2 in randomly selected regions, choose random SNPs as we did in Project 12, and then hand each one to function window_rsq. To calculate the fraction of these values that is greater than or equal to the observed value, proceed as we did in Project 6. 4. Now it is time to analyze real data, so set chromosome equal to 2. Then run the program with reach set to each of the following values: 20,000, 200,000, and 1,000,000. These runs will take longer, so be patient. 5. Write a paragraph discussing the results. Is the LD around the lactase gene unusual on chromosome 2? If so, how far does the region of unusual LD seem to reach? Appendix A Downloading HapMap data A.1 Interacting with the HapMap server Point your web browser at www.hapmap.org. Under the heading “Project Data,” click on the link to “Bulk Data Download.” 50 APPENDIX A. DOWNLOADING HAPMAP DATA 51 On the next page, click on “Genotypes” under the heading “Bulk data.” Then continue with: 2008-03 −→ forward −→ non-redundant. APPENDIX A. DOWNLOADING HAPMAP DATA 52 This will bring you to a list of files like the one on the left. In these file names, the string “chr22 JPT” means “chromosome 22” in population “JPT.” Scroll down the list to the file you need. Then right-click and select “Save link as” in order to download the file onto your own computer. A.2 Uncompress the data files At this point, the file will be on the hard disk of your own computer. Your next task is to find it. Check the folders “Desktop,” and “Documents.” On Linux, check your home directory. Once you find the file, you will need to uncompress it. HapMap delivers it in a compressed format in order to speed the download process. (That is the significance of the “.gz” in the file name.) If the file shows up as an icon on your Desktop or file browser, try clicking on the icon. This will probably bring up a dialog box that will allow you to uncompress the file. On the Mac, be careful not to click on this file repeatedly. If you do, each click will generate an additional copy of the uncompressed file. These additional files will have names that end in “-2,” “-3,” and so on. You will need to delete these extraneous files before you can proceed. If you are unable to uncompress the data file by clicking on it, here is an alternative method: Mac Open a command-line interpreter, which is called a “terminal” on the Mac operating system. Then use the ls and cd commands to navigate to the directory where the .gz file resides, and type gunzip filename.gz at the prompt. Linux The same, except that the command-line interpreter is sometimes called a shell window. Windows The gunzip facility does not come with Windows, but you can download it for free from www.gzip.org. After that, the procedure is the same as for Mac or Linux, except that APPENDIX A. DOWNLOADING HAPMAP DATA 53 the command-line interpreter is called a command window, and you use dir instead of ls. If you prefer a point-and-click interface, look into the commercial program WinZip. If you decide to download all the HapMap data files, you can uncompress them all with a single command: gunzip *.gz This is much faster than pointing and clicking on the files one at a time. A.3 The dummy data file As you are debugging your programs, you must execute it after each edit. This can be a slow process unless the program executes very fast. Unfortunately, even the smaller HapMap data files are pretty large, and parsing them repeatedly can slow down the process of programming. To solve this problem, we have created a dummy HapMap data file called genotypes_chr99_ CEU_r23a_nr.b36_fwd.txt. In contains only a small number of SNPs, and your program can parse it almost instantly. We encourage you to use this file rather than any real data until you get your code running. To do so, place the file in the same folder with your other programs and specify chromosome = 99 pop = ’CEU’ The dummy data file is under Data on the lab page of the class web site. A.4 Put the data files into the folder where you plan to use them Once the data file is on your machine, you need to figure out where (in which folder) to put it. If you work on your own machine, it’s probably best to create a separate folder. On the Marriott Macs, it makes more sense just to work on the desktop. Whatever you decide, make sure that the folder exists and that it contains your HapMap data file along with a copy of pgen.py. Then launch Idle, get into the interactive window, and try importing pgen. To do so, just type “from pgen import *”. If this works, nothing will print. If you get an error message, then Idle was unable to find the file pgen.py. To fix this problem, see section B.4 on page 57. A.5 Downloading all the HapMap files On your home machine, you may want to download all of the HapMap files, for your own personal use. You would not want to do this by pointing and clicking on each file, one after the other. It is much easier to use the program ftp, and connect to ftp.hapmap.org. If you want to do this, we’ll be happy to show you how. Appendix B More Python In writing this lab manual, we have introduced various Python constructs that were not covered in JEPy. We collect them here for convenient reference. B.1 Basic Python zip(xv, yv) is a Python facility that steps through the elements of several sequences (lists, tuples, strings, or whatever). For example, >>> xv = [’x1’, ’x2’] >>> yv = [’y1’, ’y2’] >>> for x, y in zip(xv, yv): ... print x, y ... x1 y1 x2 y2 Each time through the loop, x and y are automatically set equal to corresponding elements from the lists xv and yv. This function is often used with a pair of lists, but it will also work with three or more. (This paragraph is a copy of one on page 42.) B.2 The random module This module is part of Python’s standard library and is documented in full on the Python website: http://docs.python.org/3.1/library. We have made use of the functions listed below. Before using them, be sure to import the random module with a line of code such as from random import * After that command executes, you will have access to all the facilities of the random module. We list only a few of these here—the ones used in the lab projects but not covered by JEPy. 54 APPENDIX B. MORE PYTHON 55 choice(seq) returns a random element from the sequence seq, which may be a list, a tuple, a string, or any Python object with similar properties. For example, choice("abc") returns "a", "b", or "c" with equal probability. expovariate(h) returns a random value drawn from the exponential distribution with hazard h, or mean 1/h. randrange(a, b) returns a random integer from a, a + 1, . . . , b − 1. The first argument is optional and defaults to 0 if omitted. Thus, randrange(3) returns either 0, 1, or 2 with equal probability, but randrange(1,3) returns either 1 or 2. B.3 The pgen module This module is available on the class website: http://www.anthro.utah.edu/~rogers/ant5221/ lab as a file called pgen.py. Download that file, and place it in the same folder (or directory) as the Python program you are writing. At the top of your own Python program, import the pgen module as follows: from pgen import * B.3.1 Random numbers bnldev(n,p) returns a random deviate drawn from the binomial distribution. When the function is called, n is a integer, the number of trials, and p is a float, the probability of “success” on each trial. Consider for example the experiment of tossing an unfair coin for which the probability of “heads” is 0.6 on each toss. To simulate the experiment of tossing this coin 30 times, we could use the command x = bnldev(30,0.6). After the command executes, x will hold an integer between 0 and 30, which represents the number of “heads.” mnldev(n, p) returns a random deviate drawn from the multinomial distribution. This distribution generalizes the binomial. It might be used, for example, to describe the process of sampling balls with replacement from an urn containing balls of several colors. The number of balls drawn is called the number of trials. The probability that a random ball is (say) red is equal to the relative frequency of such balls within the urn. When mnldev is called, n (an integer) is the number of trials, and p is a list or tuple whose ith entry (pi ) is the probability that outcome i is observed on any given trial. The values in p only need to be proportional to the corresponding probabilities. They do not need to sum to unity; they only need to be positive numbers. For example, suppose that the urn contains 300 red balls, 100 black ones, and 50 white ones. To simulate the experiment of drawing 10 balls (with replacement), we could use the command x = mnldev(10, [300, 100, 50]). After the command executes, x will point to a list that contains the number of red balls drawn, then the number of black ones, and then the number of white. APPENDIX B. MORE PYTHON 56 poidev(m) returns a random deviate drawn from the Poisson distribution with mean m. For example, the command x = poidev(10) sets x to a value drawn from the Poisson distribution with mean 10. B.3.2 For HapMap The pgen module contains several facilities for working with HapMap genotype data. For a gentle introduction to these methods, see section 8.3. hapmap fname(chromosome, population) returns the name of the file (on your own computer) containing data for a given chromosome and population. It can find this file only if it resides in Python’s current working directory (see section B.4). hapmap dataset(filename) creates an object of type hapmap dataset. Such objects include the following data: filename the name of the original HapMap data file snps a list of objects of type hapmap snp, each containing data from a single snp. Objects of type hapmap dataset are created as follows: chromosome = 22 pop = "CEU" hds = hapmap_dataset(hapmap_fname(chromosome, pop)) Now hds is an object of type hapmap dataset. Its instance variables can be accessed like this: >>> hds.filename ’/home/rogers/hapmap/hapmap-r23/genotypes_chr22_JPT_r23a_nr.b36_fwd.txt’ The variable hds behaves like a list or tuple of SNPs. len(hds) returns the number of SNPs in the data set, and hds[3] returns a pointer to the SNP at position 3 within the data set. The first SNP is hds[0], and the last is hds[len(hds)-1]. Objects of type hapmap dataset provide the following methods: find position(pos) returns the index of the SNP whose position on chromosome (measured in base pairs) is closest to (but not necessarily identical to) pos. The argument pos should be a positive integer. If pos >= len(self), the function returns len(self)-1. find id(id) returns the index of SNP whose identifying string (“rs number”) equals id. On entry, id should be a string, formatted as in HapMap data files. For example: “rs4284202.” If id is not present in the data, the function returns len(self). APPENDIX B. MORE PYTHON 57 hapmap snp An object of this type represents all the data for a single SNP. Each such object includes the following data values: id the “rs number” (a unique identifier) of this SNP alleles a list of the alleles present at locus chromosome the label of the chromosome on which this SNP resides position an integer; the position of the SNP on the chromosome gtype a list of genotype data. Each item in the list is an integer, the number of copies of alleles[0] in an individual genotype. sampleSize number of values in gtype. mean the mean of gtype variance the variance of gtype Suppose that snp is an object of type hapmap snp. Then its data values can be accessed by syntax such as snp.id, snp.alleles, and so on. In addition, snp behaves like a list or tuple of genotypes. For example, len(snp) returns the number of genotypes (as does snp.sampleSize), and snp[4] returns the genotype at position 4. B.3.3 Plotting charplot(x, y, nticks, outputheight, outputwidth) prints scatterplots on terminals without graphics capabilities. On entry, x is a list of x-axis values and y a list of y-axis values. The other arguments are optional and specify the number of tick marks per axis, and the height and width of the output in character units. The specified number of tick marks is advisory only. The program will do its best to use tick marks that are as close as possible to the number requested without being ugly. scatsmooth(x, y, n, minx, maxx) smooths a set of (x,y) data by dividing the range of x values into n equally-spaced bins, and calculating the average y-value within each bin. Function returns (bin_x, bin_y, bin_n). For bin i, bin_x[i] is the midpoint of the x values, bin_y[i] is the mean of y, and bin_n[i] is number of observations. All parameters except x and y are optional. If n is omitted, the default value is 10. If minx and maxx are omitted, they default to min(x) and max(x). See page 43. B.4 Helping Python find input files In these projects, Python will need to read several kinds of input files: HapMap data files, pgen.py, and the programs that you write. If it fails to find these files, nothing works. Here is an interaction that illustrates the problem: >>> from pgen import * Traceback (most recent call last): APPENDIX B. MORE PYTHON 58 File "<stdin>", line 1, in <module> ImportError: No module named pgen Here, I tried to import the pgen module, and Python let me know that it could not find it. To avoid this problem, you must tell Python where your data files are. Python searches for input files in a list of folders called the “path.” To help it find these files, do one of two things: either put your own input files into one of the folders in Python’s path, or else modify that path so that it includes the folder where your files reside. To examine the path, use Idle’s “Path Browser,” which you will find under the “File” menu. B.4.1 Putting your own files into Python’s path On Macs, Python searches the Documents folder by default. If you put all your files there, Python will have no trouble finding them. This is the easiest approach and is the one to use when you are working in the lab at Marriott. B.4.2 Putting your files into some other folder If you work on your own computer, you may not want to clutter up Documents with the files from this class. On my own machine, I keep the material for this class in a separate folder called “pgen.” You might want to do something similar. If you launch Idle from an icon, it will not initially know about the folder that contains your files. But if you open a .py file within this folder and execute it, Idle automatically adds this folder to the path. After that, you can execute commands like import pgen without difficulty. Those of us who work with the command line interface have it even easier. If you launch Idle from the command line, it automatically adds the current working directory to its search path and thus has no difficulty finding files. B.4.3 Manipulating Python’s current working directory Python looks for input in a folder called the “current working directory,” or CWD. This works well if you launched Python from the command line, for then the CWD is just the directory within which you typed the “python” command. But if you launched Python by clicking on an icon, the CWD is unlikely to be set to a useful value. You can, however, manipulate it by typing commands into Python’s interpreter. To do so, you must first import the os module: >>> import os You can then check the CWD like this: >>> os.getcwd() ’/home/rogers’ APPENDIX B. MORE PYTHON 59 This tells me that the CWD is the directory /home/rogers. If my input file is in that directory, Python will find it. But suppose it is in a subdirectory of /home/rogers called pgen. I can change the CWD by typing >>> os.chdir("pgen") This changes the CWD to /home/rogers/pgen, provided that this directory exists on my computer. To check that this worked, use the os.getcwd command once again: >>> os.getcwd() ’/home/rogers/pgen’ In the preceding example, the os.chdir command was very simple, because we were moving into a subdirectory of the directory we started in. If you need to make a more drastic move, specify the entire path name. For example, the command >>> os.chdir("/home/rogers/pgen") would would move us to /home/rogers/pgen no matter where we started from.
© Copyright 2025