1 Supplementary Materials and methods Sample preparation DNA was extracted from blood sample by using Qiagen QIAamp DNA blood mini/midi kit (Qiagen, San Diego, CA, USA). DNA concentration and quality was determined by measuring light absorbance at 260 nm and 280 nm. Samples with A260/A280 ratio between 1.7 and 2.0 were included for genotyping. RNA for expression assays were extracted using TRIzol reagent (Life Technologies, Rockville, MD, USA) and converted to cDNA using an oligo (dT) 15 primer and Superscript III (Invitrogen, Carlsbad, CA, USA). SNP selection The association region for fine mapping was delimited by recombination hotspots flanking the candidate genes ADD3 and XPNPEP1, which had accommodated the haplotypes possibly tagged by rs17095355 and other linked association markers. The region spans 350kb from chr10:111600kb to chr10:111950kb, and accommodates 1580 SNPs (NCBI dbSNP build37). We did regional imputation based on the 13 GWAS SNPs genotype data in the fine mapping region (MACH1.0, based on HapMap Phase II data)[1]. In GWAS, the 13 SNPs were associated with BA as tagged by rs17095355, but had no independent effect in disease association. The imputation did not produce higher disease association signals. 2 Next, we ran a dense-mapping strategy to assay the genetic architecture in the region. Multiple allelic SNPs or SNP with minor allele frequency (MAF) <1% in Asian were excluded, and 624 out of 1580 SNPs were left. Tag SNPs was selected after wise by using r2>0.99 based on the HapMap Release #22 genotype data of Asian individuals. For the completely tagged SNPs, functional SNPs were prioritized according to the genomic coordinates ftp://hgdownload.cse.ucsc.edu/goldenPath/). (UCSC regulatory track, hg18, As a result, 107 SNPs in total were selected for genotyping. SNP genotyping SNPs were genotyped on Sequenom platform in 2 batches using the same reaction lot. To assess genotype reproducibility, we included 28 randomly duplicated samples on the 7 Polymerase Chain Reaction (PCR) plates. Two SNPs rs17095355 and rs2501577 were genotyped by Sanger sequencing. The sequencing result was read by two technicians independently. Phenotype permutation strategy Correction for the inflated type I error in multiple-testing using the Bonferroni method[2] is too conservative for our dataset, because these SNPs are highly correlated with each other. Alternatively, we applied the phenotype permutation test to empirically estimate the study-wide significance level of 3 our association results, by swapping the phenotype identity (105 times) of our samples, testing the “case/control” association and then examining the possibility of getting a better-than-the-observed (all variants) association statistics. Since swapping the phenotype labels preserved the linkage structure among variants, the possibility, or the empirical p (EMP2) value, fully addressed the correlation among variants, while powerfully deflating the type I error in multiple-comparisons. Another advantage of the phenotype permutation test is that sample substructures other than case/control segregation will be exposed by the random “case/control” association test, therefore the better-than-the-observed (single variant) empirical p value (EMP1) can be applied as a quality control method to test associations caused by case/control bias. We plotted EMP1 against the observed association p values of variants in a quantile-quantile (QQ) plot to examine whether there was deviation was caused by population stratification (Figure S2). The quality control procedure (Please refer to Figure S1) Samples with mixed-labeling errors, and those with low genotyping rate (<95%) were excluded from further analysis. SNPs with genotyping rate <95% (N=14), minor allele frequency (MAF) <1% (N=1), and SNPs that deviated from Hardy–Weinberg equilibrium (N=2; p≤0.0001) were excluded for downstream association tests. After QC, to refine the variants mapping, we imputed un-genotyped SNPs in 10q24.2 by IMPUTE2[3] (Table S2). 4 Imputation To explore the un-genotyped SNPs in the fine mapping region, we imputed the region encompassing the genotyped SNPs using IMPUTE2[3]. We used the 1000 Genomes Phase I reference data contains 1,094 individuals from Africa, Asia, Europe, and the Americas. SNPs with a quality score (i.e. info) >0.30 were considered eligible for downstream analysis (896 SNPs). Additionally, 41 missing genotypes of the typed SNPs (after genotyped SNPs quality control) using the implemented -pgs_miss function by IMPUTE 2.0. Disease association tests of the imputed genotypes were done using logistic regression test which accounted for the dosage uncertainty of imputation by PLINK [4]. Two hundred and fifty five SNPs were detected to be associated with disease (p<0.05 by phenotype permutation test; Figure 1b, Table S2). Model selection The rationale of the model selection strategy was developed from Macleod and Xu’s method[5]: for the generalized linear-regression model (GLM), subsets of genotypes (variables) were randomly sampled to predict the BA phenotype (response), subsequently the fitness of each prediction was evaluated by Akaike Information Criterion (AIC)±2 value (a parsimonious model was prioritized), and the best model was chosen. Since exhaustive sampling is not feasible at the current computational 5 platform, given the great number of variants, we modified the approach by using a Markov Chain Monte Carlo (MCMC) method with a simple Metropolis-Hasting algorithm 105 times (after 5000 burnin) for each fixed number of variants (ranging from 2 to 10). The algorithm was conducted in R[6] and the script is attached as supplementary material. To reduce noise of random fitting, associated variants detected by the phenotype permutation procedure were included in the model selection procedure. Each SNP had been genotyped with equal sample size to minimize bias of power in the procedure. Haplotypes were estimated by an expectation-maximization (EM) algorithm from the genotypes. The haplotype-specific case-control association test was performed by PLINK using the logistic regression method and incorporating the S2 covariate to eliminate possible confounding effect. Phenotype permutations (105 times) were performed to evaluate the empirical haplotype-specific significance[4]. Independent effect of the tagging SNPs was interrogated by haplotype-based conditional tests implemented in PLINK[4, 7]. Real-time quantitative PCR The cDNA products equivalent to 10ng of total RNA were used for quantitative real-time PCR (RTPCR) using TaqMan gene expression assays from Applied Biosystems. The assays for ADD3 (Hs00249890_m1) and XPNPEP1 (Hs00958026_m1) both target all isoforms of these genes. Real-time quantitative PCR was performed in triplicate (96-well plates) on an ABI 7900 6 (Applied Biosystems) machine using standard thermal cycling conditions (10min at 95 oC, 40 cycles for 15s at 95oC, 1min at 6oC). A standard curve was constructed for each PCR run with plasmid cDNA of ADD3 and XPNPEP1 gene. The amount of target gene per sample was interpolated according to the standard curves. All analyses were performed in a blinded fashion with the laboratory operators unaware of genotyping data. ADD3 immunohistochemistry staining ADD3 immunohistochemistry was performed to examine the expression pattern in BA livers (Adducin γ Antibody (H-60), Catalog Number: SC-25733, Santa Cruz, CA, U.S.A). Formalin-ﬁxed liver tissues were processed and 5-mm thick paraffin sections were used for ADD3 immunohistochemistry. Briefly, deparafﬁnized liver sections were treated with 3% hydrogen peroxide followed by antigen retrieval. The sections were blocked with 10% sheep serum, and incubated with the primary antibody, which was then linked to the biotinylated secondary antibody followed by the avidin–biotin complex (ABC Vectastain kit; Vector Laboratories, Burlingame, CA, USA), and diaminobenzidine, which gave a brown reaction product. The sections were then counterstained with hematoxylin. Bioinformatics investigation of SNP functionality The fine mapping genomic region including the genetic region of XPNPEP1 and ADD3, and 7 upstream/downstream sequence of the genes, therefore was enriched with cis regulatory elements (http://genome.ucsc.edu/). Because the fine mapping region can have undergone positive genetic selection, conservation-based functionality prediction from multiple species sequence alignment was less prioritized in this case. Alternatively, we applied experimental predictions for bioinformatics investigation of SNP loci functionality. We used Encode Regulation super-track (including 5 sub-tracks; italic fonts indicate track names, same as below)[8], regulatory potential prediction (regPotential7x)[9], and microRNA target scan (TS miRNA sites)[10] for the analysis (Figure S6). The related genomic regions were downloaded as .bed files from UCSC website (http://genome.ucsc.edu/), except for the layered H3k4me3, layered H3k4me1, and layered H3k27ac regulatory tracks in the Encode Regulation super-track, for which we downloaded the .wiggle files (data points at a resolution of 25 base pair per point). Signal intensity in the .wiggle files, reflected as height in related tracks of Figure S6. To reduce false discoveries, only loci with signal intensity greater than 30 in 2 or more tracks (layered H3k4me1, layered H3k27ac, and DNase clusters) were considered as positive for Enhancer prediction. Promoter prediction was mainly based on layered H3k4me1, the Encode promoter-associated histone mark prediction (signal intensity greater than 30). Besides, we also checked expression quantitative trait loci based on liver array data[11], but no positive loci for either XPNPEP1 or ADD3 was found. Positive predictions are listed in Table S2. 8 Assay the signature of selection: the fixation index (Fst) HapMap Phase II samples include 60 Utah residents with ancestry from northern and western Europe (CEU), 45 Han Chinese in Beijing (CHB), 45 Japanese in Tokyo (JPT), and 60 Yoruba in Ibadan, Nigeria (YRI) were included. We pooled the data of CHB and JPT as one geographical group and denoted it as ASN(Asian)[12]. A total of 2881972 SNPs Fst values genotyped in all three populations, excluding sex chromosome SNPs, were analyzed. Fst was calculated by Cheng, et.al[13]. The fractions of highly differentiated SNPs (F) were used to test the enrichment status of genetic regions: the ADD3 locus, the whole genome gene loci, and the sliding windows. Each gene locus was defined as the Refseq gene region extending 30 kilo base pairs both upstream and downstream. Each sliding window include 200 SNPs, slide along each chromosome with one SNP forward for each step. Haplotype blocks were estimated using Gabriel’s[14] algorithm using default parameters in Haploview[15], the computing was completed using PLINK[4]; only blocks with more than 10 SNPs were considered for downstream analysis. The analytical steps are illustrated in Figure S4. The definition of ancestral allele is based on chimpanzee/human sequence alignment from National Center for Biotechnology Information: http://www.ncbi.nlm.nih.gov/)[16]. 9 REFERENCES [1] Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic epidemiology 2010;34:816-834. [2] Dunn OJ. Multiple Comparisons Among Means. Journal of the American Statistical Association, 1961;56:9. [3] Howie B, Marchini J, Stephens M. Genotype imputation with thousands of genomes. G3 2011;1:457-470. [4] Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics 2007;81:559-575. Xu AIMaC. bestglm: Best Subset GLM. R package version 0.33. 2011. [5] [6] Team TRDC. R: A Language and Environment for Statistical Computing. 2010. [7] Purcell S, Daly MJ, Sham PC. WHAP: haplotype-based association analysis. Bioinformatics 2007;23:255-256. [8] Consortium EP. A user's guide to the encyclopedia of DNA elements (ENCODE). PLoS biology 2011;9:e1001046. [9] King DC, Taylor J, Elnitski L, Chiaromonte F, Miller W, Hardison RC. Evaluation of regulatory potential and conservation scores for detecting cis-regulatory modules in aligned mammalian genome sequences. Genome research 2005;15:1051-1060. [10] Lewis BP, Burge CB, Bartel DP. Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 2005;120:15-20. [11] Schadt EE, Molony C, Chudin E, Hao K, Yang X, Lum PY, et al. Mapping the genetic architecture of gene expression in human liver. PLoS biology 2008;6:e107. [12] International HapMap C. The International HapMap Project. Nature 2003;426:789-796. [13] Cheng F, Chen W, Richards E, Deng L, Zeng C. SNP@Evolution: a hierarchical database of positive selection on the human genome. BMC evolutionary biology 2009;9:221. [14] Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, et al. The structure of haplotype blocks in the human genome. Science 2002;296:2225-2229. [15] Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 2005;21:263-265. [16] Carlson CS, Thomas DJ, Eberle MA, Swanson JE, Livingston RJ, Rieder MJ, et al. Genomic regions exhibiting positive selection identified from dense genotype data. Genome research 2005;15:1553-1565. Subset sampling n-script N <- 100000 dev <- c() size <- n steps <- c() S <- 1:72 S1 <- sample(S,size) S2 <- setdiff(S,S1) glmfit <- glm(y~x[,S1],family="binomial") istep <- 0 while(istep < N){ i1 <- sample(S1,1) i2 <- sample(S2,1) S1.new <- c(setdiff(S1,i1),i2) glmfit1 <- glm(y~x[,S1.new],family="binomial") if(runif(1) < min(1,exp((glmfit$dev-glmfit1$dev)/2))){ S1 <- S1.new S2 <- c(setdiff(S2,i2),i1) glmfit <- glmfit1 } istep <- istep+1 if(istep%%100==1){print(istep)} steps <- cbind(steps,sort(S1)) dev <- c(dev,glmfit$dev) } bestdevs <- sort((unique(dev)))[1:10] bestsubsets <- rep(0,10); for(i in 1:10){bestsubsets[i] <which(dev==bestdevs[i])[1]} bestMarkerSets <- c() for(i in 1:10){bestMarkerSets <rbind(bestMarkerSets,markerNames[steps[,bestsubsets[i]]])} data.frame(bestdevs,bestMarkerSets,t(steps[,bestsubsets]))