F - 8th International Biocuration Conference

The 8th International Biocuration Conference, April 23-26, 2015, Beijing, China
Challenges and Practices of Big
Data in Life Science
《Genomics study driven by biological questions》
Yixue Li
Institute of Biochemistry & Cell Biology
Shanghai Institutes for Biological Sciences
Chinese Academy of Sciences
yxli@sibs.ac.cn
“For such a large number of problems
there will be some animal of choice, or a
few such animals, on which it can be
most conveniently studied”
August Krogh et al., Am J Physiol. 90(2) pp. 243-251(1929)
August Krogh: a Danish Nobel Laureate and a physiologist,
he got his Nobel praise last century on 1920.
“Science is a gamble, then you
need to win” is a Key!
Dr. Takash Gojobori, 2014 Jeju, Korea,
Dog genomics study for deciphering
mechanisms of adoption to
high-altitude hypoxia
Tibetans
average elevation of 4,000 meters
oxygen level is about 60% of that at sea level
cold climate and limited resources
sustained increase in cerebral blood flow
lower hemoglobin concentration
less susceptibility to chronic mountain sickness
Andeans
Ethiopians
hemoglobin concentration is
higher temporary and reversible
acclimatization
increased oxygen level in
hemoglobin
around 3,000 metres to 3,500
metres elevated hemoglobin
levels do not increased in
oxygen-content of hemoglobin
Beall. PNAS. 2002
As erythrocytosis is a common symptom
of chronic mountain sickness which will
lead to high blood viscosity and
cardiovascular disorders. Whereas, the
decrease in hemoglobin level may
provide a protective mechanism for
people live in highland.
Pre-genome scan result
• HIF (hypoxia-inducible factor) pathway
– Tibetans
• Metabolic pathways
– Yak
– Tibetan antelope
Simonson et al.
Science. 2010
Although a lot of studies focused on
wildlife and human highlanders, only
few researches were performed on
domesticated animals that migrated
to the plateau with humans.
Beall et al. PNAS, 2010
Qiu et al. Nat. Genet. 2012
Results about EPAS1/HIF2α
• 31 SNPs were found in intron region of EPAS1 gene
which is a transcription factor also called HIF2α.
EPAS1 gene were found in high linkage
disequilibrium that correlated significantly with
hemoglobin concentration in Tibetans
population(196 Tibetans and 84 Han individual from
HaoMap3, Beall et al. 2010).
• Because all of the found SNPs are located at the
intron region of EPAS1, the detailed functional
association between genotype and phenotype
remains unclear. We still want to know the detailed
type of selections exists for human high-altitude
adaption occurred in the hypoxia-inducible factor
(HIF)?
Foucs on domesticated animals that migrated
to the plateau with humans/Tibecan.
Tibetans Vs. Tibetan Mastiff
Increased blood flow(Tibetans) Vs. ?(Tibetan Mastiff )
Genome wide association study Vs. Whole genome sequencing
Vs.
illumina genotyping chips
Samples and Data
• We sampled six dog breeds from continuous
altitudes along the “Ancient Tea Horse Road”
in southwestern China.
• Each dog was sampled from one individual
village to avoid potential kinships.
• The sex ratio was kept as 1:1 for each breed.
• In total, 60 dogs from six dog breeds were
sequenced.
Breed (abbreviation)
History
Sample size
Location
Cuomei, Tibet, China (n =
4)
Tibetan Mastiff (TM)
Ancient
10
Yushu, Qinghai, China (n
= 4)
Diqing, Yunnan, China
(n= 2)
Diqing indigenous dog
Altitude
5,100 m
4,200 m
3,300 m
Ancient
10
Diqing, Yunnan, China
3,300 m
Ancient
10
Lijiang, Yunnan, China
2,400 m
Kunming dog (KM)
Modern
10
Kunming, Yunnan, China
1,800 m
German Shepherd (GS)
Modern
10
Kunming, Yunnan, China
1,800 m
Ancient
10
Yingjiang, Yunnan, China
800 m
(DQ)
Lijiang indigenous dog
(LJ)
Yingjiang indigenous dog
(YJ)
From raw reads to SNPs
Raw
reads
Per individual
analysis
Per breed
analysis
Read QC
QC report
Read mapping
Mapping
report
SNP calling
Depth
report
SNP filtering
SNP report
SNP annotation
Annotation
report
Workflow for Population Genetics data analyzing
Diversity
Population
polymorphism
Allele
frequency
demographic
event
LD
PCA
Population
structure
Tree
Evolution
relationship
Ancestry
Fst
Selective
sweep mapping
Diversity
reduction
LD
increasing
Selective target
Whole-genome FST mapping
• We performed whole-genome FST scan and focused on regions with the extreme
FST value (Z(FST) > 5) .
• 28 unique autosomal regions containing 141 candidate genes were identified.
• Five genes of them including: EPAS1, MSRB3, HBB, CDK2 and GNB1 belong to
the GO categories ‘response to oxygen levels’ and ‘response to oxidative stress’.
Fst:genetic differences among population
HIF pathways
• The region with the strongest differentiation EPAS1, a gene encodes the
hypoxia-inducible factor (HIF) 2α.
• Network analysis indicated that the other candidate hypoxia-response
genes we identified would all be regulated by HIF signaling pathway,
suggesting an essential role of EPAS1 in the adaption of high-altitude dogs.
• Interestingly, EPAS1 was also identified as a selective target in Tibetan
people.
Amino acid conservation
• Among the four non-synonymous mutations, one (G305S) occurred
in the PAS domain, which is essential for the activity of EPAS1.
• G305S is also a quite conserved amino acid mutation, which is
invariant among all the vertebrates we examined.
Structural and functional effects of G305S
• G305S occurred in a beta sheet, which could affect the
thermodynamic stability of the domain.
• Prediction of functional effects supports that only G305S is
deleterious, while the other three are tolerated.
Physiological association
• We conducted association testing for the variant G305S and hematologic
parameters in DQ, the high-altitude breed where enough homozygotes (n
= 40) and heterozygotes (n = 29) could be collected.
• Although no evident relationship with hemoglobin concentration was
found, The homozygotes with two mutant alleles (AA) show decreased
vascular resistance as compared with the heterozygotes (GA).
Zhen W. et.al., Genome Research, 2013
Camel genomics study for a
prevention mechanisms of
Type 2 diabetes
• Storing energy in humps and abdomen in the form
of fat, enabling them to survive long periods without
any food and water.
• The body temperature may vary from 34 to 41 ℃
(Celsius temperature) throughout the day.
• The blood glucose levels in camels (6-8 mmol·l-1)
are twice more than those in other ruminants.
• Tolerant of a high dietary intake of salt, consuming
eight times more than cattle and sheep.
• The Camelidae family are the only mammals that
can produce heavy-chain antibodies (HCAbs).
Kaske, M., Elmahdi, B., Engelhardt, W. & Sallmann, H. P. Insulin responsiveness of sheep, ponies,
miniature pigs and camels: results of hyper insulinemic clamps using porcine insulin. J. Comp.
Physiol. B 171, 549–556 (2001).
GENOME ANNOTATION PIPELINE
Protein-coding
gene prediction
Scaffolds
RepeatMasker
Repeat
elements
Ab-initio
(Augustus,
GenScan)
EST (dromedary)
Repeatmasked
sequences
tRNA
(tRNAscan)
EvidenceModeler
ncRNA
prediction
rRNA
(SILVA)
Protein-coding
gene
miRNA
(miRBase)
Repeat and ncRNAmasked sequences
InterProScan
Domain/
Family
Homology
(genBlastA)
KAAS
GO
KEGG
Genome data visualization
基因组
比较基因组
交互展示
动态展示
综合展示
群体基因组
功能基因组
(转录组
蛋白质组
代谢组)
THE ACCELERATED EVOLUTION OF
PATHWAYS IN CAMELS
We estimated the dN/dS ratios for the camel and its closest cattle orthologs, taking the
human ortholog as an outgroup. The significantly faster evolving genes in camels than
in cattle were identified and were mapped to the KEGG pathways.
RAPIDLY EVOLVING GENES
Human
Cattle
Camel
•
•
•
Rapidly evolving gens, as measured by an increased dN/dS ratio, may under adaptive
selection or relaxed purifying selection.
In total, 2,730 genes evolving significantly faster in camel than in cattle by taking human
orthologs as outgroups.
These rapidly evolving genes are enriched in metabolic pathways and signaling pathways
regulating metabolic processes, and part of genes are also cancer related.
INSULIN SIGNALING PATHWAY
• Physiological experiments demonstrated that the high level of blood
glucose in camels may be caused by their strong capacity for insulin
resistance.
• Our research shows that a significantly large number of rapidly
evolving genes in camels are involved in insulin signaling pathway,
which may change its sensitivity to insulin.
• Does there exist a unique CYP2-CYP4
metabolic module to helps camel
tolerate hyperglycemia in their
population lever?
Copy number variation of P450 Family
between camel and other mammals
A total of 60 members in the P450
family were found in the camel
genome
and
were
carefully
annotated. A remarkable gene
number variation between camels
and other mammals in the
subfamilies of CYP2J, CYP4A and
CYP4F was found; there were 11
copies of CYP2J in camels, more than
those in cattle (4) and humans (1). In
contrast, there was only one copy of
CYP4A and two copies of CYP4F in
camels, fewer than those in cattle (3
and 7, respectively) and humans (2
and 6, respectively) . Phylogenetic
analysis of CYP2 and CYP4 family
supported the expansion of CYP2J
and contraction of CYP4A/F in the
camel lineage.
Cytochrome P450 family
Family
CYP2
Subfamily
CYP2E
CYP2J
CYP4
CYP4A
CYP4F
Cattle
23
1
4
13
3
7
Horse
31
1
1
15
3
7
Human
20
1
1
12
2
6
Camel
27
2
11
7
1
2
EETs
• 19(S)-HETE was demonstrated to be a potent vasodilator of renal
preglomerular vessels that stimulate water re-absorption.
• The activity of CYP2J is regulated by high-salt diet and its suppression can lead
to high blood pressure. Camels are known to be able to take in a large amount
of salt, but they do not seem to develop hypertension, perhaps because they
have more copies of CYP2J genes.
Does natural adopted CYP2J-CYP4A duplex system enable
camel to move away from metabolic syndrome?
Zhen W. et.al., Nature communications, 2012
P450 gene family
CYP2/CYP4
20-HETE/EETS
EETS
Promote tumor metastases and
enhance angiogenesis in and
around primary tumors
Diabetes/High blood glucose
mutations
mutations
Oncogene
(Negative selection)
Tumor suppressor genes
(Positive selection)
Genes involved in insulin
signaling pathways
(Negative selection)
Rapidly evolving genes found in camel genome
GENOME ANNOTATION PIPELINE
Protein-coding
gene prediction
Scaffolds
RepeatMasker
Repeat
elements
Ab-initio
(Augustus,
GenScan)
EST (dromedary)
Repeatmasked
sequences
tRNA
(tRNAscan)
EvidenceModeler
ncRNA
prediction
rRNA
(SILVA)
Protein-coding
gene
miRNA
(miRBase)
Repeat and ncRNAmasked sequences
InterProScan
Domain/
Family
Homology
(genBlastA)
KAAS
GO
KEGG
Thanks for your attention!