Frequently in cleaning genotype data, it consists of two separate phases: samples filter and SNPs filter.. • Missing rate or missing proportion MSP at a SNP indicates how much sam-ples f
Trang 11
VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
NGUYEN HOANG SON
Detecting bad SNPs from Illumina BeadChips using Jeffreys distance
MASTER THESIS OF INFORMATION
TECHNOLOGY
Hanoi - 2012
Trang 21
VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
NGUYEN HOANG SON
DETECTING BAD SNPS FROM ILLUMINA BEADCHIPS USING JEFFREYS DISTANCE
MASTER THESIS
Sector: Information Technology Major: Computer Science
Code : 60 48 01
Supervised by: Dr Le Sy Vinh
Hanoi - 2012
Trang 3Table of Contents
1.1 Biological Background 3
1.2 SNP Genotyping 5
1.3 Quality Control and Quality Assurance 7
2 Related Work 10 2.1 Naive method for SNP genotyping 10
2.2 GenCall 11
2.3 Illuminus 12
2.4 GenoSNP 12
2.5 Discussion 13
3 Method 15 3.1 Kullback-Leibler divergence 15
3.2 Approximate relative entropy between two Student distributions 16
3.2.1 Approximate Student distribution 16
3.2.2 The matched bound approximation 17
3.3 Estimate conflict degree between three callers 19
4 Experimental Results 23 4.1 Data description 23
4.2 Parameter estimation 24
4.3 Evaluation 26
ii
Trang 4Chương 1
Introduction
very long strings of four alphabets (or bases) {A, C, G, T} which stand for four types of nucleotides respectively There are nearly three billions characters in total for a completed genome
in a DNA sequence The positions where SNP occurred are called SNP site Note that there usually happen only two possible nucleotide variants for any SNP site The nucleotide variants at SNP sites are called alleles We use 0 and 1 to denote the two possible alleles
A pair of alleles each from one chromosome copy in a deploy organism at a SNP site is called SNP genotype Since there are two different alleles at a SNP site, there are three different types for a SNP genotype and denoted as: 00, 01, and 11 (equivalent to aa, aA, and AA) 00 and 11 are called homozygous genotypes while
01 is called heterozygous genotype
Affymetrix GeneChip, Illumina Infinium Beadchips, Perlegen, Invader, etc
The Illumina whole-genome BeadChips is one of the most popular microarray technologies used to study human SNPs It takes m individuals (samples) as the input and captures genotypes across n SNP sites After several chemical and
pre-1
Trang 51.2 Quality Control and Quality Assurance 2
processing steps, it outputs a matrix G = {gij,i=1, ,m;j=1, ,n} where gij = (xij, yij) are the raw intensities which indicate the SNP genotype of sample i at SNP site j The intensities xij, yij represent for allele 0, 1 of SNP genotype gij
Bead-Chips, namely Gencall (in GenomeStudio software), Illuminus and GenoSNP
Large-scale stastical studies are supceptible to errors Quality Control and Quality Assurance (QC/QA) process is considered as a critical phase that has an significant affect to the accuracy of computational model QC is defined as a process of mon-itoring and controlling the quality of data as it is being generated, whereas QA is used to review the product quality after that The content of this thesis is related
to QC procedure of genotype calling problem
Frequently in cleaning genotype data, it consists of two separate phases: samples filter and SNPs filter We just focus on the later in this article In which, low-quality SNPs are called bad SNPs and should be filtered out
In a typical QC process for genotyping problem, there are three variables consid-ered They are SNP’s missing rate, HWE (Hardy-Weinberg equilibrium) and MAF (minor allele frequency)
• Missing rate or missing proportion (MSP) at a SNP indicates how much sam-ples failed in this call:
total of samples The higher missing rate indicates the poorer genotype calling performance
• In a biallelic locus which is in Hardy-Weinberg equilibrium and minor allele frequency is q then the probalilities of three possible genotype 0/0, 0/1 and 1/1 are (1 − q)2, 2q(1 − q), q2 Significant deviation of HWE tests is typically imply gross genotyping error A probalility test value, or p-value, is used to estimate this difference
Trang 61.2 Quality Control and Quality Assurance 3
• MAF is the ratio of minor (smaller) alleles counted in the whole set of alleles:
total alleles Low MAF means that there exist a cluster among three with fewer samples
As the consequence, SNPs with low MAF are more prone to error, since almost calling methods that based on clustering do not work well in these cases
Trang 7Chương 2
Related Work
Given an matrix G = {gij,i=1, ,m;j=1, ,n} with gij = (xij, yij), representing the geno-type intensity of m samples at n SNP loci The task is to assign every pairs to its most suitable genotype label, 00, 01 or 11 (or no call cluster if impossible) This problem is call SNP genotype calling or in short, SNP genotyping
The most simple method for SNP genotyping problem use the correlation between two allelic intensities (xij, yij) For example, with a data point (x, y):
allele 0)
• If xij yij then it should belong to genotype class 11 (homozygous of allele 1)
• Or if xij ' yij then this SNP calling should be 01 (heterozygous)
However, this naive method could not work well with the ambiguous cases and its performance is poor with the tremendous amount of data
Gencall is the canonical caller developed for Illumina microarray from the very be-ginning This method uses neural networks and heuristic methods to divide samples
4
Trang 82.3 Illuminus 5
into three clusters A GenCall score is generated as a confident degree for each call
In general, this score represents the quality of genotyping and might be varied be-tween different loci or chromosomes, however we could agree that any call with score less than 0.2 would be considered as a failure due to the lack of confidence (assigned
as no call in this case)
Illuminus is one of the best callers for 2.5M BeadChip of Illumina This method uses EM (Expectation Maximization) framework to fit a bivariate mixture model with three t-distributed components (as three genotyping clusters) and a Gaussian component for outliers (corresponding to nocall cluster) The posterior probabilities are calculated to classify the samples into appropriate genotype states Then pertu-bation analysis is applied to ensure the stability of calling results in Illuminus This process makes call for each SNP twice, in original data (xij, yij) and pertubated data (xij + , yij + ) respectively, then the concordance between the two results is estimated to determine whether this SNP is valid or not
GenoSNP also utilizes the EM framework for mixture model of Student distributed components as in Illuminus However it uses a different method to fit intensities data
to the model Instead of clustering the probe intensities across of individuals at each SNP as in two method stated above, GenoSNP develops the model within a single individual based on the log scale of the normalized intensities (log11(x + 1), log11(y + 1)) This novel approach is able to overcome the problem of other methods in that the accuracy is highly depend on the number of control samples, which sometimes
in reality cannot be afforded Moreover, this method gives a perfect solution for studies where data is typed by different chips After running the calling algorithm, there also measures of confidence for each call, which is the posterior probability representing the possibility the call coming from the class assigned This confident degree is also used in filter the poorer data of SNP or samples
Trang 92.5 Comment 6
All of these callers in general could work quite well with arbitrary dataset with call rate and accuracy usually exceed 95% A combination of all three callers, also known
as a consensus calling, will increases the performance of genotype calling problem with no doubt This very interesting idea has become motivation for our work That
is, by considering the results of three callers and making cross references, we could find the problematic SNPs that have bad effect to the calling process
Trang 10Chương 3
Method
The three genotype classes clustered by Illuminus, GenoSNP and Gencall for bad SNPs are not stable They are different when compare to each others By estimating these quantities in the whole dataset using informative distance, we could determine the SNPs with largest conflicts and assign them as bad SNPs
Kullback-Leibler divergence or relative entropy D(P ||Q) between two distributions with probability mass functions f (x) and g(x) respectively, could be computed as follows:
x∈X
p(x) logp(x)
For two γ-dimensional normal distributions f (µf, Σf) and g(µg, Σg), we have:
2(log
|Σg|
|Σf|+ Tr[Σ
−1
g Σf]
in which γ = 2 for our genotyping problem There is no such closed form expression exists to estimate KL divergence between these distributions of interest, hence an approximation is sufficient
7
Trang 113.2 Approximate relative entropy between two Student distributions 8
Stu-dent distributions
3.2.1 Approximate Student distribution
We have:
S(x; µ, Σ, υ) =
+∞
Z
0
N (x; µ, Σ/u)G(u;υ
2,
υ
where N (x; µ, Σ/u) is a Normal distribution of x with mean and covariance matrices
µ and Σ/u; while G(u;υ2,υ2) is a Gamma distribution of u with two given parameters shape and rate respectively
By simplifying the above Equation, a Student distribution could be approximated
by a finite mixture of P Norms:
P
P X i=1
N (x; µ,Σ
where {ui}P
i=1 are randomly draw from Gamma distribution G(u;υ2,υ2) The value
of P is significantly affect the accuracy of the approximation and should be chosen according to the degree of freedom υ
We use a method of Goldberger to approximately estimate D(f ||g)
• A matching function is defined as
m : {1, , P } → {1, , P }
j = m(i) = arg min
k D(fi||gk)
• The Goldberger’s approximate formula:
D(f ||g) ≈ Dgoldberger(f ||g)
P
P X i=1
Trang 123.3 Estimate conflict degree between three callers 9
Thanks to Equation 3.2, we could easily compute each of every operands in the right hand side of Equation 3.5, because ∀i, j ∈ {1, , P }, both fi and gj here are normal distributed Thus up until now, we have the solution to tackle the problem
of approximating relative entropy between two Student distributions
Each of every callers clusterd the samples into three genotype states encoded as
00, 01, 11 (excluding no call cluster) by different statistical models Thus we have nine clusters as follow: i00, i01, i11for Illuminus, g00, g01, g11 for GenoSNP and a00, a01, a11 for Gencall result
After that, the difference in the clustered results for each pair of callers is cal-culated based on the works above, there will be three figures being generated (cor-responding to three pairs of callers) In fact, we use Jeffreys distance thanks to its symmetric attribute:
2(D(P ||Q) + D(Q||P )) The pseudo-code illustrates the algorithm to estimate Jeffrey distance between two distributions is given in Algorithm 1 Assume that ig, ia, ag denote for the differences between the scores of Illuminus vs GenoSNP, Illuminus vs Gencall and Gencall vs GenoSNP respectively, given as:
ig ← min{jd(i00, g00), jd(i01, g01), jd(i11, g11)}
ia ← min{jd(i00, a00), jd(i01, a01), jd(i11, a11)}
ag ← min{jd(a00, g00), jd(a01, g01), jd(a11, g11)} (3.6)
By using appropriate metric on these three (among minimum, maximum or av-erage of {ia, ig, ag}), we ultimately have the degree of difference between all three callers’ results at a SNP locus The option for metric function will be discussed later
We have Algorithm 2 to assign bad SNPs
Trang 133.3 Estimate conflict degree between three callers 10
Algorithm 1 Estimate Jeffrey distance between two Student distribution f and g
Require:
• f (mf, Σf) and g(mg, Σg)
• number of Normal components P , number of Monte Carlo iterations M C,
common degree of freedom df
• ran_gamma(shape, scale): the random generate function for Gamma
dis-tribution
• D(f, g) function return KL divergence between two Gaussian distributions
in that order, based on Equa 3.2
Function: jd(f, g)
jd ← 0
for count = 1 → MC do
for i = 1 → P do
u[i] ← ran_gamma(df2, 2
df) end for
jd_goldberger ← 0
for j = 1 → P do
map_fg ← argmin
m∈(1, ,P )
D((mf, Σ f
u[j]), (mg, Σg
u[m])) map_gf ← argmin
m∈(1, ,P )
D((mg, Σg
u[j]), (mf, Σ f
u[m]))
u[j]), (mg, Σg
u[map_fg])) + D((mg, Σg
u[j]), (mf, Σ f
u[map_gf])) end for
end for
MC∗ jd
EndFunction
Trang 143.3 Estimate conflict degree between three callers 11
Algorithm 2 Finding bad SNPs
for all SNPs do
(i00, i01, i11); (g00, g01, g11); (a00, a01, a11)
for all Clusters do
Calculate sample mean and sample covariance matrix for each t-distributed cluster
end for
Calculate (ig, ia, ag) by Equation 3.6
conflict ← metric(ig, ia, ag)
if conflict > thres then
Assign this as bad SNP
end if
end for
Trang 153.4 Data description 12
The SNP calling of three callers for 4473 Kenyan people in 28410 SNP loci is syn-thesised and used as input data in VCF format Each call will consist of confident scores of each callers: Gencall score and with each of the two remainings, a set of three probabilities represented how likely it belong to three genotype clusters The recommended cut-off for GenoSNP and Illuminus are both 0, 95, while it is 0.2 with Gencall Every call that have confidence below the cut-off are considered as no call
This section work out the most suitable parameters and metric used in our program The degree of freedom parameter υ is fixed to 4 10 normal components for a t-distribution are sufficient for accuracy, and a Monte Carlo loop of 20 iterations is executed
For the metric option, we test three candidates: minimum, maximum or average
of the three values Through various experiments, the minimum is the best option for metric function
With all the reasons discussed before and the histogram ??, it is concluded that
a SNP with minimum conflict score among three greater than 0.1 will be filtered out There are 2360 out of 28410 (∼ 8.3%) being removed in this protocol
Our results are compared with the baseline of several traditional QC criteria namely missing rate, minor allele frequency and Hardy-Weinberg equilibrium
Bảng 3.1: Result of missing rate filter
XX
XX
XX
XX
Callers
Miss Rate
Trang 163.6 Evaluation 13
Bảng 3.2: Result of MAF filter
P
P
P
P
P
P
P
Callers
MAF
Bảng 3.3: Result of HWE exact test filter
XX
XX
XX
XX
Callers
HWE
Bảng 3.4: Result of synthesis criteria
hhh
Number of SNPs
Callers
The three tables 3.1, 3.2 and 3.3 illustrate different criteria used to help experts evaluate SNP quality Each cell of the tables include two numbers: the first one show how many SNPs that break the corresponding threshold and need to be reconsid-ered manually, the second one is the number of SNPs among them also filtreconsid-ered by our protocol For instance, in Table 3.3, with missing call rate threshold is set to 2%, there are 14761 SNP calling result being removed in Illuminus Among them,
1943 SNPs are also filtered out by our program This mean that our protocol helps reducing the overall SNPs that have to be checked in this case by 1943 Beside, we also detect other 417 potential bad SNPs that are missed in missing rate criteria alone Visual check shows that these SNPs are really problematic
The similar observations are shown in the other two tables However, the numbers
in these tables increase along with the threshold because lower bounds are used In all three tables, Illuminus is the caller with highest call rate in this case since it