" 12 SNP Genotyping 1.3 Quality Control and Qualisy 4 Assurance 2 Related Work 2.1 Naive method for SNI' genotyping 22 GenCall 24 GenoSÑP.... processing steps, it outputs a matrix G = {
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY
—
NGUYEN HOANG SON
DETECTING BAD SNPS FROM ILLUMINA
BEADCHIPS USING JEFFREYS DISTANCE
MASTER THESIS OF INFORMATION
TECHNOLOGY
Hanoi - 2012
Trang 2
VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY
—
NGUYEN HOANG SON
DETECTING BAD SNPS FROM ILLUMINA
BEADCHIPS USING JEFFREYS DISTANCE
MASTER THESIS
Sector: Information Technology
Code : 60 48 01
Supervised by: Dr Le Sy Vinh
Tlanoi - 2012
Trang 3
Table of Contents
Overview
4 Intruduction
1.1 Biological Background "
12 SNP Genotyping
1.3 Quality Control and Qualisy 4 Assurance
2 Related Work
2.1 Naive method for SNI' genotyping
22 GenCall
24 GenoSÑP
2.5 Discussion 2
3 Method
3.1 Kullback-Leibler divergence ¬ eee
32 Approximate relative entropy between two Student distributions
3.2.1 Approximate Student distribution 3.2.2 ‘The matched bond approximation
3
4 Experimental Results
4.1 Data description
4.2 Parameter estimation
43 Evaluation 0 sae "
Couclusion
3.3 Estimate conflict degree between tree callers 2 ee
10
10
TI
12
13
15
15
16
16 T7
23
23
24
26
34
Trang 4Chương 1
Tntroduction
Human Genome Human genome are encoded and represented as databases of very long strings of four alphabets (or bases) {A, C, G, ‘I'} which stand for four types of nucleotides respectively ‘I'here are nearly three billions characters in total for a completed genome
Single Nucleotide Polymorphisms ; A SNP is defined as a single base change
in a TINA sequence The positions where SNP occurred are called SNP site Note
that there usually happen only two poscible nucleotide variants for any SNP site
‘he nucleotide variants at SNP sites are called alleles We use 0 and 1 to denote the two possible alleles
A pair of alleles each from one chromosome copy in a deploy organism at a
SNP site is called SNP genotype Since there are two different alleles alt a SNP site, there are three different types for a SNP genotype and denoted as: 00, 01, and 11
{equivalent lo aa, aA, aud AA} 00 and 11 are called homozygous genulypes while
01 is called heterozygous genotype
Mumina BeadChips Ty date, there are various type of microarrays such as Allyetix GoueChii
The Thmina whole-genome BeadChips is one of the most papular microarray
TMumine Infisiun Boudchips, Perlozen, Invader, ch,
technologies used to sendy human SNPs, Tt takes m individuals (samples) as the
input and captures genotypes across n SNP sites After several chemical and
Trang 5pre-1.2 Quality Control and Qualiiy Assurance 2
(mis, vey)
are the raw intensities which indicate the SNP genotype of sample # st SNP site j
processing steps, it outputs a matrix G = {ø mij=t n} where gi
‘The intensities 2,;, 44 represent for allele 0, 1 of SNH genotype gi,
Methods We only consider the three most popular methods for Illumina Bead-
Chips, namely Gencall (in GenomeStudio software), Iluminus and GenoSNT,
Large-scale stastical studies are supceptible to errors Quality Control and Quality Assurance (QC/QA) proces is considered us a critical pluse Lat has an signiliennt affect to the weeuracy of computatiouul model, QC is defined us a provess of mou iloring and coutrofling, Uhe quafity of dutu us it iy being genvrated, whereas QA is used to review the prodnet quality after that The content of this thesis is relaved
to QC! procednre of genotype calling problem
Treqnently in cleaning genotype data, it consists of tivo separate phases: samples filter and SNPs filter We just focus on the later in this article In which, low-quality SNPs are called bad SNPs and should be filtered out
Ina typical QC process for genmtyping problem, there are three variables consid cred They ure SNP"s uissing rote, HWE (Hurdy-Weinberg, cquilibriun) aud MAF (minor allele frequeney)
* Missing rate or missing proportion (MSP) at a SNP indicates how much sam ples failed in this call:
number of no calls
w number of no calls
total of samples
‘The higher missing rate indicates the poorer genotype calling performance
e Ina hiallelic locus which is in Tar Weinberg equilibrinm and mimor allele frequency is y then the probalilities of three possible genotype 0/0, 0/1 and 1/1 are (1 — 9), 2g(1 — ¢), @? Significant deviation of HWE tests is typically imply gross genotyping error A probalility test value, or p-value, is used to
catimate thiy differcnve,
Trang 61.2 Quality Control and Qualiiy Assurance 8
« MAF is the ratio of minor (smailer) alleles counted in the whole set of alleles:
number of minor alleles
MAF _—
tutal alleles
Low MAF means that there cxist a cluster among three with fewer samples
As Uhe consequence, SNPs with low MAP are more prone lo exer, since almost calliug methods that bused on clustering do nol work well in (hese cases
Trang 7Chương 2
Related Work
Given an matrix G — {gụ, an} with gy — (maj, y
type intensity of m samples at n SNP loci The task is to assign every pairs to its
¡), representing the geno
most suitable genotype label, 0U, U1 or 11 {or no eall cluster if impossible) ‘I'his problem is call SNP genotype calling or in short, SNP genotyping
2.1 Naive method for SNP genotyping
The most simple method for SNP genotyping problem use Uke corretavion bebween
two allelic intensities (23, 44j) For example, with a data point (x,y):
@ Il xj 3 iy then this SNP genotype could be tubelled ay 00 (homozygous of allele 0}
© If aj < yy then it should belong to genotype class 11 (homuzyguus of allele
1
© Or if mij ~ yy then this SNP calling should be 01 (heterozygous)
Tlowever, this naive method conld not work well with the ambiguous cases and its
perfarmance is poor with the tremendous amonnt of data
Geneall is the canonical caller developed for Uhumina microarray from the very be-
ginning, This mcthed uyes neural networks and heuristic methods lo divide samples
Trang 8a
3.3 1llnminus
into three clusters A GenUall score is generated as a confident clegree for each call
In general, this score represents the quality of genotyping ancl might be varied be- tween different loci or chromosomes, however we could agree that any call with score Tess than 0.2 would be considered us a failure due to he lack of confidenee (assigned
ay ny cull in this esc}
Timminns is one of the beat callere for 2.5M BeadChip of Tllumina This method
uses LìM (L'xpectation Maximization) framework to fit a bivariate mixture mode!
with three t-distributed components (as three genotyping clusters) and a Gaussian component for outliers (corresponding to nocail cluster} ‘I'he posterior probabilities are calculated to classify the samples into appropriate genotype states Then pertu- bation analysis is applied to ensure the stability of calling results in Iuminus This process makes eull for each SNP twice, in original data (2y,4s) and pertubaced
data (aij + 6 tụ + €) respectively, then the concordance hetween the twa results is
estimated to determine whether this SNP is valid or not
2.4 GenoSNP
GenoSNP also utilizes the EM framework for mixture model of Student distributed components as in Iluminus However it uses a different method to fit intensities data
to the model, Instead of clustering the probe intensities across of individuals at each SNP as in two method stated above, GenoSNP develops the model within a single individual based on the log scale of the normalized intensities Jugs, (2 +0), log (y+ 1) This uovel approach is able to overcome the problem of other methods in Uhat the accuracy is highly depeud on the number of control samples, which sometimes
in reality cannot be afforded Moreover, this method gives a perfect solution for studies where data is typed by different chips After running the calling algorithm, there also measures of confidence for each call, which is the posterior probability representing the possibility the call coming from the class assigned his confident
degree is also used in filter the poorer data of SNP or samples
Trang 92.5 Comment 6
2.5 Comment
All of these callers in general conld work qnite well with arbitrary dataset, with call rate and accuraey usually exceed 959 A combination of all three callers, also known
as a consensus calling, will increases the performance of genotype calling problem with no doubt 'I'his very interesting idea has become motivation for our work ‘hat
is, by considering the results of three callers and making cross references, we could find the problematic SNPs that have bad effect to the calling process
Trang 10Chương 3
Method
Tha three genotype classes clustered by Tuminns, GenoSNP and Gencall for had SNPs are not stable They are different when compare to each others By estimating these quantities in the whole dataset using informative distance, we could determine the SNHs with largest conflicts and assign them as bad SNPs
3.1 Kullback-Leibler divergence
Kullbuck-Leibler divergcace or relative eutropy D(P||Q) between Lwo distributions with probability mass fimetions f(x) and g(a) respectively, contd be computed as
follows:
aX Mz,
For two 4-dimensional narmal distributions f(z, 5) and g(jig, Tay), we have:
5
oa 1 1 -
DU |lgh = " 1 Tr=,'5/]
+ 0# — Ma} VN uy — ta) — 3) (82)
in which + = 2 lor our genotyping problem There is xo such elosed Lorin expression cxisls lo colimate KL divergence between these distributions of interest, tcuce an
approximation is aufficient
Trang 113.2 Approximate relative entropy between two Student distributions ®
dent distributions
3.2.1 Approximate Student distribution
We have’
3(x:m.X)0)= Ï Afrrn,S/ag( P5 (3.3)
where A’(x; #2, 5/1) is a Normal distribution ef 2 with mean and covariance matrices pond 5/u; while G(w
®) is a Camma distribution of u with two given parameters
shape and rate vespectively
By simplifying the above Equation, ä Student distribution could be approximated,
by a finite mixture of P Norma:
z x, -
=
where {i}! , are randamly draw from Gamma distribution G(u;
of P
according to the degree of freedom v:
š.š) The valne
significantly affect the accnracy of the approximation and should be chosen
We use a method of Goldberger to approximately estimate D(f|[9)
¢ A matchme function is defined as
m:{l, ,P}— {1, ,P}
that map from the é* component of f to an j"* component of q auch as
j= mld) = argmin D(fillae)
e The Goldberger’s approximate formula:
D(fllg) 4 Dzxeme.(F| gì
1”
Trang 123.3 kistimatc coanflicE degrec betwcen three callers 9
‘Thanks to Equation 3.2, we could easily compute each of every operands in the
{1, P}, both fr and g; here are normal distributed ‘hus up until now, we have the solution to tackle the problem right hand side of Equation 3.5, because Vi, j
of approximating; relative entropy bebwevu two Student distributions
Fach of every callers custerd the samples inte three genotype states encoded as 0,01, 11 (excluding no call cluster) by different statistical models Thus we have nine
clusters as follow: fgg, ior, i11 for [luminus, goo, gor, gui for GenoSNV and apa, aor, t1
for Gencall result
After that, the difference in the clustered resulta for each pair of callers is cal
culated based on the works above, there will be three figures being generated (com responding to three pairs of callers), In fact, we use Jeffreys distance thanks to its synunetric abtribute:
SỐ đó 02) 2 D2118)
J4) — s(PUPI|Ø) + DIQIP))
‘Lhe pseudo-code illustrates the algorithm to estimate Jeffrey distance between two distributions is given in Algorithn 1 Assume that ig, ia, ag denote for the differences between the scores of Hmminus vs Geno§NP, Iluminus vs Gencall and Geneall vs GenoSNP respcetively, given as:
ig + min{jdlio, goo), jain, gor), 34-1, gu}
in, <— min{jA(ing- 0}; J4(éor, G01), FAA, 11) }
By using appropriate metric on these three (among minimum, meximum or av erage of {ia.ig,ag}), we ultimately have the degree of difference between alll three callers’ results at a SNP locus The option for metric function will be discussed later
We have Alyoritlun 2 lo assign bad SNPs
Trang 133.3 kistimatc coanflicE degrec betwcen three callers 10
Algorithrn 1 Bslimnte Jelfrey distance between two Student distribution f aud g
Require:
© Sling, By) aud gleng, Zp)
number of Normal components ?, number of Monte Carlo iterations MC,
vouunon degree of Íroodoin df
# ran_gamma(shape, scale): the random generate function for Gamma dis
tribusion
¢ D(f,g) function return KL divergence between two Gaussian distributions
in that order, based on Equa 3.2
Function: jd(f,g}
jdao
for count = 1— MỸ do
fori—1-+Pdo
ui] — ran_gamma(
end for
ja_goldberger +0
for j 1 +P do
; 4 te yy
map_fg + „min P((a; atl Dafa gi)
ae 24 Te?
mm Ty m Bay map gĩ + ,238min D(n, sẵn) me đã)
jd_goldberger ©— jd_ goldbarger + D{Ếm, ), (mẹ,
end for
jd + jd+4* jd_goldberger
end for
return * jd
EndFunction
ca) + (mg, ih) es wtaar’ gel)
%
ufaap
Trang 143.3 kistimatc coanflicE degrec betwcen three callers 11
Algorithm 2 Finding bad SNPs
for all 8NTs do
Assign samples to appropriate labeled cluster:
Goo, tor, 11); (B00, Bors 81s}; (Bea, dor, A41)
for all Clusters du
Calculate sample mean and sample covariance matrix for each t-distributed
choster,
end for
Calculate (ig, ia, ag) by Liquation 3.6
conflict < metric(ig, ia, ag)
if conflict > thres then
Avsign this us bad SNP
end if
end far