Luận văn detecting bad snps from illumina beadchips using jeffreys distance phát hiện các snp xấu từ illumina beadchips sử dụng khoảng cách jeffreys

" 12 SNP Genotyping 1.3 Quality Control and Qualisy 4 Assurance 2 Related Work 2.1 Naive method for SNI' genotyping 22 GenCall 24 GenoSÑP.... processing steps, it outputs a matrix G = {

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY

—

NGUYEN HOANG SON

DETECTING BAD SNPS FROM ILLUMINA

BEADCHIPS USING JEFFREYS DISTANCE

MASTER THESIS OF INFORMATION

TECHNOLOGY

Hanoi - 2012

Trang 2

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY

—

NGUYEN HOANG SON

DETECTING BAD SNPS FROM ILLUMINA

BEADCHIPS USING JEFFREYS DISTANCE

MASTER THESIS

Sector: Information Technology

Code : 60 48 01

Supervised by: Dr Le Sy Vinh

Tlanoi - 2012

Trang 3

Table of Contents

Overview

4 Intruduction

1.1 Biological Background "

12 SNP Genotyping

1.3 Quality Control and Qualisy 4 Assurance

2 Related Work

2.1 Naive method for SNI' genotyping

22 GenCall

24 GenoSÑP

2.5 Discussion 2

3 Method

3.1 Kullback-Leibler divergence ¬ eee

32 Approximate relative entropy between two Student distributions

3.2.1 Approximate Student distribution 3.2.2 ‘The matched bond approximation

3

4 Experimental Results

4.1 Data description

4.2 Parameter estimation

43 Evaluation 0 sae "

Couclusion

3.3 Estimate conflict degree between tree callers 2 ee

10

TI

12

13

15

16

16 T7

23

24

26

34

Trang 4

Chương 1

Tntroduction

Human Genome Human genome are encoded and represented as databases of very long strings of four alphabets (or bases) {A, C, G, ‘I'} which stand for four types of nucleotides respectively ‘I'here are nearly three billions characters in total for a completed genome

Single Nucleotide Polymorphisms ; A SNP is defined as a single base change

in a TINA sequence The positions where SNP occurred are called SNP site Note

that there usually happen only two poscible nucleotide variants for any SNP site

‘he nucleotide variants at SNP sites are called alleles We use 0 and 1 to denote the two possible alleles

A pair of alleles each from one chromosome copy in a deploy organism at a

SNP site is called SNP genotype Since there are two different alleles alt a SNP site, there are three different types for a SNP genotype and denoted as: 00, 01, and 11

{equivalent lo aa, aA, aud AA} 00 and 11 are called homozygous genulypes while

01 is called heterozygous genotype

Mumina BeadChips Ty date, there are various type of microarrays such as Allyetix GoueChii

The Thmina whole-genome BeadChips is one of the most papular microarray

TMumine Infisiun Boudchips, Perlozen, Invader, ch,

technologies used to sendy human SNPs, Tt takes m individuals (samples) as the

input and captures genotypes across n SNP sites After several chemical and

Trang 5

pre-1.2 Quality Control and Qualiiy Assurance 2

(mis, vey)

are the raw intensities which indicate the SNP genotype of sample # st SNP site j

processing steps, it outputs a matrix G = {ø mij=t n} where gi

‘The intensities 2,;, 44 represent for allele 0, 1 of SNH genotype gi,

Methods We only consider the three most popular methods for Illumina Bead-

Chips, namely Gencall (in GenomeStudio software), Iluminus and GenoSNT,

Large-scale stastical studies are supceptible to errors Quality Control and Quality Assurance (QC/QA) proces is considered us a critical pluse Lat has an signiliennt affect to the weeuracy of computatiouul model, QC is defined us a provess of mou iloring and coutrofling, Uhe quafity of dutu us it iy being genvrated, whereas QA is used to review the prodnet quality after that The content of this thesis is relaved

to QC! procednre of genotype calling problem

Treqnently in cleaning genotype data, it consists of tivo separate phases: samples filter and SNPs filter We just focus on the later in this article In which, low-quality SNPs are called bad SNPs and should be filtered out

Ina typical QC process for genmtyping problem, there are three variables consid cred They ure SNP"s uissing rote, HWE (Hurdy-Weinberg, cquilibriun) aud MAF (minor allele frequeney)

* Missing rate or missing proportion (MSP) at a SNP indicates how much sam ples failed in this call:

number of no calls

w number of no calls

total of samples

‘The higher missing rate indicates the poorer genotype calling performance

e Ina hiallelic locus which is in Tar Weinberg equilibrinm and mimor allele frequency is y then the probalilities of three possible genotype 0/0, 0/1 and 1/1 are (1 — 9), 2g(1 — ¢), @? Significant deviation of HWE tests is typically imply gross genotyping error A probalility test value, or p-value, is used to

catimate thiy differcnve,

Trang 6

1.2 Quality Control and Qualiiy Assurance 8

« MAF is the ratio of minor (smailer) alleles counted in the whole set of alleles:

number of minor alleles

MAF _—

tutal alleles

Low MAF means that there cxist a cluster among three with fewer samples

As Uhe consequence, SNPs with low MAP are more prone lo exer, since almost calliug methods that bused on clustering do nol work well in (hese cases

Trang 7

Chương 2

Related Work

Given an matrix G — {gụ, an} with gy — (maj, y

type intensity of m samples at n SNP loci The task is to assign every pairs to its

¡), representing the geno

most suitable genotype label, 0U, U1 or 11 {or no eall cluster if impossible) ‘I'his problem is call SNP genotype calling or in short, SNP genotyping

2.1 Naive method for SNP genotyping

The most simple method for SNP genotyping problem use Uke corretavion bebween

two allelic intensities (23, 44j) For example, with a data point (x,y):

@ Il xj 3 iy then this SNP genotype could be tubelled ay 00 (homozygous of allele 0}

1

Tlowever, this naive method conld not work well with the ambiguous cases and its

perfarmance is poor with the tremendous amonnt of data

Geneall is the canonical caller developed for Uhumina microarray from the very be-

ginning, This mcthed uyes neural networks and heuristic methods lo divide samples

Trang 8

a

3.3 1llnminus

into three clusters A GenUall score is generated as a confident clegree for each call

In general, this score represents the quality of genotyping ancl might be varied between different loci or chromosomes, however we could agree that any call with score Tess than 0.2 would be considered us a failure due to he lack of confidenee (assigned

ay ny cull in this esc}

Timminns is one of the beat callere for 2.5M BeadChip of Tllumina This method

uses LìM (L'xpectation Maximization) framework to fit a bivariate mixture mode!

with three t-distributed components (as three genotyping clusters) and a Gaussian component for outliers (corresponding to nocail cluster} ‘I'he posterior probabilities are calculated to classify the samples into appropriate genotype states Then pertu- bation analysis is applied to ensure the stability of calling results in Iuminus This process makes eull for each SNP twice, in original data (2y,4s) and pertubaced

data (aij + 6 tụ + €) respectively, then the concordance hetween the twa results is

estimated to determine whether this SNP is valid or not

2.4 GenoSNP

GenoSNP also utilizes the EM framework for mixture model of Student distributed components as in Iluminus However it uses a different method to fit intensities data

to the model, Instead of clustering the probe intensities across of individuals at each SNP as in two method stated above, GenoSNP develops the model within a single individual based on the log scale of the normalized intensities Jugs, (2 +0), log (y+ 1) This uovel approach is able to overcome the problem of other methods in Uhat the accuracy is highly depeud on the number of control samples, which sometimes

in reality cannot be afforded Moreover, this method gives a perfect solution for studies where data is typed by different chips After running the calling algorithm, there also measures of confidence for each call, which is the posterior probability representing the possibility the call coming from the class assigned his confident

degree is also used in filter the poorer data of SNP or samples

Trang 9

2.5 Comment 6

2.5 Comment

All of these callers in general conld work qnite well with arbitrary dataset, with call rate and accuraey usually exceed 959 A combination of all three callers, also known

as a consensus calling, will increases the performance of genotype calling problem with no doubt 'I'his very interesting idea has become motivation for our work ‘hat

is, by considering the results of three callers and making cross references, we could find the problematic SNPs that have bad effect to the calling process

Trang 10

Chương 3

Method

Tha three genotype classes clustered by Tuminns, GenoSNP and Gencall for had SNPs are not stable They are different when compare to each others By estimating these quantities in the whole dataset using informative distance, we could determine the SNHs with largest conflicts and assign them as bad SNPs

3.1 Kullback-Leibler divergence

Kullbuck-Leibler divergcace or relative eutropy D(P||Q) between Lwo distributions with probability mass fimetions f(x) and g(a) respectively, contd be computed as

follows:

aX Mz,

For two 4-dimensional narmal distributions f(z, 5) and g(jig, Tay), we have:

5

oa 1 1 -

DU |lgh = " 1 Tr=,'5/]

+ 0# — Ma} VN uy — ta) — 3) (82)

in which + = 2 lor our genotyping problem There is xo such elosed Lorin expression cxisls lo colimate KL divergence between these distributions of interest, tcuce an

approximation is aufficient

Trang 11

3.2 Approximate relative entropy between two Student distributions ®

dent distributions

3.2.1 Approximate Student distribution

We have’

3(x:m.X)0)= Ï Afrrn,S/ag( P5 (3.3)

where A’(x; #2, 5/1) is a Normal distribution ef 2 with mean and covariance matrices pond 5/u; while G(w

®) is a Camma distribution of u with two given parameters

shape and rate vespectively

By simplifying the above Equation, ä Student distribution could be approximated,

by a finite mixture of P Norma:

z x, -

=

where {i}! , are randamly draw from Gamma distribution G(u;

of P

according to the degree of freedom v:

š.š) The valne

significantly affect the accnracy of the approximation and should be chosen

We use a method of Goldberger to approximately estimate D(f|[9)

¢ A matchme function is defined as

m:{l, ,P}— {1, ,P}

that map from the é* component of f to an j"* component of q auch as

j= mld) = argmin D(fillae)

e The Goldberger’s approximate formula:

D(fllg) 4 Dzxeme.(F| gì

1”

Trang 12

3.3 kistimatc coanflicE degrec betwcen three callers 9

‘Thanks to Equation 3.2, we could easily compute each of every operands in the

{1, P}, both fr and g; here are normal distributed ‘hus up until now, we have the solution to tackle the problem right hand side of Equation 3.5, because Vi, j

of approximating; relative entropy bebwevu two Student distributions

Fach of every callers custerd the samples inte three genotype states encoded as 0,01, 11 (excluding no call cluster) by different statistical models Thus we have nine

clusters as follow: fgg, ior, i11 for [luminus, goo, gor, gui for GenoSNV and apa, aor, t1

for Gencall result

After that, the difference in the clustered resulta for each pair of callers is cal

culated based on the works above, there will be three figures being generated (com responding to three pairs of callers), In fact, we use Jeffreys distance thanks to its synunetric abtribute:

SỐ đó 02) 2 D2118)

J4) — s(PUPI|Ø) + DIQIP))

‘Lhe pseudo-code illustrates the algorithm to estimate Jeffrey distance between two distributions is given in Algorithn 1 Assume that ig, ia, ag denote for the differences between the scores of Hmminus vs Geno§NP, Iluminus vs Gencall and Geneall vs GenoSNP respcetively, given as:

ig + min{jdlio, goo), jain, gor), 34-1, gu}

in, <— min{jA(ing- 0}; J4(éor, G01), FAA, 11) }

By using appropriate metric on these three (among minimum, meximum or av erage of {ia.ig,ag}), we ultimately have the degree of difference between alll three callers’ results at a SNP locus The option for metric function will be discussed later

We have Alyoritlun 2 lo assign bad SNPs

Trang 13

Algorithrn 1 Bslimnte Jelfrey distance between two Student distribution f aud g

Require:

number of Normal components ?, number of Monte Carlo iterations MC,

vouunon degree of Íroodoin df

# ran_gamma(shape, scale): the random generate function for Gamma dis

tribusion

¢ D(f,g) function return KL divergence between two Gaussian distributions

in that order, based on Equa 3.2

Function: jd(f,g}

jdao

for count = 1— MỸ do

fori—1-+Pdo

ui] — ran_gamma(

end for

ja_goldberger +0

for j 1 +P do

; 4 te yy

map_fg + „min P((a; atl Dafa gi)

ae 24 Te?

mm Ty m Bay map gĩ + ,238min D(n, sẵn) me đã)

end for

jd + jd+4* jd_goldberger

end for

return * jd

EndFunction

ca) + (mg, ih) es wtaar’ gel)

%

ufaap

Trang 14

Algorithm 2 Finding bad SNPs

for all 8NTs do

Assign samples to appropriate labeled cluster:

Goo, tor, 11); (B00, Bors 81s}; (Bea, dor, A41)

for all Clusters du

Calculate sample mean and sample covariance matrix for each t-distributed

choster,

end for

Calculate (ig, ia, ag) by Liquation 3.6

conflict < metric(ig, ia, ag)

if conflict > thres then

Avsign this us bad SNP

end if

end far

Tiêu đề	Detecting Bad SNPs From Illumina BeadChips Using Jeffreys Distance
Tác giả	Nguyen Hoang Son
Người hướng dẫn	Dr. Le Sy Vinh
Trường học	Vietnam National University, Hanoi University of Engineering and Technology
Chuyên ngành	Information Technology
Thể loại	Thesis
Năm xuất bản	2012
Thành phố	Hanoi

Định dạng
Số trang	19
Dung lượng	133,3 KB