Detecting bad SNPs from illumina beadchips using jeffreys distance = phát hiện các SNP xấu từ illumina beadchips sử dụng khoảng cách jeffreys

Frequently in cleaning genotype data, it consists of two separate phases: samples filter and SNPs filter.. • Missing rate or missing proportion MSP at a SNP indicates how much sam-ples f

Trang 1

1

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY



NGUYEN HOANG SON

Detecting bad SNPs from Illumina BeadChips using Jeffreys distance

MASTER THESIS OF INFORMATION

TECHNOLOGY

Hanoi - 2012

Trang 2

1

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY



NGUYEN HOANG SON

DETECTING BAD SNPS FROM ILLUMINA BEADCHIPS USING JEFFREYS DISTANCE

MASTER THESIS

Sector: Information Technology Major: Computer Science

Code : 60 48 01

Supervised by: Dr Le Sy Vinh

Hanoi - 2012

Trang 3

Table of Contents

1.1 Biological Background 3

1.2 SNP Genotyping 5

1.3 Quality Control and Quality Assurance 7

2 Related Work 10 2.1 Naive method for SNP genotyping 10

2.2 GenCall 11

2.3 Illuminus 12

2.4 GenoSNP 12

2.5 Discussion 13

3 Method 15 3.1 Kullback-Leibler divergence 15

3.2 Approximate relative entropy between two Student distributions 16

3.2.1 Approximate Student distribution 16

3.2.2 The matched bound approximation 17

3.3 Estimate conflict degree between three callers 19

4 Experimental Results 23 4.1 Data description 23

4.2 Parameter estimation 24

4.3 Evaluation 26

ii

Trang 4

Chương 1

Introduction

very long strings of four alphabets (or bases) {A, C, G, T} which stand for four types of nucleotides respectively There are nearly three billions characters in total for a completed genome

in a DNA sequence The positions where SNP occurred are called SNP site Note that there usually happen only two possible nucleotide variants for any SNP site The nucleotide variants at SNP sites are called alleles We use 0 and 1 to denote the two possible alleles

A pair of alleles each from one chromosome copy in a deploy organism at a SNP site is called SNP genotype Since there are two different alleles at a SNP site, there are three different types for a SNP genotype and denoted as: 00, 01, and 11 (equivalent to aa, aA, and AA) 00 and 11 are called homozygous genotypes while

01 is called heterozygous genotype

Affymetrix GeneChip, Illumina Infinium Beadchips, Perlegen, Invader, etc

The Illumina whole-genome BeadChips is one of the most popular microarray technologies used to study human SNPs It takes m individuals (samples) as the input and captures genotypes across n SNP sites After several chemical and

pre-1

Trang 5

processing steps, it outputs a matrix G = {gij,i=1, ,m;j=1, ,n} where gij = (xij, yij) are the raw intensities which indicate the SNP genotype of sample i at SNP site j The intensities xij, yij represent for allele 0, 1 of SNP genotype gij

Bead-Chips, namely Gencall (in GenomeStudio software), Illuminus and GenoSNP

Large-scale stastical studies are supceptible to errors Quality Control and Quality Assurance (QC/QA) process is considered as a critical phase that has an significant affect to the accuracy of computational model QC is defined as a process of mon-itoring and controlling the quality of data as it is being generated, whereas QA is used to review the product quality after that The content of this thesis is related

to QC procedure of genotype calling problem

Frequently in cleaning genotype data, it consists of two separate phases: samples filter and SNPs filter We just focus on the later in this article In which, low-quality SNPs are called bad SNPs and should be filtered out

In a typical QC process for genotyping problem, there are three variables consid-ered They are SNP’s missing rate, HWE (Hardy-Weinberg equilibrium) and MAF (minor allele frequency)

• Missing rate or missing proportion (MSP) at a SNP indicates how much sam-ples failed in this call:

total of samples The higher missing rate indicates the poorer genotype calling performance

• In a biallelic locus which is in Hardy-Weinberg equilibrium and minor allele frequency is q then the probalilities of three possible genotype 0/0, 0/1 and 1/1 are (1 − q)2, 2q(1 − q), q2 Significant deviation of HWE tests is typically imply gross genotyping error A probalility test value, or p-value, is used to estimate this difference

Trang 6

• MAF is the ratio of minor (smaller) alleles counted in the whole set of alleles:

total alleles Low MAF means that there exist a cluster among three with fewer samples

As the consequence, SNPs with low MAF are more prone to error, since almost calling methods that based on clustering do not work well in these cases

Trang 7

Chương 2

Related Work

Given an matrix G = {gij,i=1, ,m;j=1, ,n} with gij = (xij, yij), representing the geno-type intensity of m samples at n SNP loci The task is to assign every pairs to its most suitable genotype label, 00, 01 or 11 (or no call cluster if impossible) This problem is call SNP genotype calling or in short, SNP genotyping

The most simple method for SNP genotyping problem use the correlation between two allelic intensities (xij, yij) For example, with a data point (x, y):

allele 0)

• If xij yij then it should belong to genotype class 11 (homozygous of allele 1)

• Or if xij ' yij then this SNP calling should be 01 (heterozygous)

However, this naive method could not work well with the ambiguous cases and its performance is poor with the tremendous amount of data

Gencall is the canonical caller developed for Illumina microarray from the very be-ginning This method uses neural networks and heuristic methods to divide samples

4

Trang 8

2.3 Illuminus 5

into three clusters A GenCall score is generated as a confident degree for each call

In general, this score represents the quality of genotyping and might be varied be-tween different loci or chromosomes, however we could agree that any call with score less than 0.2 would be considered as a failure due to the lack of confidence (assigned

as no call in this case)

Illuminus is one of the best callers for 2.5M BeadChip of Illumina This method uses EM (Expectation Maximization) framework to fit a bivariate mixture model with three t-distributed components (as three genotyping clusters) and a Gaussian component for outliers (corresponding to nocall cluster) The posterior probabilities are calculated to classify the samples into appropriate genotype states Then pertu-bation analysis is applied to ensure the stability of calling results in Illuminus This process makes call for each SNP twice, in original data (xij, yij) and pertubated data (xij + , yij + ) respectively, then the concordance between the two results is estimated to determine whether this SNP is valid or not

GenoSNP also utilizes the EM framework for mixture model of Student distributed components as in Illuminus However it uses a different method to fit intensities data

to the model Instead of clustering the probe intensities across of individuals at each SNP as in two method stated above, GenoSNP develops the model within a single individual based on the log scale of the normalized intensities (log11(x + 1), log11(y + 1)) This novel approach is able to overcome the problem of other methods in that the accuracy is highly depend on the number of control samples, which sometimes

in reality cannot be afforded Moreover, this method gives a perfect solution for studies where data is typed by different chips After running the calling algorithm, there also measures of confidence for each call, which is the posterior probability representing the possibility the call coming from the class assigned This confident degree is also used in filter the poorer data of SNP or samples

Trang 9

2.5 Comment 6

All of these callers in general could work quite well with arbitrary dataset with call rate and accuracy usually exceed 95% A combination of all three callers, also known

as a consensus calling, will increases the performance of genotype calling problem with no doubt This very interesting idea has become motivation for our work That

is, by considering the results of three callers and making cross references, we could find the problematic SNPs that have bad effect to the calling process

Trang 10

Chương 3

Method

The three genotype classes clustered by Illuminus, GenoSNP and Gencall for bad SNPs are not stable They are different when compare to each others By estimating these quantities in the whole dataset using informative distance, we could determine the SNPs with largest conflicts and assign them as bad SNPs

Kullback-Leibler divergence or relative entropy D(P ||Q) between two distributions with probability mass functions f (x) and g(x) respectively, could be computed as follows:

x∈X

p(x) logp(x)

For two γ-dimensional normal distributions f (µf, Σf) and g(µg, Σg), we have:

2(log

|Σg|

|Σf|+ Tr[Σ

−1

g Σf]

in which γ = 2 for our genotyping problem There is no such closed form expression exists to estimate KL divergence between these distributions of interest, hence an approximation is sufficient

7

Trang 11

3.2 Approximate relative entropy between two Student distributions 8

Stu-dent distributions

3.2.1 Approximate Student distribution

We have:

S(x; µ, Σ, υ) =

+∞

Z

0

N (x; µ, Σ/u)G(u;υ

2,

υ

where N (x; µ, Σ/u) is a Normal distribution of x with mean and covariance matrices

µ and Σ/u; while G(u;υ2,υ2) is a Gamma distribution of u with two given parameters shape and rate respectively

By simplifying the above Equation, a Student distribution could be approximated

by a finite mixture of P Norms:

P

P X i=1

N (x; µ,Σ

where {ui}P

i=1 are randomly draw from Gamma distribution G(u;υ2,υ2) The value

of P is significantly affect the accuracy of the approximation and should be chosen according to the degree of freedom υ

We use a method of Goldberger to approximately estimate D(f ||g)

• A matching function is defined as

m : {1, , P } → {1, , P }

j = m(i) = arg min

k D(fi||gk)

• The Goldberger’s approximate formula:

D(f ||g) ≈ Dgoldberger(f ||g)

P

P X i=1

Trang 12

Thanks to Equation 3.2, we could easily compute each of every operands in the right hand side of Equation 3.5, because ∀i, j ∈ {1, , P }, both fi and gj here are normal distributed Thus up until now, we have the solution to tackle the problem

of approximating relative entropy between two Student distributions

Each of every callers clusterd the samples into three genotype states encoded as

00, 01, 11 (excluding no call cluster) by different statistical models Thus we have nine clusters as follow: i00, i01, i11for Illuminus, g00, g01, g11 for GenoSNP and a00, a01, a11 for Gencall result

After that, the difference in the clustered results for each pair of callers is cal-culated based on the works above, there will be three figures being generated (cor-responding to three pairs of callers) In fact, we use Jeffreys distance thanks to its symmetric attribute:

2(D(P ||Q) + D(Q||P )) The pseudo-code illustrates the algorithm to estimate Jeffrey distance between two distributions is given in Algorithm 1 Assume that ig, ia, ag denote for the differences between the scores of Illuminus vs GenoSNP, Illuminus vs Gencall and Gencall vs GenoSNP respectively, given as:

ig ← min{jd(i00, g00), jd(i01, g01), jd(i11, g11)}

ia ← min{jd(i00, a00), jd(i01, a01), jd(i11, a11)}

ag ← min{jd(a00, g00), jd(a01, g01), jd(a11, g11)} (3.6)

By using appropriate metric on these three (among minimum, maximum or av-erage of {ia, ig, ag}), we ultimately have the degree of difference between all three callers’ results at a SNP locus The option for metric function will be discussed later

We have Algorithm 2 to assign bad SNPs

Trang 13

Algorithm 1 Estimate Jeffrey distance between two Student distribution f and g

Require:

• f (mf, Σf) and g(mg, Σg)

• number of Normal components P , number of Monte Carlo iterations M C,

common degree of freedom df

• ran_gamma(shape, scale): the random generate function for Gamma

dis-tribution

• D(f, g) function return KL divergence between two Gaussian distributions

in that order, based on Equa 3.2

Function: jd(f, g)

jd ← 0

for count = 1 → MC do

for i = 1 → P do

u[i] ← ran_gamma(df2, 2

df) end for

jd_goldberger ← 0

for j = 1 → P do

map_fg ← argmin

m∈(1, ,P )

D((mf, Σ f

u[j]), (mg, Σg

u[m])) map_gf ← argmin

m∈(1, ,P )

D((mg, Σg

u[j]), (mf, Σ f

u[m]))

u[j]), (mg, Σg

u[map_fg])) + D((mg, Σg

u[j]), (mf, Σ f

u[map_gf])) end for

end for

MC∗ jd

EndFunction

Trang 14

Algorithm 2 Finding bad SNPs

for all SNPs do

(i00, i01, i11); (g00, g01, g11); (a00, a01, a11)

for all Clusters do

Calculate sample mean and sample covariance matrix for each t-distributed cluster

end for

Calculate (ig, ia, ag) by Equation 3.6

conflict ← metric(ig, ia, ag)

if conflict > thres then

Assign this as bad SNP

end if

end for

Trang 15

3.4 Data description 12

The SNP calling of three callers for 4473 Kenyan people in 28410 SNP loci is syn-thesised and used as input data in VCF format Each call will consist of confident scores of each callers: Gencall score and with each of the two remainings, a set of three probabilities represented how likely it belong to three genotype clusters The recommended cut-off for GenoSNP and Illuminus are both 0, 95, while it is 0.2 with Gencall Every call that have confidence below the cut-off are considered as no call

This section work out the most suitable parameters and metric used in our program The degree of freedom parameter υ is fixed to 4 10 normal components for a t-distribution are sufficient for accuracy, and a Monte Carlo loop of 20 iterations is executed

For the metric option, we test three candidates: minimum, maximum or average

of the three values Through various experiments, the minimum is the best option for metric function

With all the reasons discussed before and the histogram ??, it is concluded that

a SNP with minimum conflict score among three greater than 0.1 will be filtered out There are 2360 out of 28410 (∼ 8.3%) being removed in this protocol

Our results are compared with the baseline of several traditional QC criteria namely missing rate, minor allele frequency and Hardy-Weinberg equilibrium

Bảng 3.1: Result of missing rate filter

XX

Callers

Miss Rate

Trang 16

3.6 Evaluation 13

Bảng 3.2: Result of MAF filter

P

Callers

MAF

Bảng 3.3: Result of HWE exact test filter

XX

Callers

HWE

Bảng 3.4: Result of synthesis criteria

hhh

Number of SNPs

Callers

The three tables 3.1, 3.2 and 3.3 illustrate different criteria used to help experts evaluate SNP quality Each cell of the tables include two numbers: the first one show how many SNPs that break the corresponding threshold and need to be reconsid-ered manually, the second one is the number of SNPs among them also filtreconsid-ered by our protocol For instance, in Table 3.3, with missing call rate threshold is set to 2%, there are 14761 SNP calling result being removed in Illuminus Among them,

1943 SNPs are also filtered out by our program This mean that our protocol helps reducing the overall SNPs that have to be checked in this case by 1943 Beside, we also detect other 417 potential bad SNPs that are missed in missing rate criteria alone Visual check shows that these SNPs are really problematic

The similar observations are shown in the other two tables However, the numbers

in these tables increase along with the threshold because lower bounds are used In all three tables, Illuminus is the caller with highest call rate in this case since it

Định dạng
Số trang	19
Dung lượng	338,11 KB