1. Trang chủ
  2. » Giáo Dục - Đào Tạo

(LUẬN văn THẠC sĩ) detecting bad SNPs from illumina beadchips using jeffreys distance, phát hiện các SNP xấu từ illumina beadchips sử dụng khoảng cách jeffreys

45 2 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Detecting Bad SNPs From Illumina BeadChips Using Jeffreys Distance
Tác giả Nguyen Hoang Son
Người hướng dẫn Dr. Le Sy Vinh
Trường học Vietnam National University, Hanoi University of Engineering and Technology
Chuyên ngành Information Technology
Thể loại master thesis
Năm xuất bản 2012
Thành phố Hanoi
Định dạng
Số trang 45
Dung lượng 1,38 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

In this thesis, a novel method is proposed to detect bad SNPs from theprobe intensities data of Illumina Beadchips.. 27 4.4 HWE, MAF and missing rate histogram of three callers, with ori

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY



NGUYEN HOANG SON

Detecting bad SNPs from Illumina BeadChips using Jeffreys distance

MASTER THESIS OF INFORMATION

TECHNOLOGY

Trang 2

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY



NGUYEN HOANG SON

DETECTING BAD SNPS FROM ILLUMINA BEADCHIPS USING JEFFREYS DISTANCE

Trang 3

ORIGINALITY STATEMENT

‘I hereby declare that this submission is my own work and to the best of myknowledge it contains no materials previously published or written by anotherperson, or substantial proportions of material which have been accepted for theaward of any other degree or diploma at UET or any other educational institution,except where due acknowledgement is made in the thesis Any contribution made

to the research by others is explicitly acknowledged in the thesis I also declarethat the intellectual content of this thesis is the product of my own work, except

to the extent that assistance from others in the project’s design and conception

or in style, presentation and linguistic expression is acknowledged.’

Signed

Trang 4

AbstractCurrent microarray technologies are able to assay thousands of samples over million

of SNPs simultaneously Computational approaches have been developed to analyse

a huge amount of data from microarray chips to understand sophisticated humangenomes The data from microarray chips might contain errors due to bad samples

or bad SNPs In this thesis, a novel method is proposed to detect bad SNPs from theprobe intensities data of Illumina Beadchips This approach measures the differenceamong results determined by three software Illuminus, GenoSNP and Gencall todetect the unstable SNPs Experiment with SNP data in chromosome 20 of Kenyanpeople demonstrates the usefulness of our method This approach reduces the num-ber of SNPs that are needed to check manually Furthermore, it has the ability indetecting bad SNPs that have not been recognized by other criteria

Trang 5

Apart from the efforts of myself, the success of any project depends largely on theencouragement and guidelines of many others First and foremost, I would like tothank to my supervisor Dr Sy Vinh Le for the valuable guidance and advice Thisresearch project would not have been finished successfully without his continuouslysupport and assistance His enthusiastic supervision helped me in all the time ofresearch and writing of this thesis

I also would like to gratefully acknowledge the tremendous encouragement andinsightful comments of Dr Si Quang Le during the time of research of this thesis Hisbrilliant ideas helped me so much to overcome numerous problems and difficulties.Any word is inadequate for his helpful aids

The author would also like to convey thanks to the Department of ComputerScience for providing the useful preferences and laboratory facilities I also wish

to express love and gratitude to my beloved families; for their understanding andendless love, through the duration of my studies

My thanks and appreciations also go to my colleague in developing the projectand people who have willingly helped me out with their abilities

Trang 6

Table of Contents

1.1 Biological Background 3

1.2 SNP Genotyping 5

1.3 Quality Control and Quality Assurance 7

2 Related Work 10 2.1 Naive method for SNP genotyping 10

2.2 GenCall 11

2.3 Illuminus 12

2.4 GenoSNP 12

2.5 Discussion 13

3 Method 15 3.1 Kullback-Leibler divergence 15

3.2 Approximate relative entropy between two Student distributions 16

3.2.1 Approximate Student distribution 16

3.2.2 The matched bound approximation 17

3.3 Estimate conflict degree between three callers 19

4 Experimental Results 23 4.1 Data description 23

4.2 Parameter estimation 24

4.3 Evaluation 26

Trang 7

List of Figures

1.1 An example of a SNP detected by an alignment of two DNA sequences 4

1.2 BeadChip work flow 6

3.1 Approximate t-distribution by a mixture of Gaussians 18

3.2 An example of Matched Bound Approximate method with two mix-tures of three components 19

4.1 Input file format 23

4.2 Histograms of three metrics in term of conflict score 26

4.3 Erroneous loci filtered by applied two functions but minimum 27

4.4 HWE, MAF and missing rate histogram of three callers, with original and filtered data 29

Trang 8

List of Tables

4.1 Result of missing rate filter 28

4.2 Result of MAF filter 28

4.3 Result of HWE exact test filter 30

4.4 Result of synthesis criteria 30

Trang 9

Recently, Genome-Wide Association Study (GWAS), also known as Whole GenomeAssociation Study (WGAS), proved to be an successful strategy in identifying ge-netic variants associated with common diseases or complicated phenotypes Thismethod focuses on associations between Single Nucleotide Polymorphisms (SNP)and traits Because GWAS investigates the entire genome rather than just testingone or a few genetic regions, there will be a large extreme amount of SNP genotypedata needed to be called As the consequence, different impressive methods havebeen developed to deal with the problem of genotype calling automatically and ef-ficiently In general, these programs try to translate probe hybridization intensitiesoutputed from microarray chips into SNP genotypes In the ideal case, a call at

a SNP loci generates three clusters of signals, so that SNP genotype for a certainsample could be determined according to which cluster it belongs to However, infact, the ambiguous data and errors happen frequently due to many reasons.Quality Control (QC), as an indispensable step, is included in every studies toremove these error data from the dataset as much as possible Unfortunately, thiskind of work requires significant time and effort because it is hardly automatedcompletely It is also difficult to find the bad data among the good ones when theapplied criteria are not clear for these cases Although many statistical methodshave been proposed, expert-guided evaluations are usually needed to determine theultimate results On top of that, the statistical variables used in considering SNPgenotype quality have thresholds that depending on conditions, making obstacles

on the way of finding a automatic solution based on them Therefore, an novelapproach beside the traditional QC process is necessary to make this importantstep self-regulating as much as possible

In our work, we developed a new method to detect bad SNPs Our methodtakes advantage of three available genotype callers for Illumina BeadChip, namelyGenCall, GenoSNP and Illuminus The general idea comes from the fact that data

Trang 10

LIST OF TABLES 2

of bad SNPs tend to confuse the callers, so the calling results of different methodsare usually not consistent That is, by comparing visually the cluster clouds of threecallers SNP by SNP, we could recognize these bad loci The distance in informaticstheory (relative entropy) is applied to examine these dissimilarities so that bad SNPscould be detected automatically We applied our method with real-life data and itproved to be a good protocol with satisfactory results Despite of the fact thatthere still a lot of work to do, this method really has the ability to help experts inconfirming the filtered data, also suggests potential bad SNPs that hard to find bytraditional QC process

The thesis is organized in five chapters An introduction to the definitions andbackground knowledge is given in Chapter 1 At first, this chapter will providethe very first images in bioinformatics by explaining briefly some terminologies inmolecular biology, such as human genome, DNA, SNP and SNP genotyping Italso offer further information relating to our work, about solutions for genotypecalling problem and quality control process Attending these sections are necessaryfor better understanding the later content of the thesis

Chapter 2 will describe in more details about the available genotype callers.The general idea for a simple caller will be given at the beginning to give the mostbasic solution for the genotype calling problem After that, three callers used in ouralgorithm are explained thoroughly Information provided in this chapter will helpreader get the first step further toward our solution

The proposed method is presented in the third chapter We will review thedefinitions about relative entropy (also known as Kullback-Leibler divergence) andits application with Normal distribution The next section will state an availablemethod to calculate relative entropy between Student distributions, which is required

to deal with our problem Finally, an estimate technique is proposed to measure thedissimilarities between outputs of three callers

The experiment process of applying our protocol with real data is offered inChapter 4 At first, input data description is given After that, there is a step oftry-and-error experiment to find the most appropriate parameters that should beused in our program Finally, the result of bad SNPs detected by our method isevaluated by compared with other traditional QC criteria

The final chapter will give you the overall conclusion and further discussion aboutthe thesis

Trang 11

sub-of them is sufficient (the other could be determined by pairing rule).

Human genome refers to the whole genetic makeups of human species, whichmeans every DNA sequences stored on 23 chromosome pairs and small mitochon-drial DNA They are encoded and represented as databases of very long strings offour alphabets (or bases) {A, C, G, T} which stand for four types of nucleotidesrespectively There are nearly three billions characters in total for a completedgenome, which are inherited and passed down generations by generations

Single Nucleotide Polymorphisms : Human genomes from different als are about 99.9% identical The only 0.1% difference make particular character-istics of individuals (Collins & McKusick, 2001) Single Nucleotide Polymorphisms(SNP) are of these important biological different markers Knowing informationabout SNP and SNP genotype would contributes significantly to Genome-wide as-sociation studies (GWAS), helping scientists understand better the complicated hu-

Trang 12

individu-1.1 Biological Background 4

man genomes and their inheritance mechanism As extremely important biologicalmarkers, these tiny changes could influence decisively to various exposures, espe-cially the ones related to diseases are in the top hot topics recently A SNP is defined

as a single base change in a DNA sequence The positions where SNP occurred are

Figure 1.1: An example of a SNP detected by an alignment of two DNA sequences

called SNP site Note that there usually happen only two possible nucleotide ants for any SNP site The nucleotide variants at SNP sites are called alleles Therare polymorphisms with more than two bases are out of consideration, therefore

vari-we could just use 0 and 1 to denote the two possible alleles

For instance, Figure 1.1 shows two possible DNA sequences at the same locus

in the human genome Considering the green strands, we have two sequences thatdiffer in one location (G versus A):

1 : AACGGATCCAC

2 : AACGAATCGAC

Trang 13

1.2 SNP Genotyping 5

The SNP in this site could be written along with the sequence as follow:

AACG[G

A]ATCCAC SNP Genotype : A pair of alleles each from one chromosome copy in a deployorganism at a SNP site is called SNP genotype Since there are two different alleles

at a SNP site, there are three different types for a SNP genotype and denoted as:

00, 01, and 11 (equivalent to the terminology of allele aa, aA, and AA respectively)where 0 and 1 represents from two different nucleotide variants at this SNP site 00and 11 are called homozygous genotypes while 01 is called heterozygous genotype.Analyzing SNPs and SNP genotypes helps us better understand the complicatedhuman genomes and their inheritance mechanisms These nucleotide changes couldrelate to diseases and studying these changes is recently a hot topic For example,

in Figure 1.1, if these two DNA sequences above come from a common locus oftwo chromatids in a chromosome of one individual, then the SNP genotype for thatperson is 01

Illumina BeadChips The need of simultaneous assaying a large number of viduals (samples) over million of SNP sites has led to the innovation of biologicalintegrated chips industry To date, there are many different manufacturers working

indi-on this field, produce various type of microarrays such as Affymetrix GeneChip, mina Infinium Beadchips, Perlegen, Invader, etc The microarray beadchip contains

Illu-a number of well plIllu-ate, eIllu-ach well hIllu-as thousIllu-ands of beIllu-ads A known 25-mer cleotides (a predefined short sequence DNA of 25 nucleotides of human genome) isembedded in each bead The fluorescent DNA fragments containing genetic informa-tion of individual will match with their complementary on the beads in hybridizationstep A computational system then scan and analyse the coloured signal emittedfrom the beads to output the allele densities The number of well per plate and beadper well is integrated increasingly day by day, allowing us to interrogate millions ofSNP locus simutaneously

oligonu-The Illumina whole-genome BeadChips (Steemers et al., 2006; Peiffer et al.,2006) is one of the most popular microarray technologies used to study humanSNPs It takes m individuals (samples) as the input and captures genotypes across

Trang 14

1.2 SNP Genotyping 6

Figure 1.2: BeadChip work flow

n SNP sites After several chemical and preprocessing steps, it outputs a matrix

G = {gij,i=1, ,m;j=1, ,n} where gij = (xij, yij) are the raw intensities which indicatethe SNP genotype of sample i at SNP site j The intensities xij, yij represent forallele 0, 1 of SNP genotype gij

Methods In genotyping problem, if microarray chips are hardwares, then it is essary to have appropriate softwares to deal with the data Computational methods

nec-to determine genotype types from raw intensity data have been developed alongwith each type of chip There are several methods available (also known as callers)for Illumina Bead Array In this work, we only consider the three most popularmethods, namely Gencall (in GenomeStudio software), Illuminus and GenoSNP.Gencall(Illumina Inc., 2005) is the earliest tool still being in use in GenomeStudiotool set Illuminus(Teo et al., 2007) was developed by Yik Teo et al as a furthertool focusing on the ability to process large amount of data available from advancedchips GenoSNP(Giannoulatou et al., 2008) is a novel approach that could work outwith little experimented samples available

Trang 15

1.3 Quality Control and Quality Assurance 7

Large-scale stastical studies are supceptible to errors Quality Control and QualityAssurance (QC/QA) process is considered as a critical phase that has an significantaffect to the accuracy of computational model There are a number of proposedprotocols have been made to ensure the data quality, avoiding spurious conclusionsand mistaken results QC is defined as a process of monitoring and controlling thequality of data as it is being generated, whereas QA is used to review the productquality after that (Laurie et al., 2010) The content of this thesis is related to QCprocedure of genotype calling problem

Frequently in cleaning genotype data, it consists of two separate phases: samplesfilter and SNPs filter We just focus on the later in this article In which, low-qualitySNPs are called bad SNPs and should be filtered out The most common and widelyprotocol used in evaluating SNPs quality is based on measurements such as Hardy-Weinberg Equilibrium, missing rate, MAF (minor allele frequency), Mendel erroretc Although all three callers have applied their own advanced methods to filterout these bad SNPs from the data set, it is highly possible that a number of faultyones still remained in the dataset

In a typical QC process for genotyping problem, there are three variables ered They are SNP’s missing rate, HWE (Hardy-Weinberg equilibrium) and MAF(minor allele frequency) In fact, to improve the efficiency of filtering process, sev-eral extra conditions are also included For instance, there is an suggestion aboutassociation tests based on duplicate discordance, Mendelian errors, sex differencesand heterozygosity rate in addition (Laurie et al., 2010) Briefly, in a SNP locus:

consid-• Missing rate or missing proportion (MSP) at a SNP indicates how much ples failed in this call:

sam-MSP = number of no calls

total of samples

As could be seen clearly, the higher missing rate indicates the poorer type calling performance The loci with MSP greater than 5% usually areconsidered as problematic SNPs

geno-• In a biallelic locus which is in Hardy-Weinberg equilibrium and minor allelefrequency is q then the probalilities of three possible genotype 0/0, 0/1 and

Trang 16

1.3 Quality Control and Quality Assurance 8

1/1 are (1−q)2, 2q(1−q), q2 These probabilities should be stable, meaning thereal values are nearly the same with the expected values Significant deviation

of HWE tests is typically imply gross genotyping error A probalility testvalue, or p-value, is used to estimate this difference

• MAF is the ratio of minor (smaller) alleles counted in the whole set of alleles:

MAF = number of minor alleles

total allelesLow MAF means that there exist a cluster among three with fewer samples

As the consequence, SNPs with low MAF are more prone to error, since almostcalling methods that based on clustering do not work well in these cases

An proposed criteria for a standard-quality SNP genotyping are HWE p-valuegreater than 0.00033, MSP in average is less than 3% but maximum must not pass10%, while MAF and quality score greater than a pre-defined figure which varies fromstudy to study (Group, 2007; Laurie et al., 2010) In fact, a combination of thesevariables are required to generate more dependable results of filtering However, asstated before, these thresholds are not fixed in every experiments and very hard to

be automated These numbers depends on many factors, giving a rough evaluation

to the data quality The final conclusions of bad SNPs have to be made manually by

an expert-guided process to avoid false positive cases which are potentially misseddisease variants in the study A typical marker (SNP in this case) QC process ofGWAS data includes at least four steps as follow (Anderson Carl A, ):

• Filter out SNPs with excessive missing rate

• Identify SNPs with extreme deviation from HWE

• Find every SNPs that have discrepancy of missing rates between cases andcontrols

• Detect low-MAF markers

In this suggestion, the limit for missing rate is 5% (1% for SNPs of low frequency

M AF ≤ 5%) Every SNPs with HWE p-value less than 0.001 will be examined byexperts for quality The final step is to remove SNPs showing very low MAF, thethreshold of 1-2% is applied typically The authors also emphasize that checkingcluster plots manually is the best safeguard for quality evaluation

Trang 17

1.3 Quality Control and Quality Assurance 9

In this thesis, we propose a new method to measure the difference among resultscreated from different callers At first, the next section will provide informationabout genotype calling problem and current callers Then we will introduce ourmethod of utilizing Jeffreys distance to find bad SNPs At the end, our experi-mental results will be evaluated statistically and plotted visually to estimate theeffectiveness of this approach Lastly, there comes our conclusion and further dis-cussion about the work

Trang 18

Chapter 2

Related Work

Given an matrix G = {gij,i=1, ,m;j=1, ,n} with gij = (xij, yij), representing thegenotype intensity of m samples at n SNP loci The task is to assign every pairs toits most suitable genotype label, 00, 01 or 11 (or no call cluster if impossible) Thisproblem is call SNP genotype calling or in short, SNP genotyping

The most simple method for SNP genotyping problem use the correlation betweentwo allelic intensities (xij, yij) For example, with a data point (x, y):

• If xij  yij then this SNP genotype could be labelled as 00 (homozygous ofallele 0)

• If xij  yij then it should belong to genotype class 11 (homozygous of allele1)

• Or if xij ' yij then this SNP calling should be 01 (heterozygous)

However, this naive method could not work well with the ambiguous cases andits performance is poor with the tremendous amount of data In fact, statisticalapproach are used to overcome this problem Illuminus, GenoSNP and Gencallare there most popular callers for 2.5M BeadChips of Illumina They use differentclustering algorithms to group samples into three clusters 00, 01, 11 Samples which

do not belong any clusters are called outlier

Trang 19

to affine coordinate (Ritchie et al., 2011).

• Secondly, an algorithm known as GenTrain is utilized for the unsupervisedclustering step The transformed data of every samples of each SNP are mod-elled, then combined with several heuristic information in order to be analysed

by a neural network The centroids for each of three clusters are estimated,forming three genotype groups with the shape and position of them After theclustering process, a statistical score (called GenTrain score) that combined anumber of factors, penalty terms is assigned to each SNP locus in order toevaluate the resulting clusters in a way similar to human expert’s visual

• The GenTrain score, together with clustering information is then used with

a Bayesian model to establish the classification Finally, the calling score foreach call (GenCall Score) is generated as a confident degree for that call.GenCall score is not a probability value, but is used to represent the quality ofgenotyping and might be varied between different loci or chromosomes Howeverany call with score less than 0.2 will be considered as a failure due to the lack ofconfidence (assigned as no call in this case), while good calls are indicated by thescore greater than 0.7 Every SNP genotypes that have GenCall score fall between0.2 and 0.7 are qualified and needed to be checked carefully from other point ofviews

It worth noting that GenCall run the top/bottom strand correlation studies, inwhich each SNP is genotyped twice for two strands respectively and then compared

to ensure the agreement This mechanism also helps Gencall to filter out some faultycalls from the last result

Trang 20

2.3 Illuminus 12

Illuminus (Teo et al., 2007) is one of the most popular callers for 2.5M BeadChip

of Illumina The whole working process of genotype calling methods are quite thesame with several overall common phases, namely data normalization, clustering,calling and sometimes validation In this second tool of interests, much efforts

in preprocessing data is spent by adding further procedures to ease the clusteringprocess Assume that normalized data generated from Illumina BeadStudio software(also the input of GenCall algorithm) are in form (xij, yij), where (xij and yij) are thesignal intensities for the two alleles 0 and 1 respectively, belong to a certain sample i

at a certain SNP j Illuminus then converts those figures into new measures, namelycontrast and strength defined as follow:

cij = xij − yij

xij + yij

sij = log(xij + yij)Secondly, the clustering algorithm take those values as input and by using EM(Expectation Maximization) framework on them, it try to fit a bivariate mixturemodel with three t-distributed components (as three genotype clusters) and a Gaus-sian component for outliers (corresponding to no call cluster) The parameters thatdetermine the shape and location of components are then re-estimated iterativelyuntil they maximize the fitness likelihood and become stable Next, the posteriorprobabilities are calculated to classify the samples into appropriate genotype states.For details of this technique, please refer to its publication(Teo et al., 2007)

There is also a method to ensure the stability of calling results in Illuminus,called pertubation analysis This process means that every SNPs has been calledtwice, in original data (xij, yij) and pertubated data (xij + , yij + ) respectively,then the concordance between the two result is estimated to determine whether thisSNP is valid or not

GenoSNP (Giannoulatou et al., 2008) is another prominent caller that work withIllumina BeadChip Firstly, the data transformation method of this tool is differentcompared to the other two Start with pairs of raw intensities (x, y), it generates

Trang 21

2.5 Discussion 13

(log10(x + 1), log10(y + 1)) This method also utilizes the EM framework for mixturemodel of Student distributed components as in Illuminus However it uses a differentmethod to fit intensities data to the model Instead of clustering the probe intensitiesacross of individuals at each SNP as in two method stated above, GenoSNP developsthe model within a single individual based on the log scale of the normalized figures.This novel approach is able to overcome the problem of other methods in thatthe accuracy is highly depend on the number of control samples, which sometimes

in reality cannot be afforded Moreover, this method gives a perfect solution forstudies where data is typed by different chips It is worth noting that GenoSNPhas two versions of using EM algorithm for re-estimate parameters step Besidethe original version, there is another approach called Variational Bayes expectationMaximization (VB-EM) that differ from the former by E-step VB-EM is proved

to be more robust about the uncertainty of model parameters (Giannoulatou et al.,2008), therefore improving the performance of the original algorithm

After running the calling algorithm, there also measures of confidence for eachcall, which is the posterior probability representing the possibility the call comingfrom the class assigned This confident degree is also used in filter the poorer data

of SNP or samples

Many comparisons have been made between these three callers (Giannoulatou et al.,2008; Teo et al., 2007; Ritchie et al., 2011) In general, each of these methods havetheir own advantages and disadvantages For example, Illuminus is the caller withhighest call rate (equivalent to lowest missing rate), but the accuracy is not thebest This drawback happens due to the bad effect of noisy samples, or outliers

to the clustering algorithm of Illuminus is more critical than the other methods.GenCall, on the other hand, is appreciated by its high quality calling, but the callrate is relative low among three Finally, GenoSNP has more ability in balancing thetrade-off between the two factors In addition, it could work well with fewer-sampledataset However, for a sufficient dataset, the efficiency of this method will fallsbehind the two remaining

To conclude, all of these callers in general could work quite well with arbitrarydataset with call rate and accuracy usually exceed 95% A combination of all three

Trang 22

2.5 Discussion 14

callers, also known as a consensus calling, will increases the performance of genotypecalling problem with no doubt This very interesting idea has become motivationfor our work That is, by considering the results of three callers and making crossreferences, we could find the problematic SNPs that have bad effect to the callingprocess

Ngày đăng: 27/06/2022, 09:11

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm

w