10 1.6.2 Identify samples that have high missing and heterozygosity rate 11 1.6.3 Identify duplicated or related samples.. 27 4.2 5 SNPs that have high positive changes after being remov
Trang 1Table of Contents
1.1 Biological background 3
1.2 Some common types of mutation 5
1.3 SNP and SNP genotype 6
1.4 Microarray technology and Illumina BeadChips 7
1.5 Genotype callers 8
1.6 Quality control and quality assurance 9
1.6.1 Identify samples with discordant sex information 10
1.6.2 Identify samples that have high missing and heterozygosity rate 11 1.6.3 Identify duplicated or related samples 11
1.6.4 Identify samples that have different ancestries 12
2 Genotype callers 14 2.1 Illuminus 14
2.2 GenoSNP 17
2.3 GenCall 18
2.4 Comparing three callers 18
3 Maximum likelihood method for detecting bad samples 20 3.1 Create potential bad sample list 21
3.2 Estimate the fitness of data 22
3.3 Remove bad samples 24
4 Experimental result 25 4.1 Input file format 25
4.2 Experiment 1 27
i
Trang 2TABLE OF CONTENTS ii
4.3 Experiment 2 31
Trang 3List of Figures
1.1 DNA structure 4
1.2 Human genome, chromosome and genes 5
1.3 The process of creating and genotyping of Illumina Infinium II[Inc06] 8 2.1 Mixture of two Gaussian distributions 15
2.2 x,y intensities vs strength and contrast [TIS+07] 17
3.1 The workflow of the method 21
4.1 VCF file format example 26
4.2 SNP rs2465126 before and after removing bad samples 28
4.3 SNP rs2488991 before and after removing bad samples 29
4.4 SNP rs6055460 before and after removing bad samples 32
iii
Trang 4List of Tables
1.1 An example of DNA substitution 5
1.2 An example of DNA insertions and deletions 6
1.3 An example of SNP 7
2.1 Comparison between callers[GYC+08a] 19
4.1 5 highest missing rate samples and their statistics in experiment 1 27
4.2 5 SNPs that have high positive changes after being removed bad samples 30 4.3 Number of bad samples with different thresholds in experiment 1 30
4.4 5 highest missing rate samples in experiment 2 31
4.5 5 SNPs that have high positive changes after being removed bad samples 33 4.6 Number of bad samples with different thresholds in experiment 2 33
iv
Trang 5Genome-wide association study (GWAS) is a project that uses human genome todetect single nucleotide polymorphisms and some traits of diseases With the ad-vancement of technology in recent years, some DNA microarrays have the abilities
to capture millions of SNPs from thousands of individuals (or samples) In order
to generate a microarray, we have to run through many chemical and biologicalprocesses Most of these processes are done automatically by machines that areproduced by some large DNA microarray companies
Creating microarray is only the first part, the second part is analyzing the datathat are contained in microarray to get the genotype information of each SNP fromeach individual This part is also called SNP genotyping process and nowadays,statistical approaches are the most common methods for this process thank to thelow cost and short running time However, these methods are not perfect, they maygenerate faulty genotype data of some individuals or SNPs The faulty genotype datacould be the result of the errors in the creating microarray process, the inaccuracy inthe transforming data from microarray to genotyping methods, or even the methodsthemselves
It is sure that the faulty genotype data is useless for genotype analysis Therefore,several criteria have been proposed to remove bad samples and bad SNPs Forinstance, all samples that have proportion of undefined genotype (missing rate)higher than 3% are marked as bad samples and they will be removed using thiscriterion However, These criteria could lead to the massive reduction of number ofsamples or SNPs Moreover, there is no actual mathematical verification that provesthe removals are right Hence, after the removals, some visualized graphs such asscatter plot of each SNP and some statistics are calculated to verify the removals.This job is mostly done manually by experts and it is time consuming To conclude,the problem remaining in this step is finding a statistical approach to remove badsamples and bad SNPs A good solution for this problem is the one that has reliable
1
Trang 6LIST OF TABLES 2
results and also requires as little as possible the interfere of experts
In this thesis, we propose a maximum likelihood method to detect bad samples.Our observation is that mixture model-based methods such as Illuminus has veryhigh call rate But, they are not always consistent because of the existence of noisysamples Each noisy sample data could affect the correlation matrix and the locationparameter of a distribution by shifting the cluster away from the ideal position Thisproblem might result in faulty calls of SNP genotype from Illuminus Base on thisobservation, we introduce a new fitness function to deal with this problem Ournew fitness function follows the idea of ML-based method (maximum likelihoodbased method) to maximize the fitness of mixture of student distributions If theappearance of any sample in the data reduces the fitness, this sample is marked asbad sample and it will be removed Moreover, To take the advantage of qualitycontrol criteria, we also use missing rate to create a list of samples that have highpotential of being bad samples By checking only samples in this list, the processingtime for detecting bad samples is massively reduced
The rest of the thesis is organized as follows: Firstly, Some biological knowledgeabout DNA, human genome, SNP, genotype, and SNP genotyping will be introduced
in chapter 1 In chapter 2 we would like give you a brief introduction about threemost popular algorithms that work with Illumina BeadChip: Illuminus, GenCall,GenoSNP and a short comparison of their performance After that, chapter 3 is ourproposed method to detect the bad samples from the genotype result of Illuminus.Chapter 4 will show how our method work with the real data In this chapter, wewill show the result of our method when it was applied to work with two differentdatabases Finally, we will make some conclusions in the last part of my thesis
Trang 7Chapter 1
Introduction
1.1 Biological background
In biology, Cell is the smallest unit of living Organisms could be called unicellular
if they have only one cell (Bacteria for example) Most of organisms are calledmulticellular - the number of cell in their body larger than one A single personcontains approximately 10 trillion (1013) cells Each cell has its own role in ourbody Cell knows its role and function by a special instructions that reside in cell’snucleus
The instructions of a cell are come from DNA (DeoxyriboNucleic Acid) DNA islike a blueprint to our cells, it contains a set of plans for building our cells Figure1.1 shows the structure of DNA Scientists call DNA structure is the double helixform that was build by two sugar phosphate backbones, nucleotides (bases), andhydrogen bonds between two nucleotides There are 4 types of nucleotide: A standsfor Adenine, C stands for Cytosine, G - Guanine, and T - Thymine A could onlyhave hydrogen bonds with T, C could only connect to G and vice versa For thisreason, when studying DNA, scientists only have to examine a half part of DNA Ingeneral, AT and CG are called base pairs
There is not only one DNA in our cell The fact is that each cell in our bodycontains a lot of DNA However, in our cell, DNA is packaged into single unitthat called Chromosome Each organism has its own number of chromosomes Forinstance: a dog has 78 chromosomes while a mosquito only has 6 chromosomes.Chromosomes always come in pair, one from father and another one from mother.That is the reason why children look like both their mother and father Human
3
Trang 81.1 Biological background 4
Figure 1.1: DNA structure
genome consists of 23 pairs of chromosomes, one of them determines gender and theothers are autosomal chromosome pairs
Genes are parts of DNA, they encode the information to build all proteins in ourbody Those proteins are very important because they keep our body functioning It
is said that human body contains approximately 25000 genes Genes normally tain thousands of nucleotides We could easily understand the relationship betweengenes and DNA as: DNA contains millions of characters (A,C,G,T), each group ofthree characters makes a word (three nucleotides are made to decode one amino acid
con-in the process of encodcon-ing protecon-in, they are also called DNA triplet), many wordsmake a sentence, and each sentence is a gene See Figure 1.2 for more informationabout the relationship between human genome, chromosomes, DNA, and genes
Human genome studies show that 99.9% of our genomes are identical to others[CM01],however, the appearance of each person is unique For example, the eye color of a
Trang 91.2 Some common types of mutation 5
Figure 1.2: Human genome, chromosome and genes
person could be blue, black, or brow The uniqueness of appearance is all thanks tothe polymorphisms between our genome sequences, this is also called genetic poly-morphisms The polymorphisms may be the results of mutation such as: insertions,deletions, substitutions,
1.2 Some common types of mutation
Table 1.1: An example of DNA substitutionsequence 1 A C G A T G C A Asequence 2 A C G A A C G A A
Substitution: DNA substitution is the phenomenon when one or more nucleotidesare transformed into another nucleotides Table 1.1 is an example of DNA substitu-tion where 3 bases in sequences 1 are transformed into three other bases in sequence
2 when we align them together (All of these bases have been highlighted in red)
Trang 101.3 SNP and SNP genotype 6
Table 1.2: An example of DNA insertions and deletionssequence 1 A C G A T G C A Asequence 2 A C G A - - - A A
Insertion and deletion: In the one hand, DNA deletion occurs when one or morenucleotides are removed from the DNA sequence In the other hand, when somenucleotides are inserted into DNA sequences, we will have DNA insertion Whenstudying human DNA, these two types are often indistinguishable Therefore, theyare grouped together and called indel mutations For instance, in Table 1.2 we coulddescribe in two different ways The first one is there are three bases are inserted intosequence 1 and the other one is there are three bases have been deleted in sequence2
Others: Other than these three above types, there are many other types of morphism For example, gene duplication (create a multiple copies of whole chromeregion and increase the number of genes that located in this region), or chromo-somal inversion (inverse the order of whole chromosome region), and many othermutations between chromosomes
Alleles are the alternate forms of a gene represent in an individual Normally,alleles have two forms, one from the father and the other one from mother The set
of alleles of an individual is called genotype A genotype at a SNP site, called SNPgenotype, is a pair of alleles each from one chromosome copy in a deployed organism
A SNP genotype is classified into three types: AA, AB and BB where A and B
Trang 111.4 Microarray technology and Illumina BeadChips 7
Table 1.3: An example of SNPsequence 1 A C G A T G C A Asequence 2 A C G A G G C A Asequence 3 A C G A G G C A Asequence 4 A C G A C G C A Asequence 5 A C G A A G C A Asequence 6 A C G A T G C A A
encodes two alleles AA and BB genotypes are called homozygous genotypes while
AB genotype is called heterozygous genotype
Although most of SNPs do not affect the physical or appearance of individuals,some of them might result in genetic diseases Therefore, the study of SNPs iscurrently one of the hottest research trends For instance, If a family has severalmembers who are being affected by a specific disease, then the same SNP that related
to this disease may be passed from one family member to another Moreover, SNPgenotyping could be also used to identify the SNPs that related to specific diseasegenotypes such as heart disease by using large number samples of unrelated patients.These studies could lead to better understanding of diseases and become one of theimportant steps to find the best treatment method for patients
1.4 Microarray technology and Illumina BeadChips
Current DNA microarray technologies such as Affymetrix GeneChip, Illumina finium Beadchips, Perlegen and Invader are the solutions for the need of parallelinggenetic tests such as SNP detecting [KF01, Syv05] A DNA microarray could cap-ture sampling data of thousands of samples (individuals) over millions of SNP sites.Illumina BeadChips is one of the most popular microarray technologies and ourmethod only works with its data
In-It is said that Illumina Infinium platform is very high throughput SNP typing system that can help detecting up to over 2 millions SNPs over each DNAsample Figure 1.3 illustrates the Illumina Infinium II(one of Illumina BeadChips)workflow Firstly, the DNA sequences will be cloned by Polymerase Chain Reaction(PCR) process This step could increase the amount of DNA up to 1000 times
Trang 12geno-1.5 Genotype callers 8
Figure 1.3: The process of creating and genotyping of Illumina Infinium II[Inc06]
Then they will be incubated before being fragmented in the next day After that,the fragmented DNA is captured on a BeadArray using hybridization and extended
by detectable labels The image BeadChip is the result of using Illumina BeadArrayreader or Illumina HiScan to scan over the BeadChip
To complete, the intensities values are extracted from the image BeadChip Forinstance, if we have m individuals (samples) and the BeadChip captured n SNP sites,these intensity values create a matrix G = {gij,i=1, ,m;j=1, ,n} where gij = (xij, yij)
is the intensity values of sample i at SNP site j The intensities xij, yij represent forallele A, B of SNP genotype for gij
1.5 Genotype callers
Methods which cluster each SNP genotype into one of three types based on the tensities are called genotype callers The performance of a caller is often determined
Trang 13in-1.6 Quality control and quality assurance 9
by the rate of successful clustered data points from the caller (call rate) The datapoints that were not be clustered to any cluster are called outliers
A naive caller method is described as follows:
• If xij yij(xij > yij + α), SNP genotype gij is AA,
• If xij yij(xij < yij + α), SNP genotype gij is BB,
• If xij ≈ yij(|xij − yij| ≤ α), SNP genotype gij is AB
where α is a given threshold
This naive method is not good enough because the output from Illumina Chips might contain noises due to the bad sample or flaws in the scanning process.Gencall [Inc05], Illuminus [TIS+07], GenoSNP [GYC+08a] are three most popularcallers for Illumina BeadChips Gencall and Illuminus determines genotypes fromSNP site to site across samples At SNP site j, these methods analyze the inten-sities at this site {(x1j, y1j), , (xmj, ymj)} to groups these m samples into threeclusters AA, AB, BB GenCall applies neural network to predefine the centroids
Bead-of three clusters and then GenCall generates GenTrain scores to do the remainingtasks Illuminus uses EM (Expectation Maximization) framework to fit a bivariatemixture model with three t-distributed components (as three genotyping clusters)
It also adds a Gaussian component for outliers (corresponding to the outlier ter) GenoSNP also shares the same idea of using mixture model of three studentdistributions with Illuminus, but it uses different schema that determines genotypesfrom sample to sample More information about these algorithms will be discussed
clus-in Chapter 2
1.6 Quality control and quality assurance
Although these three above callers provide reasonable clusters, their results arenot always consistent The conflicts among these results might be due to the badsamples (poor prepared samples before adding to the chip) or bad SNPs (SNPs notwell processed or scanned) To overcome this problem, quality control and qualityassurance (QC and QA) processes are conducted by experts to detect both badSNPs and bad samples from the results of callers This process is handy and timeconsuming
Trang 141.6 Quality control and quality assurance 10
In this thesis, we only focus on controlling the quality of samples All the samplesthat fail the quality test will be marked as bad samples These samples will beremoved from the genotype result of callers and they will be examined later tounderstand the reason why their qualities could not reach the thresholds
Two most popular criteria for detecting bad samples are missing rate, and erozygosity rate These rates are defined as:
het-• Missing rate (m rate): the missing rate or failure rate of a sample is theproportion of missing genotypes
• Heterozygosity rate (h rate): the proportion of heterozygous genotypes for agiven sample
According to these definitions, the missing rate and heterozygosity rate of a samplecan be calculated as:
m rate = number of outliers
number of SNPs (1.1)
h rate = number of heterozygous calls
number of success calls (1.2)
According to Anderson et al, a general quality control process for samplesconsists of at least 4 small steps [APC+10]:
• Step 1: Identify samples with discordant sex information
• Step 2: Identify samples that have high missing and heterozygosity rate
• Step 3: Identify duplicated or related samples
• Step 4: Identify samples that have different ancestry
1.6.1 Identify samples with discordant sex information
This step should be the very first step of quality control if the genotype data is thedata of X-chromosome Males only have one X-chromosome, therefore, all genotypecaller algorithms are expected to detect males data as homozygous Normally, allthe data from males that have been called as heterozygous are remarked as outlierswhich means the heterozygosity rate of males should be around 1 Moreover, allthe data of females should have homozygosity rates less than 0.2 Males that have
Trang 151.6 Quality control and quality assurance 11
been incorrectly addressed as females should have higher homozygosity rate thanreal females, and females that have been addressed as males by mistakes should havelower heterozygosity rate than real males
For these reasons, when control the quality of genotype data, we could use thedata of X-chromosome to identify whether the sex information have been addressedcorrectly or not All samples that have been marked as discordant gender informa-tion should be removed from genotype data
1.6.2 Identify samples that have high missing and
These thresholds work quite well with genotype data that do not have too muchnoise For other data that the frequency of noise appear too high, using missingrate and heterozygosity rate could remove a large amount of samples Therefore thenumber of remaining sample to further analysis is reduced massively
1.6.3 Identify duplicated or related samples
For effective analysis, all samples that are collected should be unrelated to another
In the other words, the relatedness between two samples should not be less thansecond degree relative (first degree relative is identified as the relationship betweenchildren and parents or two siblings, second degree relative is identified as the rela-tionship between grandchildren and grandparents, nephews, aunts, ) The reason
is that the close relative will not create a fair allele frequency data As the result itcould create a bias for further studies
The easiest way to identify two close blood samples is to using missing rate
A sample tends to have the similar missing rate to another one in his/her family.However, this method is too risky because two unrelated samples could have high
Trang 161.6 Quality control and quality assurance 12
probability of having the same missing rate
To deal with that problem, for every pairs of samples, metrics of identify bystate (IBS) are calculated This metric is calculated using the common of genotypedSNPs between two individuals (sex chromosome are not included in this calculation).Pairs of samples that have close relationship to each other are expected to have highIBS values If a samples is duplicated to another, the IBS between them is 1.Using genome-wide IBS data, we could compute another metric called identify bydescent (IBD) For duplicated or twins, the value of IBD is expected to be 1.When IBD = 0.5 two corresponding samples are marked as first degree relative.Two samples are marked as second degree relative and third degree relative whenIBD = 0.25 and IBD = 0.125 respectively
These above thresholds are just theoretical values With real data, these valuesmust be lower due to the failure of genotyping or some other reasons For instance,
we could identify two duplicated samples if the IBD between them is higher than0.98 Generally, for every pairs of samples that have IBD > 0.185 one of twosamples must be removed
1.6.4 Identify samples that have different ancestries
When studying SNPs data, two samples that come from two different ancestries tend
to have too much different in their results This matter could reduce the effectiveness
of further analysis particularly in population based case control Therefore, the finalstep of quality control is to identify samples that have come from different ancestriesand remove them out of the genotype data
Principal Component Analysis (PCA) [PPP+06] is the most common methodfor detecting samples that have diverge ancestries This method requires a pairwisematrix of IBD that is calculated in the third step of quality control The componentsare build so that the first one could represent as much variation from the data aspossible, after that is the second one, the third, Using PCA and the Hapmapgenotype data from European, Asia, and Africa we could build three clusters Theneach samples will be predicted with its appropriate cluster The samples that havethe different with the interested region will be removed
Except the second step, three remaining steps are only used in some special dataand requirements For this reason, missing rate and heterozygosity rate are widelyused in every Illumina BeadChips genotype data However, these two rates are
Trang 171.6 Quality control and quality assurance 13
chose manually and when using the rates the number of samples may be reducedmassively For instance, in the first experiment that we will show latter in the fourthchapter, the number of samples will be removed when 0.2 is used as the threshold
of missing rate is 1097 while the total number of samples is only 3656 To deal withthose problems, we should improve the second step to lower the number of removedsamples
Trang 18Chapter 2
Genotype callers
As we have showed in chapter 1, GenoSNP, Illuminus, and GenCall are three mostpopular callers for Illumina BeadChips In this chapter, we would like to show somemore information of these algorithms
2.1 Illuminus
Illuminus was introduced by Teo et al in 2007 and quickly become one of the mostpopular methods for SNP genotyping problem Its algorithm uses a mixture model ofthree bivariate student t-distributions and a Gaussian component with zeros locationand large variance to illustrate three genotype cluster and the outliers
Clustering using mixture model [MB88] is an unsupervised clustering algorithm.The idea of it is to use multiple multivariate distributions to fit the given datawhere each distribution represents for a cluster This method is an ExpectationMaximization (EM) algorithm [MK97] itself, all the parameters that could changethe shape and the location of a distribution are updated iteratively during the re-estimate distribution steps The algorithm of mixture model can be divided into 6small steps as following:
• Step 1: initialize m distributions For example, with the mixture model of mmultivariate Gaussian distribution, this step will create randomly or predefined
m Gaussian distributions with different mean values and covariance matrices.The mean values and the covariance matrices are called the parameters of thismodel
14
Trang 19(priorP robi∗ probi,j)
• Step 5: re-calculate the parameters of all m distributions after the changes.This step uses EM algorithm to find the formulas that maximize the likelihood
of current model
• Step 6: back to step 2 until there is no change in all m distributions
Figure 2.1: Mixture of two Gaussian distributions
Figure 2.1 is an example of a mixture of two Gaussian distributions with differentmean values and covariance matrices As can be seen, some parts of two distributionsconflict to each other The clustering algorithm does not care too much about the
Trang 202.1 Illuminus 16
Algorithm 1 Mixture model algorithm
Require: n points: p1, p2, , pn; number of clusters m
Ensure: m clusters with their points
Initialize m empty clusters C = {c1, , cm}
// Calculate the prior probabilities of m distributions
Calculate priorP robi
// Calculate the probabilities of pj by m distributions
// Then assign them to the highest possible cluster
Back to Illuminus, in the first step Illuminus transforms the x, y intensities ofeach sample to the strength and contrast by the following formulas:
strength = log(x + y) (2.1)contrast = x − y
In the best scenario, the center of three clusters locates at -1.0, 0.0, and 1.0 (seeFigure 2.2) However its scenario rarely happens for real data where the centres