we 16.1 Identify samples with discordant sex information 1.6.2 Tdentify samples that have high missing and heterozygosity rate 16.3 Identify duplicated or related samples.. 27 5 SNPs t
Trang 11.6 Quality control and quality assurance we
16.1 Identify samples with discordant sex information
1.6.2 Tdentify samples that have high missing and heterozygosity rate
16.3 Identify duplicated or related samples 1.6.4 Identify samples that have different ancestries
2 Genotype callers
3.4 Comparing three callers
3 Maximum likelituud method for detecting bad samples
3.1 Create potential bad sample list
3.2 Tistimate the fitness of dara,
3.3 Remove bed samples
Trang 3The process of ercating and genotyping of Illumina Infnium =x I[inc06]
Mixture of two Gaussian distributions
Xuy IutenaiLies và, sircngth und voutrast TRỢ,
The warkflow of the method -
VCF file format example
SNP 182465126 before and after removing bai, samples
SNP 182488991 before and after removing bad samples SNP 16055460 before and after removing had samples
iii
Trang 4
5 highest missing rate samples and their statisties in experiment 1 27
5 SNPs that have high positive changes after being removed had aarnples 30 Number of had samples with different thresholds in experiment 1 30
5 highest anissing rable sumples in experiment 2200.2 3L
5 SNPs that have high positive changes after being removed bad samples 33 Number of bad samples with different thresholds in experiment 2 33
Trang 5Overview
Genome-wide association study (GWAS} is @ project that uses human genome to
detect single nucleotide polymorphisins and some traits of diseases With the ad
vancoment of technology in reeent year
jome DNA inicroarrays heve the abilitics
to capture millions of SNPs from thousands of individuals (or samples) In order
to generale a microarray, we have lo run through many chemical and biological processes Most of these processes are done automatically by machines that are produced by some large DNA microarray companies
Creating microarray is only the first part, the second part is analyzing the data that are contained in microarray to get the genotype information of each SNP from each individual This part is also called SNP genotyping process and nowadays,
statistical approaches are the most common methods for this process thank to the low cost and short running time However, these methods are not perfect, they may"
generale faully genotype data of some individuals or SNPs The faulty genotype dala
could be the result of the errers in the crealing microarray process, Uhe inaccuracy in
the transforming data from inicroarray bo genotyping methods, or even the methods themselves
It is sure that the faulty genotype data is useless for genotype analysis Therefore, several criteria have been proposed ta remove bad samples and bad SND’ Dor instance, all samples that have proportion of undetined genotype (missing rate)
higher than 3% are marked as bed samples and they will be removed using this
criterion However, These criteria could lead to the massive reduction of number of
samples or SNPs Moreover, there is no actual mathematical verification that proves the removals are right Hence, afler the removals, some visualized graphs such as
scatver plot of euch SNP aud some statistics are calculated to verily the removal
This job is mostly done manually by experts and it is time consuming To conclude, the problem remaining in this step is finding a statistical approach to remove had samples and bad SNPs A good solution for this problem is the one that has reliable
Trang 6LIST OF TABLES 2
resulls and ule reguires ay little us posible the iuterlere of experts
Tu this thesis, we propose u auaiisuut likelihood metrod lo detect bud sumples Our observation is that, mixture model-based methada such as Thiminus has very high call rate But, they are not always consistent because of the existence of noisy samples Tach naisy sample data could affect the correlation matrix and the lacation parameter of a distribution by shifting the cluster away from the icleal position 'his problem might result in faulty calls of SNP genotype from Llluminus Base on this observation, we introduce a new fitness function to deal with this problem Our new lites fuuetion follows Uke idex of ML-bused ancthed (mixin Likelihood based wethod) lo maxiutize the fiuess of mixture of ybudeut distributions If unc appearance of any sample in the duty reduces the Gluess, tidy sample is amarked as bad sample and it will be removed Morenver, To take the advantage af quality control criteria, we also nse missing rate to create a list of samples that have high potential of heing bad samples By checking only samples in thia list, the processing time for detecting bad samples is massively reduced
‘The rest nf the thesis is organized as follows: Hirstly, Some biological lnowledge about DNA, human genome, SNP, genotype, and SNP genotyping will be introduced
iu chapter 1 In chapter 2 we would like give you a brie! introduction about Unree tost popular algorithms nat work with Huuiua BeudChip: THưưniaas, CcuCal, GenoSNP and a short: comparison of their performance After that, chapter 3 is our proposed method to detect: the bad samptes from the genotype result of Thiminus Chapter 4 will show haw our method work with the real data Tn this chapter, we will show the result of our method when it was applied to work with two different databases Finally, we will make some conclusions in the last part of my thesis
Trang 7The instructions of a cell are come from DNA (DeoxyriboNucleie Acid) DNA is like a blueprint to our cells, it contains a set of plans for building our cells Figure 1.1 shows the structure of DNA Scientists call DNA structure is the double helix form that was build by two sugar phosphate backbones, uuckotides (bases), und hydrogeu bonds between two nuckotides There are 4 typos of nucleotide: A slands for Adenine, C stands for Cytosine, G - Guanine, und T - Thyuine A could only have hydrogen bands with T, C! could only connect to G and vice versa For this
reason, when stndying DNA, scient ists only have to examine a half part of DNA In general, AT and CG are called base pairs
There is not only ene DNA in our cell ‘I'he fact is that each cell in our body contains a lot of DNA Llowever, in our cell, DNA is packaged into single unit that called Chromosome ach organism has its own munber of chromosomes Hor iustonce: u dog, hus 78 ehromevonics while u mosquite only bas 6 chromosomes Chromosomes always come in pair, one from father and another one frum mother
That is vhe reuson why children look like both their mother uod futher Human
Trang 88 = deoxyribose sugar P= phosphate group
Figure 1.1: DNA structure
genome consists of 23 pairs of chromosomes, one of them determines gender and the
others are autosomal chromosome pairs
Genes are parts of DNA, they encode the information to build all proteins in our
body ‘Those proteins are very important because they keep our body functioning It
is said that human body contains approximately 25000 genes Genes normally con-
tain thousands of nucleotides We could easily understand the relationship between
genes and DNA as: DNA contains millions of characters (A,C.G.T) each group of
three characters makes a word (three nucleotides are made to decode one amino acid
in the process of encoding protein, they are also called DNA triplet), many words
make a sentence, and each sentence is a gene See Figure 1.2 for more information
about the relationship between human genome, chromosomes, DNA, and genes
Human genome studies show that 99.9% of our genomes are identical to others[{CMO01},
however, the appearance of each person is unique For example, the eye color of a
Trang 91.2 Some common types of mutation 5
Figure 1.2: Human genome, chromosome and genes
person could be blue, black, or brow The uniqueness of appearance is all thanks to the polymorphisms between our genome sequences, this is also called genetic poly- morphisms, The polymorphisms may be the results of mutation such as: insertions,
deletions, substitutions,
Table 1.1: An example of DNA substitution
2 when we align them together (All of these bases have been highlighted in red)
Trang 10Insertion and deletion: In the one hand, DNA deletion occurs when one or more
nucleotides are removed from the DNA sequence In the other hand, when some nucleotides are inserted into DNA sequences, we will have DNA insertion, When studying human DNA, these two types are often indistinguishable Therefore, they are grouped together and called indel mutations For instance, in Table 1.2 we could describe in two different ways The first one is there are three bases are inserted into sequence 1 and the other one is there are three bases have been deleted in sequence
Single nucleotide polymorphism (SNP) is one of the most common genetic polymor-
phisms between genomes of members of a species, it occurs at only one nucleotide
Allleles are the alternate forms of a gene represent in an individual Normally,
ot
of alleles of an individual is called genotype A genotype at a SNP site, called SNP
alleles have two forms, one from the father and the other one from mother, The
genotype, is a pair of alleles each from one chromosome copy in a deployed organism
A SNP genotype is classified into three types: AA, AB and BB where A and B
Trang 111.4 Microarray technology and Illumina BeadChips 7
encodes two alleles, AA and BB genotypes are called homozygous genotypes while
AB genotype is called heterozygous genotype
Although most of SNPs do not affect the physical or appearance of individuals, some of them might result in genetic diseases Therefore, the study of SNPs is
currently one of the hottest research trends For instance, If a family has several
disease, then the same SNP that related
from one family member to another Moreover, SNP
members who are being affected by a speci
to this disease may be pa
genotyping could be also used to identify the SNPs that related to specific disease genotypes such as heart disease by using large number samples of unrelated patients
These studies could lead to better understanding of diseases and become one of the
important steps to find the best treatment method for patients
Current DN
finium Beadchips, Perlegen and Invader are the solutions for the need of paralleling genetic tests such as SNP detecting [KFO1, Syv05] A DNA microarray could eap-
microarray technologies such as Affymetrix GeneChip, Iumina In-
ture sampling data of thousands of samples (individuals) over millions of SNP sites Illumina BeadChips is one of the most popular microarray technologies and our method only works with its data,
It is said that Hlumina Infinium platform is very high throughput SNP geno- typing system that can help detecting up to over 2 millions SNPs over each DNA sample Figure 1.3 illustrates the Iumina Infinum [H(one of Hlumina BeadChips)
workflow Firstly, the DNA sequences will be cloned by Polymerase Chain Reaction
(PCR) process This step could increase the amount of DNA up to 1000 times
Trang 12Figure 1.3: The process of creating and genotyping of Ihunina Infinium I[lnc06)
Then they will be incubated before being fragmented in the next day After that, the fragmented DNA is captured on a BeadArray using hybridization and extended
by detectable labels The image BeadChip is the result of using Illumina BeadArray
reader or Illumina HiScan to scan over the BeadChip
To complete, the intensities values are extracted from the image BeadChip For instance, if we have m individuals (samples) and the BeadChip captured m SNP sites,
Trang 131.6 Quality control and quality assurance 9
by the rate of successful clustered dala points from the caller (eall rate} The data points that were uot be clustered (o any cluster are called outliers,
A naive caller method is described aa follows:
© Way > yyley > vy a) SNP genolype gi is AA,
© lfxy < yylzy < vy — a), SNP genotype gi; is BB,
« Tfzg % p¡(|
— Đụ| < na), ST genatype gụ ia AT
where œ is a given Ghreshold
This naive method is nut good enough because the output from Tlunina Bead Chips might contain noises due to the bad sample or flaws in the scanning process Gencall [IncO5], Lluminus [TIS*07, GenoSNP GYCT08a] are three most popular callers for [lumina BeadChips Geneall and Illuminue determines genotypes from SNP site to site across samples At SNP site j, these methods analyze the inten
sities at this site {(1;,y1;), -, (@mj,Ymg)} to groups these m samples into three clusters AA, AB, BB CenCall applics neural network to predefine the centroids
of three dusters and then GenCull yenerales GenTrain scores lo du the remaining
tasks Tiluminus uses EM (Expectation Maximization} framework to [it a bivariate
mixture morel with three t-distributed components (as three genotyping clusters)
Tt also adds a, Gar
ter) GenoSNP also shares the same idea of using mixture model of three student
jan camponent for outliers (corresponding to the outlier clis-
distributions with ILuminus, but it uses different schema that determines genotypes from sample to sample More information about these algorithms will be discussed
in Chapter 2
Alhough Ulcse Uhrce abeve cullcrs provide reesonuble dusters, Ubeir revulls are
not always consistent The conflicts among these results might be dne ta the had
samples (poor prepared samples hefore adding to the chip) or bad SNPa (SNPs not well processed or scanned) ‘lo overcome this problem, quality control and quality assurance (QC and QA) processes are conclucted by experts to detect both bad SNPs and bad samples from the results of callers This process is handy and time
consuming,
Trang 141.6 Quality control and quality assurance 10
Tu Unis Uhexiy, we only focus ou controlling hé quality o[ samples AII the sunples thal fuil the quality test will be mauked sẽ bai samples These sunples will be remaved from the genotype result of callers and they will he examined later to understand the reasan why their qualities conld not reach the threshalds
Two most popular criteria for derecting bad samples are missing rate, and het- erozygosity rate ‘These rates are defined as:
¢ Missing rate (m.røfe): the missing rate or failire rate of a sample is the proportion of missing genotypes
« Hewrozygosity rate (h rate): the proportion of heterozygous genotypes for a given sample
According to these definitions, the missing rate and heterozygosity rate of a sample
can be calenlated as:
| nuưmiber of outliers ay
hrate = univer of heterozygous calls ti
According to Anderson et al, a general quality control process for samples
consists of at least 4 small steps [APC 10]:
Step 1: Identify samples with discordant sex informarion
Step 2: Identily samples hát have high missing and lclerogygosity rate
Step 3: Identify duplicated or related samples
Step 4: Identify semples that have different ancestry
1.6.1 Identify samples with discordant sex information
This step should he the very first step of quality control if the genotype data is the
data of X-chromosome Males only have one X-chromosome, therefore, all genotype caller algorithms are expected to detect males data as homozygous Normally, alt the data {roi wales that have been called us heterozygous are remarked us outliers
Trang 151.6 Quality control and quality assurance 11
been incorrectly addressed us Eemalœ should hayc bìgher hornosygosity rate than xoại [enplus, and [orulos that Inve boon addressed ay mulea by mistakes should have lower heterozygosity rate than real males
For these reasons, when concral the quality of genotype data, we could nse the data of X-chramosome to identify whether the sex information have heen addressed correctly or not All samples that have been marked as discorclant gender informe: tion should be removed from genotype data
1.6.2 Identify samples that have high missing and heterozy-
These thresholds work qnite well with genotype data that do not have tao much noise For other data thet the frequency of noise appear too high, using missing rate and heterozygosity cate could remove large amount of samples ‘Therefore the
number of remaining sumple to furLier analysis is reduced aussively
1.6.3 Identify duplicated or related samples
For effective analysis, all samples that are collected should be tmrelated to another
In the other words, the relatedness between two samples should not be fess than second degree relative {first degree relative is identified as the relationship between children and parents or two siblings, second degree relative is identified as the rela tionship between grandchildron and grandparents, nephews, aunts, .) Tho reason
is that the close relative will not ercate a fair allele frequency data, As the result it could create a bias [or further studics
The easiest way to identify two close blond samples is to nsing missing rate
A sample tends to have the similar missing rate to another one in his/her family
is method is too risky hecanse two nnrelated samples conld have high
Trang 161.6 Quality control and quality assurance 12
probubilily of having Uhe sune missing rate,
To deal wilh that problem, lor every pairs of samples, urctries of identify by state (188) are calonlated This metric is caleulated using the common of genotyped
SNPs between two individuals (sex chromosome are nat included in this calenlation)
Pairs of sampies that have close relationship to each other are expected to have high {BS values If a samples is duplicated to another, the £85 between them is 1 Using genome-wide IBS data, we could compute another metric called identify by descent (18D) For duplicated or twins, the value of /8L is expected to be 1 When [BD = 0.5 two corresponding sunples are marked ay first degree relutive Two suanpley are murked ay second degree relative and third degree relative when
IBD — 0.25 and IBD — 0.125 respoctively
These above thresholds are just theoretical values With real data, these values must be lower due ta the failure of genotyping or some other reasons For inatance,
we could identify two duplicated samples if the 72D herween them is higher than (98 Generally, for every pairs of samples that have {81 > 0.185 one of two
samples must be removed
1.6.4 Identify samples that have different ancestries
When aindying SNPs data, two samples thar come from two different ancestries tend
to have too much different in their results This matter contd reduce the effentiveness
of further analysis particularly in popnlation based case control Therefore, the final step of quality control is to identify samples that have come from different ancestries and remove them out of the genctype data
Principal Component Analysis (PCA) [PPP*06] is the most common method for detecting samples that have divorge ancestries This method requires a pairwise matrix of IBD that is ealculuted iu the third step of quality coutrol The eompoucnts are build so that the first ane conld represent as much variation from the data as possible, after that is the second one, the third, Using PCA and the Hapmap genatype data from Enropean, Asia, and Africa, we could build three clusters Then each samples will be predicted with its appropriate cluster The samples that have the dilferent with the interested region will he removed
Lxcept the second step, three remaining steps are only used in some special data and requirements For this reason, missing rate and heterozygosity rate are widely used in every [lumina BeadChips genotype data However, these twe rates are
Trang 171.6 Quality control and quality assurance 18
chọn manunlly und when usiug Ube rates the number of sauples may be reduced iuassively For justance, in the fist experiment Uwt we will slow lniter in the fourth chapter, the number of samples will be removed when 0.2 is nsed as the threshold
of missing rate is 1097 while the total number of samples ia only 3656 To deal with those problems, we should improve the second step to lower the number of remaved samples
Trang 18Clustering using mixture modet [MB83] is an unsupervised clustering algorithm The idee of it ix lo use mulliple multivariate distributions to fit the given date
where cuch distribution represents for a cluster This method is un Expectation
Maximizution (EM) ulgorithan “MK97] itself, all the parameters tht could change the shape and the location of a distribution are updated iteratively during the re- estimate distribution ateps The algorithm of mixtime model can he divided into 6
small steps as following:
¢ Step 1: initialize m distributions For example, with the mixture model af mm multivariate Ganasian distribution this step will create randomly ar predefined
m, Ganssian distributions with different mean valnes and covariance matrices
‘The mean values and the covariance matrices are called the parameters of this
model.
Trang 19© Step 4: assign point; to cluster k where
k = argmax(prior Prob; « probj,)
© Step 5: re-caleulate the parameters of all m distributions after the change
This step uses EM algorithm to find the formulas that maximize the likelihood
of current model,
Step 6: back to step 2 until there is no change in all m distributions
Figure 2.1: Mixture of two Gaussian distributions
Figure 2.1 is an example of a mixture of two Gaussian distributions with different mean values and covariance matrices
As can be seen, some parts of two distributions
conflict to each other The clustering algorithm does not care too much about the
Trang 202.1 1llnminus 18
‘Algorithm 1 Mixture model algorithm
Require: n points; p,, py, .zn; number of clusters m
Ensure: 1 duster with their points
Làn
end for
// Calculate the prior probubilities of m distributions
Calculate prior Prob;
Ji Calculate the probabilities of py by wn distributions
// Then assign them to the highest possible cluster
for all pj do
Calentate proby, where i —1 m
k © argmaa(prior Prob; x prabi ;)
Back to llluminus, in the first step Iluminus transforms the x, y intensities of
each sample to the strength and contrast by the following formulas
in the best scenario, the center of three clusters locates at -1.U, 0.0, and LU (see
Figure 2.2) However its secuario rarcly happens for real data where the centres