This thesis develops a software system that includes a database repository to store different microarray datasets and a microarray data analysis tool for analyzing the stored data.. 29 3
Trang 1MICROARRAY DATA ANALYSIS TOOL (MAT)
A Thesis Presented to The Graduate Faculty of The University of Akron
Trang 2_ Date
Trang 3iii
ABSTRACT
Microarray is a technology that has been widely used by the biologists to probe the presence of genes in a sample of DNA or RNA Using the technology, the oligonucleotide probes can be massively parallel immobilized on a microarray chip It allows the biologists to check the expression levels of thousands of genes together This thesis develops a software system that includes a database repository to store different microarray datasets and a microarray data analysis tool for analyzing the stored data The repository currently allows datasets of GenepixPro format to be deposited, although it can
be expanded to include datasets of other formats The user interface of the repository allows users conveniently upload data files and perform preferred data preprocessing and analysis The analysis methods implemented includes the traditional k-nearest neighbor (kNN) methods and two new kNN methods developed in this study Additional analysis methods can be added by future developers The system was tested using a set of microRNA gene expression data The design and implementation of the software tool are presented in the thesis along with the testing results from the microRNA dataset The results indicate that the new weighted kNN method proposed in this study outperforms the traditional kNN method and the proposed mean method We conclude that the system developed in the thesis effectively provides a structured microarray data repository, a flexible graphical user interface, and rational data mining methods
Trang 4iv
ACKNOWLEDGEMENTS
I would like to thank my advisor Dr Zhong-Hui Duan for giving me an opportunity to work on this project for my Masters thesis I was motivated to choose this topic after I took Introduction to Bioinformatics course I would like to thank her for invaluable suggestions and steady guidance during the entire course of the project
I am thankful to my committee members Dr Yingcai Xiao and Dr Xuan-Hien Dang for their guidance, invaluable suggestion and time
I would like to thank my friends Shanth Anand and Prashanth Puliyadi for helping me to do Master’s and change my career path I couldn’t have achieved this without their help
I would like to thank my friend Manik Dhawan for his guidance in writing and formatting this report
I would finally like to express my gratefulness towards my parents and all my family members who were always there for me and cheering me on all situations and for their great interest in my venture
Trang 5v
TABLE OF CONTENTS
Page
LIST OF TABLES viii
LIST OF FIGURES ix
CHAPTER I INTRODUCTION 1
1.1 Introduction to Bioinformatics 1
1.2 Introduction to Microarray Technology……… 2
1.2.1 Genepix Experiment Procedural……… 3
1.3 Applications of Microarrays 5
1.4 Need for Automated Analysis……… 6
1.5 Knowledge Discovery in Data……… 7
1.5.1 KDD Steps……… 8
1.6 Classification……… 8
1.6.1 General Approach……… 9
1.6.2 Decision Trees……… 10
1.6.3 k Nearest - Neighbor Classifiers……… 12
Trang 6vi
1.7 Outline of the Current Study……… 14
II LITERATURE REVIEW … 17
2.1 Previous Work … 17
2.2 Existing Tools for Normalizing GPR Datasets……… 20
2.3 Stanford Microarray Database (SMD)… 21
2.4 Microarray Tools… 21
2.5 Available Source for Microarray Data……… 24
III MATERIALS AND METHODS … … 25
3.1 Database Design… … 25
3.1.1 Schema Design……… 25
3.1.2 Table Details……… 26
3.1.3 Attributes.……… 28
3.2 Description of Genepix Data Format 29
3.2.1 Features and Blocks……… 31
3.2.2 Sample Dataset……… 32
3.2.3 Transferring Genepix Dataset to Database……… 33
3.3 Data Selection … 34
3.3.1 Creation of Training and Testing Dataset……… 34
3.4 Preprocessing……… 38
3.4.1 Preprocessing in MAT……… 39
3.5 Normalization … 41
3.6 Feature Selection … 42
Trang 7vii
3.6.1 Student T-Test……… 42
3.6.2 Implementation of T-Test in MAT……… 43
3.7 Classification … 44
3.7.1 Classical kNN Method……… 44
3.7.2 Weighted kNN Method……… 46
3.7.3 Mean kNN Method……… 47
IV RESULTS AND DISCUSSIONS……… 49
4.1 A Case Study……… 49
4.2 Results……… 53
4.3 Discussion……… 55
V CONCLUSIONS AND FUTURE WORK… 56
5.1 Conclusion……… 56
5.2 Future Work……… 56
REFERENCES… … 58
APPENDICES……… 61
APPENDIX A COPYRIGHT PERMISSION FOR FIGURE 1.2…… 62
APPENDIX B PERL SCRIPT FOR T-TEST - TTEST.PL……… 63
APPENDIX C CLASSIFICATION ALGORITHMS……… 66
Trang 8viii
LIST OF TABLES
1.1 Confusion matrix for a 2-class problem ……… 9
1.2 Software used ……… 16
2.1 List of microarray tools ………… 22
2.2 Available source for microarray data……… 25
3.1 Tables used in MAT……… 26
3.2 Attributes and their description……… 29
3.3 List of default choices for feature selection……… 40
4.1 Training and testing samples – Experiment 1……… 53
4.2 Accuracy of three classification methods for different N features……… 53
4.3 Training and testing samples – Experiment 2……… 54
4.4 Accuracy of three classification methods for different N features……… 54
Trang 9ix
LIST OF FIGURES
1.1 Schematic view of a typical microarray experiment……… 3
1.2 Genepix experimental procedure……… 4
1.3 Overview of KDD process 7
1.4 Mapping an input attribute set x into its class label y……… 8
1.5 A decision tree for the mammal classification problem……… 11
1.6 Classifying an unlabeled vertebrate……… 12
1.7 Schematic representation of k-NN classifier……… 13
1.8 System diagram … 14
1.9 Application flow diagram……… 15
2.1 Sketch of the ProGene algorithm……… 19
3.1 Database schema……… 26
3.2 Genepix_version table design……… 27
3.3 Genepix_header table design……… 28
3.4 Genepix_sequence table design……… 28
3.5 Hypothetical arrays of blocks……… 31
3.6 Sample dataset……… 32
Trang 10x
3.7 Creation of repository……… 33
3.8 Selection of datasets……… 34
3.9 Temporary table names for training and testing datasets ……… 35
3.10 Flowchart – Creation of dataset……… 36
3.11 Replication of gene……… 37
3.12 Sample training dataset with median intensity values……… 37
3.13 Preprocessing in MAT……… 39
3.14 T-Test formulas……… 42
3.15 Calculated p-values for the genes……… 44
3.16 Pseudo code of kNN classical method……… 45
3.17 Pseudo code of kNN mean method……… 48
4.1 Training samples selected for the experiment……… 50
4.2 Testing samples selected for the experiment……… 51
4.3 Attribute selection and constraint specification for normalization………… 51
4.4 Training datasets……… 52
4.5 Testing datasets……… 52
4.6 Feature selection and normalization……… 52
Trang 111
CHAPTER I
INTRODUCTION
1.1 Introduction to Bioinformatics
The central dogma of molecular biology is that DNA (deoxyribonucleic acid) acts
as template to replicate itself, DNA is transcribed to RNA and RNA is translated into protein DNA is the genetic material It represents the answers to most of the researchers and scientists for years: “What is the basis of inheritance?” The information stored in DNA that allows the organization of inanimate molecules into functioning, living cells and organism that are able to regulate their internal chemical composition, growth and reproduction [1] This is what allows us to inherit our parents’ features, ex: our parents’ curly hair, their nose and others The various units that govern those characteristics at the genetic level are called genes The term bioinformatics refers to the use of computers to retrieve, process, analyze and simulate biological information Bioinformatics has led to huge researches and has well proven itself for diagnosis, classification and discovery of many aspects that lead to diseases Although bioinformatics began with sequence comparison it now encompasses a wide spread of activity for the modern scientific research It requires mathematical, biological, physical, and chemical knowledge Its implementation may further more require knowledge of computer science and etc
Trang 122
1.2 Introduction to Microarray Technology
A DNA microarray is an orderly arrangement of tens to hundreds of thousands of DNA fragments (probes) of known sequence It provides a platform for probe hybridization to radioactive or fluorescent labeled cDNAs (targets) The intensity of the radioactive or fluorescent signals generated by the hybridization reveals the level of the cDNAs in the biological samples under study Figure 1.1 shows the major processes in a typical microarray experiment Microarray technology has been widely used to investigate gene expression levels on a genome-wide scale [1, 2, 5, 10] It can be used to identify the genetic changes associated with diseases, drug treatments, or stages in cellular processes such as apoptosis or the cycle of cell growth and division [10] The scientific tasks involved in analyzing microarray gene expression data include the identification of co-expressed genes, discovery of sample or gene groups with similar expression patterns, study of gene activity patterns under various stress conditions, and identification of genes whose expression patterns are highly discriminative for differentiating discerned biological samples
Microarray platforms include Affymetrix GeneChips which uses presynthesized oligonucleotides as probes and cDNA microarrays which use full length cDNAs as probes The array experiment uses slides or blotting membranes The spot sizes are typically less than 200 microns in diameter usually containing thousands of spots The spotted samples are known as probes The spots can be DNA, cDNA or oligonucleotides [2] These are used to determine complementary binding of the unknown sequences thus allowing parallel analysis for gene expression and gene discovery An orderly
Trang 133
arrangement of probes is important as the location of each spot on the array is used for the identification of a gene The diagram of the microarray experiment is shown in Figure 1.1
Figure 1.1 Schematic view of a typical microarray experiment
In the current study we are using the microarray dataset which were generated through cDNA microarray experiments The arrays were scanned using Genepix pro biological kit The forthcoming section explains the experimental procedure of the creation of the dataset
1.2.1 Genepix Experiment Procedural
Genepix Pro is an automatic microarray slide scanner Genepix Pro automatically loads, scans, does analysis and saves results It can accommodate up to 36 slides The auto loader accommodates microarrays on micro slides labeled with up to four fluorescent dies These micro arrays can contain few hundred spots or few thousand spots representing an entire genome
targets with microarray chip fluorescent dye with probes
targets hybridized
to probes
Trang 144
Figure 1.2 Genepix experimental procedure [Copyright – Appendix A]
When the slide career is inserted into the scanner, sensors detect the location of the scanner Software helps to select of slides to be scanned The graphical representation
of the slides will be shown on the screen for user selection which makes it easier for the user to identify the slide For each slide or for group of slide we can set the settings for the experiment We can also choose automatic analysis option from the software If the email address is specified in the settings, the experiment is done and the results will be sent to the email address
The robotic arm takes the first slide from the slide career and scans the bar code in the slide and the slide is positioned for scanning Genepix can be configured with four
Trang 155
lasers Laser power wheel is used to adjust the laser strength for especially bright samples The laser excitation beam is delivered to the surface of the microarray slide and the beam scans shortly across the access of the slide As robotic arm moves slowly the slide fluorescent signals emitted from the sample is collected by a photo multiplier tube Sensors detect any non-uniformity in the slide surface and robotic arm is used to adjust the focus of the scan Each channel is scanned sequentially and the developing images are displayed on the monitor The multichannel tiff images are saved automatically according
to file naming conventions specified by the user
Once the scan has been completed the robotic arm replaces the slide in the career and repeats the process for the other slides selected from the tray Genepix automatically finds the spot and calculates up to 108 measures and saves the result as GPR files If the experiment is conducted with single channel the number of measures will be 50 or else the number of measure will be 50 to 108
1.3 Applications of Microarrays
As we know the basic working of microarrays, we can now explore the different applications of microarray technology
Gene discovery: Microarray technology helps in the identification of new genes They
help to know about the functioning and expression levels under different conditions
Disease diagnosis: Microarray technology helps to learn more about different diseases
such as heart disease, mental illness, infectious disease and especially the study of cancer Different types of cancer have been classified on the basis of the organs in which the
Trang 166
tumors develop With the help of microarray technology, it will be possible for the researchers to further classify the types of cancer on the basis of the patterns of gene activity in the tumor cells This will help the pharmaceutical community to develop more effective drugs as the treatment strategies will be targeted directly to the specific type of cancer
Drug discovery: Pharmacogenomics is the study of correlations between therapeutic
responses to drugs and the genetic profiles of the patients [2] Comparative analysis of the genes from a diseased and a normal cell will help the identification of the biochemical constitution of the proteins synthesized by the diseased genes The researchers can use this information to synthesize drugs which combat with these proteins and reduce their effect
Toxicological research: Microarray technology provides a robust platform for the
research of the impact of toxins on the cells and their passing on to the progeny [2] Toxicogenomics establishes correlation between responses to toxicants and the changes
in the genetic profiles of the cells exposed to such toxicants [2]
1.4 Need for Automated Analysis
The intrinsic problem of a typical data set produced by microarrays is the sample size and the high dimensionality of the data set The dataset created by genepix pro has various measures for thousands of genes There is no way of analyzing the samples manually In this study we propose a microarray analysis tool (MAT) with their ability of appropriately representing new methods of classification and finding new classes The
Trang 177
tool follows the knowledge discovery in data (KDD) steps which are explained in detail
in the forthcoming section
1.5 Knowledge Discovery in Data
The term knowledge discovery in data (KDD) refers to the process of finding the knowledge in data and application of particular data mining methods It involves the evaluation and possible interpretation of the patterns known as knowledge The unifying knowledge of the KDD process is to extract useful information from large database Overview of the KDD process is shown in Figure 1.3
Figure 1.3 Overview of KDD process
Trang 188
1.5.1 KDD Steps
Data selection processes the knowledge in the application domain and selects the dataset that are relevant to the problem to be solved Preprocessing step removes the unwanted data from the database and find strategies to update the missing fields in the dataset Transformation is the process of transforming data from one type to another In this step we find the useful features to represent the data depending on the goal of the task and normalize the data set In data mining step we decide the algorithms suitable for the study The current study is mainly about classification and hence we choose the classification algorithm to be implemented in this step Interpretation and evaluation is the process of creating the model The model is tested with the test sample and accuracy
of the prediction is calculated
1.6 Classification
Classification is the task of learning a target function f that maps each attribute set
x to one of the predefined class labels y [4]
Trang 199
The input data for the classification model is a collection of records Each record
is characterized by a tuple (x, y), where x is the attribute set and y is a special attribute, designated as the class label A classification model can also serve as an explanatory tool
to distinguish between objects of different classes
1.6.1 General Approach
Several approaches are taken in creating classification including decision trees, networks, KNN classifiers and others Each approach has a learning algorithm which creates a model based on the input attribute set given The model generated by the learning algorithm should both fit the input data well and correctly predict the class labels
of records it has never seen before
The training set consists of records whose class labels are known The classification model is build using the training set and the model is applied to the test data with unknown class labels The evaluation of the classification model is done using the confusion matrix
Table 1.1 Confusion matrix for a 2-class problem [4]
Trang 2010
Each entry fij in this table denotes the number of records from class i predicted to
be of class j For instance, f01 is the number of records from class 0 incorrectly predicted
as class1 Based on the entries in the confusion matrix, the total number of correct
predictions made by the model is (f11 + f00) and the total number of incorrect predictions
is (f10 + f01) Accuracy is calculated using the (Eq.1.1) and the error rate is calculated using the (Eq 1.2)
Accuracy
00 01 10 11
00 11
f f f f
f f
+++
+
Error rate
00 01 10 11
01 10
f f f f
f f
+++
• A root node that has no incoming edges and zero or more outgoing edges
• Internal node, each of which has exactly one incoming edge and two or more outgoing edges
• Leaf or terminal nodes, each of which has exactly one incoming edge and no
Trang 2111
outgoing edges
Figure1.5 A decision tree for the mammal classification problem [4]
In the decision tree, each leaf node is assigned a class label The non-terminal nodes, which include the root and other internal nodes, contain attribute test conditions to separate records that have different characteristics For example, the root node shown in Figure 1.5 uses the attribute Body Temperature to separate warm-blooded from cold-blooded vertebrates Since all cold-blooded vertebrates are non-mammals, a leaf node labeled Non-mammals is created as the right child of the root node If the vertebrate is warm-blooded, a subsequent attribute, Gives Birth, is used to distinguish mammals from other warm-blooded creatures, which are mostly birds Classifying a test record is straightforward once a decision tree has been constructed Starting from the root node, we apply the test condition to the record and follow the appropriate branch based on the outcome of the test This will lead us
Body Temperature
Gives Birth
Mammals Non
-Mammals
Non Mammals
-Leaf Nodes
Root node Internal Node
Trang 22Figure 1.6 Classifying an unlabeled vertebrate [4]
1.6.3 k Nearest - Neighbor classifiers
k Nearest neighbor method is a simple machine learning algorithm which is used for classification purposes based on the training samples in the feature space In this method, the target object is classified by the majority vote of its neighbors and the object
Trang 2313
is assigned to the class to which most of the neighbors belong (Figure 1.7) For the purpose of identification of neighbors, objects are represented by position vectors in a multidimensional feature space In this method k training samples that are most similar to the attributes of the test sample are found, which are considered as nearest neighbors and
are used to determine the class label of the test sample The distance between sample x and y can be calculated using the Euclidean distance (Eq 1.3), Manhattan distance (Eq
1.4), or other distance measures
x d
1
2
)(
),
x d
1
),
Where x i is the expression level of gene i in sample x; y i is the expression level of gene i
in sample y; and n in the number of genes whose expression values are measured
Figure 1.7 Schematic representation of k-NN classifier
Trang 2414
1.7 Outline of the Current Study
The objective of this study is to create database repository to store different microarray datasets and create a microarray analysis tool (MAT) which can be used for analysis of gene expressions The tool has been designed such that it follows the KDD steps The database repository currently allows the genepix datasets although it can be expanded to include different formats The analysis methods implemented includes three different kNN methods, classical kNN, weighted kNN and mean kNN The system diagram, application flow diagram and software used are shown below
Figure 1.8 System diagram
User Interface (C++)
Screen’s for data mining process
Database (SQL Server 2005)
Dynamic scripts for creation of
training and testing datasets
Text files
Training and testing datasets, input
file for ttest, cls file for identifying
the type of samples
Trang 2515 Figure 1.9 Application flow diagram
Trang 2616
Table 1.2 Software used
Feature Selection algorithms Perl
Trang 2717
CHAPTER II LITERATURE REVIEW
In the study a generic approach for cancer classification based on gene expression monitoring by DNA microarrays was applied to human acute leukemia as a test case A
Trang 2818
class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes An automatically derived class predictor was able to determine the class of new leukemia cases The results demonstrated the feasibility of cancer classification based solely on gene expression monitoring and suggest strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge
Improving classification of microarray data using prototype-based feature selection
This study of improving accuracy in the machine-learning task of classification from microarray data was done by Blaise Hanczar [11] One of the known issues specifically related to microarray data is the large number of genes versus the small number of available samples The most important thing is to identify the genes that are most relevant for classification Classical feature selection methods are based on the notion of prototype gene Each prototype represents a set of similar gene according to a given clustering method In this method experimental evidence of the usefulness of combining prototype-based feature selection with statistical gene selection methods for the task of classifying adenocarcinoma from gene expressions was presented [11] To improve the accuracy of machine learning based classifier algorithm, reduction methods play a key role A somewhat original dimension reduction method, which experimentally increases classification accuracy of a support vector machine based classifier, was developed Although the performance gain is comparable to that of classical reduction methods combining them outperforms both methods
Trang 2919
The dimension reduction technique follows two steps The first one is to identify
equivalent classes inside the gene space with respect to a given criterion be it the gene
expression, the known gene function, or any biologically relevant criteria [11] The
second step is to create gene prototypes that are good representatives of these classes
[11] The classification task is performed using one or more prototype-genes that have
been computed by an aggregation of genes that best represent the class The sketch of the
algorithm is show below
Figure 2.1 Sketch of the ProGene algorithm [11]
Improved Gene Selection for classification of microarrays
This study of deriving methods for improving techniques for selecting informative
genes from microarray data was done by J.Jaeger, R.Sengupta, and W.L Ruzzo [12]
Genes of interest are typically selected by ranking genes according to a test-statistic and
then choosing the top k-genes A problem with this approach is that many of these genes
are highly correlated For classification purpose it would be ideal to have distinct but still
highly informative genes Three different pre-filter methods - two based on clustering and
1 CM <- Select method of clustering (default: k mean)
2 NBCLUST <- Select the desired number of clusters
3 For each iteration of the cross validation
3.1 Define the train and test dataset
3.2 Do NBCLUST clusters of genes on train set using CM
3.3 For each cluster Cu
3.3.1 Build prototype Pu <- mean of this cluster
3.4 Model <- training of SVM using prototype
3.5 accuracy <- prediction on the test set
4 Compute the average accuracy
Trang 3020
one based on correlation - to retrieve groups of similar genes was proposed For these groups a test-statistic to finally select genes of interest was applied This filtered set of genes can be used to significantly improve existing classifiers
2.2 Existing Tools for Normalizing GPR Datasets
There are various tools available online for analyzing GPR datasets Few amongst them are GProcessor, GPR Normalizer and Microbial Diagnostic Array Workstation
(MDAW)
GenepixPro built in normalization method is simple linear normalization The label effect caused by the two channel experiment cannot be resolved by the simple liner normalization GProcessor provides the option of user defined normalized conditions The mot efficient nonlinear normalization method which can deal with the label effect is Lowess fit method which was originally proposed by William.S Another method to analyze microarry data is the analysis of variance method GProcessor uses these
methods and does the normalization
GPR Normalizer does the preprocessing of the raw data which includes background correction and normalization of raw intensities The statistical analyses are
done using the Bioconductor package (lemma)
Microbial Diagnostic Array Workstation is a web server for diagnostic array data storage, sharing and analysis It is not platform dependent We can analyze GPR datasets
by uploading them directly
Trang 3121
2.3 Stanford Microarray Database (SMD)
The Stanford Microarray database serves as a microarray database for researchers and collaborators It allows public login for data viewing and analysis In addition, SMD functions as a resource for the entire scientific community by allowing them to download datasets, do analysis, download source code and use the various available tools to explore and analyze the data’s The number of publicly accessible arrays is increasing by about
1000 per year This data’s include experiments on twelve distinct organisms including Homo sapiens, Caenorhabditis elegans, Arabidopsis thaliana, Saccharomyces cerevisiae, Drosophila melanogaster and Escherichia coli [9] SMD provides users the option of selecting experimental data, assessing data quality, filtering by individual spot characteristics and by expression pattern and analyzing data using clustering technique [9] SMD’s software is open source
SMD’s database server is currently an eight-processor Sun V880, which has 32GB or RAM installed [9] The software used includes Database management system – SMD uses oracle server enterprise edition version 9i (9.2.0.1.0); System software – The machine on which SMD resides currently uses SunOS 5.9; other software – Perl 5.0004_04 or later
2.4 Microarray Tools
A list of available tools to work with microarray datasets is given in Table 2.1 Each tool performs different tasks for data mining
Trang 3222
Table 2.1 List of microarray tools [9]
Software from other sources
Array Designer
[13]
Tool assisting in primer design for microarray construction
Premier Biosoft International JAVA
ArrayMiner [14]
Set of analysis tools using advanced algorithms to reveal the true structure of gene expression data
Optimal Design, Sprl
Windows MacOS
ArrayViewer [15]
Identification of statistically significant hybridization signals
National Human Genome Research Institute
JAVA
BAGEL [16]
Bayesian Analysis of Gene Expression Levels:
a program for the statistical analysis of spotted microarray data
University of Connecticut
MacOS, Windows, Linux
BASE [17] Microarray database and
analysis platform Lund University Web
Cluster 3.0 [18] An enhanced version of
Mike Eisen’s Cluster
University of Tokyo, Japan
UNIX Linux MacOS Windows Expression Profiler
[19]
Analysis & clustering of gene expression data
European Bioinformatics Institute (EBI)
Web
GEDA [20]
Gene expression data analysis and simulation tools, offering a variety of options for processing and analyzing results
University of Pittsburgh and UPMC
Web
GeneCluster [21] Self-organizing maps
Whitehead Institute/MIT Center for Genome Research
JAVA Windows NT
GenMAPP [22]
Tools for visualizing data from gene expression experiments in the context of biological pathways
Conklin lab;
Gladstone Institute
& the UCSF
Windows
Trang 3323
Table 2.1 List of microarray tools [9] (Continued)
GeneSifter [23]
The GeneSifter microarray data analysis system provides access to powerful statistical tools through a web interface, with integrated features for determining the biological significance of the data GeneSifter works with any array format and is especially optimized for Affymetrix GeneChip users Free trial accounts available
GeneSifter Web
GeneX [24]
Gene Expression Database : integrated toolset for data analysis and comparison
National Center for Genome Resources
Windows Linux SunOS/Solaris
Ocimum Biosolutions
Windows Macintosh Unix Linux Solaris Partek Pattern
Recognition [27]
Extracting and visualizing patterns in large multivariate data
Partek Incorporated
Linux, Unix, Windows TIGR
MultiExperiment
Viewer [28]
Analysis and Visualization of Microarray Data
University of Waterloo, Canada
Linux Unix Windows
Trang 3424
2.5 Available Source for Microarray Data
Few available sources for microarray data which are available online are tabulated below These sources provide different types of microarray dataset conducted by different biological experiment
Table 2.2 Available source for microarray data
National Center for Biotechnology
Information
http://www.ncbi.nlm.nih.gov/geo/
Stanford Microarray Database http://genome-www5.stanford.edu/
University of Pittsburgh Microarray
Dataset Collection
http://bioinformatics.upmc.edu/Help /UPITTGED.html
Kent Ridge Bio-medical Data Set
Repository
http://sdmc.lit.org.sg/GEDatasets/Datasets.html
Trang 3525
CHAPTER III MATERIALS AND METHODS
3.1 Database Design
A database is a structured collection of records or data that is stored in a computer system The structure is achieved by organizing the data according to a database model The model in most common use today is the relational model Other models such as the hierarchical model and the network model use a more explicit representation of relationships A computer based database relies upon software to organize the storage of data; this software is known as database management system In this section we will be seeing about the database schema design of MAT and the table details
3.1.1 Schema Design
The schema of a database system is its structure described in a formal language supported by the database management system In a relational database, the schema defines the tables, the fields, the fields in each table, and the relationships between field and tables The levels of database schema can be divided into conceptual schema, logical schema and physical schema Conceptual schema is a map of concepts and their
Trang 3626
relationships Logical schema is a map of entities and their attributes and relations Physical schema is a particular implementation of a logical schema The diagram of the database schema for MAT is shown below
Figure 3.1 Database schema
3.1.2 Table Details
The three tables used in MAT and their usage are depicted in the table shown below
Table 3.1 Tables used in MAT
Genepix_header Store the header information about the samples
Genepix_version Store the gpr dataset record sets apart from the header information Genepix_sequence Store the gpr file names that are transferred to the repository Training_data_bank Store the samples for creation of training data set
Testing_data_bank Store the samples for creation of testing data set
Trang 3828
Figure 3.3 Genepix_header table design
Figure 3.4 Genepix_sequence table design
3.1.3 Attributes
We are using datasets generated by the biological kit GenepixPro Once a microarray image is analyzed using the kit, the results are saved in the GPR format GPR files has approximately 50 different attributes, including feature intensities, background intensities, ratio types, sums of various sorts, threshold parameters, several different variations of these attributes, etc To use all these data wisely, we need to know basic facts about the microarray attributes The current database design supports genepix formats; it can be further extended for any types of microarray datasets by adding tables
to the schema and creating the relationship between the tables The detail description of the data types will be explained in the next section
Trang 3929
3.2 Description of Genepix Data Format
The explanation of the attributes of the sample file and the biological relevance for the few attributes is given in Table 3.2
Table 3.2 Attributes and their description
1 Block
A Block is the unit that consists of a set number
of rows and columns arrayed spots Block corresponds to a single pin in the array printer
is not used in the experiment the Name column will be assigned null values or it will contain the text “Blank” The names in the GAL file are created by the authors and the ID will be taken from database GenBank ID is used as the unique identifier in the analysis experiment [6]
1 F635 SD
2 F532 SD
The standard deviation of the intensity values at wavelength 1 or 2 of all pixels that fall within the feature-indicator ring
Trang 40of the microarray If the background intensity is higher we can assume that high voltage has been used [6]
1 % >B635 +1SD
2 % >B635 +2SD
3 % >B532 +1SD
4 % >B532 +2SD
The percentage of feature pixels at wavelength 1 or
2 that have intensity values greater than 1 or 2 SD above the median background intensity value [6]
1 F635 % Saturated
2 F532 % Saturated
The percentage of feature pixels at wavelength 1 or
2 that have maximum 16-bit intensity value of
1 Ratios SD
The standard deviation (SD) of the intensity ratios
of all pixels within a feature This parameter is derived from the ratio values computed on pixel by pixel basis
1 Rgn R2 This parameter is the square of the correlation
coefficient and ranges between 0 and 1
1 Sum of Median
2 Sum of Means
This parameter is the sum of the subtracted median or mean pixel intensity values for both wavelengths
background-1 F635 Median – B635
2 F532 Median – B532
3 F635Mean – B635
4 F532Mean – B532
The median or mean pixel intensity of wavelength
1 or 2 for the feature with the median background subtracted
1 Flags
Software flags few features based on the default conditions set on the system The possible values are -100, -75, -50 Values greater than 0 are considered to be good features