microarray data analysis tool (mat)

This thesis develops a software system that includes a database repository to store different microarray datasets and a microarray data analysis tool for analyzing the stored data.. 29 3

Trang 1

MICROARRAY DATA ANALYSIS TOOL (MAT)

A Thesis Presented to The Graduate Faculty of The University of Akron

Trang 2

_ Date

Trang 3

iii

ABSTRACT

Microarray is a technology that has been widely used by the biologists to probe the presence of genes in a sample of DNA or RNA Using the technology, the oligonucleotide probes can be massively parallel immobilized on a microarray chip It allows the biologists to check the expression levels of thousands of genes together This thesis develops a software system that includes a database repository to store different microarray datasets and a microarray data analysis tool for analyzing the stored data The repository currently allows datasets of GenepixPro format to be deposited, although it can

be expanded to include datasets of other formats The user interface of the repository allows users conveniently upload data files and perform preferred data preprocessing and analysis The analysis methods implemented includes the traditional k-nearest neighbor (kNN) methods and two new kNN methods developed in this study Additional analysis methods can be added by future developers The system was tested using a set of microRNA gene expression data The design and implementation of the software tool are presented in the thesis along with the testing results from the microRNA dataset The results indicate that the new weighted kNN method proposed in this study outperforms the traditional kNN method and the proposed mean method We conclude that the system developed in the thesis effectively provides a structured microarray data repository, a flexible graphical user interface, and rational data mining methods

Trang 4

iv

ACKNOWLEDGEMENTS

I would like to thank my advisor Dr Zhong-Hui Duan for giving me an opportunity to work on this project for my Masters thesis I was motivated to choose this topic after I took Introduction to Bioinformatics course I would like to thank her for invaluable suggestions and steady guidance during the entire course of the project

I am thankful to my committee members Dr Yingcai Xiao and Dr Xuan-Hien Dang for their guidance, invaluable suggestion and time

I would like to thank my friends Shanth Anand and Prashanth Puliyadi for helping me to do Master’s and change my career path I couldn’t have achieved this without their help

I would like to thank my friend Manik Dhawan for his guidance in writing and formatting this report

I would finally like to express my gratefulness towards my parents and all my family members who were always there for me and cheering me on all situations and for their great interest in my venture

Trang 5

v

TABLE OF CONTENTS

Page

LIST OF TABLES viii

LIST OF FIGURES ix

CHAPTER I INTRODUCTION 1

1.1 Introduction to Bioinformatics 1

1.2 Introduction to Microarray Technology……… 2

1.2.1 Genepix Experiment Procedural……… 3

1.3 Applications of Microarrays 5

1.4 Need for Automated Analysis……… 6

1.5 Knowledge Discovery in Data……… 7

1.5.1 KDD Steps……… 8

1.6 Classification……… 8

1.6.1 General Approach……… 9

1.6.2 Decision Trees……… 10

1.6.3 k Nearest - Neighbor Classifiers……… 12

Trang 6

vi

1.7 Outline of the Current Study……… 14

II LITERATURE REVIEW … 17

2.1 Previous Work … 17

2.2 Existing Tools for Normalizing GPR Datasets……… 20

2.3 Stanford Microarray Database (SMD)… 21

2.4 Microarray Tools… 21

2.5 Available Source for Microarray Data……… 24

III MATERIALS AND METHODS … … 25

3.1 Database Design… … 25

3.1.1 Schema Design……… 25

3.1.2 Table Details……… 26

3.1.3 Attributes.……… 28

3.2 Description of Genepix Data Format 29

3.2.1 Features and Blocks……… 31

3.2.2 Sample Dataset……… 32

3.2.3 Transferring Genepix Dataset to Database……… 33

3.3 Data Selection … 34

3.3.1 Creation of Training and Testing Dataset……… 34

3.4 Preprocessing……… 38

3.4.1 Preprocessing in MAT……… 39

3.5 Normalization … 41

3.6 Feature Selection … 42

Trang 7

vii

3.6.1 Student T-Test……… 42

3.6.2 Implementation of T-Test in MAT……… 43

3.7 Classification … 44

3.7.1 Classical kNN Method……… 44

3.7.2 Weighted kNN Method……… 46

3.7.3 Mean kNN Method……… 47

IV RESULTS AND DISCUSSIONS……… 49

4.1 A Case Study……… 49

4.2 Results……… 53

4.3 Discussion……… 55

V CONCLUSIONS AND FUTURE WORK… 56

5.1 Conclusion……… 56

5.2 Future Work……… 56

REFERENCES… … 58

APPENDICES……… 61

APPENDIX A COPYRIGHT PERMISSION FOR FIGURE 1.2…… 62

APPENDIX B PERL SCRIPT FOR T-TEST - TTEST.PL……… 63

APPENDIX C CLASSIFICATION ALGORITHMS……… 66

Trang 8

viii

LIST OF TABLES

1.1 Confusion matrix for a 2-class problem ……… 9

1.2 Software used ……… 16

2.1 List of microarray tools ………… 22

2.2 Available source for microarray data……… 25

3.1 Tables used in MAT……… 26

3.2 Attributes and their description……… 29

3.3 List of default choices for feature selection……… 40

4.1 Training and testing samples – Experiment 1……… 53

4.2 Accuracy of three classification methods for different N features……… 53

4.3 Training and testing samples – Experiment 2……… 54

4.4 Accuracy of three classification methods for different N features……… 54

Trang 9

ix

LIST OF FIGURES

1.1 Schematic view of a typical microarray experiment……… 3

1.2 Genepix experimental procedure……… 4

1.3 Overview of KDD process 7

1.4 Mapping an input attribute set x into its class label y……… 8

1.5 A decision tree for the mammal classification problem……… 11

1.6 Classifying an unlabeled vertebrate……… 12

1.7 Schematic representation of k-NN classifier……… 13

1.8 System diagram … 14

1.9 Application flow diagram……… 15

2.1 Sketch of the ProGene algorithm……… 19

3.1 Database schema……… 26

3.2 Genepix_version table design……… 27

3.3 Genepix_header table design……… 28

3.4 Genepix_sequence table design……… 28

3.5 Hypothetical arrays of blocks……… 31

3.6 Sample dataset……… 32

Trang 10

x

3.7 Creation of repository……… 33

3.8 Selection of datasets……… 34

3.9 Temporary table names for training and testing datasets ……… 35

3.10 Flowchart – Creation of dataset……… 36

3.11 Replication of gene……… 37

3.12 Sample training dataset with median intensity values……… 37

3.13 Preprocessing in MAT……… 39

3.14 T-Test formulas……… 42

3.15 Calculated p-values for the genes……… 44

3.16 Pseudo code of kNN classical method……… 45

3.17 Pseudo code of kNN mean method……… 48

4.1 Training samples selected for the experiment……… 50

4.2 Testing samples selected for the experiment……… 51

4.3 Attribute selection and constraint specification for normalization………… 51

4.4 Training datasets……… 52

4.5 Testing datasets……… 52

4.6 Feature selection and normalization……… 52

Trang 11

1

CHAPTER I

INTRODUCTION

1.1 Introduction to Bioinformatics

The central dogma of molecular biology is that DNA (deoxyribonucleic acid) acts

as template to replicate itself, DNA is transcribed to RNA and RNA is translated into protein DNA is the genetic material It represents the answers to most of the researchers and scientists for years: “What is the basis of inheritance?” The information stored in DNA that allows the organization of inanimate molecules into functioning, living cells and organism that are able to regulate their internal chemical composition, growth and reproduction [1] This is what allows us to inherit our parents’ features, ex: our parents’ curly hair, their nose and others The various units that govern those characteristics at the genetic level are called genes The term bioinformatics refers to the use of computers to retrieve, process, analyze and simulate biological information Bioinformatics has led to huge researches and has well proven itself for diagnosis, classification and discovery of many aspects that lead to diseases Although bioinformatics began with sequence comparison it now encompasses a wide spread of activity for the modern scientific research It requires mathematical, biological, physical, and chemical knowledge Its implementation may further more require knowledge of computer science and etc

Trang 12

2

1.2 Introduction to Microarray Technology

A DNA microarray is an orderly arrangement of tens to hundreds of thousands of DNA fragments (probes) of known sequence It provides a platform for probe hybridization to radioactive or fluorescent labeled cDNAs (targets) The intensity of the radioactive or fluorescent signals generated by the hybridization reveals the level of the cDNAs in the biological samples under study Figure 1.1 shows the major processes in a typical microarray experiment Microarray technology has been widely used to investigate gene expression levels on a genome-wide scale [1, 2, 5, 10] It can be used to identify the genetic changes associated with diseases, drug treatments, or stages in cellular processes such as apoptosis or the cycle of cell growth and division [10] The scientific tasks involved in analyzing microarray gene expression data include the identification of co-expressed genes, discovery of sample or gene groups with similar expression patterns, study of gene activity patterns under various stress conditions, and identification of genes whose expression patterns are highly discriminative for differentiating discerned biological samples

Microarray platforms include Affymetrix GeneChips which uses presynthesized oligonucleotides as probes and cDNA microarrays which use full length cDNAs as probes The array experiment uses slides or blotting membranes The spot sizes are typically less than 200 microns in diameter usually containing thousands of spots The spotted samples are known as probes The spots can be DNA, cDNA or oligonucleotides [2] These are used to determine complementary binding of the unknown sequences thus allowing parallel analysis for gene expression and gene discovery An orderly

Trang 13

3

arrangement of probes is important as the location of each spot on the array is used for the identification of a gene The diagram of the microarray experiment is shown in Figure 1.1

Figure 1.1 Schematic view of a typical microarray experiment

In the current study we are using the microarray dataset which were generated through cDNA microarray experiments The arrays were scanned using Genepix pro biological kit The forthcoming section explains the experimental procedure of the creation of the dataset

1.2.1 Genepix Experiment Procedural

Genepix Pro is an automatic microarray slide scanner Genepix Pro automatically loads, scans, does analysis and saves results It can accommodate up to 36 slides The auto loader accommodates microarrays on micro slides labeled with up to four fluorescent dies These micro arrays can contain few hundred spots or few thousand spots representing an entire genome

targets with microarray chip fluorescent dye with probes

targets hybridized

to probes

Trang 14

4

Figure 1.2 Genepix experimental procedure [Copyright – Appendix A]

When the slide career is inserted into the scanner, sensors detect the location of the scanner Software helps to select of slides to be scanned The graphical representation

of the slides will be shown on the screen for user selection which makes it easier for the user to identify the slide For each slide or for group of slide we can set the settings for the experiment We can also choose automatic analysis option from the software If the email address is specified in the settings, the experiment is done and the results will be sent to the email address

The robotic arm takes the first slide from the slide career and scans the bar code in the slide and the slide is positioned for scanning Genepix can be configured with four

Trang 15

5

lasers Laser power wheel is used to adjust the laser strength for especially bright samples The laser excitation beam is delivered to the surface of the microarray slide and the beam scans shortly across the access of the slide As robotic arm moves slowly the slide fluorescent signals emitted from the sample is collected by a photo multiplier tube Sensors detect any non-uniformity in the slide surface and robotic arm is used to adjust the focus of the scan Each channel is scanned sequentially and the developing images are displayed on the monitor The multichannel tiff images are saved automatically according

to file naming conventions specified by the user

Once the scan has been completed the robotic arm replaces the slide in the career and repeats the process for the other slides selected from the tray Genepix automatically finds the spot and calculates up to 108 measures and saves the result as GPR files If the experiment is conducted with single channel the number of measures will be 50 or else the number of measure will be 50 to 108

1.3 Applications of Microarrays

As we know the basic working of microarrays, we can now explore the different applications of microarray technology

Gene discovery: Microarray technology helps in the identification of new genes They

help to know about the functioning and expression levels under different conditions

Disease diagnosis: Microarray technology helps to learn more about different diseases

such as heart disease, mental illness, infectious disease and especially the study of cancer Different types of cancer have been classified on the basis of the organs in which the

Trang 16

6

tumors develop With the help of microarray technology, it will be possible for the researchers to further classify the types of cancer on the basis of the patterns of gene activity in the tumor cells This will help the pharmaceutical community to develop more effective drugs as the treatment strategies will be targeted directly to the specific type of cancer

Drug discovery: Pharmacogenomics is the study of correlations between therapeutic

responses to drugs and the genetic profiles of the patients [2] Comparative analysis of the genes from a diseased and a normal cell will help the identification of the biochemical constitution of the proteins synthesized by the diseased genes The researchers can use this information to synthesize drugs which combat with these proteins and reduce their effect

Toxicological research: Microarray technology provides a robust platform for the

research of the impact of toxins on the cells and their passing on to the progeny [2] Toxicogenomics establishes correlation between responses to toxicants and the changes

in the genetic profiles of the cells exposed to such toxicants [2]

1.4 Need for Automated Analysis

The intrinsic problem of a typical data set produced by microarrays is the sample size and the high dimensionality of the data set The dataset created by genepix pro has various measures for thousands of genes There is no way of analyzing the samples manually In this study we propose a microarray analysis tool (MAT) with their ability of appropriately representing new methods of classification and finding new classes The

Trang 17

7

tool follows the knowledge discovery in data (KDD) steps which are explained in detail

in the forthcoming section

1.5 Knowledge Discovery in Data

The term knowledge discovery in data (KDD) refers to the process of finding the knowledge in data and application of particular data mining methods It involves the evaluation and possible interpretation of the patterns known as knowledge The unifying knowledge of the KDD process is to extract useful information from large database Overview of the KDD process is shown in Figure 1.3

Figure 1.3 Overview of KDD process

Trang 18

8

1.5.1 KDD Steps

Data selection processes the knowledge in the application domain and selects the dataset that are relevant to the problem to be solved Preprocessing step removes the unwanted data from the database and find strategies to update the missing fields in the dataset Transformation is the process of transforming data from one type to another In this step we find the useful features to represent the data depending on the goal of the task and normalize the data set In data mining step we decide the algorithms suitable for the study The current study is mainly about classification and hence we choose the classification algorithm to be implemented in this step Interpretation and evaluation is the process of creating the model The model is tested with the test sample and accuracy

of the prediction is calculated

1.6 Classification

Classification is the task of learning a target function f that maps each attribute set

x to one of the predefined class labels y [4]

Trang 19

9

The input data for the classification model is a collection of records Each record

is characterized by a tuple (x, y), where x is the attribute set and y is a special attribute, designated as the class label A classification model can also serve as an explanatory tool

to distinguish between objects of different classes

1.6.1 General Approach

Several approaches are taken in creating classification including decision trees, networks, KNN classifiers and others Each approach has a learning algorithm which creates a model based on the input attribute set given The model generated by the learning algorithm should both fit the input data well and correctly predict the class labels

of records it has never seen before

The training set consists of records whose class labels are known The classification model is build using the training set and the model is applied to the test data with unknown class labels The evaluation of the classification model is done using the confusion matrix

Table 1.1 Confusion matrix for a 2-class problem [4]

Trang 20

10

Each entry fij in this table denotes the number of records from class i predicted to

be of class j For instance, f01 is the number of records from class 0 incorrectly predicted

as class1 Based on the entries in the confusion matrix, the total number of correct

predictions made by the model is (f11 + f00) and the total number of incorrect predictions

is (f10 + f01) Accuracy is calculated using the (Eq.1.1) and the error rate is calculated using the (Eq 1.2)

Accuracy

00 01 10 11

00 11

f f f f

f f

+++

+

Error rate

00 01 10 11

01 10

f f f f

f f

+++

• A root node that has no incoming edges and zero or more outgoing edges

• Internal node, each of which has exactly one incoming edge and two or more outgoing edges

• Leaf or terminal nodes, each of which has exactly one incoming edge and no

Trang 21

11

outgoing edges

Figure1.5 A decision tree for the mammal classification problem [4]

In the decision tree, each leaf node is assigned a class label The non-terminal nodes, which include the root and other internal nodes, contain attribute test conditions to separate records that have different characteristics For example, the root node shown in Figure 1.5 uses the attribute Body Temperature to separate warm-blooded from cold-blooded vertebrates Since all cold-blooded vertebrates are non-mammals, a leaf node labeled Non-mammals is created as the right child of the root node If the vertebrate is warm-blooded, a subsequent attribute, Gives Birth, is used to distinguish mammals from other warm-blooded creatures, which are mostly birds Classifying a test record is straightforward once a decision tree has been constructed Starting from the root node, we apply the test condition to the record and follow the appropriate branch based on the outcome of the test This will lead us

Body Temperature

Gives Birth

Mammals Non

-Mammals

Non Mammals

-Leaf Nodes

Root node Internal Node

Trang 22

Figure 1.6 Classifying an unlabeled vertebrate [4]

1.6.3 k Nearest - Neighbor classifiers

k Nearest neighbor method is a simple machine learning algorithm which is used for classification purposes based on the training samples in the feature space In this method, the target object is classified by the majority vote of its neighbors and the object

Trang 23

13

is assigned to the class to which most of the neighbors belong (Figure 1.7) For the purpose of identification of neighbors, objects are represented by position vectors in a multidimensional feature space In this method k training samples that are most similar to the attributes of the test sample are found, which are considered as nearest neighbors and

are used to determine the class label of the test sample The distance between sample x and y can be calculated using the Euclidean distance (Eq 1.3), Manhattan distance (Eq

1.4), or other distance measures

x d

1

2

)(

),

x d

1

),

Where x i is the expression level of gene i in sample x; y i is the expression level of gene i

in sample y; and n in the number of genes whose expression values are measured

Figure 1.7 Schematic representation of k-NN classifier

Trang 24

14

1.7 Outline of the Current Study

The objective of this study is to create database repository to store different microarray datasets and create a microarray analysis tool (MAT) which can be used for analysis of gene expressions The tool has been designed such that it follows the KDD steps The database repository currently allows the genepix datasets although it can be expanded to include different formats The analysis methods implemented includes three different kNN methods, classical kNN, weighted kNN and mean kNN The system diagram, application flow diagram and software used are shown below

Figure 1.8 System diagram

User Interface (C++)

Screen’s for data mining process

Database (SQL Server 2005)

Dynamic scripts for creation of

training and testing datasets

Text files

Training and testing datasets, input

file for ttest, cls file for identifying

the type of samples

Trang 25

15 Figure 1.9 Application flow diagram

Trang 26

16

Table 1.2 Software used

Feature Selection algorithms Perl

Trang 27

17

CHAPTER II LITERATURE REVIEW

In the study a generic approach for cancer classification based on gene expression monitoring by DNA microarrays was applied to human acute leukemia as a test case A

Trang 28

18

class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes An automatically derived class predictor was able to determine the class of new leukemia cases The results demonstrated the feasibility of cancer classification based solely on gene expression monitoring and suggest strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge

Improving classification of microarray data using prototype-based feature selection

This study of improving accuracy in the machine-learning task of classification from microarray data was done by Blaise Hanczar [11] One of the known issues specifically related to microarray data is the large number of genes versus the small number of available samples The most important thing is to identify the genes that are most relevant for classification Classical feature selection methods are based on the notion of prototype gene Each prototype represents a set of similar gene according to a given clustering method In this method experimental evidence of the usefulness of combining prototype-based feature selection with statistical gene selection methods for the task of classifying adenocarcinoma from gene expressions was presented [11] To improve the accuracy of machine learning based classifier algorithm, reduction methods play a key role A somewhat original dimension reduction method, which experimentally increases classification accuracy of a support vector machine based classifier, was developed Although the performance gain is comparable to that of classical reduction methods combining them outperforms both methods

Trang 29

19

The dimension reduction technique follows two steps The first one is to identify

equivalent classes inside the gene space with respect to a given criterion be it the gene

expression, the known gene function, or any biologically relevant criteria [11] The

second step is to create gene prototypes that are good representatives of these classes

[11] The classification task is performed using one or more prototype-genes that have

been computed by an aggregation of genes that best represent the class The sketch of the

algorithm is show below

Figure 2.1 Sketch of the ProGene algorithm [11]

Improved Gene Selection for classification of microarrays

This study of deriving methods for improving techniques for selecting informative

genes from microarray data was done by J.Jaeger, R.Sengupta, and W.L Ruzzo [12]

Genes of interest are typically selected by ranking genes according to a test-statistic and

then choosing the top k-genes A problem with this approach is that many of these genes

are highly correlated For classification purpose it would be ideal to have distinct but still

highly informative genes Three different pre-filter methods - two based on clustering and

1 CM <- Select method of clustering (default: k mean)

2 NBCLUST <- Select the desired number of clusters

3 For each iteration of the cross validation

3.1 Define the train and test dataset

3.2 Do NBCLUST clusters of genes on train set using CM

3.3 For each cluster Cu

3.3.1 Build prototype Pu <- mean of this cluster

3.4 Model <- training of SVM using prototype

3.5 accuracy <- prediction on the test set

4 Compute the average accuracy

Trang 30

20

one based on correlation - to retrieve groups of similar genes was proposed For these groups a test-statistic to finally select genes of interest was applied This filtered set of genes can be used to significantly improve existing classifiers

2.2 Existing Tools for Normalizing GPR Datasets

There are various tools available online for analyzing GPR datasets Few amongst them are GProcessor, GPR Normalizer and Microbial Diagnostic Array Workstation

(MDAW)

GenepixPro built in normalization method is simple linear normalization The label effect caused by the two channel experiment cannot be resolved by the simple liner normalization GProcessor provides the option of user defined normalized conditions The mot efficient nonlinear normalization method which can deal with the label effect is Lowess fit method which was originally proposed by William.S Another method to analyze microarry data is the analysis of variance method GProcessor uses these

methods and does the normalization

GPR Normalizer does the preprocessing of the raw data which includes background correction and normalization of raw intensities The statistical analyses are

done using the Bioconductor package (lemma)

Microbial Diagnostic Array Workstation is a web server for diagnostic array data storage, sharing and analysis It is not platform dependent We can analyze GPR datasets

by uploading them directly

Trang 31

21

2.3 Stanford Microarray Database (SMD)

The Stanford Microarray database serves as a microarray database for researchers and collaborators It allows public login for data viewing and analysis In addition, SMD functions as a resource for the entire scientific community by allowing them to download datasets, do analysis, download source code and use the various available tools to explore and analyze the data’s The number of publicly accessible arrays is increasing by about

1000 per year This data’s include experiments on twelve distinct organisms including Homo sapiens, Caenorhabditis elegans, Arabidopsis thaliana, Saccharomyces cerevisiae, Drosophila melanogaster and Escherichia coli [9] SMD provides users the option of selecting experimental data, assessing data quality, filtering by individual spot characteristics and by expression pattern and analyzing data using clustering technique [9] SMD’s software is open source

SMD’s database server is currently an eight-processor Sun V880, which has 32GB or RAM installed [9] The software used includes Database management system – SMD uses oracle server enterprise edition version 9i (9.2.0.1.0); System software – The machine on which SMD resides currently uses SunOS 5.9; other software – Perl 5.0004_04 or later

2.4 Microarray Tools

A list of available tools to work with microarray datasets is given in Table 2.1 Each tool performs different tasks for data mining

Trang 32

22

Table 2.1 List of microarray tools [9]

Software from other sources

Array Designer

[13]

Tool assisting in primer design for microarray construction

Premier Biosoft International JAVA

ArrayMiner [14]

Set of analysis tools using advanced algorithms to reveal the true structure of gene expression data

Optimal Design, Sprl

Windows MacOS

ArrayViewer [15]

Identification of statistically significant hybridization signals

National Human Genome Research Institute

JAVA

BAGEL [16]

Bayesian Analysis of Gene Expression Levels:

a program for the statistical analysis of spotted microarray data

University of Connecticut

MacOS, Windows, Linux

BASE [17] Microarray database and

analysis platform Lund University Web

Cluster 3.0 [18] An enhanced version of

Mike Eisen’s Cluster

University of Tokyo, Japan

UNIX Linux MacOS Windows Expression Profiler

[19]

Analysis & clustering of gene expression data

European Bioinformatics Institute (EBI)

Web

GEDA [20]

Gene expression data analysis and simulation tools, offering a variety of options for processing and analyzing results

University of Pittsburgh and UPMC

Web

GeneCluster [21] Self-organizing maps

Whitehead Institute/MIT Center for Genome Research

JAVA Windows NT

GenMAPP [22]

Tools for visualizing data from gene expression experiments in the context of biological pathways

Conklin lab;

Gladstone Institute

& the UCSF

Windows

Trang 33

23

Table 2.1 List of microarray tools [9] (Continued)

GeneSifter [23]

The GeneSifter microarray data analysis system provides access to powerful statistical tools through a web interface, with integrated features for determining the biological significance of the data GeneSifter works with any array format and is especially optimized for Affymetrix GeneChip users Free trial accounts available

GeneSifter Web

GeneX [24]

Gene Expression Database : integrated toolset for data analysis and comparison

National Center for Genome Resources

Windows Linux SunOS/Solaris

Ocimum Biosolutions

Windows Macintosh Unix Linux Solaris Partek Pattern

Recognition [27]

Extracting and visualizing patterns in large multivariate data

Partek Incorporated

Linux, Unix, Windows TIGR

MultiExperiment

Viewer [28]

Analysis and Visualization of Microarray Data

University of Waterloo, Canada

Linux Unix Windows

Trang 34

24

2.5 Available Source for Microarray Data

Few available sources for microarray data which are available online are tabulated below These sources provide different types of microarray dataset conducted by different biological experiment

Table 2.2 Available source for microarray data

National Center for Biotechnology

Information

http://www.ncbi.nlm.nih.gov/geo/

Stanford Microarray Database http://genome-www5.stanford.edu/

University of Pittsburgh Microarray

Dataset Collection

http://bioinformatics.upmc.edu/Help /UPITTGED.html

Kent Ridge Bio-medical Data Set

Repository

http://sdmc.lit.org.sg/GEDatasets/Datasets.html

Trang 35

25

CHAPTER III MATERIALS AND METHODS

3.1 Database Design

A database is a structured collection of records or data that is stored in a computer system The structure is achieved by organizing the data according to a database model The model in most common use today is the relational model Other models such as the hierarchical model and the network model use a more explicit representation of relationships A computer based database relies upon software to organize the storage of data; this software is known as database management system In this section we will be seeing about the database schema design of MAT and the table details

3.1.1 Schema Design

The schema of a database system is its structure described in a formal language supported by the database management system In a relational database, the schema defines the tables, the fields, the fields in each table, and the relationships between field and tables The levels of database schema can be divided into conceptual schema, logical schema and physical schema Conceptual schema is a map of concepts and their

Trang 36

26

relationships Logical schema is a map of entities and their attributes and relations Physical schema is a particular implementation of a logical schema The diagram of the database schema for MAT is shown below

Figure 3.1 Database schema

3.1.2 Table Details

The three tables used in MAT and their usage are depicted in the table shown below

Table 3.1 Tables used in MAT

Genepix_header Store the header information about the samples

Genepix_version Store the gpr dataset record sets apart from the header information Genepix_sequence Store the gpr file names that are transferred to the repository Training_data_bank Store the samples for creation of training data set

Testing_data_bank Store the samples for creation of testing data set

Trang 38

28

Figure 3.3 Genepix_header table design

Figure 3.4 Genepix_sequence table design

3.1.3 Attributes

We are using datasets generated by the biological kit GenepixPro Once a microarray image is analyzed using the kit, the results are saved in the GPR format GPR files has approximately 50 different attributes, including feature intensities, background intensities, ratio types, sums of various sorts, threshold parameters, several different variations of these attributes, etc To use all these data wisely, we need to know basic facts about the microarray attributes The current database design supports genepix formats; it can be further extended for any types of microarray datasets by adding tables

to the schema and creating the relationship between the tables The detail description of the data types will be explained in the next section

Trang 39

29

3.2 Description of Genepix Data Format

The explanation of the attributes of the sample file and the biological relevance for the few attributes is given in Table 3.2

Table 3.2 Attributes and their description

1 Block

A Block is the unit that consists of a set number

of rows and columns arrayed spots Block corresponds to a single pin in the array printer

is not used in the experiment the Name column will be assigned null values or it will contain the text “Blank” The names in the GAL file are created by the authors and the ID will be taken from database GenBank ID is used as the unique identifier in the analysis experiment [6]

1 F635 SD

2 F532 SD

The standard deviation of the intensity values at wavelength 1 or 2 of all pixels that fall within the feature-indicator ring

Trang 40

of the microarray If the background intensity is higher we can assume that high voltage has been used [6]

1 % >B635 +1SD

2 % >B635 +2SD

3 % >B532 +1SD

4 % >B532 +2SD

The percentage of feature pixels at wavelength 1 or

2 that have intensity values greater than 1 or 2 SD above the median background intensity value [6]

1 F635 % Saturated

2 F532 % Saturated

The percentage of feature pixels at wavelength 1 or

2 that have maximum 16-bit intensity value of

1 Ratios SD

The standard deviation (SD) of the intensity ratios

of all pixels within a feature This parameter is derived from the ratio values computed on pixel by pixel basis

1 Rgn R2 This parameter is the square of the correlation

coefficient and ranges between 0 and 1

1 Sum of Median

2 Sum of Means

This parameter is the sum of the subtracted median or mean pixel intensity values for both wavelengths

background-1 F635 Median – B635

2 F532 Median – B532

3 F635Mean – B635

4 F532Mean – B532

The median or mean pixel intensity of wavelength

1 or 2 for the feature with the median background subtracted

1 Flags

Software flags few features based on the default conditions set on the system The possible values are -100, -75, -50 Values greater than 0 are considered to be good features

Định dạng
Số trang	89
Dung lượng	706,68 KB