THAI NGUYEN UNIVERSITY UNIVERSITY OF AGRICULTURE AND FORESTRY NGUYEN THI QUYNH LAM IDENTIFYING THE EFFECT OF EXPOSURE TO DIOXIN AND FURAN ON HUMAN HEALTH LEADING TO DIFFUSE LARGE B LY
Trang 1THAI NGUYEN UNIVERSITY
UNIVERSITY OF AGRICULTURE AND FORESTRY
NGUYEN THI QUYNH LAM
IDENTIFYING THE EFFECT OF EXPOSURE TO DIOXIN AND FURAN
ON HUMAN HEALTH LEADING TO DIFFUSE LARGE B LYMPHOMA
THROUGH GENE-NETWORK CONSTRUCTION
BACHELOR THESIS
Study Mode: Full-time Major: Environmental Science and Management Faculty: International Programs Office
Batch: 2013 - 2017
Thai Nguyen, December 2017
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 2DOCUMENTATION PAGE WITH ABSTRACT
Thai Nguyen University of Agriculture and Forestry
Degree Program Bachelor of Environmental Science and Management
Student name Nguyen Thi Quynh Lam
Student ID DTN1353110372
Thesis Title Identifying the effect of exposure to dioxin and furans on
human health leading to diffuse large B lymphoma through gene-network construction
Supervisor(s) Prof ChunYu Chuang, Assoc Prof Tran Thi Thu Ha
Abstract:
Many studies indicated that exposure to dioxins and dioxins – compounds
(e.g., 2,3,7,8 tetrachlorodibenzo-p-dioxin (TCDD) and polychlorinated
dibenzofurans (furans) can induce several outcomes on human and animal in the
long term period, and one of them is diffuse large B lymphoma which is considered
as the most popular kind of lymphoma In order to identify the gene expression
altered by TCDD and Furans potentially underlying DLBCL development,
bioinformatics meta-analysis was applied in this study In this study, 10 datasets
containing the information of gene expression of DLBCL, TCDD and Furans were
obtained from Gene Expression Omnibus (GEO) and Array Express websites, and
further analyzed using Cytoscape software and its plugins – ClueGO and CluePedia
As a result, the most differentially expressed genes were found to construct
gene-networks of DLBCL, TCDD and furans and hence the potential pathway presented
how dioxins could cause the progress of lymphoma In addition, the analytical result
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 3reported that TCDD and furans have a possibility to induce the receptor AhR which
promotes the appearance of protein TWIST1 and enhance the progress of DLBCL
The result of this study has made a great contribution for further dioxins study, and
it is also considered as the initial steps of future work for DLBCL diagnosis and
treatment
Keywords: TCDD, Furans, DBLCL, bioinformatics, GEO, Array
Express Number of pages: 87
Date of Submission: 20/09/2017
Supervisor’s signature
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 4ACKNOWLEDGEMENT
First of all, I would like to use this opportunity to express my deepest gratitude
and special thanks to Prof Chun-Yu Chuang for her patient to guide and keep me on
the correct path and show me many of wonderful things during the time of my
internship at the Department of Biomedical Engineering and Environmental Science
at National Tsing Hua University
I would like to express my deep thanks to Assoc Prof Tran Thi Thu Ha for
giving me necessary advices and guidance in order to complete my thesis
My sincere thanks are also given to all the members working in the Department
of Biomedical Engineering and Environmental Science for supporting me all the
materials and necessities when conducting experiments for my research
Finally, I would like to thank my family and my friends encouraging me and
advising me during completion of this thesis
Thai Nguyen, October 2017
Trang 5TABLE OF CONTENTS
ACKNOWLEDGEMENT iii
TABLE OF CONTENTS iv
LIST OF FIGURES vii
LIST OF TABLES viii
LIST OF ABBREVIATIONS ix
PART I INTRODUCTION 1
1.1 Research rationale 1
1.2 Research objectives 2
PART II LITERATURE REVIEW 3
2.1 Persistent Organic Compounds (POPs) 3
2.2 Dioxins and dioxin – liked compounds 4
2.3 Lymphoma and non – Hodgkin lymphoma 8
2.3.1 Diffuse large B lymphoma 8
2.3.2 SNPs of Diffuse Large B lymphoma 9
2.4 Gene - network components 10
2.4.1 Microarray data 10
2.4.2 Gene network database: Array Express and GEO 11
2.4.3 Statistical analysis 13
2.4.4 Hub – proteins 15
2.4.5 GO term 15
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 62.5 Gene Network construction tools 16
2.5.1 Network Analyst website 16
2.5.2 Cytoscape software and plugins: ClueGO and CluePedia Apps 17
PART III METHODOLOGY 19
3.1 Data collection 19
3.2 Data processing 19
3.3 Network construction 21
PART IV RESULTS AND DISCUSSION 24
4.1 Results 24
4.1.1 Genetic datasets 24
4.1.2 Differentially genes expression 27
4.1.3 Gene-network construction of DLBCL, TCDD and Furans 33
4.1.4 Protein – protein interaction network of DLBCL, TCDD and Furans 35
4.1.5 Potential pathway showing the relation between TCDD and Furans and Diffuse Large B lymphoma 37
4.2 Discussion 39
4.2.1 AhR – mediated key factor of dioxins – like compounds 39
4.2.2 Key factors of hypoxia response and the risk of MYC – TP53 interaction 40
4.2.3 Inhibition of cancer cell apoptosis and tumorigenesis factor in DLBCL 42
PART V CONCLUSION 44
REFERENCES 46
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 7APPENDICES 56
Appendix 1 Differentially expressed genes of DLBCL versus normal cell 56
Appendix 2 Differentially expressed genes of exposure to TCDD group and versus control group 63
Appendix 3 Differentially expressed genes of exposure to FURANS group versus control group 67
Appendix 4 Hub proteins of DLBCL network 74
Appendix 5 Hub proteins of TCDD network 75
Appendix 6 Hub proteins of FURANS network 76
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 8LIST OF FIGURES
Figure 2.1: General molecular structure of polychlorinated dibenzo-p-dioxins
(PCDD) and dibenzofurans (Source: Pereira, 2004) 4
Figure 2.2: Representative structure of 2,3,7,8-tetrachhlorodibenzo-p-dioxins
(TCDD) (Pereira, 2004) 5
Figure 2.3: A schematic representation of signal transduction after
TCDD/AHR interaction (Fracchiolla et al., 2016) 7
Figure 3.1: The flowchart of methodology 22
Figure 4.1: Gene Ontology network showing the relationship of DLBCL,
Trang 9LIST OF TABLES
Table 4.1: Database of DLBCL 24
Table 4.2: Database of TCDD and Furans 25
Table 4.3: Differentially expressed genes, including up-and down-regulated
genes in Diffuse Large B lymphoma compared to normal cells 27
Table 4.4: Differentially expressed genes, including up-and down-regulated
genes activated by TCDD compared to control group 29
Table 4.5: Differentially expressed genes, including up-and down-regulated
genes activated by Furans compared to control group 30
Table 4.6: Lists of hub proteins containing in DLBCL, TCDD and Furans
networks 34
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 10LIST OF ABBREVIATIONS
ABC DLBCL activated B – cell like DLBCL
AhR Aryl Hydrocarbon Receptor
ARNT Aryl Hydrocarbon Receptor nuclear translocator
B-NHL B cell non-Hodgkin lymphoma
CRE CAMP response element
DEG Differentially expressed genes
DLBCL Diffuse large B cell lymphoma
DNA Deoxyribonucleic acid
DNMT1 DNA methyl transferase
EGFR Epidermal growth factor receptor
FDR False discovery rate
GCB DLBCL Germinal center B-cell like DLBCL
GEO Gene expression omnibus
HAHs Halogenate aromatic hydrocarbon
MAGE-ML Microarray and Gene Expression Markup Language
MAGE-TAB Microarray Gene Expression - Tabular format
MIAME Minimum information about microarray experiment
Trang 11PCDD/Fs Polychlorinated dibenzo-p-dioxins/furans
PCDDs Polychlorinated dibenzofurans
POPs Persistent organic compounds
ROS Reactive oxygen species
SNPs Single nucleotide polymorphisms
TCDD 2,3,7,8 tetrachlorodibenzo-p-dioxin
TNF Tumor necrosis factor
XRE Xenobiotic response elements transcription
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 12PART I INTRODUCTION 1.1 Research rationale
Dioxins and dioxins-like compounds are largely concerned these days due to
their permanent impacts on human and animals in the long-term period TCDD and
Furans are representative of dioxins and dioxins-liked compounds, which can
influence negatively on human health with a little amount through bio-magnification
and food chain The most significant impact of these chemicals is genetic variation
through aryl hydrocarbon receptor (AhR) activation when it passes into nucleus in
animal body and hence induces genetic disease and carcinogenesis
Diffuse large B lymphoma (DLBCL) is the most prevalent B cell non –
Hodgkin lymphoma, which occupies 40% of lymphoma diagnoses The cause of
Diffuse Large B lymphoma is exactly unknown, however, many pro-oncogenes and
abnormal genes causing lymphoma have been found in previous studies The
identification of biological mechanisms activating those genes, whether they are
related to dioxins and dioxins-liked compounds impact or not, is highly essential to be
considered Bioinformatics, including sequence analysis, gene and protein expression,
cellular organization analysis, structural bioinformatics, network and system biology
and others, has a large contribution to various fields in global scale
The application of bioinformatics in biomedical has been largely paid attention
in many developed countries, by contrast, it is still unpopular in Vietnam
Specifically, many researches indicated that the application of high sequencing and
DNA microarray technology has a significant role in attempt to identify
genetic/transcriptomic alterations causing DLBCL and prognosis biomarkers for
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 13lymphoma treatment Therefore, the activation of those abnormal genes and the
influence of dioxins can be clarified by the application of bioinformatics In order to
clarify diagnosis of lymphoma, the study “Identifying the effect of exposure TCDD
and Furans on human health leading to diffuse large B lymphoma through network construction” has been conducted with promotion of Biomedical Engineering and Environmental Science faculty of National Tsing Hua University in
gene-Taiwan
1.2 Research objectives
The objectives of this research are:
- To investigate respectively the differentially expressed genes for diffuse large
B lymphoma (DLBCL) tissues and dioxin exposure of human cell lines;
- To construct the gene-network for exploring number whether exposure to
dioxin can induce DLBCL;
- To identify the potential pathway exposure to dioxin corresponding to
DLBCL
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 14PART II LITERATURE REVIEW 2.1 Persistent Organic Compounds (POPs)
Persistent organic compounds include a variety of lipophilic compounds that
relate to environmental degradation Amongst various kinds of POPs, for example,
Organochlorine (OC) pesticides or industrial chemicals of by products, the category
containing Cl atom has a great ability to cause the most deleterious effects and as a
consequently, they have been banned and strictly regulated in many countries Despite
of that regulation, POPs exposure sustains in general population due to the
consumption of fatty acid derived from animals The concentration of POPs has a
tendency of increasing which corresponds to the level of food webs in order to
perform biomagnification, as a results, the POPs concentration accumulating in
human bodies might be higher compared to the external environment (Fisher et al.,
1999) In addition, POPs accumulated in adipose tissue in life is considered as one
route of chronic exposure since they are continuously released from adipose tissue to
the circulation and vital organs with lipid content (La Merrill et al., 2013)
POPs consist of these main properties The first property is mentioned as a
combination of lipophilic compounds that accumulate mainly in lipid – containing
tissues like adipose tissue and move within the body bound to lipids (Lewis et al.,
2002) In addition, POPs are always presented as chemical mixtures in the external
environment due to mixing in the environment, food web, long – term retention of fat
tissues (Kortenkamp et al., 2008) Therefore, these distinct groups of OC pesticides,
polychlorinated biphenyls (PCBs) and dioxins are classified referring to chemical
mixtures of each POPs subclasses
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 152.2 Dioxins and dioxin – liked compounds
Polychlorinated dibenzo-p-dioxins/furans (PCDD/Fs) is classified as
ubiquitous POPs PCDD/Fs is two of the three subclasses of the halogenated aromatic
hydrocacbon and two of them are referred as dioxins and dioxin-liked-compounds
respectively (see Figure 2.1)
Figure 2.1: General molecular structure of polychlorinated dibenzo-p-dioxins
(PCDD) and dibenzofurans (PCDF)
(Source: Pereira, 2004)
They are widespread in almost area in the environment, especially there is no
exception for the remote area Dioxins and dioxin-liked-compounds tend to be
persistent and lipophilic in the external environment so that they can be
bio-accumulated through food chains and potentially cause potential effects on human
health biota and even human PCDD/Fs are two of subclasses of the halogenate
aromatic hydrocarbon (HAHs), which are specified by the basic aromatic structure of
a benzene ring, a hexagonal carbon structure with conjugated double bonds
connecting to the carbon The difference of both dioxins and dioxins like compounds
depends on the number of oxygen rings in their structure; are 2 and 1 rings
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 16similar spectrum of toxic effects through binding of dioxins and
dioxins-liked-compounds to a receptor protein – Aryl Hydrocarbon Receptor (AHR) The molecular
planar shape facilities binding to the receptor and its relative potency depends to a
large degree on its persistence and how well it fits to the receptor PCDDs/Fs and one
component of PCDDs – tetrachlorodibenzo-p-dioxins (TCDD) have a high affinity to
AHR and fit very well on that receptor, actively PCDD/Fs are derived from 4 main
sources, including (1) combustion, (2) meta – smelting, refining and processing, and
(3) biological and photochemical process (US National Research Council, 2006)
PCDD/Fs has a potential to cause cancer, birth effect, reproductive disorders,
immunotoxicity, and other potential toxic end points, including liver diseases, thyroid
dysfunction, lipid disorders, neurotoxicity, cardiovascular disease, and metabolic
disorders, such as diabetes (US National Research Council, 2006)
* 2,3,7,8 tetrachlorodibenzo-p-dioxin (TCDD):
According Pereira (2004) 2,3,7,8-tetrachhlorodibenzo-p-dioxins (TCDD) is
structured as below (see Figure 2.2)
Figure 2.2: Representative structure of 2,3,7,8-tetrachhlorodibenzo-p-dioxins
(TCDD)
(Source: Pereira, 2004)
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 172,3,7,8 tetrachlorodibenzo-p-dioxin (TCDD) is one of the most toxic members
of the family of polychlorinated dibenzodioxin (PCDDs) and represents a nearly
ubiquitous environmental contaminant (Pesatori et al., 1993, 2009) TCDD is
considered as a synthesis byproduct from chlorophenols or chlorophenoxy herbicides
manufacturing (Saracci et al., 1991) It can be formed in burning processes along with
other polychlorinated dibenzodioxins and dibenzofurans In addition, it can be derived
from waste incineration, metal production, fossil fuel or wood combustion (Deziel et
al., 2012) Dioxins are likely to involve in bioaccumulation in the food chain due to
its long biological half-life and the low water solubility; even the small amount of
dioxins can induce the significant level of dioxin concentration in the food chain
(Paustenbach et al., 1992) It is proved that TCDD can induce its effects via the
binding of the dioxin receptor AhR due to its affinity to TCDD in many mammalian
species
AhR is a basic-loop-helix/PAS transcription factor that locates in cytoplasm
where it forms a complex with various proteins and lipophilic compounds (Agostinis
et al., 2007) In cytoplasm, it is associated to pp60, which can bind to epidermal
growth factor receptor (EGFR) and induce mitogen – activated protein signaling In
nucleus, AhR builds up a heterodimer with the intranuclear aryl hydrocarbon receptor
nuclear translocator (ARNT) to form a AhR – ARNT complex which promotes
xenobiotic response elements transcription (XRE) and interact with several important
pathways, for example, Wnt-beta-catenin, estrogen receptors, retinoblastoma protein,
retino acids, NF-kB and the circadian rhythm regulators (Sorg, 2013) AhR has been
proved to be involved in multiple physiological regulation and effects, for example,
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 18altered cell cycle regulation and proliferation In fact, exposure to TCDD in Sweden
and US workers indicated similar observation of a relationship between phenoxyl
herbicide exposure and cancer, particularly prolong TCDD exposure are related to the
increase of relative risk of Non – Hodgkin lymphoma (Hardell et al., 1996) Besides,
45 million liters of Agent Orange contaminated TCDD were spread out in South
Vietnam and Cambodia to destroy vegetation from 1962 to 1971 that leads to several
cancer incidence has still remained (Stellman et al., 2003)
Therefore, the aim of this study mainly focus on the potential gene-network
and pathway to investigate how the most toxic substance of PCDDs – TCDD and
furans - group of dioxin-liked-compound can induce one of common Non – Hodgkin,
especially diffuse large B lymphoma disease (Figure 2.3)
Figure 2.3: A schematic representation of signal transduction after TCDD/AHR
interaction
(Source: Fracchiolla et al., 2016)
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 192.3 Lymphoma and non – Hodgkin lymphoma
Lymphoma is considered as a well-known name of neoplasms of lymphoid
precursor cells, which was initially reported in 1832 by Thomas Hodgkin and hence
the disease was named to Hodgkin’s lymphoma After that, several kinds of
lymphoma were discovered, however, the disease was divided mainly into 2
subclasses: Hodgkin lymphoma and non – Hodgkin lymphoma The majority of Non –
Hodgkin lymphoma is B cell lymphoma apart from T-cell and NK-cell lymphoma
Lymphoid neoplasms are a group of highly diverse disease and reflect the diversity of
immune system (Hussain and Harris, 1998) In Vietnam, the incidence of
Non-Hodgkin lymphoma has increased during the last ten years that record 2700 cases each
year (Nguyen, 2015)
2.3.1 Diffuse large B lymphoma
Diffuse large B lymphoma (DLBCL) is considered as the most prevalent B cell
non-Hodgkin lymphoma (B-NHL) in adulthood, occupying for 40% of diagnoses
There are three major subclasses of DLBCL which are characterized basing on
molecular heterogeneity of DLBCL, including germinal center B-cell like DLBCL
(GCB DLBCL), activated B-cell like DLBCL (ABC DLBCL) and primary
mediastinal B-cell lymphoma GCB DLBCL is derived from germinal center B cell
and expresses genes characteristics of germinal center B lymphocytes, while ABC
DLBCL expresses genes characteristic of plasma cells, which are thought to arise
from B-cells activated for differentiation into plasma cells Primary mediastinal B cell
lymphoma is thought to mediate from rare B-cell populations that reside in the thymus
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 20and have a distinct gene expression compared to GCB and ABC DLBCL (Rosenwald,
2003)
2.3.2 SNPs of Diffuse Large B lymphoma
The application of gene expression and genome sequencing is carried out in
order to increase our understanding of DLBCL subclasses and the molecular basic of
chemotherapy resistance and support for identification of novel molecular DLBCL
subset and target for drug interventions and hence to prevent and treat DLBCL
(Lossos et al., 2006)
The majority of DLBCL can arise from normal antigen-exposed B cells that are
at separate stages of differentiation and undergo clonal expansion in the germinal
center (GCs) of peripheral lymphoid organs (Martelli et al., 2013) Besides, DLBCL
can involve and progress through a range of multistep transformation processes
Specifically, progression of DLBCL can be evolved slowly or rapidly due to different
stages, through clonal evolution or simultaneous and extensive DNA rearrangements
in subclones Several diverse genetic abnormalities have been observed referring to
their clinical and genetic (clonal) heterogeneity, including aberrant somatic
hypermutation, nonrandom chromosomal deletions, balanced reciprocal translocation,
deregulating the expression of proto – oncogene products, such as BCL6, BCL2, REL
or c-MYC and dysregulated apoptosis of defective DNA repair (Morin et al., 2013)
Several genes mutation causing DLBCL have been identified in several studies,
for example, the primary or early oncogenic events are chromosomal translocations
involving oncogenes such as BCL6, BCL2, REL or c-MYC, whereas a groups of
BCL2, PRDM1, CARD11, MyD88, TNFAIP3, CREBBP, TP53, EZH2, MLL2,
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 21MYOM2, PIM1, LYN, CD36, B2M, CD79B, MEF2B, ANKLE2, KDM2B, HNF1B,
NOTCH1/2, DTX1 and MYCCD58 tend to appear in the secondary or late oncogenic
events of clonally represented recurrent mutations or gene alteration (Morin et al.,
2013) In addition, the alteration of DNA repair and DNA signaling genes causing
effects on DNA repair pathway has been identified in DLBCL tumors and they have a
tendency to form intermediate cancer driver events in lymphomagenesis Moreover,
mutation or translocation of BCL6, BCL2, REL or c-MYC can induce overexpression
of proto-oncogene products, whereas genetic lesions and mutations in TNFAIP3,
CARD11, CD79A/B, MYD88 or TRAF2 can activate canonical and non-canonical
NF-kB pathways (Zhang et al., 2015) Furthermore, most frequent cancer driver
events in DLBCL are accounted for some epigenetic reprogramming, trigged by
mutations in genes, for example, TET1, MLL2, EZH2, MEF2B, EP300 and CREBBP
(Zhang et al., 2013) Therefore, tumor cell with gene expression plasticity, escape
from apoptosis and enhanced growth are provided by the alterations in gene
expression of proto – oncogene products and tumor suppressors through constitutive
survival and proliferative signals
2.4 Gene - network components
2.4.1 Microarray data
DNA microarray has been used to determine the expression level of a large
number of genes Microarray platforms for gene expression include single-color and
two-color system Affymetrix Gene Chip arrays are widely used single-platform for
microarray analysis, which are constituted of probed complementary to a region of
each mRNA transcript, usually at the 3’ end of the transcript Each probe sets consists
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 22of a set of 11 to 20 perfect match (PM) of probes which are typically 25 nucleotides
long, together with an equal number of mismatch (MM) probes which are identical to
the PM probes except for a single nucleotide substitution in the center of proves
DNA microarray techniques have been applied to predict DLBCL treatment
success and explain disease heterogeneity five clinical features (age, tumor stage,
serum lactate, dehydrogenase concentration, performance status, number of extra
nodal disease sites) (Gohlmanm and Talloen, 2009) In fact, this technique is most
widely used to profile gene expression of an organism on a whole genome scale, and
available for spawning a series of microarray-based expression studies of DLBCL in
order to refine prognosis referring to molecular – level information (Segal, 2005)
Besides, DNA microarray was also carried out to analyze the changes of human B-cell
gene expression induced by dioxins (Kovalova et al., 2017)
In this study, the gene expression profiling representing DLBCL and dioxins
(TCDD and Furans) created by DNA microarray techniques were conducted for
further analytical steps The datasets of gene expression are collected in two main
kinds of databases: Gene Expression Omnibus (GEO) and Array Express databases,
that will be discussed more detailed in the following part
2.4.2 Gene network database: Array Express and GEO
All of the datasets in this study were derived from Array Express database and
Gene Expression Omnibus (GEO) database Array Express is a public database for
high throughput functional genomics data, which consists two distinct parts, including
the Array Express Repository and the Array Express Data Warehouse The Array
Express Repository is considered as a MIAME supportive public archive of
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 23microarray data, whereas the Array Express Data Warehouse performs a database of
gene expression profiles selected from the repository and consistently re-annotated
The required samples or experiments can be found by experiment attributes, for
example, keywords, species, array platforms, authors, journals or accession numbers
Gene names, gene properties or gene ontology terms are useful in order to visualize
gene expression profiles The database of Array Express is rapidly growing and it
includes data from larger 50000 hybridization and 1500000 individual expression
profiles MIAME (Minimum Information About Microarray Experiment), Microarray
and Gene Expression Markup Language (MAGE-ML) and Microarray Gene
Expression - Tabular format (MAGE-TAB) are considered as some of community
standards that are supported by Array Express (Parkinson et al., 2007)
GEO database derived from National Center for Biotechnology Information
(NCBI) is considered as an abundant data containing gene expression data generated
by DNA microarray technology The database has a suitable design for both
unprocessed and processed data in a MIAME The quantitative of gene expression
data resulting in a large number of biological phenomena in GEO is about billion, and
all of them are derived from over 100 organisms and 1500 laboratories Several
user-friendly web applications have been carried out in order to increase the utility,
effective exploration, query and visualization of these data in both individual and
entire studies (Barrett, 2004)
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 242.4.3 Statistical analysis
2.4.3.1 Meta-analysis
Meta-analysis is a kind of statistical techniques for the sake of combining result
from several studies apart from various kinds of statistics, for example, Fisher’s
statistic, minimum and maximum statistic This technique has been applied to
microanalysis, in particular, in order to combine different studies for DEGs
(Differentially expressed genes) application in microarray studies and boost the
reliability of results from individual studies (Shen and Tseng, 2010) In order to
conduct microarray meta-analysis, seven steps have been carried out, including: (1)
identify suitable microarray studied, (2) extract the data from the studies, (3) prepare
the individual datasets, (4) annotate the individual datasets, (5) resolve the relationship
between probes and genes, (6) combine the estimation of the studies and (7) analyze,
present and interpret results (Ramasamy et al., 2008) Meta-analysis is probably
beneficial for this study in the attempt to identify DEGs of DLBCL tissues and
dioxins group compared to normal tissues and control group respectively, which are
mainly concerned in the next part
2.4.3.2 False Discovery Rate (FDR)
The false discovery rate (FDR) is considered as the expected fraction of false
rejections among those hypotheses rejected This method is carried out in
microanalysis in order to estimate the proportion of false positive finding amongst the
genes that were selected to become differentially expressed (Gohlmann and Talloen,
2009) Although various procedures have been built to control the FDR, the FDR
method of Benjamini and Hochberg is considered the most popular which has been
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 25carried out in this study The Benjamini and Hochberg method is calculated as the
formula below:
order(pi) with i = 1,2,3,4 … , m
Where:
p is adjust P value by Benjamini Hochberg method
pi is the p Value of gene I
m is the total number of genes in dataset
2.4.3.3 Different Expression Analysis
Different gene expression is currently applied in microarray analysis in order to
find the genes that are differentially expressed In fact, mutation in gene or a set of
gene is the main factor that induce abnormal or fail gene expression, for example, p53
tumor suppressor gene are transcribed that can cause cancer disease Therefore,
microarray experiments are useful to identify which gene are differentially expressed
in disease cell versus normal cells The comparison between various kind of “disease”
and “normal” cells provides an opportunity in order to find multiple target genes that
their up- and down- regulation can be the result of the disease After that, the
development of drug target for specific mutated genes is carried out in order to reduce
their undesirable effects In addition, Different Gene Expression has a significant
relationship with gene function and it can provide fully information about genes and
protein interaction Therefore, differentially expressed genes are carried out in the
reconstruction of gene network, metabolic pathway and gene annotation (Zhang,
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 262006) In this study, DEGs are the main components for gene-network construction to
figure out whether dioxins can induce DLBCL
2.4.4 Hub – proteins
A gene-network is consisted of various nodes, which are connected by edges
In molecular biology, nodes are referred as the term of “genes” or “proteins” and
edges are molecular interaction, as a result, gene network represents the interaction of
genes or proteins leading to a variety of biological processes The types of nodes in
each network is currently divided into two distinct types, including: (1)
highly-connected nodes, or hub-proteins and (2) poorly-highly-connected nodes or non-hub proteins
Hub-proteins are significantly more important that non-hubs since they have an ability
to ensure the maintenance of the network It has been indicated that in protein-protein
interaction network, hubs tend to be essential due to the centrality-lethality rule that
shows functional importance of a node is thought to increase from its structure
importance in the network, as a results, hubs tend to relate to significant biological
pathways that may result in biological reaction in human body (He and Zhang, 2006)
In this study hub-proteins play an important role in order to observe the potential
pathway exposure to dioxins leading to DLBCL
2.4.5 GO term
The gene ontology (GO) mainly contains the terms, which are connected
through a hierarchical order The GO terms associate with gene products that is able to
classify proteins into three distinct group, including: (1) molecular, (2) biological
processes and (3) cellular components corresponding to their biological function
(Balakrishnan, 2013) In fact, these functions are summarized from published papers
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 27and uploaded in GO database and hence researchers can access to this information by
the process of annotation In addition, GO database provides the main annotation
sources that can be useful for analysis of high throughput datasets, for example,
transcriptomic and proteomic studies and function, pathway or cellular components
identification, which are represented by these datasets (Pavlidis, 2004) Furthermore,
GO database is considered as a pathway-driven analysis tools in order to identify risk
since it relates to single nucleotide polymorphisms (SNPs) that are useful to inform
biomarker identification studies (Holmans, 2009)
2.5 Gene Network construction tools
2.5.1 Network Analyst website
Network analyst website is one of the most basic and friendly tools and it
combines all necessary steps to analyze network and performs the results through a
high-quality visualization system This website is available for anyone and it is
designed for efficient Protein-protein interaction network performance The data in
this website is generated from several gene expression experiments of various species,
mainly from human and mouse studies
Network analyst website was developed by three main steps in network
analysis, including significant gene identification by data processing steps, a step of
network construction for mapping, building and refining network, a step of network
analysis and visualization Besides, multiple options are certainly provided within
each main steps (Xia et al., 2014) In this study Network Analyst is considered as an
inadequate tool to find the most obvious DEGs of DLBCL, dioxins and hub-proteins
for gene-network construction and potential pathway identification respectively
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 282.5.2 Cytoscape software and plugins: ClueGO and CluePedia Apps
Cytoscape is an open source software that is helpful to perform
high-throughput expression data and other molecular states into a conceptual framework
Cytoscape has a powerful role in conjunction with these databases of protein-protein,
protein-DNA and genetic interaction that are available for human and other organisms
(Shannon, 2003)
There were a large number of enrichment tools and algorithms that were
constructed for the sake of sufficient data interpretation, and ClueGO is considered as
a Cytoscape plugin used to represent the biological interpretation and functional group
terms in the form of networks and charts In particularly, Kappa statistic is mainly
used in ClueGO in the attempt to link the terms in the network and hence GO terms or
pathways are functionally organized Therefore, ClueGO is one of available
Cytoscape plugin that is used for analysis of terms relation and function groups in
biological networks (Bindea et al., 2009)
CluePedia is the second Cytoscape plugin that has been used in this study Clue
Pedia is an useful tool in order to search new markers, which are associated to
pathways By using CluePedia application various kinds of genes, proteins and
miRNAs have a possibility to connect referring to experimental information before
integrating into ClueGO network In addition, new association of pathway can be
informed by genes, proteins and miRNA enrichments Therefore, this Cytoscape
plugin is certainly portable for users and has a powerful visualization in the attempt to
present genes, miRNA or proteins connection network (Bindea et al., 2013)
Cytoscape software and ClueGO/CluePedia plugins are applied to perform
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 29gene-network reconstruction and identify potential pathway corresponding to the
second and the third objectives of this study
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 30PART III METHODOLOGY 3.1 Data collection
At the beginning, all required microarray datasets were collected in two main
websites, including the Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/)
and Array Express (http://www.ebi.ac.uk/arrayexpress), which are considered as a
huge public resource of gene expression data and provide users a flexible data mining
tool
(https://academic.oup.com/nar/article/35/suppl_1/D760/1106106/NCBI-GEO-mining-tens-of-millions-of-expression)
To measure human gene expression of DLBCL pattern and how it relates to
chemical exposure, several datasets were obtained from these both websites by using
these following keywords: Homo sapiens, DLBCL, TCDD and Furans, and the array
files containing proceed data were carried out in this study In each DLBCL array
files, the experimental samples are normally obtained from various sources In this
study, two different types of samples, including normal tissues and DLBCL tissues,
were conducted and all of them must be untreated by any chemical The array
platform of these data totally was Affymetrix platform and those files must be
available from September 2015 to present time As a result, a total of 10 microarray
datasets were found to fit the scope for this study, including GEOD-12195,
83632, GSE47355, GSE56313, 69844, 69845,
E-GEOD-69849, E-GEOD-69850, E-GEOD-69851
3.2 Data processing
Data analysis were subsequently performed using Network Analyst – a
standard web browser for network analysis and interactive exploration The datasets
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 31were combined and divided into 3 types of distinct groups basing on the samples
sources in text files (txt.), including (1) DLBCL and normal tissues, (2) control and
TCDD, and (3) control and Furans In the initial step, Text files were uploaded in
order to define the types of organism – Homo sapiens and the ID type of Official
Gene Symbol was chosen ID conversion steps were applied immediately after
uploading these Text data in order to identify the types of organism and provide the
number of matched or unmatched genes with chosen ID type Then, these files were
summited to gene annotation step to ensure the labels are consistent across all datasets
upload After that, data normalization step was carried out in order to set adequate
normalization procedure In this report there is no normalized procedure setting for
DLBCL and normal tissues data, while log2 normalization was applied for control
group and dioxin treatment data in order to increase the variance at low intensities
Those normalized data were transformed into different expression analysis
dialog in order to perform different gene expression analysis on individual dataset and
hence the number of DEGs between DLBCL and normal tissues, control and chemical
groups can be detected An analysis of variance (ANOVA) was conducted on
individual dataset and cut-off p value was adjusted by using the
Benjamini-Hochberg’s false discovery rate (FDR) which is enable to decide whether the gene is
differentially expressed or not; and it was randomly set equal to 0.05 in DE analysis
dialog in Network analyst website After data summarization step all 4 datasets of
DLBCL, 5 datasets of TCDD and 1 of Furans (Table 4.1.1, Table 4.1.2) were applied
“directed merge” method in meta-analysis step in order to merge all datasets into a
single data to analyze
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 32Finally, three distinct result tables containing top-ranking DEGs and relevant
statistics (CombineLogFC, adjust P value) for DLBCL, TCDD and Furans were
separately exported (Appendix 1,2,3)
3.3 Network construction
The obtained DEGs of DLBCL, TCDD and Furans above were basically
screened by |fold – change| ratio that equals to 2.0, 1.2 and 1.2 (|Combine LogFC| ≥
0.26) respectively in order to filter the top up – regulated and down – regulated genes,
which are employed for further analysis steps including: (1) Gene ontology analysis
and (2) Gene network reconstruction The GO biological pathway of these expressed
genes can easily found by summiting the DEGs lists of each group into Cytoscape
plug-in ClueGO The results of gene networks were released showing fully the
biological pathway of these involved DEGs Besides, these DEGs were plug in
ClueGO app in Cytoscape software thereby reconstructing DLBCL, TCDD, Furans
gene networks The standard Kappa score level threshold can initially be adjusted on a
positive scale from 0 to 1 for the purpose of restricting the network connectivity in a
customized way as well as creating the functional group of gene (Bindea et al., 2009)
and the kappa score was chosen to equal 0.4 in this study to create these subnetworks
All three sub – networks of DLBCL, TCDD and Furans were merged together into a
single network thus providing a potential pathway showing the effect of TCDD and
Furans chemical on Human health and leading to DLBCL
To clarify the potential pathway between TCDD/ Furans exposure and DLBCL,
protein-protein interaction network was constructed for further purpose of identifying
the hub genes, which may have a vital function and indirectly involve in many
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 33biological process (Raman et al., 2013) All filtered DEGs of DLBCL TCDD and
Furans were submitted individually in Network Analyst website in order to create its
own protein-protein interaction network Later on, those hub proteins of DLBCL and
two types of dioxin-related compounds having the highest value of (1) Node degree
and (2) node betweeness were characterized and summarized for the next step of
pathway analysis The list of hub protein and the additional target gene were plug in
Clue Pedia app in Cytoscape software in order to show potential pathway of TCDD
and Furans leading to DLBCL disease The directed edges chosen for pathway
network construction consisted of two distinct types: gene activation, and gene
expression in order to build the pathway network showing how dioxin-related
compounds can lead to DLBCL disease in human body
In this research, all necessary steps to be undertaken are assembled in the
following flowchart (Figure 3.1) for better illustration
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 34Figure 3.1: The flowchart of methodology
DATA COLLECTION
After September
2015
DATA PROCESSING
Network
Analyst
Normalization Annotation
ID conversion
DE analysis
Differentially expressed genes
Gene-network presenting GO
Protein – protein interaction network
Potential pathway
NETWORK CONSTRUCT -ION
Trang 35
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 3625
Table 4.1: Database of DLBCL
Name
Data Source (website)
Species
Sample source (type of tissues)
Normal tissues DLBCL
Total samples
Array platform
E-
GEOD-12195
Array express
Homo sapiens
Fresh frozen tissue, normal tonsil
Homo sapiens
Fresh frozen tissue
GSE473
Homo sapiens
Lymph node tissues of DLBCL patients
GSE563
Homo sapiens
Lymph node tissues of DLBCL patients
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 37trol
Con- ment Total
Treat-Array platform
E-GEOD
-69844
Array Express
Homo sapiens
HeraRG Hepatocytes 45 7 52 AffymetrixE-
GEOD
-69845
Array Express
Homo sapiens
MCF7 Breast Adenocarcinom
Homo sapiens
Ishikawa Endometrial adenocarcinoma Cell Line
Homo sapiens
HepG2 Human Hepatocyte Carcinoma Cell Line
Homo sapiens
HepaRG Hepatocyte Carcinoma Cell Line
HomoSapiens
Expression Profiles of HepG2 cells treated with furans
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 3827
4.1.2 Differentially genes expression
By using Network Analyst and Benjamini – Hochberg’s FDR statistical method, the total number of DEGs screened by |fold change| ≥ 1,2 (|Combine LogFC|≥0.26) were counted for 1228 DEGs, including 488 DEGs of DLBCL, 288 DEGs of TCDD and 512 DEGs of Furans As the result table above, the quantity of up – regulated genes of DLBCL,
TCDD, Furans were 316, 268 and 217 respectively and down – regulated genes of these categories were counted for 172, 20 and 295 DEGs respectively (Tables 4.3, 4.4, 4.5)
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 3928
Table 4.3: Differentially expressed genes, including up- and down – regulated genes
in Diffuse Large B lymphoma compared to normal cells
DLBCL (448 DEGs) Up-regulated
COMMD8, NDUFS3, MFSD1, VAMP8, HSBP1, HSD17B10, LSM1, RRM1, RSL24D1, C14orf2, PDZD11, POP5, PSMD14, APEX1, ACTR10, MRPL33, NDUFA8, DDX39A, TMEM147, IGBP1, DCTPP1, IMPDH2, RRM2, MRPL18, POLR2H, PSMD1, TIMM10, MRPL27, YWHAG, FIS1, DDX23, SNAPIN, BLOC1S2, RCC2, SAT1, PTRHD1, CDK2AP1, TAF7, BCL2A1, GLA, ALYREF, CD19, COX14, CD3D, SLAMF8, CMC2, S100A11, PRDX1, C19orf70, MRPS33, NDUFB5, TMEM126A, DNAJB11, PRR13, EMG1, NAE1, ADSL, EVI2A, MRPL15, MRFAP1, GZMA ,EIF2A, GYG1, ISG15, GJA1, MLH1, PARP1, RACGAP1, SNRPB, MMADHC, CBX1, MRFAP1L1, ACADM ,RPL36AL, CCDC12, STARD3NL, CETN3, HEXB ,CEBPB, ISCU, LSM6, DNMT1, PSMC5, MAGED1, NAA20, TSPO, MORF4L1, PDLIM1, DDIT4, EIF4H, PSMB10, ITPA, MRPS28, PSMC4, BLOC1S1, CD52, UBE2A, ATP5L, RPS19BP1, SEC13, SEC11A, THOC7, NOC3L, DKFZP586I1420, ZNF121, MRPS35, DIABLO, OCIAD2, PSMD8, ARPC5L, MS4A1, MCM2, MRPL40, GCH1, PSMD10, DNTTIP2, OAT, NSMCE1, TBCB, C14orf119, ACP5, PPM1G,
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com
Trang 4029
POLR2K, TSG101, PEA15, MRPL49, NIT2, ATIC, PPP2CB, NCBP2, RABAC1, DRG1, NUP107, TCF4, SLC25A19, UFC1, CIB1, BIRC2, NDUFB10, RBBP8, SNX3, SMNDC1, HDHD2, ETF1, RAD23A, MYBL2, SRRM1, TIMMDC1, COX5B, LYRM1, IL18, ARHGAP17, IRF2BPL, NONO, TM2D2, MFAP1, ITGA3, KCTD12, NUPR1, HAT1, AP3S1, MANF,, TMEM14B, CPSF4, PPIH, MIEN1, MTIF2, FAM50A, LRRC47, PAPSS1, GLO1, CCNG1, RPIA, ASNSD1, LYPLA1, WDR83OS, CUTA, DAZAP1, AP1S2, BTBD1, VPS25, BCL11A, MT1E, ZNHIT3, EIF3I, RPL11, S100A8, ANXA2, PPIL3, GLRX, ENOPH1, IER5, CISD1, HAUS1, DRAM1, DDX21, SNRPD3, UBE2L6, TMEM138, RPF2, DUT, GTF3C6, TSPAN13, ITM2A, PPP1R7,PIH1D1, GTF2B, CDK5RAP3, TMEM208, DBF4, GTF3A, RFC4, IER3, YTHDF2, FIBP, TIMM8B, MPLKIP, VPS28, LAGE3, CLIC1, HARS, IMP3, CS, CEBPZ, RFX5, DNAJB1, MRPL16, CSRP1, ORMDL2, PIGP, CDKN1A, NMI, FAM35A, TNFAIP3, PCMT1, EBPL,TUBB6, GBP1, PLOD1, TUBA1C, REEP5, EIF2S1, MRPL1, IMP4, SNRPA, MARCKSL1, DYNLT3, UBE2E2, SCAMP3, POLR3GL, CUEDC2 Down-
regulated
genes (172)
DUSP6, CYTH4, LCP2, SIRPB1, ITGB2, CORO1A, RAB7A, COX7A2L, MEFV, ANPEP, C5AR1, ZYX, DOCK5, STEAP4, GRK6, MSL2, PLXNC1, STK17B, PYGL, CD3E, KCNJ15, SCIMP, CAPNS1, GLIPR1, CPPED1, IST1, LILRA1, PRKAR1A, ARRB2, WDR1, ARHGAP26, DUSP1, WIPF1, MXD1, BSG, CELF2, GNAQ, ZFAND5, MBOAT7, GABARAP, MBNL1, AOAH, CTSS, DOK3, HIST1H1E, CYP4F3, PTBP3, NCF2, RNASET2, TCP11L2, MAPK1, PIP4K2A, STAT3, DOCK8, TLN1, TGFBR2, SELPLG, PGK1, FPR1, SDHA, SMCHD1, MOB3A, DDX17, TUBB1, GUK1, LYN, CD37, ETS1, CCNI, STK38, ATP6V1B2, CAP1, PDZK1IP1, HBB, EPB41, TREM1, PTAFR, GNAS, FFAR2, RPL18, IL7R, EIF4EBP2, SLC44A2, HLA-DPA1 LITAF ITM2B CXCR2, CYBB, CFL1, LCP1, ALAS2, PTPRC, CSF3R, ARHGDIB, AQP9, DAZAP2, SLC6A6, B2M, SMAP2, BCL2L1, SORL1, RAC2, FBXO7, PSAP, FCN1, ND5,
SLC25A37, TNFRSF10C, TMBIM6, CD74, HLA-E, SLC25A39, DCAF12, CX3CR1, RHOA, CD53, XPO6, TAGLN2, FCGR2A, MSN, LYZ, LAPTM5, MALAT1, TXNIP, ACTB
Table 4.4: Differentially expressed genes, including up- and down – regulated genes
activated by TCDD compared to control group
luan van tot nghiep download luanvanfull moi nhat z z @gmail.com