As a result, the most differentially expressed genes were found to construct networks of DLBCL, TCDD and furans and hence the potential pathway presented how dioxins could cause the prog
Trang 1THAI NGUYEN UNIVERSITY
UNIVERSITY OF AGRICULTURE AND FORESTRY
NGUYEN THI QUYNH LAM
IDENTIFYING THE EFFECT OF EXPOSURE TO DIOXIN AND FURAN
ON HUMAN HEALTH LEADING TO DIFFUSE LARGE B LYMPHOMA
THROUGH GENE-NETWORK CONSTRUCTION
BACHELOR THESIS
Study Mode: Full-time Major: Environmental Science and Management Faculty: International Programs Office
Batch: 2013 - 2017
Thai Nguyen, December 2017
Trang 2DOCUMENTATION PAGE WITH ABSTRACT
Thai Nguyen University of Agriculture and Forestry
Degree Program Bachelor of Environmental Science and Management
Student name Nguyen Thi Quynh Lam
Thesis Title Identifying the effect of exposure to dioxin and furans on
human health leading to diffuse large B lymphoma through gene-network construction
Supervisor(s) Prof ChunYu Chuang, Assoc Prof Tran Thi Thu Ha
Abstract:
Many studies indicated that exposure to dioxins and dioxins – compounds (e.g., 2,3,7,8 tetrachlorodibenzo-p-dioxin (TCDD) and polychlorinated dibenzofurans (furans) can induce several outcomes on human and animal in the long term period, and one of them is diffuse large B lymphoma which is considered
as the most popular kind of lymphoma In order to identify the gene expression altered by TCDD and Furans potentially underlying DLBCL development, bioinformatics meta-analysis was applied in this study In this study, 10 datasets containing the information of gene expression of DLBCL, TCDD and Furans were obtained from Gene Expression Omnibus (GEO) and Array Express websites, and further analyzed using Cytoscape software and its plugins – ClueGO and CluePedia
As a result, the most differentially expressed genes were found to construct networks of DLBCL, TCDD and furans and hence the potential pathway presented how dioxins could cause the progress of lymphoma In addition, the analytical result
Trang 3gene-reported that TCDD and furans have a possibility to induce the receptor AhR which promotes the appearance of protein TWIST1 and enhance the progress of DLBCL The result of this study has made a great contribution for further dioxins study, and
it is also considered as the initial steps of future work for DLBCL diagnosis and treatment
Keywords: TCDD, Furans, DBLCL, bioinformatics, GEO, Array
Express Number of pages: 87
Date of Submission: 20/09/2017
Supervisor’s signature
Trang 4ACKNOWLEDGEMENT
First of all, I would like to use this opportunity to express my deepest gratitude and special thanks to Prof Chun-Yu Chuang for her patient to guide and keep me on the correct path and show me many of wonderful things during the time of my internship at the Department of Biomedical Engineering and Environmental Science
at National Tsing Hua University
I would like to express my deep thanks to Assoc Prof Tran Thi Thu Ha for giving me necessary advices and guidance in order to complete my thesis
My sincere thanks are also given to all the members working in the Department
of Biomedical Engineering and Environmental Science for supporting me all the materials and necessities when conducting experiments for my research
Finally, I would like to thank my family and my friends encouraging me and advising me during completion of this thesis
Thai Nguyen, October 2017
Nguyen Thi Quynh Lam
Trang 5
TABLE OF CONTENTS
ACKNOWLEDGEMENT iii
TABLE OF CONTENTS iv
LIST OF FIGURES vii
LIST OF TABLES viii
LIST OF ABBREVIATIONS ix
PART I INTRODUCTION 1
1.1 Research rationale 1
1.2 Research objectives 2
PART II LITERATURE REVIEW 3
2.1 Persistent Organic Compounds (POPs) 3
2.2 Dioxins and dioxin – liked compounds 4
2.3 Lymphoma and non – Hodgkin lymphoma 8
2.3.1 Diffuse large B lymphoma 8
2.3.2 SNPs of Diffuse Large B lymphoma 9
2.4 Gene - network components 10
2.4.1 Microarray data 10
2.4.2 Gene network database: Array Express and GEO 11
2.4.3 Statistical analysis 13
2.4.4 Hub – proteins 15
2.4.5 GO term 15
Trang 62.5 Gene Network construction tools 16
2.5.1 Network Analyst website 16
2.5.2 Cytoscape software and plugins: ClueGO and CluePedia Apps 17
PART III METHODOLOGY 19
3.1 Data collection 19
3.2 Data processing 19
3.3 Network construction 21
PART IV RESULTS AND DISCUSSION 24
4.1 Results 24
4.1.1 Genetic datasets 24
4.1.2 Differentially genes expression 27
4.1.3 Gene-network construction of DLBCL, TCDD and Furans 33
4.1.4 Protein – protein interaction network of DLBCL, TCDD and Furans 35
4.1.5 Potential pathway showing the relation between TCDD and Furans and Diffuse Large B lymphoma 37
4.2 Discussion 39
4.2.1 AhR – mediated key factor of dioxins – like compounds 39
4.2.2 Key factors of hypoxia response and the risk of MYC – TP53 interaction 40
4.2.3 Inhibition of cancer cell apoptosis and tumorigenesis factor in DLBCL 42
PART V CONCLUSION 44
REFERENCES 46
Trang 7APPENDICES 56
Appendix 1 Differentially expressed genes of DLBCL versus normal cell 56
Appendix 2 Differentially expressed genes of exposure to TCDD group and versus control group 63
Appendix 3 Differentially expressed genes of exposure to FURANS group versus control group 67
Appendix 4 Hub proteins of DLBCL network 74
Appendix 5 Hub proteins of TCDD network 75
Appendix 6 Hub proteins of FURANS network 76
Trang 8LIST OF FIGURES
Figure 2.1: General molecular structure of polychlorinated dibenzo-p-dioxins
(PCDD) and dibenzofurans (Source: Pereira, 2004) 4
Figure 2.2: Representative structure of 2,3,7,8-tetrachhlorodibenzo-p-dioxins
(TCDD) (Pereira, 2004) 5
Figure 2.3: A schematic representation of signal transduction after
TCDD/AHR interaction (Fracchiolla et al., 2016) 7
Figure 3.1: The flowchart of methodology 22
Figure 4.1: Gene Ontology network showing the relationship of DLBCL,
Trang 9LIST OF TABLES
Table 4.1: Database of DLBCL 24
Table 4.2: Database of TCDD and Furans 25
Table 4.3: Differentially expressed genes, including up-and down-regulated
genes in Diffuse Large B lymphoma compared to normal cells 27
Table 4.4: Differentially expressed genes, including up-and down-regulated
genes activated by TCDD compared to control group 29
Table 4.5: Differentially expressed genes, including up-and down-regulated
genes activated by Furans compared to control group 30
Table 4.6: Lists of hub proteins containing in DLBCL, TCDD and Furans
networks 34
Trang 10LIST OF ABBREVIATIONS
ABC DLBCL activated B – cell like DLBCL
ARNT Aryl Hydrocarbon Receptor nuclear translocator
GCB DLBCL Germinal center B-cell like DLBCL
MAGE-ML Microarray and Gene Expression Markup Language
MAGE-TAB Microarray Gene Expression - Tabular format
MIAME Minimum information about microarray experiment
Trang 11PCDD/Fs Polychlorinated dibenzo-p-dioxins/furans
XRE Xenobiotic response elements transcription
Trang 12PART I INTRODUCTION 1.1 Research rationale
Dioxins and dioxins-like compounds are largely concerned these days due to their permanent impacts on human and animals in the long-term period TCDD and Furans are representative of dioxins and dioxins-liked compounds, which can influence negatively on human health with a little amount through bio-magnification and food chain The most significant impact of these chemicals is genetic variation through aryl hydrocarbon receptor (AhR) activation when it passes into nucleus in animal body and hence induces genetic disease and carcinogenesis
Diffuse large B lymphoma (DLBCL) is the most prevalent B cell non – Hodgkin lymphoma, which occupies 40% of lymphoma diagnoses The cause of Diffuse Large B lymphoma is exactly unknown, however, many pro-oncogenes and abnormal genes causing lymphoma have been found in previous studies The identification of biological mechanisms activating those genes, whether they are related to dioxins and dioxins-liked compounds impact or not, is highly essential to be considered Bioinformatics, including sequence analysis, gene and protein expression, cellular organization analysis, structural bioinformatics, network and system biology and others, has a large contribution to various fields in global scale
The application of bioinformatics in biomedical has been largely paid attention
in many developed countries, by contrast, it is still unpopular in Vietnam Specifically, many researches indicated that the application of high sequencing and DNA microarray technology has a significant role in attempt to identify genetic/transcriptomic alterations causing DLBCL and prognosis biomarkers for
Trang 13lymphoma treatment Therefore, the activation of those abnormal genes and the influence of dioxins can be clarified by the application of bioinformatics In order to
clarify diagnosis of lymphoma, the study “Identifying the effect of exposure TCDD and Furans on human health leading to diffuse large B lymphoma through gene- network construction” has been conducted with promotion of Biomedical Engineering and Environmental Science faculty of National Tsing Hua University in Taiwan
1.2 Research objectives
The objectives of this research are:
- To investigate respectively the differentially expressed genes for diffuse large
B lymphoma (DLBCL) tissues and dioxin exposure of human cell lines;
- To construct the gene-network for exploring number whether exposure to dioxin can induce DLBCL;
- To identify the potential pathway exposure to dioxin corresponding to DLBCL
Trang 14PART II LITERATURE REVIEW 2.1 Persistent Organic Compounds (POPs)
Persistent organic compounds include a variety of lipophilic compounds that relate to environmental degradation Amongst various kinds of POPs, for example, Organochlorine (OC) pesticides or industrial chemicals of by products, the category containing Cl atom has a great ability to cause the most deleterious effects and as a consequently, they have been banned and strictly regulated in many countries Despite
of that regulation, POPs exposure sustains in general population due to the consumption of fatty acid derived from animals The concentration of POPs has a tendency of increasing which corresponds to the level of food webs in order to perform biomagnification, as a results, the POPs concentration accumulating in
human bodies might be higher compared to the external environment (Fisher et al.,
1999) In addition, POPs accumulated in adipose tissue in life is considered as one route of chronic exposure since they are continuously released from adipose tissue to
the circulation and vital organs with lipid content (La Merrill et al., 2013)
POPs consist of these main properties The first property is mentioned as a combination of lipophilic compounds that accumulate mainly in lipid – containing
tissues like adipose tissue and move within the body bound to lipids (Lewis et al.,
2002) In addition, POPs are always presented as chemical mixtures in the external environment due to mixing in the environment, food web, long – term retention of fat
tissues (Kortenkamp et al., 2008) Therefore, these distinct groups of OC pesticides,
polychlorinated biphenyls (PCBs) and dioxins are classified referring to chemical
mixtures of each POPs subclasses
Trang 152.2 Dioxins and dioxin – liked compounds
Polychlorinated dibenzo-p-dioxins/furans (PCDD/Fs) is classified as ubiquitous POPs PCDD/Fs is two of the three subclasses of the halogenated aromatic hydrocacbon and two of them are referred as dioxins and dioxin-liked-compounds respectively (see Figure 2.1)
Figure 2.1: General molecular structure of polychlorinated dibenzo-p-dioxins
(PCDD) and dibenzofurans (PCDF)
(Source: Pereira, 2004)
They are widespread in almost area in the environment, especially there is no exception for the remote area Dioxins and dioxin-liked-compounds tend to be persistent and lipophilic in the external environment so that they can be bio-accumulated through food chains and potentially cause potential effects on human health biota and even human PCDD/Fs are two of subclasses of the halogenate aromatic hydrocarbon (HAHs), which are specified by the basic aromatic structure of
a benzene ring, a hexagonal carbon structure with conjugated double bonds connecting to the carbon The difference of both dioxins and dioxins like compounds depends on the number of oxygen rings in their structure; are 2 and 1 rings
Trang 16similar spectrum of toxic effects through binding of dioxins and compounds to a receptor protein – Aryl Hydrocarbon Receptor (AHR) The molecular planar shape facilities binding to the receptor and its relative potency depends to a large degree on its persistence and how well it fits to the receptor PCDDs/Fs and one component of PCDDs – tetrachlorodibenzo-p-dioxins (TCDD) have a high affinity to AHR and fit very well on that receptor, actively PCDD/Fs are derived from 4 main sources, including (1) combustion, (2) meta – smelting, refining and processing, and (3) biological and photochemical process (US National Research Council, 2006)
dioxins-liked-PCDD/Fs has a potential to cause cancer, birth effect, reproductive disorders, immunotoxicity, and other potential toxic end points, including liver diseases, thyroid dysfunction, lipid disorders, neurotoxicity, cardiovascular disease, and metabolic disorders, such as diabetes (US National Research Council, 2006)
Trang 172,3,7,8 tetrachlorodibenzo-p-dioxin (TCDD) is one of the most toxic members
of the family of polychlorinated dibenzodioxin (PCDDs) and represents a nearly
ubiquitous environmental contaminant (Pesatori et al., 1993, 2009) TCDD is
considered as a synthesis byproduct from chlorophenols or chlorophenoxy herbicides
manufacturing (Saracci et al., 1991) It can be formed in burning processes along with
other polychlorinated dibenzodioxins and dibenzofurans In addition, it can be derived
from waste incineration, metal production, fossil fuel or wood combustion (Deziel et
al., 2012) Dioxins are likely to involve in bioaccumulation in the food chain due to its long biological half-life and the low water solubility; even the small amount of dioxins can induce the significant level of dioxin concentration in the food chain
(Paustenbach et al., 1992) It is proved that TCDD can induce its effects via the
binding of the dioxin receptor AhR due to its affinity to TCDD in many mammalian species
AhR is a basic-loop-helix/PAS transcription factor that locates in cytoplasm where it forms a complex with various proteins and lipophilic compounds (Agostinis
et al., 2007) In cytoplasm, it is associated to pp60, which can bind to epidermal growth factor receptor (EGFR) and induce mitogen – activated protein signaling In nucleus, AhR builds up a heterodimer with the intranuclear aryl hydrocarbon receptor nuclear translocator (ARNT) to form a AhR – ARNT complex which promotes xenobiotic response elements transcription (XRE) and interact with several important pathways, for example, Wnt-beta-catenin, estrogen receptors, retinoblastoma protein, retino acids, NF-kB and the circadian rhythm regulators (Sorg, 2013) AhR has been proved to be involved in multiple physiological regulation and effects, for example,
Trang 18altered cell cycle regulation and proliferation In fact, exposure to TCDD in Sweden and US workers indicated similar observation of a relationship between phenoxyl herbicide exposure and cancer, particularly prolong TCDD exposure are related to the
increase of relative risk of Non – Hodgkin lymphoma (Hardell et al., 1996) Besides,
45 million liters of Agent Orange contaminated TCDD were spread out in South Vietnam and Cambodia to destroy vegetation from 1962 to 1971 that leads to several
cancer incidence has still remained (Stellman et al., 2003)
Therefore, the aim of this study mainly focus on the potential gene-network and pathway to investigate how the most toxic substance of PCDDs – TCDD and furans - group of dioxin-liked-compound can induce one of common Non – Hodgkin, especially diffuse large B lymphoma disease (Figure 2.3)
Figure 2.3: A schematic representation of signal transduction after TCDD/AHR
interaction
(Source: Fracchiolla et al., 2016)
Trang 192.3 Lymphoma and non – Hodgkin lymphoma
Lymphoma is considered as a well-known name of neoplasms of lymphoid precursor cells, which was initially reported in 1832 by Thomas Hodgkin and hence the disease was named to Hodgkin’s lymphoma After that, several kinds of lymphoma were discovered, however, the disease was divided mainly into 2 subclasses: Hodgkin lymphoma and non – Hodgkin lymphoma The majority of Non – Hodgkin lymphoma is B cell lymphoma apart from T-cell and NK-cell lymphoma Lymphoid neoplasms are a group of highly diverse disease and reflect the diversity of immune system (Hussain and Harris, 1998) In Vietnam, the incidence of Non-Hodgkin lymphoma has increased during the last ten years that record 2700 cases each year (Nguyen, 2015)
2.3.1 Diffuse large B lymphoma
Diffuse large B lymphoma (DLBCL) is considered as the most prevalent B cell non-Hodgkin lymphoma (B-NHL) in adulthood, occupying for 40% of diagnoses There are three major subclasses of DLBCL which are characterized basing on molecular heterogeneity of DLBCL, including germinal center B-cell like DLBCL (GCB DLBCL), activated B-cell like DLBCL (ABC DLBCL) and primary mediastinal B-cell lymphoma GCB DLBCL is derived from germinal center B cell and expresses genes characteristics of germinal center B lymphocytes, while ABC DLBCL expresses genes characteristic of plasma cells, which are thought to arise from B-cells activated for differentiation into plasma cells Primary mediastinal B cell lymphoma is thought to mediate from rare B-cell populations that reside in the thymus
Trang 20and have a distinct gene expression compared to GCB and ABC DLBCL (Rosenwald, 2003)
2.3.2 SNPs of Diffuse Large B lymphoma
The application of gene expression and genome sequencing is carried out in order to increase our understanding of DLBCL subclasses and the molecular basic of chemotherapy resistance and support for identification of novel molecular DLBCL subset and target for drug interventions and hence to prevent and treat DLBCL
(Lossos et al., 2006)
The majority of DLBCL can arise from normal antigen-exposed B cells that are
at separate stages of differentiation and undergo clonal expansion in the germinal
center (GCs) of peripheral lymphoid organs (Martelli et al., 2013) Besides, DLBCL
can involve and progress through a range of multistep transformation processes Specifically, progression of DLBCL can be evolved slowly or rapidly due to different stages, through clonal evolution or simultaneous and extensive DNA rearrangements
in subclones Several diverse genetic abnormalities have been observed referring to their clinical and genetic (clonal) heterogeneity, including aberrant somatic hypermutation, nonrandom chromosomal deletions, balanced reciprocal translocation, deregulating the expression of proto – oncogene products, such as BCL6, BCL2, REL
or c-MYC and dysregulated apoptosis of defective DNA repair (Morin et al., 2013)
Several genes mutation causing DLBCL have been identified in several studies, for example, the primary or early oncogenic events are chromosomal translocations involving oncogenes such as BCL6, BCL2, REL or c-MYC, whereas a groups of
BCL2, PRDM1, CARD11, MyD88, TNFAIP3, CREBBP, TP53, EZH2, MLL2,
Trang 21MYOM2, PIM1, LYN, CD36, B2M, CD79B, MEF2B, ANKLE2, KDM2B, HNF1B, NOTCH1/2, DTX1 and MYCCD58 tend to appear in the secondary or late oncogenic
events of clonally represented recurrent mutations or gene alteration (Morin et al.,
2013) In addition, the alteration of DNA repair and DNA signaling genes causing effects on DNA repair pathway has been identified in DLBCL tumors and they have a tendency to form intermediate cancer driver events in lymphomagenesis Moreover, mutation or translocation of BCL6, BCL2, REL or c-MYC can induce overexpression
of proto-oncogene products, whereas genetic lesions and mutations in TNFAIP3, CARD11, CD79A/B, MYD88 or TRAF2 can activate canonical and non-canonical
NF-kB pathways (Zhang et al., 2015) Furthermore, most frequent cancer driver
events in DLBCL are accounted for some epigenetic reprogramming, trigged by mutations in genes, for example, TET1, MLL2, EZH2, MEF2B, EP300 and CREBBP
(Zhang et al., 2013) Therefore, tumor cell with gene expression plasticity, escape
from apoptosis and enhanced growth are provided by the alterations in gene expression of proto – oncogene products and tumor suppressors through constitutive survival and proliferative signals
2.4 Gene - network components
2.4.1 Microarray data
DNA microarray has been used to determine the expression level of a large number of genes Microarray platforms for gene expression include single-color and two-color system Affymetrix Gene Chip arrays are widely used single-platform for microarray analysis, which are constituted of probed complementary to a region of each mRNA transcript, usually at the 3’ end of the transcript Each probe sets consists
Trang 22of a set of 11 to 20 perfect match (PM) of probes which are typically 25 nucleotides long, together with an equal number of mismatch (MM) probes which are identical to the PM probes except for a single nucleotide substitution in the center of proves
DNA microarray techniques have been applied to predict DLBCL treatment success and explain disease heterogeneity five clinical features (age, tumor stage, serum lactate, dehydrogenase concentration, performance status, number of extra nodal disease sites) (Gohlmanm and Talloen, 2009) In fact, this technique is most widely used to profile gene expression of an organism on a whole genome scale, and available for spawning a series of microarray-based expression studies of DLBCL in order to refine prognosis referring to molecular – level information (Segal, 2005) Besides, DNA microarray was also carried out to analyze the changes of human B-cell
gene expression induced by dioxins (Kovalova et al., 2017)
In this study, the gene expression profiling representing DLBCL and dioxins (TCDD and Furans) created by DNA microarray techniques were conducted for further analytical steps The datasets of gene expression are collected in two main kinds of databases: Gene Expression Omnibus (GEO) and Array Express databases, that will be discussed more detailed in the following part
2.4.2 Gene network database: Array Express and GEO
All of the datasets in this study were derived from Array Express database and Gene Expression Omnibus (GEO) database Array Express is a public database for high throughput functional genomics data, which consists two distinct parts, including the Array Express Repository and the Array Express Data Warehouse The Array Express Repository is considered as a MIAME supportive public archive of
Trang 23microarray data, whereas the Array Express Data Warehouse performs a database of gene expression profiles selected from the repository and consistently re-annotated The required samples or experiments can be found by experiment attributes, for example, keywords, species, array platforms, authors, journals or accession numbers Gene names, gene properties or gene ontology terms are useful in order to visualize gene expression profiles The database of Array Express is rapidly growing and it includes data from larger 50000 hybridization and 1500000 individual expression profiles MIAME (Minimum Information About Microarray Experiment), Microarray and Gene Expression Markup Language (MAGE-ML) and Microarray Gene Expression - Tabular format (MAGE-TAB) are considered as some of community
standards that are supported by Array Express (Parkinson et al., 2007)
GEO database derived from National Center for Biotechnology Information (NCBI) is considered as an abundant data containing gene expression data generated
by DNA microarray technology The database has a suitable design for both unprocessed and processed data in a MIAME The quantitative of gene expression data resulting in a large number of biological phenomena in GEO is about billion, and all of them are derived from over 100 organisms and 1500 laboratories Several user-friendly web applications have been carried out in order to increase the utility, effective exploration, query and visualization of these data in both individual and entire studies (Barrett, 2004)
Trang 242.4.3 Statistical analysis
2.4.3.1 Meta-analysis
Meta-analysis is a kind of statistical techniques for the sake of combining result from several studies apart from various kinds of statistics, for example, Fisher’s statistic, minimum and maximum statistic This technique has been applied to microanalysis, in particular, in order to combine different studies for DEGs (Differentially expressed genes) application in microarray studies and boost the reliability of results from individual studies (Shen and Tseng, 2010) In order to conduct microarray meta-analysis, seven steps have been carried out, including: (1) identify suitable microarray studied, (2) extract the data from the studies, (3) prepare the individual datasets, (4) annotate the individual datasets, (5) resolve the relationship between probes and genes, (6) combine the estimation of the studies and (7) analyze,
present and interpret results (Ramasamy et al., 2008) Meta-analysis is probably
beneficial for this study in the attempt to identify DEGs of DLBCL tissues and dioxins group compared to normal tissues and control group respectively, which are mainly concerned in the next part
2.4.3.2 False Discovery Rate (FDR)
The false discovery rate (FDR) is considered as the expected fraction of false rejections among those hypotheses rejected This method is carried out in microanalysis in order to estimate the proportion of false positive finding amongst the genes that were selected to become differentially expressed (Gohlmann and Talloen, 2009) Although various procedures have been built to control the FDR, the FDR method of Benjamini and Hochberg is considered the most popular which has been
Trang 25carried out in this study The Benjamini and Hochberg method is calculated as the formula below:
order(pi) with i = 1,2,3,4 … , m Where:
p is adjust P value by Benjamini Hochberg method
pi is the p Value of gene I
m is the total number of genes in dataset
2.4.3.3 Different Expression Analysis
Different gene expression is currently applied in microarray analysis in order to find the genes that are differentially expressed In fact, mutation in gene or a set of gene is the main factor that induce abnormal or fail gene expression, for example, p53 tumor suppressor gene are transcribed that can cause cancer disease Therefore, microarray experiments are useful to identify which gene are differentially expressed
in disease cell versus normal cells The comparison between various kind of “disease” and “normal” cells provides an opportunity in order to find multiple target genes that their up- and down- regulation can be the result of the disease After that, the development of drug target for specific mutated genes is carried out in order to reduce their undesirable effects In addition, Different Gene Expression has a significant relationship with gene function and it can provide fully information about genes and protein interaction Therefore, differentially expressed genes are carried out in the reconstruction of gene network, metabolic pathway and gene annotation (Zhang,
Trang 262006) In this study, DEGs are the main components for gene-network construction to figure out whether dioxins can induce DLBCL
2.4.4 Hub – proteins
A gene-network is consisted of various nodes, which are connected by edges
In molecular biology, nodes are referred as the term of “genes” or “proteins” and edges are molecular interaction, as a result, gene network represents the interaction of genes or proteins leading to a variety of biological processes The types of nodes in each network is currently divided into two distinct types, including: (1) highly-connected nodes, or hub-proteins and (2) poorly-connected nodes or non-hub proteins Hub-proteins are significantly more important that non-hubs since they have an ability
to ensure the maintenance of the network It has been indicated that in protein-protein interaction network, hubs tend to be essential due to the centrality-lethality rule that shows functional importance of a node is thought to increase from its structure importance in the network, as a results, hubs tend to relate to significant biological pathways that may result in biological reaction in human body (He and Zhang, 2006)
In this study hub-proteins play an important role in order to observe the potential pathway exposure to dioxins leading to DLBCL
2.4.5 GO term
The gene ontology (GO) mainly contains the terms, which are connected through a hierarchical order The GO terms associate with gene products that is able to classify proteins into three distinct group, including: (1) molecular, (2) biological processes and (3) cellular components corresponding to their biological function (Balakrishnan, 2013) In fact, these functions are summarized from published papers
Trang 27and uploaded in GO database and hence researchers can access to this information by the process of annotation In addition, GO database provides the main annotation sources that can be useful for analysis of high throughput datasets, for example, transcriptomic and proteomic studies and function, pathway or cellular components identification, which are represented by these datasets (Pavlidis, 2004) Furthermore,
GO database is considered as a pathway-driven analysis tools in order to identify risk since it relates to single nucleotide polymorphisms (SNPs) that are useful to inform biomarker identification studies (Holmans, 2009)
2.5 Gene Network construction tools
2.5.1 Network Analyst website
Network analyst website is one of the most basic and friendly tools and it combines all necessary steps to analyze network and performs the results through a high-quality visualization system This website is available for anyone and it is designed for efficient Protein-protein interaction network performance The data in this website is generated from several gene expression experiments of various species, mainly from human and mouse studies
Network analyst website was developed by three main steps in network analysis, including significant gene identification by data processing steps, a step of network construction for mapping, building and refining network, a step of network analysis and visualization Besides, multiple options are certainly provided within
each main steps (Xia et al., 2014) In this study Network Analyst is considered as an
inadequate tool to find the most obvious DEGs of DLBCL, dioxins and hub-proteins for gene-network construction and potential pathway identification respectively
Trang 282.5.2 Cytoscape software and plugins: ClueGO and CluePedia Apps
Cytoscape is an open source software that is helpful to perform throughput expression data and other molecular states into a conceptual framework Cytoscape has a powerful role in conjunction with these databases of protein-protein, protein-DNA and genetic interaction that are available for human and other organisms (Shannon, 2003)
high-There were a large number of enrichment tools and algorithms that were constructed for the sake of sufficient data interpretation, and ClueGO is considered as
a Cytoscape plugin used to represent the biological interpretation and functional group terms in the form of networks and charts In particularly, Kappa statistic is mainly used in ClueGO in the attempt to link the terms in the network and hence GO terms or pathways are functionally organized Therefore, ClueGO is one of available Cytoscape plugin that is used for analysis of terms relation and function groups in
biological networks (Bindea et al., 2009)
CluePedia is the second Cytoscape plugin that has been used in this study Clue Pedia is an useful tool in order to search new markers, which are associated to pathways By using CluePedia application various kinds of genes, proteins and miRNAs have a possibility to connect referring to experimental information before integrating into ClueGO network In addition, new association of pathway can be informed by genes, proteins and miRNA enrichments Therefore, this Cytoscape plugin is certainly portable for users and has a powerful visualization in the attempt to
present genes, miRNA or proteins connection network (Bindea et al., 2013)
Cytoscape software and ClueGO/CluePedia plugins are applied to perform
Trang 29gene-network reconstruction and identify potential pathway corresponding to the second and the third objectives of this study
Trang 30PART III METHODOLOGY 3.1 Data collection
At the beginning, all required microarray datasets were collected in two main websites, including the Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/) and Array Express (http://www.ebi.ac.uk/arrayexpress), which are considered as a huge public resource of gene expression data and provide users a flexible data mining tool (https://academic.oup.com/nar/article/35/suppl_1/D760/1106106/NCBI-GEO-mining-tens-of-millions-of-expression)
To measure human gene expression of DLBCL pattern and how it relates to chemical exposure, several datasets were obtained from these both websites by using these following keywords: Homo sapiens, DLBCL, TCDD and Furans, and the array files containing proceed data were carried out in this study In each DLBCL array files, the experimental samples are normally obtained from various sources In this study, two different types of samples, including normal tissues and DLBCL tissues, were conducted and all of them must be untreated by any chemical The array platform of these data totally was Affymetrix platform and those files must be available from September 2015 to present time As a result, a total of 10 microarray datasets were found to fit the scope for this study, including E-GEOD-12195, E-GEOD-83632, GSE47355, GSE56313, E-GEOD-69844, E-GEOD-69845, E-GEOD-
69849, E-GEOD-69850, E-GEOD-69851
3.2 Data processing
Data analysis were subsequently performed using Network Analyst – a standard web browser for network analysis and interactive exploration The datasets
Trang 31were combined and divided into 3 types of distinct groups basing on the samples sources in text files (txt.), including (1) DLBCL and normal tissues, (2) control and TCDD, and (3) control and Furans In the initial step, Text files were uploaded in order to define the types of organism – Homo sapiens and the ID type of Official Gene Symbol was chosen ID conversion steps were applied immediately after uploading these Text data in order to identify the types of organism and provide the number of matched or unmatched genes with chosen ID type Then, these files were summited to gene annotation step to ensure the labels are consistent across all datasets upload After that, data normalization step was carried out in order to set adequate normalization procedure In this report there is no normalized procedure setting for DLBCL and normal tissues data, while log2 normalization was applied for control group and dioxin treatment data in order to increase the variance at low intensities
Those normalized data were transformed into different expression analysis dialog in order to perform different gene expression analysis on individual dataset and hence the number of DEGs between DLBCL and normal tissues, control and chemical groups can be detected An analysis of variance (ANOVA) was conducted on individual dataset and cut-off p value was adjusted by using the Benjamini-Hochberg’s false discovery rate (FDR) which is enable to decide whether the gene is differentially expressed or not; and it was randomly set equal to 0.05 in DE analysis dialog in Network analyst website After data summarization step all 4 datasets of DLBCL, 5 datasets of TCDD and 1 of Furans (Table 4.1.1, Table 4.1.2) were applied
“directed merge” method in meta-analysis step in order to merge all datasets into a single data to analyze
Trang 32Finally, three distinct result tables containing top-ranking DEGs and relevant statistics (CombineLogFC, adjust P value) for DLBCL, TCDD and Furans were separately exported (Appendix 1,2,3)
3.3 Network construction
The obtained DEGs of DLBCL, TCDD and Furans above were basically screened by |fold – change| ratio that equals to 2.0, 1.2 and 1.2 (|Combine LogFC| ≥ 0.26) respectively in order to filter the top up – regulated and down – regulated genes, which are employed for further analysis steps including: (1) Gene ontology analysis and (2) Gene network reconstruction The GO biological pathway of these expressed genes can easily found by summiting the DEGs lists of each group into Cytoscape plug-in ClueGO The results of gene networks were released showing fully the biological pathway of these involved DEGs Besides, these DEGs were plug in ClueGO app in Cytoscape software thereby reconstructing DLBCL, TCDD, Furans gene networks The standard Kappa score level threshold can initially be adjusted on a positive scale from 0 to 1 for the purpose of restricting the network connectivity in a
customized way as well as creating the functional group of gene (Bindea et al., 2009)
and the kappa score was chosen to equal 0.4 in this study to create these subnetworks All three sub – networks of DLBCL, TCDD and Furans were merged together into a single network thus providing a potential pathway showing the effect of TCDD and Furans chemical on Human health and leading to DLBCL
To clarify the potential pathway between TCDD/ Furans exposure and DLBCL, protein-protein interaction network was constructed for further purpose of identifying the hub genes, which may have a vital function and indirectly involve in many
Trang 33biological process (Raman et al., 2013) All filtered DEGs of DLBCL TCDD and
Furans were submitted individually in Network Analyst website in order to create its own protein-protein interaction network Later on, those hub proteins of DLBCL and two types of dioxin-related compounds having the highest value of (1) Node degree and (2) node betweeness were characterized and summarized for the next step of pathway analysis The list of hub protein and the additional target gene were plug in Clue Pedia app in Cytoscape software in order to show potential pathway of TCDD and Furans leading to DLBCL disease The directed edges chosen for pathway network construction consisted of two distinct types: gene activation, and gene expression in order to build the pathway network showing how dioxin-related compounds can lead to DLBCL disease in human body
In this research, all necessary steps to be undertaken are assembled in the following flowchart (Figure 3.1) for better illustration
Trang 34Figure 3.1: The flowchart of methodology
After September
2015
DATA
PROCESSING
Network Analyst
Normalization Annotation
ID conversion
DE analysis
Differentially expressed genes
Gene-network presenting GO
Protein – protein interaction network
Potential pathway
Trang 36Normal tissues DLBCL
Total samples
Array platform
Fresh frozen tissue, normal tonsil
Fresh frozen tissue
GSE473
Homo sapiens
Lymph node tissues of DLBCL patients
GSE563
Homo sapiens
Lymph node tissues of DLBCL patients
Trang 37trol
Con- ment Total
Treat-Array platform
HeraRG Hepatocytes 45 7 52 AffymetrixE-
MCF7 Breast Adenocarcinom
Ishikawa Endometrial adenocarcinoma Cell Line
HepG2 Human Hepatocyte Carcinoma Cell Line
HepaRG Hepatocyte Carcinoma Cell Line
Expression Profiles of HepG2 cells treated with furans
Trang 3827
4.1.2 Differentially genes expression
By using Network Analyst and Benjamini – Hochberg’s FDR statistical method, the total number of DEGs screened by |fold change| ≥ 1,2 (|Combine LogFC|≥0.26) were counted for 1228 DEGs, including 488 DEGs of DLBCL, 288 DEGs of TCDD and 512 DEGs of Furans As the result table above, the quantity of up – regulated genes of DLBCL,
TCDD, Furans were 316, 268 and 217 respectively and down – regulated genes of these categories were counted for 172, 20 and 295 DEGs respectively (Tables 4.3, 4.4, 4.5)
Trang 3928
Table 4.3: Differentially expressed genes, including up- and down – regulated genes
in Diffuse Large B lymphoma compared to normal cells
DLBCL (448 DEGs) Up-regulated
COMMD8, NDUFS3, MFSD1, VAMP8, HSBP1, HSD17B10, LSM1, RRM1, RSL24D1, C14orf2, PDZD11, POP5, PSMD14, APEX1, ACTR10, MRPL33, NDUFA8, DDX39A, TMEM147, IGBP1, DCTPP1, IMPDH2, RRM2, MRPL18, POLR2H, PSMD1, TIMM10, MRPL27, YWHAG, FIS1, DDX23, SNAPIN, BLOC1S2, RCC2, SAT1, PTRHD1, CDK2AP1, TAF7, BCL2A1, GLA, ALYREF, CD19, COX14, CD3D, SLAMF8, CMC2, S100A11, PRDX1, C19orf70, MRPS33, NDUFB5, TMEM126A, DNAJB11, PRR13, EMG1, NAE1, ADSL, EVI2A, MRPL15, MRFAP1, GZMA ,EIF2A, GYG1, ISG15, GJA1, MLH1, PARP1, RACGAP1, SNRPB, MMADHC, CBX1, MRFAP1L1, ACADM ,RPL36AL, CCDC12, STARD3NL, CETN3, HEXB ,CEBPB, ISCU, LSM6, DNMT1, PSMC5, MAGED1, NAA20, TSPO, MORF4L1, PDLIM1, DDIT4, EIF4H, PSMB10, ITPA, MRPS28, PSMC4, BLOC1S1, CD52, UBE2A, ATP5L, RPS19BP1, SEC13, SEC11A, THOC7, NOC3L, DKFZP586I1420, ZNF121, MRPS35, DIABLO, OCIAD2, PSMD8, ARPC5L, MS4A1, MCM2, MRPL40, GCH1, PSMD10, DNTTIP2, OAT, NSMCE1, TBCB, C14orf119, ACP5, PPM1G,
Trang 4029
POLR2K, TSG101, PEA15, MRPL49, NIT2, ATIC, PPP2CB, NCBP2, RABAC1, DRG1, NUP107, TCF4, SLC25A19, UFC1, CIB1, BIRC2, NDUFB10, RBBP8, SNX3, SMNDC1, HDHD2, ETF1, RAD23A, MYBL2, SRRM1, TIMMDC1, COX5B, LYRM1, IL18, ARHGAP17, IRF2BPL, NONO, TM2D2, MFAP1, ITGA3, KCTD12, NUPR1, HAT1, AP3S1, MANF,, TMEM14B, CPSF4, PPIH, MIEN1, MTIF2, FAM50A, LRRC47, PAPSS1, GLO1, CCNG1, RPIA, ASNSD1, LYPLA1, WDR83OS, CUTA, DAZAP1, AP1S2, BTBD1, VPS25, BCL11A, MT1E, ZNHIT3, EIF3I, RPL11, S100A8, ANXA2, PPIL3, GLRX, ENOPH1, IER5, CISD1, HAUS1, DRAM1, DDX21, SNRPD3, UBE2L6, TMEM138, RPF2, DUT, GTF3C6, TSPAN13, ITM2A, PPP1R7,PIH1D1, GTF2B, CDK5RAP3, TMEM208, DBF4, GTF3A, RFC4, IER3, YTHDF2, FIBP, TIMM8B, MPLKIP, VPS28, LAGE3, CLIC1, HARS, IMP3, CS, CEBPZ, RFX5, DNAJB1, MRPL16, CSRP1, ORMDL2, PIGP, CDKN1A, NMI, FAM35A, TNFAIP3, PCMT1, EBPL,TUBB6, GBP1, PLOD1, TUBA1C, REEP5, EIF2S1, MRPL1, IMP4, SNRPA, MARCKSL1, DYNLT3, UBE2E2, SCAMP3, POLR3GL, CUEDC2 Down-
regulated
genes (172)
DUSP6, CYTH4, LCP2, SIRPB1, ITGB2, CORO1A, RAB7A, COX7A2L, MEFV, ANPEP, C5AR1, ZYX, DOCK5, STEAP4, GRK6, MSL2, PLXNC1, STK17B, PYGL, CD3E, KCNJ15, SCIMP, CAPNS1, GLIPR1, CPPED1, IST1, LILRA1, PRKAR1A, ARRB2, WDR1, ARHGAP26, DUSP1, WIPF1, MXD1, BSG, CELF2, GNAQ, ZFAND5, MBOAT7, GABARAP, MBNL1, AOAH, CTSS, DOK3, HIST1H1E, CYP4F3, PTBP3, NCF2, RNASET2, TCP11L2, MAPK1, PIP4K2A, STAT3, DOCK8, TLN1, TGFBR2, SELPLG, PGK1, FPR1, SDHA, SMCHD1, MOB3A, DDX17, TUBB1, GUK1, LYN, CD37, ETS1, CCNI, STK38, ATP6V1B2, CAP1, PDZK1IP1, HBB, EPB41, TREM1, PTAFR, GNAS, FFAR2, RPL18, IL7R, EIF4EBP2, SLC44A2, HLA-DPA1 LITAF ITM2B CXCR2, CYBB, CFL1, LCP1, ALAS2, PTPRC, CSF3R, ARHGDIB, AQP9, DAZAP2, SLC6A6, B2M, SMAP2, BCL2L1, SORL1, RAC2, FBXO7, PSAP, FCN1, ND5,
SLC25A37, TNFRSF10C, TMBIM6, CD74, HLA-E, SLC25A39, DCAF12, CX3CR1, RHOA, CD53, XPO6, TAGLN2, FCGR2A, MSN, LYZ, LAPTM5, MALAT1, TXNIP, ACTB
Table 4.4: Differentially expressed genes, including up- and down – regulated genes
activated by TCDD compared to control group