(Luận văn) identifying the effect of exposure to dioxin and furans on human health leading to diffuse large b lymphoma through gene network construction

THAI NGUYEN UNIVERSITY UNIVERSITY OF AGRICULTURE AND FORESTRY NGUYEN THI QUYNH LAM IDENTIFYING THE EFFECT OF EXPOSURE TO DIOXIN AND FURAN ON HUMAN HEALTH LEADING TO DIFFUSE LARGE B LY

Trang 1

THAI NGUYEN UNIVERSITY

UNIVERSITY OF AGRICULTURE AND FORESTRY

NGUYEN THI QUYNH LAM

IDENTIFYING THE EFFECT OF EXPOSURE TO DIOXIN AND FURAN

ON HUMAN HEALTH LEADING TO DIFFUSE LARGE B LYMPHOMA

THROUGH GENE-NETWORK CONSTRUCTION

BACHELOR THESIS

Study Mode: Full-time Major: Environmental Science and Management Faculty: International Programs Office

Batch: 2013 - 2017

Thai Nguyen, December 2017

luan van tot nghiep download luanvanfull moi nhat z z @gmail.com

Trang 2

DOCUMENTATION PAGE WITH ABSTRACT

Thai Nguyen University of Agriculture and Forestry

Degree Program Bachelor of Environmental Science and Management

Student name Nguyen Thi Quynh Lam

Student ID DTN1353110372

Thesis Title Identifying the effect of exposure to dioxin and furans on

human health leading to diffuse large B lymphoma through gene-network construction

Supervisor(s) Prof ChunYu Chuang, Assoc Prof Tran Thi Thu Ha

Abstract:

Many studies indicated that exposure to dioxins and dioxins – compounds

(e.g., 2,3,7,8 tetrachlorodibenzo-p-dioxin (TCDD) and polychlorinated

dibenzofurans (furans) can induce several outcomes on human and animal in the

long term period, and one of them is diffuse large B lymphoma which is considered

as the most popular kind of lymphoma In order to identify the gene expression

altered by TCDD and Furans potentially underlying DLBCL development,

bioinformatics meta-analysis was applied in this study In this study, 10 datasets

containing the information of gene expression of DLBCL, TCDD and Furans were

obtained from Gene Expression Omnibus (GEO) and Array Express websites, and

further analyzed using Cytoscape software and its plugins – ClueGO and CluePedia

As a result, the most differentially expressed genes were found to construct

gene-networks of DLBCL, TCDD and furans and hence the potential pathway presented

how dioxins could cause the progress of lymphoma In addition, the analytical result

Trang 3

reported that TCDD and furans have a possibility to induce the receptor AhR which

promotes the appearance of protein TWIST1 and enhance the progress of DLBCL

The result of this study has made a great contribution for further dioxins study, and

it is also considered as the initial steps of future work for DLBCL diagnosis and

treatment

Keywords: TCDD, Furans, DBLCL, bioinformatics, GEO, Array

Express Number of pages: 87

Date of Submission: 20/09/2017

Supervisor’s signature

Trang 4

ACKNOWLEDGEMENT

First of all, I would like to use this opportunity to express my deepest gratitude

and special thanks to Prof Chun-Yu Chuang for her patient to guide and keep me on

the correct path and show me many of wonderful things during the time of my

internship at the Department of Biomedical Engineering and Environmental Science

at National Tsing Hua University

I would like to express my deep thanks to Assoc Prof Tran Thi Thu Ha for

giving me necessary advices and guidance in order to complete my thesis

My sincere thanks are also given to all the members working in the Department

of Biomedical Engineering and Environmental Science for supporting me all the

materials and necessities when conducting experiments for my research

Finally, I would like to thank my family and my friends encouraging me and

advising me during completion of this thesis

Thai Nguyen, October 2017

Trang 5

TABLE OF CONTENTS

ACKNOWLEDGEMENT iii

TABLE OF CONTENTS iv

LIST OF FIGURES vii

LIST OF TABLES viii

LIST OF ABBREVIATIONS ix

PART I INTRODUCTION 1

1.1 Research rationale 1

1.2 Research objectives 2

PART II LITERATURE REVIEW 3

2.1 Persistent Organic Compounds (POPs) 3

2.2 Dioxins and dioxin – liked compounds 4

2.3 Lymphoma and non – Hodgkin lymphoma 8

2.3.1 Diffuse large B lymphoma 8

2.3.2 SNPs of Diffuse Large B lymphoma 9

2.4 Gene - network components 10

2.4.1 Microarray data 10

2.4.2 Gene network database: Array Express and GEO 11

2.4.3 Statistical analysis 13

2.4.4 Hub – proteins 15

2.4.5 GO term 15

Trang 6

2.5 Gene Network construction tools 16

2.5.1 Network Analyst website 16

2.5.2 Cytoscape software and plugins: ClueGO and CluePedia Apps 17

PART III METHODOLOGY 19

3.1 Data collection 19

3.2 Data processing 19

3.3 Network construction 21

PART IV RESULTS AND DISCUSSION 24

4.1 Results 24

4.1.1 Genetic datasets 24

4.1.2 Differentially genes expression 27

4.1.3 Gene-network construction of DLBCL, TCDD and Furans 33

4.1.4 Protein – protein interaction network of DLBCL, TCDD and Furans 35

4.1.5 Potential pathway showing the relation between TCDD and Furans and Diffuse Large B lymphoma 37

4.2 Discussion 39

4.2.1 AhR – mediated key factor of dioxins – like compounds 39

4.2.2 Key factors of hypoxia response and the risk of MYC – TP53 interaction 40

4.2.3 Inhibition of cancer cell apoptosis and tumorigenesis factor in DLBCL 42

PART V CONCLUSION 44

REFERENCES 46

Trang 7

APPENDICES 56

Appendix 1 Differentially expressed genes of DLBCL versus normal cell 56

Appendix 2 Differentially expressed genes of exposure to TCDD group and versus control group 63

Appendix 3 Differentially expressed genes of exposure to FURANS group versus control group 67

Appendix 4 Hub proteins of DLBCL network 74

Appendix 5 Hub proteins of TCDD network 75

Appendix 6 Hub proteins of FURANS network 76

Trang 8

LIST OF FIGURES

Figure 2.1: General molecular structure of polychlorinated dibenzo-p-dioxins

(PCDD) and dibenzofurans (Source: Pereira, 2004) 4

Figure 2.2: Representative structure of 2,3,7,8-tetrachhlorodibenzo-p-dioxins

(TCDD) (Pereira, 2004) 5

Figure 2.3: A schematic representation of signal transduction after

TCDD/AHR interaction (Fracchiolla et al., 2016) 7

Figure 3.1: The flowchart of methodology 22

Figure 4.1: Gene Ontology network showing the relationship of DLBCL,

Trang 9

LIST OF TABLES

Table 4.1: Database of DLBCL 24

Table 4.2: Database of TCDD and Furans 25

Table 4.3: Differentially expressed genes, including up-and down-regulated

genes in Diffuse Large B lymphoma compared to normal cells 27

genes activated by TCDD compared to control group 29

genes activated by Furans compared to control group 30

Table 4.6: Lists of hub proteins containing in DLBCL, TCDD and Furans

networks 34

Trang 10

LIST OF ABBREVIATIONS

ABC DLBCL activated B – cell like DLBCL

AhR Aryl Hydrocarbon Receptor

ARNT Aryl Hydrocarbon Receptor nuclear translocator

B-NHL B cell non-Hodgkin lymphoma

CRE CAMP response element

DEG Differentially expressed genes

DLBCL Diffuse large B cell lymphoma

DNA Deoxyribonucleic acid

DNMT1 DNA methyl transferase

EGFR Epidermal growth factor receptor

FDR False discovery rate

GCB DLBCL Germinal center B-cell like DLBCL

GEO Gene expression omnibus

HAHs Halogenate aromatic hydrocarbon

MAGE-ML Microarray and Gene Expression Markup Language

MAGE-TAB Microarray Gene Expression - Tabular format

MIAME Minimum information about microarray experiment

Trang 11

PCDD/Fs Polychlorinated dibenzo-p-dioxins/furans

PCDDs Polychlorinated dibenzofurans

POPs Persistent organic compounds

ROS Reactive oxygen species

SNPs Single nucleotide polymorphisms

TCDD 2,3,7,8 tetrachlorodibenzo-p-dioxin

TNF Tumor necrosis factor

XRE Xenobiotic response elements transcription

Trang 12

PART I INTRODUCTION 1.1 Research rationale

Dioxins and dioxins-like compounds are largely concerned these days due to

their permanent impacts on human and animals in the long-term period TCDD and

Furans are representative of dioxins and dioxins-liked compounds, which can

influence negatively on human health with a little amount through bio-magnification

and food chain The most significant impact of these chemicals is genetic variation

through aryl hydrocarbon receptor (AhR) activation when it passes into nucleus in

animal body and hence induces genetic disease and carcinogenesis

Diffuse large B lymphoma (DLBCL) is the most prevalent B cell non –

Hodgkin lymphoma, which occupies 40% of lymphoma diagnoses The cause of

Diffuse Large B lymphoma is exactly unknown, however, many pro-oncogenes and

abnormal genes causing lymphoma have been found in previous studies The

identification of biological mechanisms activating those genes, whether they are

related to dioxins and dioxins-liked compounds impact or not, is highly essential to be

considered Bioinformatics, including sequence analysis, gene and protein expression,

cellular organization analysis, structural bioinformatics, network and system biology

and others, has a large contribution to various fields in global scale

The application of bioinformatics in biomedical has been largely paid attention

in many developed countries, by contrast, it is still unpopular in Vietnam

Specifically, many researches indicated that the application of high sequencing and

DNA microarray technology has a significant role in attempt to identify

genetic/transcriptomic alterations causing DLBCL and prognosis biomarkers for

Trang 13

lymphoma treatment Therefore, the activation of those abnormal genes and the

influence of dioxins can be clarified by the application of bioinformatics In order to

clarify diagnosis of lymphoma, the study “Identifying the effect of exposure TCDD

and Furans on human health leading to diffuse large B lymphoma through network construction” has been conducted with promotion of Biomedical Engineering and Environmental Science faculty of National Tsing Hua University in

gene-Taiwan

1.2 Research objectives

The objectives of this research are:

- To investigate respectively the differentially expressed genes for diffuse large

B lymphoma (DLBCL) tissues and dioxin exposure of human cell lines;

- To construct the gene-network for exploring number whether exposure to

dioxin can induce DLBCL;

- To identify the potential pathway exposure to dioxin corresponding to

DLBCL

Trang 14

PART II LITERATURE REVIEW 2.1 Persistent Organic Compounds (POPs)

Persistent organic compounds include a variety of lipophilic compounds that

relate to environmental degradation Amongst various kinds of POPs, for example,

Organochlorine (OC) pesticides or industrial chemicals of by products, the category

containing Cl atom has a great ability to cause the most deleterious effects and as a

consequently, they have been banned and strictly regulated in many countries Despite

of that regulation, POPs exposure sustains in general population due to the

consumption of fatty acid derived from animals The concentration of POPs has a

tendency of increasing which corresponds to the level of food webs in order to

perform biomagnification, as a results, the POPs concentration accumulating in

human bodies might be higher compared to the external environment (Fisher et al.,

1999) In addition, POPs accumulated in adipose tissue in life is considered as one

route of chronic exposure since they are continuously released from adipose tissue to

the circulation and vital organs with lipid content (La Merrill et al., 2013)

POPs consist of these main properties The first property is mentioned as a

combination of lipophilic compounds that accumulate mainly in lipid – containing

tissues like adipose tissue and move within the body bound to lipids (Lewis et al.,

2002) In addition, POPs are always presented as chemical mixtures in the external

environment due to mixing in the environment, food web, long – term retention of fat

tissues (Kortenkamp et al., 2008) Therefore, these distinct groups of OC pesticides,

polychlorinated biphenyls (PCBs) and dioxins are classified referring to chemical

mixtures of each POPs subclasses

Trang 15

2.2 Dioxins and dioxin – liked compounds

Polychlorinated dibenzo-p-dioxins/furans (PCDD/Fs) is classified as

ubiquitous POPs PCDD/Fs is two of the three subclasses of the halogenated aromatic

hydrocacbon and two of them are referred as dioxins and dioxin-liked-compounds

respectively (see Figure 2.1)

Figure 2.1: General molecular structure of polychlorinated dibenzo-p-dioxins

(PCDD) and dibenzofurans (PCDF)

(Source: Pereira, 2004)

They are widespread in almost area in the environment, especially there is no

exception for the remote area Dioxins and dioxin-liked-compounds tend to be

persistent and lipophilic in the external environment so that they can be

bio-accumulated through food chains and potentially cause potential effects on human

health biota and even human PCDD/Fs are two of subclasses of the halogenate

aromatic hydrocarbon (HAHs), which are specified by the basic aromatic structure of

a benzene ring, a hexagonal carbon structure with conjugated double bonds

connecting to the carbon The difference of both dioxins and dioxins like compounds

depends on the number of oxygen rings in their structure; are 2 and 1 rings

Trang 16

similar spectrum of toxic effects through binding of dioxins and

dioxins-liked-compounds to a receptor protein – Aryl Hydrocarbon Receptor (AHR) The molecular

planar shape facilities binding to the receptor and its relative potency depends to a

large degree on its persistence and how well it fits to the receptor PCDDs/Fs and one

component of PCDDs – tetrachlorodibenzo-p-dioxins (TCDD) have a high affinity to

AHR and fit very well on that receptor, actively PCDD/Fs are derived from 4 main

sources, including (1) combustion, (2) meta – smelting, refining and processing, and

(3) biological and photochemical process (US National Research Council, 2006)

PCDD/Fs has a potential to cause cancer, birth effect, reproductive disorders,

immunotoxicity, and other potential toxic end points, including liver diseases, thyroid

dysfunction, lipid disorders, neurotoxicity, cardiovascular disease, and metabolic

disorders, such as diabetes (US National Research Council, 2006)

* 2,3,7,8 tetrachlorodibenzo-p-dioxin (TCDD):

According Pereira (2004) 2,3,7,8-tetrachhlorodibenzo-p-dioxins (TCDD) is

structured as below (see Figure 2.2)

Figure 2.2: Representative structure of 2,3,7,8-tetrachhlorodibenzo-p-dioxins

(TCDD)

(Source: Pereira, 2004)

Trang 17

2,3,7,8 tetrachlorodibenzo-p-dioxin (TCDD) is one of the most toxic members

of the family of polychlorinated dibenzodioxin (PCDDs) and represents a nearly

ubiquitous environmental contaminant (Pesatori et al., 1993, 2009) TCDD is

considered as a synthesis byproduct from chlorophenols or chlorophenoxy herbicides

manufacturing (Saracci et al., 1991) It can be formed in burning processes along with

other polychlorinated dibenzodioxins and dibenzofurans In addition, it can be derived

from waste incineration, metal production, fossil fuel or wood combustion (Deziel et

al., 2012) Dioxins are likely to involve in bioaccumulation in the food chain due to

its long biological half-life and the low water solubility; even the small amount of

dioxins can induce the significant level of dioxin concentration in the food chain

(Paustenbach et al., 1992) It is proved that TCDD can induce its effects via the

binding of the dioxin receptor AhR due to its affinity to TCDD in many mammalian

species

AhR is a basic-loop-helix/PAS transcription factor that locates in cytoplasm

where it forms a complex with various proteins and lipophilic compounds (Agostinis

et al., 2007) In cytoplasm, it is associated to pp60, which can bind to epidermal

growth factor receptor (EGFR) and induce mitogen – activated protein signaling In

nucleus, AhR builds up a heterodimer with the intranuclear aryl hydrocarbon receptor

nuclear translocator (ARNT) to form a AhR – ARNT complex which promotes

xenobiotic response elements transcription (XRE) and interact with several important

pathways, for example, Wnt-beta-catenin, estrogen receptors, retinoblastoma protein,

retino acids, NF-kB and the circadian rhythm regulators (Sorg, 2013) AhR has been

proved to be involved in multiple physiological regulation and effects, for example,

Trang 18

altered cell cycle regulation and proliferation In fact, exposure to TCDD in Sweden

and US workers indicated similar observation of a relationship between phenoxyl

herbicide exposure and cancer, particularly prolong TCDD exposure are related to the

increase of relative risk of Non – Hodgkin lymphoma (Hardell et al., 1996) Besides,

45 million liters of Agent Orange contaminated TCDD were spread out in South

Vietnam and Cambodia to destroy vegetation from 1962 to 1971 that leads to several

cancer incidence has still remained (Stellman et al., 2003)

Therefore, the aim of this study mainly focus on the potential gene-network

and pathway to investigate how the most toxic substance of PCDDs – TCDD and

furans - group of dioxin-liked-compound can induce one of common Non – Hodgkin,

especially diffuse large B lymphoma disease (Figure 2.3)

Figure 2.3: A schematic representation of signal transduction after TCDD/AHR

interaction

(Source: Fracchiolla et al., 2016)

Trang 19

2.3 Lymphoma and non – Hodgkin lymphoma

Lymphoma is considered as a well-known name of neoplasms of lymphoid

precursor cells, which was initially reported in 1832 by Thomas Hodgkin and hence

the disease was named to Hodgkin’s lymphoma After that, several kinds of

lymphoma were discovered, however, the disease was divided mainly into 2

subclasses: Hodgkin lymphoma and non – Hodgkin lymphoma The majority of Non –

Hodgkin lymphoma is B cell lymphoma apart from T-cell and NK-cell lymphoma

Lymphoid neoplasms are a group of highly diverse disease and reflect the diversity of

immune system (Hussain and Harris, 1998) In Vietnam, the incidence of

Non-Hodgkin lymphoma has increased during the last ten years that record 2700 cases each

year (Nguyen, 2015)

2.3.1 Diffuse large B lymphoma

Diffuse large B lymphoma (DLBCL) is considered as the most prevalent B cell

non-Hodgkin lymphoma (B-NHL) in adulthood, occupying for 40% of diagnoses

There are three major subclasses of DLBCL which are characterized basing on

molecular heterogeneity of DLBCL, including germinal center B-cell like DLBCL

(GCB DLBCL), activated B-cell like DLBCL (ABC DLBCL) and primary

mediastinal B-cell lymphoma GCB DLBCL is derived from germinal center B cell

and expresses genes characteristics of germinal center B lymphocytes, while ABC

DLBCL expresses genes characteristic of plasma cells, which are thought to arise

from B-cells activated for differentiation into plasma cells Primary mediastinal B cell

lymphoma is thought to mediate from rare B-cell populations that reside in the thymus

Trang 20

and have a distinct gene expression compared to GCB and ABC DLBCL (Rosenwald,

2003)

2.3.2 SNPs of Diffuse Large B lymphoma

The application of gene expression and genome sequencing is carried out in

order to increase our understanding of DLBCL subclasses and the molecular basic of

chemotherapy resistance and support for identification of novel molecular DLBCL

subset and target for drug interventions and hence to prevent and treat DLBCL

(Lossos et al., 2006)

The majority of DLBCL can arise from normal antigen-exposed B cells that are

at separate stages of differentiation and undergo clonal expansion in the germinal

center (GCs) of peripheral lymphoid organs (Martelli et al., 2013) Besides, DLBCL

can involve and progress through a range of multistep transformation processes

Specifically, progression of DLBCL can be evolved slowly or rapidly due to different

stages, through clonal evolution or simultaneous and extensive DNA rearrangements

in subclones Several diverse genetic abnormalities have been observed referring to

their clinical and genetic (clonal) heterogeneity, including aberrant somatic

hypermutation, nonrandom chromosomal deletions, balanced reciprocal translocation,

deregulating the expression of proto – oncogene products, such as BCL6, BCL2, REL

or c-MYC and dysregulated apoptosis of defective DNA repair (Morin et al., 2013)

Several genes mutation causing DLBCL have been identified in several studies,

for example, the primary or early oncogenic events are chromosomal translocations

involving oncogenes such as BCL6, BCL2, REL or c-MYC, whereas a groups of

BCL2, PRDM1, CARD11, MyD88, TNFAIP3, CREBBP, TP53, EZH2, MLL2,

Trang 21

MYOM2, PIM1, LYN, CD36, B2M, CD79B, MEF2B, ANKLE2, KDM2B, HNF1B,

NOTCH1/2, DTX1 and MYCCD58 tend to appear in the secondary or late oncogenic

events of clonally represented recurrent mutations or gene alteration (Morin et al.,

2013) In addition, the alteration of DNA repair and DNA signaling genes causing

effects on DNA repair pathway has been identified in DLBCL tumors and they have a

tendency to form intermediate cancer driver events in lymphomagenesis Moreover,

mutation or translocation of BCL6, BCL2, REL or c-MYC can induce overexpression

of proto-oncogene products, whereas genetic lesions and mutations in TNFAIP3,

CARD11, CD79A/B, MYD88 or TRAF2 can activate canonical and non-canonical

NF-kB pathways (Zhang et al., 2015) Furthermore, most frequent cancer driver

events in DLBCL are accounted for some epigenetic reprogramming, trigged by

mutations in genes, for example, TET1, MLL2, EZH2, MEF2B, EP300 and CREBBP

(Zhang et al., 2013) Therefore, tumor cell with gene expression plasticity, escape

from apoptosis and enhanced growth are provided by the alterations in gene

expression of proto – oncogene products and tumor suppressors through constitutive

survival and proliferative signals

2.4 Gene - network components

2.4.1 Microarray data

DNA microarray has been used to determine the expression level of a large

number of genes Microarray platforms for gene expression include single-color and

two-color system Affymetrix Gene Chip arrays are widely used single-platform for

microarray analysis, which are constituted of probed complementary to a region of

each mRNA transcript, usually at the 3’ end of the transcript Each probe sets consists

Trang 22

of a set of 11 to 20 perfect match (PM) of probes which are typically 25 nucleotides

long, together with an equal number of mismatch (MM) probes which are identical to

the PM probes except for a single nucleotide substitution in the center of proves

DNA microarray techniques have been applied to predict DLBCL treatment

success and explain disease heterogeneity five clinical features (age, tumor stage,

serum lactate, dehydrogenase concentration, performance status, number of extra

nodal disease sites) (Gohlmanm and Talloen, 2009) In fact, this technique is most

widely used to profile gene expression of an organism on a whole genome scale, and

available for spawning a series of microarray-based expression studies of DLBCL in

order to refine prognosis referring to molecular – level information (Segal, 2005)

Besides, DNA microarray was also carried out to analyze the changes of human B-cell

gene expression induced by dioxins (Kovalova et al., 2017)

In this study, the gene expression profiling representing DLBCL and dioxins

(TCDD and Furans) created by DNA microarray techniques were conducted for

further analytical steps The datasets of gene expression are collected in two main

kinds of databases: Gene Expression Omnibus (GEO) and Array Express databases,

that will be discussed more detailed in the following part

2.4.2 Gene network database: Array Express and GEO

All of the datasets in this study were derived from Array Express database and

Gene Expression Omnibus (GEO) database Array Express is a public database for

high throughput functional genomics data, which consists two distinct parts, including

the Array Express Repository and the Array Express Data Warehouse The Array

Express Repository is considered as a MIAME supportive public archive of

Trang 23

microarray data, whereas the Array Express Data Warehouse performs a database of

gene expression profiles selected from the repository and consistently re-annotated

The required samples or experiments can be found by experiment attributes, for

example, keywords, species, array platforms, authors, journals or accession numbers

Gene names, gene properties or gene ontology terms are useful in order to visualize

gene expression profiles The database of Array Express is rapidly growing and it

includes data from larger 50000 hybridization and 1500000 individual expression

profiles MIAME (Minimum Information About Microarray Experiment), Microarray

and Gene Expression Markup Language (MAGE-ML) and Microarray Gene

Expression - Tabular format (MAGE-TAB) are considered as some of community

standards that are supported by Array Express (Parkinson et al., 2007)

GEO database derived from National Center for Biotechnology Information

(NCBI) is considered as an abundant data containing gene expression data generated

by DNA microarray technology The database has a suitable design for both

unprocessed and processed data in a MIAME The quantitative of gene expression

data resulting in a large number of biological phenomena in GEO is about billion, and

all of them are derived from over 100 organisms and 1500 laboratories Several

user-friendly web applications have been carried out in order to increase the utility,

effective exploration, query and visualization of these data in both individual and

entire studies (Barrett, 2004)

Trang 24

2.4.3 Statistical analysis

2.4.3.1 Meta-analysis

Meta-analysis is a kind of statistical techniques for the sake of combining result

from several studies apart from various kinds of statistics, for example, Fisher’s

statistic, minimum and maximum statistic This technique has been applied to

microanalysis, in particular, in order to combine different studies for DEGs

(Differentially expressed genes) application in microarray studies and boost the

reliability of results from individual studies (Shen and Tseng, 2010) In order to

conduct microarray meta-analysis, seven steps have been carried out, including: (1)

identify suitable microarray studied, (2) extract the data from the studies, (3) prepare

the individual datasets, (4) annotate the individual datasets, (5) resolve the relationship

between probes and genes, (6) combine the estimation of the studies and (7) analyze,

present and interpret results (Ramasamy et al., 2008) Meta-analysis is probably

beneficial for this study in the attempt to identify DEGs of DLBCL tissues and

dioxins group compared to normal tissues and control group respectively, which are

mainly concerned in the next part

2.4.3.2 False Discovery Rate (FDR)

The false discovery rate (FDR) is considered as the expected fraction of false

rejections among those hypotheses rejected This method is carried out in

microanalysis in order to estimate the proportion of false positive finding amongst the

genes that were selected to become differentially expressed (Gohlmann and Talloen,

2009) Although various procedures have been built to control the FDR, the FDR

method of Benjamini and Hochberg is considered the most popular which has been

Trang 25

carried out in this study The Benjamini and Hochberg method is calculated as the

formula below:

order(pi) with i = 1,2,3,4 … , m

Where:

p is adjust P value by Benjamini Hochberg method

pi is the p Value of gene I

m is the total number of genes in dataset

2.4.3.3 Different Expression Analysis

Different gene expression is currently applied in microarray analysis in order to

find the genes that are differentially expressed In fact, mutation in gene or a set of

gene is the main factor that induce abnormal or fail gene expression, for example, p53

tumor suppressor gene are transcribed that can cause cancer disease Therefore,

microarray experiments are useful to identify which gene are differentially expressed

in disease cell versus normal cells The comparison between various kind of “disease”

and “normal” cells provides an opportunity in order to find multiple target genes that

their up- and down- regulation can be the result of the disease After that, the

development of drug target for specific mutated genes is carried out in order to reduce

their undesirable effects In addition, Different Gene Expression has a significant

relationship with gene function and it can provide fully information about genes and

protein interaction Therefore, differentially expressed genes are carried out in the

reconstruction of gene network, metabolic pathway and gene annotation (Zhang,

Trang 26

2006) In this study, DEGs are the main components for gene-network construction to

figure out whether dioxins can induce DLBCL

2.4.4 Hub – proteins

A gene-network is consisted of various nodes, which are connected by edges

In molecular biology, nodes are referred as the term of “genes” or “proteins” and

edges are molecular interaction, as a result, gene network represents the interaction of

genes or proteins leading to a variety of biological processes The types of nodes in

each network is currently divided into two distinct types, including: (1)

highly-connected nodes, or hub-proteins and (2) poorly-highly-connected nodes or non-hub proteins

Hub-proteins are significantly more important that non-hubs since they have an ability

to ensure the maintenance of the network It has been indicated that in protein-protein

interaction network, hubs tend to be essential due to the centrality-lethality rule that

shows functional importance of a node is thought to increase from its structure

importance in the network, as a results, hubs tend to relate to significant biological

pathways that may result in biological reaction in human body (He and Zhang, 2006)

In this study hub-proteins play an important role in order to observe the potential

pathway exposure to dioxins leading to DLBCL

2.4.5 GO term

The gene ontology (GO) mainly contains the terms, which are connected

through a hierarchical order The GO terms associate with gene products that is able to

classify proteins into three distinct group, including: (1) molecular, (2) biological

processes and (3) cellular components corresponding to their biological function

(Balakrishnan, 2013) In fact, these functions are summarized from published papers

Trang 27

and uploaded in GO database and hence researchers can access to this information by

the process of annotation In addition, GO database provides the main annotation

sources that can be useful for analysis of high throughput datasets, for example,

transcriptomic and proteomic studies and function, pathway or cellular components

identification, which are represented by these datasets (Pavlidis, 2004) Furthermore,

GO database is considered as a pathway-driven analysis tools in order to identify risk

since it relates to single nucleotide polymorphisms (SNPs) that are useful to inform

biomarker identification studies (Holmans, 2009)

2.5 Gene Network construction tools

2.5.1 Network Analyst website

Network analyst website is one of the most basic and friendly tools and it

combines all necessary steps to analyze network and performs the results through a

high-quality visualization system This website is available for anyone and it is

designed for efficient Protein-protein interaction network performance The data in

this website is generated from several gene expression experiments of various species,

mainly from human and mouse studies

Network analyst website was developed by three main steps in network

analysis, including significant gene identification by data processing steps, a step of

network construction for mapping, building and refining network, a step of network

analysis and visualization Besides, multiple options are certainly provided within

each main steps (Xia et al., 2014) In this study Network Analyst is considered as an

inadequate tool to find the most obvious DEGs of DLBCL, dioxins and hub-proteins

for gene-network construction and potential pathway identification respectively

Trang 28

2.5.2 Cytoscape software and plugins: ClueGO and CluePedia Apps

Cytoscape is an open source software that is helpful to perform

high-throughput expression data and other molecular states into a conceptual framework

Cytoscape has a powerful role in conjunction with these databases of protein-protein,

protein-DNA and genetic interaction that are available for human and other organisms

(Shannon, 2003)

There were a large number of enrichment tools and algorithms that were

constructed for the sake of sufficient data interpretation, and ClueGO is considered as

a Cytoscape plugin used to represent the biological interpretation and functional group

terms in the form of networks and charts In particularly, Kappa statistic is mainly

used in ClueGO in the attempt to link the terms in the network and hence GO terms or

pathways are functionally organized Therefore, ClueGO is one of available

Cytoscape plugin that is used for analysis of terms relation and function groups in

biological networks (Bindea et al., 2009)

CluePedia is the second Cytoscape plugin that has been used in this study Clue

Pedia is an useful tool in order to search new markers, which are associated to

pathways By using CluePedia application various kinds of genes, proteins and

miRNAs have a possibility to connect referring to experimental information before

integrating into ClueGO network In addition, new association of pathway can be

informed by genes, proteins and miRNA enrichments Therefore, this Cytoscape

plugin is certainly portable for users and has a powerful visualization in the attempt to

present genes, miRNA or proteins connection network (Bindea et al., 2013)

Cytoscape software and ClueGO/CluePedia plugins are applied to perform

Trang 29

gene-network reconstruction and identify potential pathway corresponding to the

second and the third objectives of this study

Trang 30

PART III METHODOLOGY 3.1 Data collection

At the beginning, all required microarray datasets were collected in two main

websites, including the Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/)

and Array Express (http://www.ebi.ac.uk/arrayexpress), which are considered as a

huge public resource of gene expression data and provide users a flexible data mining

tool

(https://academic.oup.com/nar/article/35/suppl_1/D760/1106106/NCBI-GEO-mining-tens-of-millions-of-expression)

To measure human gene expression of DLBCL pattern and how it relates to

chemical exposure, several datasets were obtained from these both websites by using

these following keywords: Homo sapiens, DLBCL, TCDD and Furans, and the array

files containing proceed data were carried out in this study In each DLBCL array

files, the experimental samples are normally obtained from various sources In this

study, two different types of samples, including normal tissues and DLBCL tissues,

were conducted and all of them must be untreated by any chemical The array

platform of these data totally was Affymetrix platform and those files must be

available from September 2015 to present time As a result, a total of 10 microarray

datasets were found to fit the scope for this study, including GEOD-12195,

83632, GSE47355, GSE56313, 69844, 69845,

E-GEOD-69849, E-GEOD-69850, E-GEOD-69851

3.2 Data processing

Data analysis were subsequently performed using Network Analyst – a

standard web browser for network analysis and interactive exploration The datasets

Trang 31

were combined and divided into 3 types of distinct groups basing on the samples

sources in text files (txt.), including (1) DLBCL and normal tissues, (2) control and

TCDD, and (3) control and Furans In the initial step, Text files were uploaded in

order to define the types of organism – Homo sapiens and the ID type of Official

Gene Symbol was chosen ID conversion steps were applied immediately after

uploading these Text data in order to identify the types of organism and provide the

number of matched or unmatched genes with chosen ID type Then, these files were

summited to gene annotation step to ensure the labels are consistent across all datasets

upload After that, data normalization step was carried out in order to set adequate

normalization procedure In this report there is no normalized procedure setting for

DLBCL and normal tissues data, while log2 normalization was applied for control

group and dioxin treatment data in order to increase the variance at low intensities

Those normalized data were transformed into different expression analysis

dialog in order to perform different gene expression analysis on individual dataset and

hence the number of DEGs between DLBCL and normal tissues, control and chemical

groups can be detected An analysis of variance (ANOVA) was conducted on

individual dataset and cut-off p value was adjusted by using the

Benjamini-Hochberg’s false discovery rate (FDR) which is enable to decide whether the gene is

differentially expressed or not; and it was randomly set equal to 0.05 in DE analysis

dialog in Network analyst website After data summarization step all 4 datasets of

DLBCL, 5 datasets of TCDD and 1 of Furans (Table 4.1.1, Table 4.1.2) were applied

“directed merge” method in meta-analysis step in order to merge all datasets into a

single data to analyze

Trang 32

Finally, three distinct result tables containing top-ranking DEGs and relevant

statistics (CombineLogFC, adjust P value) for DLBCL, TCDD and Furans were

separately exported (Appendix 1,2,3)

3.3 Network construction

The obtained DEGs of DLBCL, TCDD and Furans above were basically

screened by |fold – change| ratio that equals to 2.0, 1.2 and 1.2 (|Combine LogFC| ≥

0.26) respectively in order to filter the top up – regulated and down – regulated genes,

which are employed for further analysis steps including: (1) Gene ontology analysis

and (2) Gene network reconstruction The GO biological pathway of these expressed

genes can easily found by summiting the DEGs lists of each group into Cytoscape

plug-in ClueGO The results of gene networks were released showing fully the

biological pathway of these involved DEGs Besides, these DEGs were plug in

ClueGO app in Cytoscape software thereby reconstructing DLBCL, TCDD, Furans

gene networks The standard Kappa score level threshold can initially be adjusted on a

positive scale from 0 to 1 for the purpose of restricting the network connectivity in a

customized way as well as creating the functional group of gene (Bindea et al., 2009)

and the kappa score was chosen to equal 0.4 in this study to create these subnetworks

All three sub – networks of DLBCL, TCDD and Furans were merged together into a

single network thus providing a potential pathway showing the effect of TCDD and

Furans chemical on Human health and leading to DLBCL

To clarify the potential pathway between TCDD/ Furans exposure and DLBCL,

protein-protein interaction network was constructed for further purpose of identifying

the hub genes, which may have a vital function and indirectly involve in many

Trang 33

biological process (Raman et al., 2013) All filtered DEGs of DLBCL TCDD and

Furans were submitted individually in Network Analyst website in order to create its

own protein-protein interaction network Later on, those hub proteins of DLBCL and

two types of dioxin-related compounds having the highest value of (1) Node degree

and (2) node betweeness were characterized and summarized for the next step of

pathway analysis The list of hub protein and the additional target gene were plug in

Clue Pedia app in Cytoscape software in order to show potential pathway of TCDD

and Furans leading to DLBCL disease The directed edges chosen for pathway

network construction consisted of two distinct types: gene activation, and gene

expression in order to build the pathway network showing how dioxin-related

compounds can lead to DLBCL disease in human body

In this research, all necessary steps to be undertaken are assembled in the

following flowchart (Figure 3.1) for better illustration

Trang 34

Figure 3.1: The flowchart of methodology

DATA COLLECTION

After September

2015

DATA PROCESSING

Network

Analyst

Normalization Annotation

ID conversion

DE analysis

Differentially expressed genes

Gene-network presenting GO

Protein – protein interaction network

Potential pathway

NETWORK CONSTRUCT -ION

Trang 35

Trang 36

25

Table 4.1: Database of DLBCL

Name

Data Source (website)

Species

Sample source (type of tissues)

Normal tissues DLBCL

Total samples

Array platform

E-

GEOD-12195

Array express

Homo sapiens

Fresh frozen tissue, normal tonsil

Homo sapiens

Fresh frozen tissue

GSE473

Homo sapiens

Lymph node tissues of DLBCL patients

GSE563

Homo sapiens

Lymph node tissues of DLBCL patients

Trang 37

trol

Con- ment Total

Treat-Array platform

E-GEOD

-69844

Array Express

Homo sapiens

HeraRG Hepatocytes 45 7 52 AffymetrixE-

GEOD

-69845

Array Express

Homo sapiens

MCF7 Breast Adenocarcinom

Homo sapiens

Ishikawa Endometrial adenocarcinoma Cell Line

Homo sapiens

HepG2 Human Hepatocyte Carcinoma Cell Line

Homo sapiens

HepaRG Hepatocyte Carcinoma Cell Line

HomoSapiens

Expression Profiles of HepG2 cells treated with furans

Trang 38

27

4.1.2 Differentially genes expression

By using Network Analyst and Benjamini – Hochberg’s FDR statistical method, the total number of DEGs screened by |fold change| ≥ 1,2 (|Combine LogFC|≥0.26) were counted for 1228 DEGs, including 488 DEGs of DLBCL, 288 DEGs of TCDD and 512 DEGs of Furans As the result table above, the quantity of up – regulated genes of DLBCL,

TCDD, Furans were 316, 268 and 217 respectively and down – regulated genes of these categories were counted for 172, 20 and 295 DEGs respectively (Tables 4.3, 4.4, 4.5)

Trang 39

28

Table 4.3: Differentially expressed genes, including up- and down – regulated genes

in Diffuse Large B lymphoma compared to normal cells

DLBCL (448 DEGs) Up-regulated

COMMD8, NDUFS3, MFSD1, VAMP8, HSBP1, HSD17B10, LSM1, RRM1, RSL24D1, C14orf2, PDZD11, POP5, PSMD14, APEX1, ACTR10, MRPL33, NDUFA8, DDX39A, TMEM147, IGBP1, DCTPP1, IMPDH2, RRM2, MRPL18, POLR2H, PSMD1, TIMM10, MRPL27, YWHAG, FIS1, DDX23, SNAPIN, BLOC1S2, RCC2, SAT1, PTRHD1, CDK2AP1, TAF7, BCL2A1, GLA, ALYREF, CD19, COX14, CD3D, SLAMF8, CMC2, S100A11, PRDX1, C19orf70, MRPS33, NDUFB5, TMEM126A, DNAJB11, PRR13, EMG1, NAE1, ADSL, EVI2A, MRPL15, MRFAP1, GZMA ,EIF2A, GYG1, ISG15, GJA1, MLH1, PARP1, RACGAP1, SNRPB, MMADHC, CBX1, MRFAP1L1, ACADM ,RPL36AL, CCDC12, STARD3NL, CETN3, HEXB ,CEBPB, ISCU, LSM6, DNMT1, PSMC5, MAGED1, NAA20, TSPO, MORF4L1, PDLIM1, DDIT4, EIF4H, PSMB10, ITPA, MRPS28, PSMC4, BLOC1S1, CD52, UBE2A, ATP5L, RPS19BP1, SEC13, SEC11A, THOC7, NOC3L, DKFZP586I1420, ZNF121, MRPS35, DIABLO, OCIAD2, PSMD8, ARPC5L, MS4A1, MCM2, MRPL40, GCH1, PSMD10, DNTTIP2, OAT, NSMCE1, TBCB, C14orf119, ACP5, PPM1G,

Trang 40

29

POLR2K, TSG101, PEA15, MRPL49, NIT2, ATIC, PPP2CB, NCBP2, RABAC1, DRG1, NUP107, TCF4, SLC25A19, UFC1, CIB1, BIRC2, NDUFB10, RBBP8, SNX3, SMNDC1, HDHD2, ETF1, RAD23A, MYBL2, SRRM1, TIMMDC1, COX5B, LYRM1, IL18, ARHGAP17, IRF2BPL, NONO, TM2D2, MFAP1, ITGA3, KCTD12, NUPR1, HAT1, AP3S1, MANF,, TMEM14B, CPSF4, PPIH, MIEN1, MTIF2, FAM50A, LRRC47, PAPSS1, GLO1, CCNG1, RPIA, ASNSD1, LYPLA1, WDR83OS, CUTA, DAZAP1, AP1S2, BTBD1, VPS25, BCL11A, MT1E, ZNHIT3, EIF3I, RPL11, S100A8, ANXA2, PPIL3, GLRX, ENOPH1, IER5, CISD1, HAUS1, DRAM1, DDX21, SNRPD3, UBE2L6, TMEM138, RPF2, DUT, GTF3C6, TSPAN13, ITM2A, PPP1R7,PIH1D1, GTF2B, CDK5RAP3, TMEM208, DBF4, GTF3A, RFC4, IER3, YTHDF2, FIBP, TIMM8B, MPLKIP, VPS28, LAGE3, CLIC1, HARS, IMP3, CS, CEBPZ, RFX5, DNAJB1, MRPL16, CSRP1, ORMDL2, PIGP, CDKN1A, NMI, FAM35A, TNFAIP3, PCMT1, EBPL,TUBB6, GBP1, PLOD1, TUBA1C, REEP5, EIF2S1, MRPL1, IMP4, SNRPA, MARCKSL1, DYNLT3, UBE2E2, SCAMP3, POLR3GL, CUEDC2 Down-

regulated

genes (172)

DUSP6, CYTH4, LCP2, SIRPB1, ITGB2, CORO1A, RAB7A, COX7A2L, MEFV, ANPEP, C5AR1, ZYX, DOCK5, STEAP4, GRK6, MSL2, PLXNC1, STK17B, PYGL, CD3E, KCNJ15, SCIMP, CAPNS1, GLIPR1, CPPED1, IST1, LILRA1, PRKAR1A, ARRB2, WDR1, ARHGAP26, DUSP1, WIPF1, MXD1, BSG, CELF2, GNAQ, ZFAND5, MBOAT7, GABARAP, MBNL1, AOAH, CTSS, DOK3, HIST1H1E, CYP4F3, PTBP3, NCF2, RNASET2, TCP11L2, MAPK1, PIP4K2A, STAT3, DOCK8, TLN1, TGFBR2, SELPLG, PGK1, FPR1, SDHA, SMCHD1, MOB3A, DDX17, TUBB1, GUK1, LYN, CD37, ETS1, CCNI, STK38, ATP6V1B2, CAP1, PDZK1IP1, HBB, EPB41, TREM1, PTAFR, GNAS, FFAR2, RPL18, IL7R, EIF4EBP2, SLC44A2, HLA-DPA1 LITAF ITM2B CXCR2, CYBB, CFL1, LCP1, ALAS2, PTPRC, CSF3R, ARHGDIB, AQP9, DAZAP2, SLC6A6, B2M, SMAP2, BCL2L1, SORL1, RAC2, FBXO7, PSAP, FCN1, ND5,

SLC25A37, TNFRSF10C, TMBIM6, CD74, HLA-E, SLC25A39, DCAF12, CX3CR1, RHOA, CD53, XPO6, TAGLN2, FCGR2A, MSN, LYZ, LAPTM5, MALAT1, TXNIP, ACTB

Table 4.4: Differentially expressed genes, including up- and down – regulated genes

activated by TCDD compared to control group

Tiêu đề	Identifying the effect of exposure to dioxin and furans on human health leading to diffuse large B lymphoma through gene network construction
Tác giả	Nguyen Thi Quynh Lam
Người hướng dẫn	Prof. Chun-Yu Chuang, Assoc. Prof. Tran Thi Thu Ha
Trường học	Thai Nguyen University of Agriculture and Forestry
Chuyên ngành	Environmental Science and Management
Thể loại	luận văn
Năm xuất bản	2017
Thành phố	Thai Nguyen

Định dạng
Số trang	87
Dung lượng	1,46 MB