Developing a hadoop based distributed system for metagenomic binning

2.2.1 Hadoop Components Figure 2.2: Hadoop Components Basic components in Hadoop software are: • For data storage: Hadoop Distributed File System HDFS • For resource management monitor m

Trang 1

VIETNAM NATIONAL UNIVERSITY OF HO CHI MINH CITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE AND ENGINEERING

Major: Computer Engineering

Committee: Computer Engineering 1

(English Program) Supervisor: Assoc Prof Dr Tran Van Hoai Reviewer: Assoc Prof Dr Thoai Nam

—o0o—

Student 3: Nguyen Huu Trung Nhan (1752392)

Trang 2

ĐẠI HỌC QUỐC GIA TP.HCM CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

- Độc lập - Tự do - Hạnh phúc

TRƯỜNG ĐẠI HỌC BÁCH KHOA

BỘ MÔN:KHMT _ Chú ý: Sinh viên phải dán tờ này vào trang nhất của bản thuyết trình

HỌ VÀ TÊN: _MSSV:

HỌ VÀ TÊN: _MSSV: NGÀNH: LỚP: _

1 Đầu đề luận án:

Developing a Hadoop-based Distributed System for Metagenomic Binning

2 Nhiệm vụ (yêu cầu về nội dung và số liệu ban đầu):

- Tìm hiểu về Metagenomic Binning và thuật toán BiMeta

- Tìm hiểu Hadoop và cài đặt một hệ thống minh hoạ

- Phát triển thuật toán BiMeta dựa trên Hadoop

- Xây dựng giao diện để người dùng có thể sử dụng BiMeta trên hệ thống tính toán Hadoop

- Đánh giá khả năng sử dụng Hadoop cho bài toán Metagenomic Binning trên tập số liệu được dùng bởi cộng đồng

3 Ngày giao nhiệm vụ luận án: …

4 Ngày hoàn thành nhiệm vụ: …

1) PGS.TS Trần Văn Hoài, hướng dẫn toàn bộ. _2) 3)

Nội dung và yêu cầu LVTN đã được thông qua Bộ môn

Ngày tháng năm

CHỦ NHIỆM BỘ MÔN GIẢNG VIÊN HƯỚNG DẪN CHÍNH

(Ký và ghi rõ họ tên) (Ký và ghi rõ họ tên)

PHẦN DÀNH CHO KHOA, BỘ MÔN:

Người duyệt (chấm sơ bộ): _

Trang 3

p"Jcfqqr Ukpj"xk‒p"8«"p逸o"8逢嬰e"pj英pi"mk院p"vj泳e"e挨"d違p"x隠"Jcfqqr."Urctm"8吋"e„"vj吋"vjk院v"n壱r"o瓜v"j羽"vj嘘pi"okpj"jq衣0"

- Ukpj"xk‒p"8«"p逸o"8逢嬰e"pj英pi"mk院p"vj泳e"e挨"d違p"8吋"e„"vj吋"vtk吋p"mjck"o瓜v"vjw壱v"vq p"vw亥p"v詠"n‒p"o»k"vt逢運pi"v pj"vq p"rj¤p"d嘘"f衣pi"Jcfqqr0"

- Sinh viêp"8«"vj詠e"pijk羽o"x噂k"pj英pi"v壱r"f英"nk羽w"8逢嬰e"u穎"f映pi"vtqpi"pj英pi"e»pi"d嘘"w{"v p"8吋"8 pj"ik "mj違"p<pi"e栄c"j羽"vj嘘pi0"

90"Pj英pi"vjk院w"u„v"ej pj"e栄c"NXVP<"

- X噂k"mj違"p<pi"e p"j衣p"ej院"vtqpi"n pj"x詠e"ukpj"j丑e"p‒p"xk羽e"ejw{吋p"8鰻k"vjw壱v"vq p"ucpi"o»"j·pj"n壱r"vt·pj"Jcfqqr"x "Urctm"e p"o瓜v"x k"j衣p"ej院."pj医v"n "vj運k"ikcp"v pj"vq p"ej逢c"vj吋"jk羽p"8逢嬰e"mj違"p<pi"e栄c"j羽"vj嘘pi"rj¤p"d嘘0"

- Fq"mj»pi"vj吋"vk院r"e壱p"f宇"f pi"x噂k j衣"v亥pi"v pj"vq p"n噂p"p‒p"m院v"sw違"v pj"vq p"e p"pjk隠w"j衣p"ej院0"

- Ukpj"xk‒p"o噂k"ej雨"e„"mk院p"vj泳e"e挨"d違p"x隠"ogvcigpqoke"dkppkpi"p‒p"pj英pi"ej泳e"p<pi"*ikcq"fk羽p"pi逢運k"f́pi+"e栄c"j羽"vj嘘pi"ej逢c"8逢嬰e"u pi"v衣q0"

:0"A隠"pij鵜<"A逢嬰e"d違q"x羽"T" D鰻"uwpi"vj‒o"8吋"d違q"x羽"¸ Mj»pi"8逢嬰e"d違q"x羽"¸

;0"5"e¤w"j臼k"UX"rj違k"vt違"n運k"vt逢噂e"J瓜k"8欝pi<"

c0"Ukpj"xk‒p"j«{"ejq"dk院v"p院w"rj v"vtk吋p"vk院r"8吋"v<pi"mj違"p<pi"v pj"vq p"vj·"e映"vj吋"u胤"n o gì?

320"A pj"ik "ejwpi"*d茨pi"ej英<"ik臼k."mj "VD+<"ik臼k"" Ak吋o"<";02132"

M#"v‒p"*ijk"t "j丑"v‒p+

Vt亥p"X<p"Jq k

Trang 4

VT姶云PI"A萎K"J窺E"DèEJ"MJQC E浦PI"JñC"ZÊ"J浦K"EJ曳"PIJ C XK烏V"PCO

30"J丑"x "v‒p"UX<"Vt亥p"F逢挨pi"Jw{ OUUX<"3974464 Pi pj<"MVOV

Rj衣o"Pj壱v"Rj逢挨pi OUUX<"3974264 Pi pj<"MVOVPiw{宇p"J英w"Vtwpi"Pj¤p OUUX<"39745;4 Pi pj<"MVOV040"A隠"v k<"Fgxgnqrkpi"c"Jcfqqr/dcugf"fkuvtkdwvgf"u{uvgo"hqt"ogvcigpqoke"dkppkpi0

& Ukpj"xk‒p"p逸o"d逸v"m悦"vjw壱v"x "e»pi"pij羽"v pj"vq p"vt‒p"Jcfqqr"x "Urctm"v嘘v=

& D k"vq p"Ogvcigpqoke"Dkppkpi"e pi"8逢嬰e"ukpj"xk‒p"n pj"j瓜k"x "vtk吋p"mjck"o瓜v"ik違k"rj r"v pj

& Fq"j衣p"ej院"x隠"vj運k"ikcp"x "o»k"vt逢運pi"vj詠e"p‒p"xk羽e"8 pj"ik "ik違k"rj r"ik違k"d k"vq p

Ogvcigpqoke"Dkppkpi"vt‒p"o»k"vt逢運pi"rj¤p"v p"u穎"f映pi"Jcfqqr"("Urctm"e p"8挨p"ik違p"p‒pej逢c"vj医{"jk羽w"sw違"e栄c"ik違k"rj r"e pi"pj逢"e e"rj¤p"v ej"m院v"sw違"8吋"j逢噂pi"8院p"vj詠e"jk羽p"e違kvk院p"x隠"jk羽w"p<pi"e栄c"ik違k"rj r0

:0"A隠"pij鵜<"A逢嬰e"d違q"x羽 D鰻"uwpi"vj‒o"8吋"d違q"x羽 Mj»pi"8逢嬰e"d違q"x羽

;0"5"e¤w"j臼k"UX"rj違k"vt違"n運k"vt逢噂e"J瓜k"8欝pi<

320"A pj"ik "ejwpi"*d茨pi"ej英<"ik臼k."mj "VD+<"ik臼k Ak吋o"<":02132

M#"v‒p

Vjq衣k"Pco

Trang 5

We commit that our topic "Developing a Hadoop-based distributed system for metagenomicbinning" is our personal thesis proposal We declare that this topic is conducted under our effort,time, and the recommendation of our supervisor, Assoc Prof Dr Tran Van Hoai

All of the research results are conducted by ourselves and not copied from any other sources

If there is any evidence of plagiarism, we will be responsible for all consequences

Ho Chi Minh City, 2020,Tran Duong Huy, Pham Nhat Phuong, Nguyen Huu Trung Nhan

Trang 6

This thesis would not have been possible to complete without the help and support of manyothers First and foremost, we would like to express our sincere gratitude to our supervisor,Associate Professor PhD Tran Van Hoai His insight and expert knowledge in the field improvedour research work He has also helped us to have organized our thinking in research and inwriting the thesis

We sincerely thank the teachers in the Faculty of Computer Science and Engineering from

Ho Chi Minh City University of Technology for their enthusiasm to impart knowledge duringthe time we study at school With knowledge accumulated throughout the learning process, ithelps us to complete this thesis

In the end, we would like to wish the teachers and our supervisor good health and success intheir noble careers

Trang 7

Bioinformatics research is considered to be an area in which biological data is vast, extensiveand complex Biological data are constantly evolving and often unlabeled, so a controlled testmethod cannot be used One of the most difficult problems to be solved in this area is thatdetecting new symptoms of the virus, or at least grouping them together, is an urgent need(for example, determining the proximity of SARS-CoV-2 to the bat virus) One way to solvethe problem is to create scalable clustering tools that can handle very large amounts of data.Genomics and next-generations technology like Illumina, Roche 454 are producing 200 billionkits a week, transferring 60 thousand genes and efficient computers Our goal in this thesis,inspired by previous research, is to create a Hadoop-based tool for metagenomic binning

Trang 8

1.1 Overview of this thesis 1

1.2 Scope and Objectives 1

1.3 Thesis Outline 2

2 Background 3 2.1 Metagenomic 3

2.1.1 Background 3

2.1.2 Basic Concepts 4

2.1.3 Metagenomic Binning 5

2.2 Hadoop 8

2.2.1 Hadoop Components 8

2.2.2 HDFS - Hadoop Distributed File System 9

2.2.3 Hadoop MapReduce 12

2.2.4 YARN 14

2.3 Spark 18

2.3.1 What is Spark? 18

2.3.2 Introduction to Spark 19

2.3.3 How do Sparks run on a cluster? 22

Trang 9

3 Related Work 28

3.1.1 Background 28

3.1.2 Method 28

3.1.3 Fundamentals of proposed method 29

3.1.4 Datasets 30

3.1.5 Result 30

3.1.6 Conclusion 30

3.2 Libra: scalable k-mer–based tool for massive all-vs-all metagenome comparisons 31 3.2.1 Background 31

3.2.2 Method 31

3.2.3 Datasets and Hadoop cluster configuration 31

3.2.4 Result 32

4 Methodology 33 4.1 BiMetaReduce 33

4.1.1 Step 1: Load Fasta file 33

4.1.2 Step 2: Create Document 33

4.1.3 Step 3: Create Corpus 33

4.1.4 Step 4: Build Overlap Graph 35

4.1.5 Step 5 and 6: Find Connected Component and Clustering 35

4.2 System Design 37

4.2.1 Overview 37

4.2.2 Proposed System Architecture 37

4.2.3 Devices and Components 37

4.2.4 Web Application Design 39

5 Experimental and Evaluation 50 5.1 Datasets 50

5.2 Experiments 50

Trang 10

List of Figures

2.1 Reads and Sequence 5

2.2 Hadoop Components 8

2.3 A Basic Hadoop Cluster 9

2.4 Upload A File To HDFS 10

2.5 Read A File From HDFS 11

2.6 The first few lines of the file S8.fna 13

2.7 The pipeline of the phase Reading Fasta File 14

2.8 Map-Reduce Logical Data Flow for Reading Fasta File 16

2.9 Hadoop YARN - How YARN manages a running job 17

2.10 Spark’s toolkit 18

2.11 The architecture of a Spark Application 20

2.12 A narrow dependency 21

2.13 A wide dependency 21

2.14 A cluster driver and worker (no Spark Application yet) 22

2.15 Spark’s Cluster mode 23

2.16 Spark’s Client mode 24

2.17 Requesting resources for a driver 24

2.18 Launching the Spark Application 25

2.19 Application execution 25

2.20 Shutting down the application 26

3.1 Binning process of BiMeta 29

3.2 The Libra workflow 32

3.3 Scalability testing for Libra 32

4.1 Workflow of the BiMetaReduce program 35

4.2 Overview of the proposed system 38

4.3 Python & Django frameworks 39

Trang 11

4.4 HTML, CSS & Javascript 39

4.5 Model View Template Model 40

4.6 Use case diagram of web application 41

4.7 Login and Register Sequence Diagram 44

4.8 Homepage,"About" page and "Your Project" page Sequence Diagram 45

4.9 "System" page Sequence Diagram 46

4.10 Homepage 47

4.11 Homepage 2 47

4.12 Homepage 3 48

4.13 "System" page 48

4.14 "Your Project" page UI 49

4.15 "About" page UI 49

5.1 Sequential and MapReduce Runtime 52

5.2 Runtime of different settings 53

5.3 Runtime of setting 3 and 4 53

5.4 Runtime of setting 7 and 8 54

5.5 Fruchterman-Reingold graph of the R4 file 54

5.6 Kamada-Kawai graph of the R4 file 55

5.7 Circular graph of the R4 file 55

Trang 12

Chapter 1

Introduction

1.1 Overview of this thesis

Bioinformatics research is considered an area that includes large, expanding, and complexbiological data sets The biological data is increasingly large, and often unlabeled, making itimpossible to use supervised learning methods A challenging problem that arises in this domain

is detecting a new strain of the virus, or at least cluster it into a group of species, is a very urgentneed (such as identifying the proximity of SARS-CoV-2 to a bat virus group) An alternativeapproach to the problem is to build a scalable clustering tool that can work with very largedatasets

Genomics and next generation, like Illumina, Roche 454, produce 200 billion sets per weekfor the transmission of 60k genes and efficient computers Inspired by the previous studies, inthis thesis, our goal is to create a Hadoop-based tool for metagenomic mechanics

1.2 Scope and Objectives

This work aims to develop a scalable clustering tool for working with large datasets Someobjectives required for this thesis:

• Learn about the fundamental of Genes and Genomes

• Research about Genomic Technologies

• Research about the Hadoop distributed processing framework for data processing, storagefor big data applications, and related distributed computing environments

• Learn about metagenomic clustering problems and algorithms capable of distributedcomputing

• Develop distributed clustering tools and experimental computing on the actual logical dataset from NCBI’s database

Trang 13

microbio-1.3 Thesis Outline

In this thesis, it contains five main parts Section 1 is the introduction of the project Section

2 is the background of the research field Section 3 is the related work that proposes a differentmethod to solve the problem Section 4 shows the proposed methods and discussion Section 5covered our system experiment results and evaluation And the last section is the conclusion ofour work

Trang 14

Which leads to the question: what is metagenomic? Metagenomics is made up of two words:meta and genomics Genomics collects DNA sequences, meta shows that we obtain it with manyorganisms together Metagenomics is a molecular tool used to analyze DNA from environmentalsamples to study existing microbial communities without having pure cultures [4].

Sanger sequencing, also known as the "chain termination method", is a well-known methodfor determining the nucleotide sequence in DNA Many years after the development of Sangersequencing technology, the massively parallel sequencing technology known as next-generationsequencing (NGS) has revolutionized the biological sciences Next-generation sequencingtechnologies make millions of readings faster and less expensive Recently, many projects havebeen used Roche 454, Illumina Genome Analyzer, AB SOLiD [11]

The analysis of microbial communities is usually carried out by a process referred to asbinning, reads from the related organisms are grouped together Binning methods can be broadlyclassified into two categories: supervised, and unsupervised methods Among the most importantsupervised methods, we can recall Mega [6], Kraken [14], Clark [9], and MetaPhlan [10].The other category of methods is unsupervised, MetaProb [5], BiMeta [12], MetaCluster [13],AbundanceBin [15]

Because of the lack of labeled datasets, the unsupervised method is a better approach for thebinning problem We found that BiMeta and MetaProb has a significant result in metagenomicclustering among the remaining studies In this thesis, we will develop a Hadoop-based tool formetagenomic binning, BiMeta will be used as a main base model for clustering

Trang 15

2.1.2 Basic Concepts

1 Defitition Of Bioinformatics

The term “bioinformatics” was coined by Paulien Hogeweg and Ben Hesper in 1978.Bioinformatics is fundamentally informatics as applied to biology or computer-aidedanalysis of biological data [2]

2 Goals Of Bioinformatics Analysis

The main goal of bioinformatics is to be able to predict the biological processes in healthand disease In order to satisfy those abilities, understanding of the biological processes

is necessary Thus, the extra goal of bioinformatics is to develop such an understandingthrough analysis integration of the information obtained on genes and proteins For furtherbioinformatics develop new tools and continuously improve the existing set of tools fordiverse types of analyses [2]

3 DNA & Nucleobases

DNA is a double-stranded right-handed helix; the two strands are complementary because

of complementary base pairing, and antiparallel because the two strands have opposite5’-3’ orientation

DNA is composed of structural units called nucleotides (deoxyribonucleotides) Eachnucleotide is composed of a pentose sugar (20-deoxy-D-ribose); one of the four nitrogenousbases—adenine (A), thymine (T), guanine (G), or cytosine (C); and a phosphate Thepentose sugar has five carbon atoms and they are numbered 10 (1-prime) through 50(5-prime) The base is attached to the 10 carbon atom of the sugar, and the phosphate isattached to the 50 carbon atom

In the double-stranded DNA, A pairs with T by two hydrogen bonds and G pairs with C

by three hydrogen bonds; thus GC-rich regions of DNA have more hydrogen bonds andconsequently are more resistant to thermal denaturation [2].The other nucleotide symbolsusually used is R (which denotes G or A), its complement is Y (which denotes C or T)

4 Sequencing & Read

Genome sequencing is the most direct method of detecting mutations, such as singlenucleotide polymorphisms (SNPs) and copy number variations(CNVs) The development

of the dideoxy method of DNA sequencing was a major step forward for the science ofmolecular biology The dideoxy method of DNA sequencing was published by Sangerand colleagues in 1977 About 20 years after the development of Sanger’s dideoxysequencing, Pal Nyren introduced the pyrosequencing technique The pyrosequencingtechnique paved the way for the development and commercialization of large-scale, high-throughput, massively parallel sequencing technology, popularly referred to as next-generation sequencing or next-gen sequencing (NGS) technology [2]

Most recent projects have made use of next generation sequencing technologies such

as 454 pyrosequencing, Illumina Genome Analyzer, and AB SOLiD New sequencingtechnology can generate millions of reads at a much faster and lower cost However,the length of the sequences produced by these technologies varies greatly Illumina readlengths range from 50 to 300 bp, while the Roche 454 System can generate reads up to 700

bp As a result, both long read and short read research tools are needed for metagenomicprojects [12]

Trang 16

In next-generation sequencing, a read refers to the DNA sequence from one fragment (asmall section of DNA) Prior to sequencing, most next-generation sequencing methodssegment the genome, and each sequenced fragment creates a read The length of the readand the number of reads generated are determined by fragment size and the technologiesused Since DNA fragments usually overlap, the reads may be pieced together to recreatethe genome Some next-generation sequencing techniques that do not fragment the genomeare known as long-read sequencing because they yield very long reads.

Figure 2.1: Reads and Sequence

2.1.3 Metagenomic Binning

1 Binning Method & Clustering Taxonomy

Metagenomic science, which is only a few years old, makes it possible to study microbes

in their natural environment, the complex communities in which they usually live standing microbial communities bring changes in many fields, such as biology, medicine,ecology, and biotechnology

Under-The analysis of microbial communities is usually carried out by a process referred to as ning, is to group reads into clusters that represent genomes from closely related organisms.Binning methods can be roughly classified into three main categories: supervised, semisu-pervised, and unsupervised method Among the above binning methods, unsupervisedmethods base classification on features derived from reads, which is particularly usefulwhen reference database availability is limited There are so many clustering algorithms,the processes of different clustering algorithms can be broadly classified into five cate-gory: Partitioning-based, Hierarchical-based, Density-based, Grid-based and Model-based.Clustering can be supervised or unsupervised

bin-In supervised clustering, the expression pattern of the gene(s) is known and this knowledge

is used to group genes into clusters The most widely used method of unsupervisedclustering is known as hierarchical clustering [2] However, because of applying BiMetabinning algorithm, the used unsupervised clustering is k-means clustering which belongs

to Partitioning-based

Trang 17

2 K-means Clustering

K-means is a centroid-based or distance-based algorithm that computes distances toallocate a point to a cluster Each cluster in K-means is correlated with a centroid Thereare five main steps in K-means algorithm:

• Step 1: Choose the number of clusters k

• Step 2: Select k random points from the data as centroids

• Step 3: Assign all the points to the closest cluster centroid

• Step 4: Recompute the centroids of newly formed clusters

• Step 5: Repeat steps 3 and 4

There are essentially three stopping criteria that can be adopted to stop the K-meansalgorithm:

• Centroids of newly formed clusters do not change

• Points remain in the same cluster

• Maximum number of iterations are reached

K-means clustering algorithm in detail is given as in Algorithm 1

Algorithm 1 k-means clustering

Input: Dataset, number of cluster k

Output: Centroids of the k clusters, labels for the training data

{m1, ,m K} yielding the means of the currently assigned clusters

3: Given a current set of means{m1, ,m K}, is minimized by assigning each observation tothe closest (current) cluster mean That is,

1≤k≤K

k x i − m kk2

3 Metrics in binning problem

Three commonly used performance metrics, namely, precision, recall, and F-measure areused to evaluate the binning algorithm.Let m be the number of species in a metagenomic

number of reads from species j assigned to cluster i A cluster i can represents a species j

when A i j0= max j A i j The precision and recall are defined as follows:

Trang 18

Precision shows the ratio of reads assigned in a cluster that belong to the same specieswhile recall presents the ratio of reads from the same species that are assigned in the samecluster.

If the number of cluster i >> the number of species j, the majority of reads in eachcluster probably belongs to a single genome and thus precision would be high However,sensitivity would be low as some genomes are represented by multiple clusters If thenumber of cluster i << the number of species j, some clusters would contain reads frommultiple genomes and precision would be low Thus, precision increases while sensitivitydecreases with the number of predicted clusters Besides, we also use F-measure whichemphasizes comprehensively on both precision and recall:

1

precision+ 1

recall

Trang 19

2.2 Hadoop

In this section, we cover the fundamental components in Hadoop Hadoop is an open-sourcesoftware (or open-source big data framework) that allows to store and process big data in adistributed environment across a group that contains many computers using simple programmingmodel (Map-Reduce Programming Model) It is designed to scale up from a single computer tothousands of computers, each computer offers local computation and storage

2.2.1 Hadoop Components

Figure 2.2: Hadoop Components

Basic components in Hadoop software are:

• For data storage: Hadoop Distributed File System (HDFS)

• For resource management (monitor memory, disk, cpu of a group of computers): YARN(this is an acronym of Yet Another Resource Negotiator)

• For data processing: Hadoop MapReduce (Hadoop MR)

Other big data processing frameworks that can work together with Hadoop’s basic nents inside Hadoop software are: Apache Spark, Pig, Hive,

Trang 20

compo-2.2.2 HDFS - Hadoop Distributed File System

Hadoop Distributed File System is a file system of Hadoop software that manage the storageacross a groups of computer in Hadoop Cluster (Figure 2.2) Hadoop Distributed File Systemhas all basic operations (reading files, creating directories, moving files, deleting data, and listingdirectories) that a normal file system has

Figure 2.3: A Basic Hadoop Cluster

A block (the minimum amount of data that can be read or written) in HDFS has the defaultsize 128 MB Regarding to blocks, HDFS has a property which is "a file in HDFS that is smallerthan a single block does not occupy a full block’s worth of underlying storage" (For example,

a 10 KB file stored with a block size of 128 MB uses 10 KB of disk space, not 128 MB) Theblock size of HDFS is large in order to reduce the cost of disk seek time because disk seeks aregenerally expensive operations

When discussing HDFS, there is another term to call the Master computer, this term isNamenode Similarly, there is another term to call the Slave computer, this term is Datanode.The namenode manages the filesystem namespace It maintains the filesystem tree and themetadata for all the files and directories in the tree This information is stored persistently onthe local disk in the form of two files: the namespace image and the edit log The namenodealso knows the datanodes on which all the blocks for a given file are located; however, it doesnot store block locations persistently, because this information is reconstructed from datanodeswhen the system starts

Datanodes are the workers of the filesystem They store and retrieve blocks when they aretold to by the namenode, and they report back to the namenode periodically with lists of blocksthat they are storing

Trang 21

Figure 2.4: Upload A File To HDFS

Taking a closer look into a specific example related to Metagenomics Data in order tounderstand deeper how a file is uploaded to Hadoop Cluster (Figure 2.3) Assuming that there is

a Metagenomics file called "S8.fna" and the size of this file is 213 MB A command hdfs dfs

-put S8.fna hdfs:///useris used to put this "S8.fna" file on the Hadoop Cluster

Looking deep into this command to see what is actually happening behind the command(Figure 2.3)

Firstly, this command calls the inner function create() of Distributed File System.

Secondly, Distributed File System makes an Remote Procedure Call to the Namenode to create

a new file in the filesystem’s namespace, with no blocks associated with it The Namenodeperforms various checks to make sure the file does not already exist and that there is the rightpermissions to create the file If these checks pass, the Namenode makes a record of the new file;otherwise, file creation fails and the client is thrown an Input-Output-Exception The DistributedFile System returns an File System Data Output Stream to start writing data to

Thirdly, the data (S8.fna file specifically) is now split into blocks and each block is maximum

128 MB in size Because the size of S8.fna file is 213 MB, as the result, this file is split into twoblocks The first block is 128 MB and the second block is 85 MB After splitting S8.fna filesinto blocks, these blocks are then written to an internal queue called the data queue

Fourthly, assuming that the replication for each block is 3, the first block in the data queue isstreamed to the first data node, the first data node stores the first block and forwards this block to

Trang 22

the second data node in the pipeline, the second data node stores the first block and forwards thisblock to the third data node in the pipeline, the third data node stores the first block, at this time,the first block has already been replicated 3 times.

Fifthly, an Acknowledge data is sent from the third data node to the second data node, and thenfrom the second data node to the first data node and then from the first data node to the MasterComputer in order to inform that the first block is successfully uploaded to Hadoop DistributedFile System The above step 4 and step 5 repeat similarly for the second block

Sixthly, after the entire file S8.fna has been uploaded to HDFS successfully, inner function

close()to remove all temporary data in the stream And finally, a signal is sent to Namenode toinform that the uploading file process is completed The block replication in step 4 above is one

of the important properties of Hadoop The block replication helps Hadoop handle hardwarefailure If any slave computer in the Hadoop cluster crashes, the data on that crashed computercan still be retrieved from other computers in the cluster because that data have been duplicatedbefore

Figure 2.5: Read A File From HDFS

Figure 2.5 demonstrates the operation of reading a file from HDFS Observing a specificexample related to Metagenomics Data in order to understand more how a file is read fromHDFS Assuming that the Metagenomics file S8.fna has already been uploaded to HDFS Now,

Trang 23

in order to read this S8.fna file, a command hdfs dfs -cat S8.fna is used Taking a closer look

into this command in order to see what is actually happening behind this command

Step 1, internal function open() is called on Distributed File System.

Step 2, Distributed File System calls the Namenode the determine the locations of the blocks ofthe file For each block, the Namenode returns an address of the datanode which has that block(the datanodes in the Hadoop Cluster are sorted according to their closeness to the computer that

calls the hdfs dfs -cat S8.fna command).

Step 3, internal function read() is called on the File System Data Input Stream.

Step 4, the File System Data Input Stream (which has the datanode addresses) connects to the

closest datanode for Block 1 of S8.fna file The internal read() function is called repeatedly on the stream in order to stream the data back to the computer that called the hdfs dfs -cat S8.fna

command When all the data in Block 1 is read, File System Data Input Stream will close theconnection to the datanode

Step 5, the File System Data Input Stream then find the best (the closest) datanode for Block 2 of

file S8.fna and connects to that datanode Again, the internal read() function is called repeatedly

on the stream in order to stream the data of Block 2 back to the computer that called the hdfs dfs

-cat S8.fnacommand When all the data in Block 2 is read, File System Data Input Stream willclose the connection to the datanode

Step 6, When all the data of S8.fna is read, internal function close() is called on the File System

Data Input Stream

Hadoop MapReduce is going to be explained in detail through a specific example belowand this below example also related to metagenomic data Figure 2.6 below is the first fewlines of the S8.fna file In figure 2.6, the orange sequence of characters is the DNA sequence(or the read) The line located above the DNA sequence contains the information about theDNA sequence The blue information highlighted within the information about the read is theGroup Identification Number of the DNA sequence DNA sequences that have the same GroupIdentification Number belong to the same species

In the first phase of Metagenomics Binning Algorithm, reading this input S8.fna file is carriedout The pipeline of this phase is demonstrated in Figure 2.7 above From the figure, it is easily

to see that the output contains two arrays One array is "reads" and each element in this array isthe DNA sequence (the read) from file S8.fna The other array is "labels" and each element inthis array correspond to an element in array "reads" at the respective position in the array The

"labels" array represents the group of species that the DNA sequence belongs to

The Map-Reduce logical data flow for the example Reading the file S8.fna is demonstrated

in Figure 2.8 below The Mapper in Hadoop will follow the instruction in Program 1 and processthe input S8.fna file distributively

Take a look back in Figure 2.5: Read A File From HDFS, file S8.fna is made up of two blocks(Block 1 and Block 2) Within a Hadoop Cluster, one slave computer will take care of Block 1and another slave computer in the cluster will take care of Block 2

Trang 24

Figure 2.6: The first few lines of the file S8.fna

Going back to Figure 2.8: Map-Reduce Logical Data Flow for Reading Fasta File, HadoopMapper in one slave computer within Hadoop Cluster will handled a block of data of file S8.fna.This Hadoop Mapper will process the input data block base on the instructions of Program 1.The output of Hadoop Mapper will have the format Key-Value

In this specific example, the Key is the index number (order number) and the Value is the DNAsequence or the Value is the label After Hadoop Mapper has finished the its work, it comes

to the Shuffling-Sorting phase in Hadoop Cluster During shuffling-sorting phase in HadoopCluster, key-value pairs with the same key will be grouped together, exchanging the key-valuepair between slave computers in Hadoop Cluster may happen in this phase The output aftershuffling-sorting phase will guarantee that key-value pairs that have the same key with each otherwill be grouped together as Figure 2.8 After shuffling-sorting phase, Hadoop Reducer will takethat result as input as process that information base on the instructions of Program 2 in Figure2.8 The values that have the same key will be concatenated to each other as the demonstration

in Figure 2.8 in order to make sure that there is only and only one key-value pair in the result.This means that the key is unique in the result and one key has only and only one value

The final output will then be sent back to Hadoop Distributed File System so that all computerswithin Hadoop Cluster can access to the file The engineers of the Apache Software Foundation(the company created Hadoop software) has designed the internal mechanism in order to sendthe result of Map-Reduce process back to Hadoop Distributed File System Therefore, Hadoopframework has handled the phase sending result back to Hadoop Distributed File System whichmakes it is unnecessary for programmer coding the Map-Reduce functions to handle this in thecode

The basic command in terminal to submit the job to Hadoop Cluster is presented as follow:

hadoop jar $HADOOP_HOME/share/ /hadoop-streaming-*.jar \

-input "S8.fna" \

-output "outputDirectory" \

-mapper "Program 1" \

Trang 25

Figure 2.7: The pipeline of the phase Reading Fasta File

5 -reducer "Program 2"

Observing Figure 2.8: Map-Reduce Logical Data Flow for Reading Fasta File, the file resulthas the Key-Value format In the later steps of BiMeta algorithm, in order to get the array "reads"and array "labels", the instructions should look some how like below:

for line in sys.stdin:

# remove leading and trailing whitespace

line = line.strip()

# slpit the data according to the tab (or space) character

YARN is responsible for managing the resources (memory, CPU, task scheduling, ) of theHadoop Cluster It is an acronym for Yet Another Resource Negotiator YARN provides its coreservices via two types of long-running daemon: a resource manager (one per cluster) to managethe use of resources across the cluster, and node managers running on all the nodes in the cluster

to launch and monitor containers

A container executes an application-specific process with a constrained set of resources(memory, CPU, and so on) Therefore, the computer plays as the Master in Hadoop Cluster willhave the resource manager and the computers play as the Slaves in Hadoop Cluster will havenode managers

The operation of YARN is illustrated with details in Figure 2.9 In the first move, an initializesignal is sent to resource manager in order to run the job Within the second phase, the resource

Trang 26

manager will find a node manager that is available to launch the application in a container.Specifically, what the application process does once it is running in the container depends on theapplication (actually, it depends on the instructions in Program 1 and Program 2 in Figure 2.8above).

The application process could simply run a computation in the container it is running in andreturn the result to Hadoop Distributed File System In case the application needs more resources,for instance: memory, processors, then the container will send a signal back to resource manager

to request more containers in other slave computers within Hadoop Cluster (step 3) If there is

an available slave computer in Hadoop Cluster, The container running the current applicationprocess will send a signal request to the node manager of the new slave computer start a newcontainer in that slave computer

Finally, the node manager launches the new container to run the distributed computation.This is how slave computers in Hadoop Cluster "help" each other when executing a job thatrequires more resources (memory, processors, ) than one slave computer in the cluster canhandle Figure 2.9 demonstrates observably the general operation of YARN

Trang 28

Figure 2.9: Hadoop YARN - How YARN manages a running job

Trang 29

2.3 Spark

2.3.1 What is Spark?

Apache Spark is a unified computing engine and a set of libraries for parallel data processing

on computer clusters Spark supports multiple widely-used programming languages (Python,Java, Scala, and R), includes libraries for diverse tasks ranging from SQL to streaming andmachine learning, and runs anywhere from a laptop to a cluster of thousands of servers Thismakes it easy to start with and scale up to big data processing or incredibly large scale Figure 2.10illustrates all the components and libraries Spark offers to end-users These blocks are thefoundation of Apache Spark’s ecosystem of tools and libraries Spark’s libraries support a variety

of different tasks, from graph analysis and machine learning to streaming and integrations with ahost of computing and storage systems

Figure 2.10: Spark’s toolkit

In this section, we want to briefly cover the philosophy behind Spark and the developmentcontext

Apache Spark’s Philosophy

Let’s break down the description of Apache Spark - a unified computing engine and set oflibraries for big data — into its key components:

Unified

The goal of Spark is to provide a unified platform for building big data applications.What does it mean by unified? Spark’s unified nature makes the tasks of data analyticswhich include many different processing types and libraries easier and more efficient

to write Spark provides consistent, composable APIs that use to build an applicationout of smaller pieces or out of existing libraries Spark’s APIs are designed to enablehigh performance by optimizing across the different libraries and functions composedtogether in a user program

Computing engine

Spark handles loading data from storage systems and performing computation on it, notpermanent storage as the end itself Spark’s focus on computation makes it differentfrom earlier big data software platforms such as Apache Hadoop

Trang 30

Spark’s final component is its libraries, which build on its design as a unified engine toprovide a unified API for common data analysis tasks Spark includes libraries for SQLand structured data (Spark SQL), machine learning (MLlib), stream processing (SparkStreaming and the newer Structured Streaming), and graph analytics (GraphX)

Context: The Big Data Problem

Computers became faster every year because the speed of processors increases In otherwords, the new processor could run more instructions per second than the previous generation.Unfortunately, this trend of improved processor speeds stopped around 2005 due to hard limits

in heat dissipation This problem leads to applications needed to be modified to add parallelism

to run faster, which sets the stage for new programming models such as Apache Spark

In the world of data collection is inexpensive, the processing system requires large, distributed,parallel computation creates the need for new programming models It is this world that ApacheSpark was built for

2.3.2 Introduction to Spark

This section presents a introduction to Spark, in which we will walk through the corearchitecture of a cluster, Spark Application, and Spark’s structured APIs using DataFrames andSQL

As many computer users likely experience at some point, there are some things that yourmachine is not powerful enough to perform One of the challenging areas is data processing.Single machines do not have enough power and resources to perform computations on hugeamounts of information A cluster or group of computers can have the ability to use all ofits resources to overcome this challenge but it is not enough, it also needs the framework tocoordinate jobs across them Spark can manage and coordinate the execution of tasks on dataacross a cluster of computers

The cluster of machines that Spark will use to execute tasks is managed by a cluster managerlike Spark’s standalone cluster manager, YARN, or Mesos We then submit Spark Applications tothese cluster managers, which will allocate resources to our application so that we can completeour work

Spark Applications

Spark Applications consist of a driver process and a set of executor processes.

The driver process runs your main() function, sits on a node in the cluster, and is responsible

for three things:

• maintaining information about the Spark Application

• responding to a user’s program or input

• analyzing, distributing, and scheduling work across the executors (discussed momentarily)

Trang 31

The executors are responsible for actually carrying out the work that the driver assigns them.This means that each executor is responsible for only two things:

• executing code assigned to it by the driver

• reporting the state of the computation on that executor back to the driver node

Figure 2.11 demonstrates how the cluster manager controls physical machines and allocatesresources to Spark Applications This can be one of three core cluster managers: Spark’sstandalone cluster manager, YARN, or Mesos This means that there can be multiple SparkApplications running on a cluster at the same time

Figure 2.11: The architecture of a Spark Application

We control the Spark Application through a driver process called the SparkSession Toexecutes user-defined manipulations across the cluster, the SparkSession is the answer Thecorrespondence between a SparkSession and a Spark Application is one-to-one

A DataFrame is the most common Structured API that represents a tabular datasets Thelist that defines the columns and the types within those columns is called the schema In Spark,DataFrame can span thousands of computers The reason for that is it is either too large to store

in a machine or it would take too long to perform computation on a single computer

Partitions

To make every executor run in parallel, Spark breaks up data into chunks called partitions Apartition is a collection of rows that sit on one physical machine in your cluster A DataFrame’spartitions represent how the data is physically distributed across the cluster of machines duringexecution

Trang 32

2.3.2.4 Transformations

In Spark, the core data structures are immutable, meaning it cannot be changed after created

To "change" a DataFrame, Spark needed to be instructed how to modify to do what you want.These instructions are called transformations There are two types of transformations: narrowdependencies, and wide dependencies

A narrow dependency (or narrow transformation) are those for which each input partitionwill contribute to only one output partition

Figure 2.12: A narrow dependency

A wide dependency (or wide transformation) style transformation will have input partitionscontributing to many output partitions

Figure 2.13: A wide dependency

Trang 33

2.3.2.5 Actions

Transformations allow us to build up our logical transformation plan To trigger the putation, we run an action An action instructs Spark to compute a result from a series oftransformations

com-2.3.3 How do Sparks run on a cluster?

In the previous section, we cover some basic information about Spark’s architecture In thissection, we will go into detail about the architecture of a Spark application, the life cycle of anapplication from inside and outside Spark

When we start a Spark Application, we request resources from the cluster manager Theapplication can be a place to run Spark driver or resources for the executors depending on ourapplication configuration In the execution process, the cluster manager will be managing themachine that the application is running on

Figure 2.14 shows a basic cluster setup The machine on the left is the Cluster ManagerDriver Node The circles represent daemon processes running on and managing each workernode This setup is just the process of the cluster manager and there is no Spark applicationrunning yet

Figure 2.14: A cluster driver and worker (no Spark Application yet)

The first choices you will need to make when running your applications is choosing theexecution mode

Execution Modes

There are three modes: Cluster, Client, or Local mode The execution mode gives the way todetermine where the resources are located when the application is running

Trang 34

1 Cluster mode

Cluster mode is probably the most common way of running Spark Applications A usersubmits a pre-compiled JAR, Python script, or R script to a cluster manager in clustermode The cluster manager then launches the driver process on a worker node insidethe cluster, in addition to the executor processes This means that the cluster manager isresponsible for maintaining all Spark Application–related processes

Figure 2.15 shows that the cluster manager placed the driver on a worker node and theexecutors on other worker nodes Solid borders rectangle illustrate Spark driver process.Dotted borders rectangle represent the executor process

Figure 2.15: Spark’s Cluster mode

2 Client mode

Client mode only differs from the cluster mode with the Spark driver remains on the clientmachine In other words, it means that the client machine maintains the driver process,and the cluster manager maintains the executor processes

Figure 2.16 shows that the Spark application is running from a machine that is not on thecluster and the workers are located on machines in the cluster

3 Local mode

Local mode runs the entire application on a single machine This mode is a commonchoice for learning Spark or test your applications

Overall, the life cycle of the Spark Application includes four steps: Client request, Launch,Execution, and Completion In this section, we assume that a cluster is already running with fournodes, a cluster manager driver, and three worker nodes

This section also makes use of illustrations and follows the same notation that we introducedpreviously Additionally, we now introduce lines that represent network communication Red

Tiêu đề	Developing A Hadoop-Based Distributed System For Metagenomic Binning
Tác giả	Tran Duong Huy, Pham Nhat Phuong, Nguyen Huu Trung Nhan
Người hướng dẫn	Assoc. Prof. Dr. Tran Van Hoai
Trường học	Vietnam National University of Ho Chi Minh City
Chuyên ngành	Computer Engineering
Thể loại	thesis
Thành phố	Ho Chi Minh City

Định dạng
Số trang	69
Dung lượng	2,92 MB