20 2.2.2 Existing Open Source Event Correlation Software.. Nowadays, managing applications on inter-cloud environment especially monitoring faultsbecomes challenging due to the increasin
Trang 1FAULT MANAGEMENT IN INTER-CLOUD SYSTEM
In Partial Fulfillment of the Requirements of the Degree of
MASTER OF INFORMATION TECHNOLOGY MANAGEMENT
In Information Management
By MR: LONG NGOC HOANG
ID: MITM03006
International University - Vietnam National University HCMC
May 2015
Trang 2FAULT MANAGEMENT IN INTER-CLOUD SYSTEM
In Partial Fulfillment of the Requirements of the Degree of
MASTER OF INFORMATION TECHNOLOGY MANAGEMENT
In Information Management
By
MR: LONG NGOC HOANG
ID: MITM03006 International University - Vietnam National University HCMC
Trang 3This thesis concludes my degree of the Master Information Technology Management, and
is submitted to the School of Computer Science and Engineering at the International versity, Vietnam National University - Ho Chi Minh City
Uni-I would like to show my greatest gratitude to Dr Ha Manh Tran for his guidance andhelpful advices His skilful and valuable comments and feedback help me get back onthe thesis work whenever I lose my focus on the thesis objectives due to the nature of mybusiness
Last but not least, I would like to thank my family for unconditional support and couragement, and thank my friends for valuable feedback
Trang 4en-Plagiarism Statements
I would like to declare that, apart from the acknowledged references, this thesis either doesnot use language, ideas, or other original material from anyone; or has not been previouslysubmitted to any other educational and research programs or institutions I fully understandthat any writings in this thesis contradicted to the above statement will automatically lead
to the rejection from the Master of Information Technology program at the InternationalUniversity - Vietnam National University Ho Chi Minh City
Trang 5Copyright Statement
This copy of the thesis has been supplied on condition that anyone who consults it is derstood to recognize that its copyright rests with its author and that no quotation fromthe thesis and no information derived from it may be published without the author‘s priorconsent
un-c
○Long Ngoc Hoang - MITM03006 - 2015
Trang 6Table of Contents
2.1 Fault 15
2.2 Event Correlation 20
2.2.1 Event Correlation Techniques 20
2.2.2 Existing Open Source Event Correlation Software 26
2.3 Related Cloud and Fault Management Software 27
2.4 Machine Learning 29
2.4.1 Feature Extraction from Logs 30
2.5 OpenStack 34
2.6 Hadoop 40
3 Proposal 48 4 Experiment 53 4.1 Use existing open source tools for monitoring and correlating logs 53
4.1.1 Setup OpenStack 53
4.1.2 Setup Ganglia 60
4.1.3 Setup Hadoop and OpenStack on Windows Azure 68
4.1.4 Open Stack Log Collection and Processing 71
4.1.5 Open Stack Database Tables 74
Trang 7A Setup and Configuration 78A.1 Setup OpenStack with Fuel 78A.2 Setup and Configure Ganglia 81A.3 Logstash Configuration 82
Trang 8List of Figures
2-1 A taxonomy of faults [1] 16
2-2 A taxonomy for online failure prediction approaches [2] 18
2-3 Fault management on inter-cloud enviroment 28
2-4 OpenStack conceptual architecture [3] 36
2-5 OpenStack logical architecture [3] 37
2-6 Devstack’s localrc for controller node (192.168.1.5) 38
2-7 Devstack’s localrc for compute node (192.168.1.6) 38
2-8 nova-manage service list 38
2-9 Launch an instance from Horizon dashboard 39
2-10 MapReduce workflow 41
2-11 Hadoop ecosystem 42
2-12 HCatalog - table list 43
2-13 HCatalog - batting_data table 43
2-14 HCatalog - master_data table 44
2-15 Hive query 44
2-16 Hive query result 45
2-17 Hive query log 45
2-18 Pig query 46
2-19 Pig query result 46
2-20 Pig query log 47
3-1 Fault analyzer in the fault resolution system 49
3-2 Log management model 50
3-3 OpenStack Log Analysis Block Diagram [4] 51
Trang 93-4 Monitoring and Alerting for OpenStack [5] 52
4-1 Critical issue from cinder-scheduler service 54
4-2 Error from nova-compute service 54
4-3 OpenStack Nova log files on controller node 54
4-4 Error from savanna-api log 55
4-5 Node Overview 61
4-6 Summary Node Metric Last Hour 62
4-7 CPU Metrics 63
4-8 Disk Metrics 64
4-9 Load Metrics 65
4-10 Memory Metrics 66
4-11 Network and Process Metrics 67
4-12 Ganglia metrics on Graphite 68
4-13 Windows Azure Virtual Network for Hadoop and OpenStack clusters 69
4-14 Hadoop cluster on Windows Azure 70
4-15 OpenStack Juno on Windows Azure 70
4-16 Logstash Historam 72
4-17 Open Stack Log Type Summary 72
4-18 Query and filter Open Stack Logs 73
4-19 Open Stack Nova log 73
4-20 Error status of an instance on OpenStack Dashboard 74
4-21 Information from nova.instance_faults and nova.instances tables 75
4-22 Exception details from nova.instance_faults table 75
5-1 Log Analysis Workflow 77
A-1 Fuel Server 79
A-2 Fuel UI 79
A-3 Successfully Havana Deployment on Fuel 80
A-4 Open Stack Havana Services 81
Trang 10List of Tables
2.1 Advantages and drawbacks of the presented event correlation approaches 25
2.2 OpenStack services 35
2.3 OpenStack Log Location 39
4.1 OpenStack Cinder Log Files 56
4.2 OpenStack Nova Log Files 57
4.3 OpenStack Horizon Log Files 58
4.4 OpenStack Keystone Log Files 58
4.5 OpenStack Glance Log Files 58
4.6 OpenStack Ceilometer Log Files 59
4.7 OpenStack Heat Log Files 60
4.8 OpenStack Savanna Log Files 60
Trang 11Nowadays, managing applications on inter-cloud environment especially monitoring faultsbecomes challenging due to the increasing of complexity and diversity of these systems.The inter-cloud environment fostering the centralization of various services need a largenumber of system administrators and supporting systems to manage faults occurring in theinter-cloud systems and services It is necessary to develop a supporting system that canmanaging and analysing faults
This master thesis deals with the topic of fault management on inter-cloud systems.This thesis research investigates multiple studies of fault, techniques, and related fault man-agement software We setup inter-cloud environment and propose various approaches formonitoring and analysing fault on inter-cloud system In particular, we study OpenStack,Hadoop components and their ecosystems to understand the complexity of inter-cloud en-vironment We deploy and integrate several open source tools for monitoring and analysingfaults in the inter-cloud environment
Keywords: Fault Management, Inter-Cloud, Cloud Computing, OpenStack, Hadoop,Event Correlation
Trang 12This page is intentionally left blank
Trang 13Chapter 1
Introduction
Communication networks and distributed systems today become more and more largeand complex to adapt the increasing demand of users Managing services operating onthese systems is even more challenging Cloud computing has recently emerged as a newparadigm of provisioning infrastructure, platform, and software as services over the In-ternet This paradigm combines distributed computing resources and virtualization tech-nologies that outsource not only platform and software but also infrastructure to solve thedemand of users Cloud computing is attractive to business owners and scientists as it al-lows them to deploy many types of workloads on demand easily As we live in the dataage, cloud computing is also an enabler for big data processing In the last few years,there is significant increase in the number of commercial and open source cloud platforms,for example, Amazon EC2 [6], Google App Engine [7], Microsoft Windows Azure [8],OpenStack [9], Eucalyptus [10], Nimbus [11], OpenNebula [12]
In similar context, Hadoop [13] - an Apache open source project and widely adoptedMapReduce implementation - has evolved rapidly into a major technology movement Ithas emerged as the best way to handle massive amounts of data, including not only struc-tured data but also complex, unstructured data as well The Apache Hadoop open-sourcesoftware supports reliable, scalable, distributed computing for large datasets on clusters
of computers using programming models The software features the capability of scaling
up from one to thousands of computers, detecting and handling failures at the applicationlayer Hadoop systems also support distributed computing services for processing large
Trang 14datasets on clusters of workstations The Hadoop system associated with the MapReduceprogramming methodology has been applied to multiple application domains related tolarge data processing, such as indexing a large number of web pages, doing financial riskanalysis and studying customer behavior.
From the varieties of cloud and big data providers, consumers may have a lot of loads running across their inter-cloud environment Managing applications on inter-cloudenvironment especially monitoring faults becomes challenging due to the increasing ofcomplexity and diversity of these systems As a result, inter-cloud environment fosteringthe centralization of various services need a large number of system administrators andsupporting systems to manage faults occurring in the inter-cloud systems and services It
work-is necessary to develop a supporting system that can managing and analysing faults
In this thesis, we propose an approach for monitoring and analysing faults on the cloud environment The approach recruits open source technologies to facilitate monitoringand correlating services logs among cloud systems The contribution is thus twofold:
inter-1 Studying faults and existing techniques and tools of fault management on cloud tems We also study OpenStack, Hadoop components and their ecosystems to under-stand the complexity of inter-cloud environment
sys-2 Deploying and integrating several open source tools for monitoring and analysingfaults In particular, we collect and process services logs on inter-cloud environmentincluding OpenStack and Hadoop components
The rest of the thesis is structured as follows: the next chapter presents the literaturereview of faults, survey of tools and techniques of faults management on single cloud, inter-cloud environment The chapter 3 furnishes the proposal of the thesis research We proposeapproaches for monitoring and analysing faults on the inter-cloud environment with thesystem architecture and component communication The chapter 4 provides experimentsfor monitoring and analysing faults on the inter-cloud systems The chapter 5 concludesthis thesis with the short discussion of the ongoing work Last but not least, the Appendix Aprovides the details of setup and configuration that have been used in the thesis
Trang 15Chapter 2
Literature Review
Fault in cloud computing and Hadoop has attracted several research activities Avizienis et
al [1] has presented basic concepts and definition associated with system dependability.According to the study, a failure or service failure is observed as a deviation from thecorrect state of the system An error is the part of the total state of the system that maylead to service failure The root cause of an error is a fault All faults that may affect asystem during its life are classified according to eight basic viewpoints as shown in Figure2-1 Thus, there would be 256 different combined fault classes if all combinations of theeight elementary fault classes were possible
Trang 16Figure 2-1: A taxonomy of faults [1]
According to this study [1], major techniques for handling faults can be also groupedinto:
∙ fault prevention: means to prevent the occurrence or introduction of faults;
∙ fault tolerance: means to avoid service failures in the presence of faults;
∙ fault removal: means to reduce the number and severity of faults;
Trang 17∙ fault forecasting: means to estimate the present number, the future incidence, andthe likely consequences of faults;
A topical survey of Salfner et al [2] presents variety of online failure prediction ods A taxonomy has been developed for failure prediction which is based on runtimemonitoring and a variety of models and methods that use the current state of a system andthe past experience As shown in Figure 2-2, the full taxonomy is split vertically into fourmajor branches of the type of input data used, namely data from failure tracking, symptommonitoring, detected error reporting, and undetected error auditing Each major branch
meth-is further divided vertically into principal approaches Each principal approach meth-is thenhorizontally divided into categories grouping the surveyed methods
Trang 18Figure 2-2: A taxonomy for online failure prediction approaches [2]
The study of Armbrust et al has emphasized 10 obstacles for cloud computing [14][15] Several obstacles are related to fault management, such as service availability, perfor-mance unpredictability, and bugs in large-scale distributed systems
The authors of the study [16] have proposed an approach for realizing generic faulttolerance mechanisms as independent modules, validating fault tolerance properties of eachmechanism, and matching user’s requirements with available fault tolerance modules toobtain a comprehensive solution with desired properties
The study of Dudko et al [17] has described the shortcomings of failure monitoringand prediction models in the context of Hadoop, and proposes a novel approach to predict
Trang 19performance failures in Hadoop clusters The approach exploits the high-level data vided by Hadoop logs and the low-level data provided by hardware logs to predict failuresmore accurately.
pro-Thanamani [18] has studied various failure prediction methods in distributed systems.The study focuses on the information of hardware errors from individual nodes to pre-dict failures for high performance computing clusters, Linux computing clusters or singleservers Faults can be observed at three stages: monitor of symptoms, detection of errors,
or observation of failures
The authors of the study [19] have proposed an auto-recovery system for the job trackerthat can be applied to Hadoop applications The auto-recovery mechanism is based on acheckpoint method, which the snapshots of the job tracker are stored on a distributed filesystem periodically, and used later when the system detects any failure The advantage ofthis system is the capability of continuing job execution during the recovery phase
The study of Garduno et al [20] has proposed the Theia visualization tool that analyzesapplication-level logs in a Hadoop clusters, and generates visual signatures of job perfor-mance Theia recruits heuristics to identify visual signatures of problems that allow users
to distinguish application-level problems, e.g., software bugs or workload imbalance frominfrastructural problems, e.g., contention problems, hardware problems
Tan et al [21] have described a non-intrusive log-analysis approach that traces thecausality of execution in a MapReduce system based on its behavior The approach alsoproposes a novel way to characterize the behavior by decomposing performance along thedimensions of time, space, and volume The approach can be useful for system administra-tors to debug performance problems in MapReduce systems
The study of Xiaoen et al [22] uses a fault-injection tool to identify the ability ofmaintaining correct funtionalities under faults on OpenStack The study focus on the faultresilience during the processing of VM creation or deletion with three fault types: servercrashes, transient server non-responsiveness, and network partitions These studies dealwith the same challenge: fault monitoring and diagnosis, or failure prediction on a singlecloud or Hadoop cluster enviroment
Bahman et al [23] [24] have created the Fault Trace Archive (FTA) - an online, public
Trang 20respository failure trace collected from many distributed systems In this work, the authorshave designed the trace archive, FTA dataset format, and a toolbox for analyzing and mod-eling the FTA traces The analysis of failure in nine distributed systems is performed bybasic statistics and probability distributions According to the analysis, the Weibull, theLognormal, and the Gamma distributions are often the best candidates for availability andunavailability distributions.
The authors of the studies [25], [26] and [27] have recruited the FTA traces to presentthe time and space correlation model of failure events Several new traces have been alsoadded to FTA from the studies However, since most of FTA traces are from Grid systems,obviously virtualization is another source of failure in Clouds that these FTA studies havenot been considered yet
Soila et al [28] have studied 10-months of MapReduce logs from the M45 datasetwhich Yahoo made freely available to selected universities The study recruits the maxi-mum likelihood estimation method and instance-based learning technique to describe thefailure characterization and predict job completion time
The study of Gang et al [29] has proposed a MapReduce algorithm, Apriori, to data mining event association rules in distributed system monitoring
MapReduce-The authors of the study [30] have recruited text analytics-based approach to examineover 9,575 message threads appearing in the forum of a large IaaS provider over a 3 yearperiod The goal is to understand properties of cloud support models in resolving problems.The study of Lee et al [31] has presented Twitter’s production logging infrastructureand its evolution from application-specific logging to a unified “client events” log format
2.2.1 Event Correlation Techniques
Event correlation is a way to gain higher level knowledge from the information in theevents It is a conceptual interpretation procedure where new meaning is assigned to aset of events that happen within a predefined time interval Nowadays, there are a lot of
Trang 21different correlation techniques, and combinations of those approaches Obviously, trying
to argue, that one approach is generally better than an other one is futile, as this dependshighly on the problem we have The study of [32] has presented some existing correlationtechniques which can be applied in system log and network event analysis
∙ Finite State Machine Based (FSM): The FSM approach to event correlation is troduced in the study of [33] The authors argue, that the fault identification processcan be split into two steps, fault detection (i.e noticing, that there is a problem) andfault localization (i.e finding out, what the problem is), and that a FSM based cor-relation engine can help in the first step, by modelling the monitored system Themodel proposed in the study of [33] is an FSM based on the observable events gen-erated by the monitored process, (which is assumed to be an FSM as well) If anevent arrives, which leads to an invalid state, an error is reported
in-∙ Rule Based Event Correlation: One of the earliest approaches to event correlation
is Rule-based Reasoning (RBR) As explained in the study [34], a rule based eventcorrelation engine is organized in three levels:
- Data level: working memory or global database, which contains informationabout the problem at hand
- Knowledge level: a knowledge base (rule repository), which contains specific expert knowledge
domain Control level: an inference engine, which determines, how to apply the rulesfrom the knowledge base to solve a given problem
A benefit of rule-based systems, especially when used with simple if-then style guages is the similarity to the natural language A statement, such as “if event user–login–failed occurs 10 times within 5 minutes, then send an email to the operator”
lan-is perfectly understandable even to someone without computer programming ence This also makes the decisions of a rule-based correlation engine comparativelyeasy to reproduce
Trang 22experi-Unfortunately, rule-based approaches also have several drawbacks Traditionally,the rule repository relies on the knowledge of an expert, who has domain-specificexperience with the problems that are to be solved by the system, and a knowledgeengineer, who knows how to represent that knowledge in the rule-system [34] Even
if the rule creation is simple enough to be done directly by the expert, the knowledgestill has to be entered into the system manually, which is time-consuming While a lot
of initial work may be acceptable, frequent changes in the network also make tediousmaintenance of the rule-repository necessary, which is counterproductive, as one ofthe stated goals for the correlation engine is to lessen the workload of operators.Another problem is the inability of rule-based systems to automatically learn fromexperience, meaning that the same calculations have to be made over and over again,whenever the same set of events occurs
∙ Case Based Reasoning: In Case-based Reasoning (CBR), each problem and thecorresponding solution is considered as a case The approach of CBR to solve a givenproblem is to find past problems from a case library, that are similar to the problem
at hand, and to try to apply a similar solution In the end, the gained experience isstored as a new case in the library
∙ Model Based Reasoning: The basic idea of Model-based Reasoning (MBR) is torepresent the structure and the behaviour of the system under observation in a model,
to allow the reasoning about fault causes The task therefore is “a process of ing from behaviour to structure, or more precisely, from misbehaviour to structuraldefect” [35] As explained in the study of [35], this requires
reason a description of the structure,
- a description of the behaviour,
- and a set of guidelines to investigate misbehaviour based on these two tions
descrip-∙ Codebook Based Event Correlation: The codebook based event correlation proach is explained in the study of [36] The authors propose the use of coding
Trang 23ap-techniques To allow the localization of problems, the dependencies between able symptoms and underlying problems are examined and a suitable subset of thesymptom events is selected (the codebook) The codebook must be sufficiently large
observ-to identify the problems (a codebook that is observ-too large provides unnecessary dancy, but a small codebook may omit information needed to distinguish betweenproblems)
redun-∙ Voting Approaches: Correlation by voting can be used to localize a fault Usually,the votes (expressed by events from different nodes) can not give exact informationabout the location of a fault, but they can indicate a direction As pointed out inthe study of [37], in this case it is necessary, that the correlation engine knows thetopology of the managed network, such that the correlation engine can calculate thenumber of votes for each element
∙ Explicit Fault-localization: According to the study of [38], the author proposes toinclude the information about all possible fault localizations with each alarm Asthe authors explain, the process of fault localization is then simple: In the case thatalarms are reliable and there is only a single fault in the network, then fault local-ization is straight forward: The fault lies in the intersection of the set of locationsindicated by each alarm Thus, intuitively, alarms that share a common intersectionshould be correlated
∙ Dependency Graphs:According to the study of [39], the use of dependency graphsfor event correlation is examined A dependency graph is a directed graph, whichmodels dependencies between the managed objects In the case of a network, thenodes represent the network elements (e.g hosts), and an edge from node A to node
B indicates, that failures in node A can cause failures in node B
∙ Bayesian Network Based Event Correlation: A Bayesian network (sometimes alsocalled a belief network) is a directed acyclic graph, which models the probabilisticrelations between network elements, represented by random variables A more thor-ough explanation can be found in the study of [40]
Trang 24∙ Neural Network Approaches: The idea behind Artificial Neural Networks (ANNs)
is to reproduce the function of a human brain in an artificial model As the humanbrain is particularly efficient at pattern recognition, the use of ANNs for tasks, such
as speech and image recognition, or event correlation, suggests itself
Table 2.1 lists some of the strengths and weaknesses of different event correlationapproaches
Trang 25Table 2.1: Advantages and drawbacks of the presented event correlation approaches
Approach Strengths and Weaknesses
1 FSM Pros: Simple, good as a basic model, easy to understand
Cons: Too simple for practical applications, no tolerance to noise
2 RBR Pros: Transparent behaviour, close to natural language, modularity
Cons: Time-consuming maintenance, not robust, does not learn from experience
3 CBR Pros: Automatic learning from experience, reasoning from past
experience is natural, can be combined with ticketing systemCons: Automatic solution adaptation and reuse is difficult
4 MBR Pros: Relies on deep knowledge
Cons: Description of behaviour and structure may be difficult in practice
5 Codebook Pros: Fast, robust, adapts to topology changes
Cons: Reproducing the behaviour manually is tedious; no notion of time
6 Voting Pros: Great for use in a distributed fashion
Cons: Requires knowledge about topology
7 Explicit
fault
local-ization
Pros: More efficient and extendable than rule-based approach
Cons: Depends heavily on a-priori information
8 Dependency
graphs
Pros: Good for dealing with dynamic, complex managed systems
Cons: Assumption, that there is only one problem at a time
9 Bayesian
networks
Pros: Good theoretical foundation
Cons: Probabilistic inference is NP-hard
10 ANN Pros: Powerful for problems, that are suitable to be solved
by the human brainCons: Behaviour difficult to understand; requires a lot of processing power
Trang 262.2.2 Existing Open Source Event Correlation Software
In this section, a selection of open source event correlation applications is presented
∙ Swatch [41] is a rule based log monitoring system, which can be configured withsimple rules It is written in Perl and licensed under the General Public License(GPL) Each rule contains a regular expression pattern to either ignore a matching logmessage, or take a specified action, like printing the message on the screen, sending
an email, or executing an external program Although the authors do not describeSwatch as event correlation software, Swatch also supports simple event correlationoperations, such as the specification of a rate threshold, or of a time window for rules.More information about Swatch can be found in [42], as well as it’s manual page
∙ LogSurfer [43] is a log monitoring tool based on Swatch, but written in C (whichmakes it more suitable for large volumes of messages) LogSurfer operates the sameway as Swatch, by matching on log lines with regular expressions and executingcorresponding actions, but introduces some new features An interesting possibility isthe dynamic creation (and deletion) of rules, which allows, for instance, the grouping(aggregation) of log messages (something, which is not possible with Swatch)
∙ SEC [44] (Simple Even Correlator) is an event correlation tool written in Perl ilar to Swatch and LogSurfer, SEC allows the specification of rules to match linebased input events (such as log messages) and execute corresponding actions Be-sides regular expressions, custom Perl functions can also be used to match the inputlines, or to evaluate conditions An action can be the creation of a log message,writing the event to a file, executing an external program, etc Additionally, SECalso allows the creation of synthetic events and dynamic contexts (internal state) Acontext can be used as an additional condition for rules Together with the basic cor-relation operations provided by SEC, this allows the detection of composite events
Sim-∙ OSSEC [45] is an open source Host-based Intrusion Detection System (HIDS), sisting of a core application, an agent for Windows systems, and a web based UI.According to the OSSEC website, the key features are file integrity checking, log
Trang 27con-monitoring, rootkit detection and active response OSSEC supports a large number
of operating systems and can analyze logs from various devices and applications,such as Cisco routers, Microsoft exchange servers, OpenSSH or NMAP3 Amongother options, possibilities for output include logging to syslog, storing events in adatabase, sending email, generating reports, and of course the access via the web UI
∙ OpenNMS is an open source network monitoring platform written in Java According
to the OpenNMS website, the main focuses of OpenNMS are service polling, datacollection and event and notification management
∙ Esper is an open source component for building real-time ESP (Event Stream cessing) and CEP (Complex Event Processing) applications in Java (additionally,NEsper, written in C#, can be used with NET) Although Esper is not primarilytargeted at network event correlation, it is a CEP and ESP toolkit certainly worthmentioning
Figure 2-3 represents where the cloud and fault management software locate in the cloud environment
Trang 28inter-Figure 2-3: Fault management on inter-cloud enviroment
Ganglia [46] [47] is a scalable distributed monitoring system for high performancecomputing systems such as clusters and Grids It is based on a hierarchical design targeted
at federations of clusters It relies on a multicast-based listen/announce protocol to monitorstate within clusters and uses a tree of point-to-point connections amongst representativecluster nodes to federate clusters and aggregate their state
Nagios [48] is an open source monitoring system While Ganglia is more focused ongathering and tracking metrics, Nagios is more of an alerting mechanism
Collectd [49] is a daemon which collects system performance statistics periodicallyand provides mechanisms to store the values in a variety of ways, for example in RoundRobin Database (RRD) files
Riemann [50] aggregates events from servers and applications with a powerful streamprocessing language It sends an email for every exception raised by application code andtrack the latency distribution of web application
Splunk [51] is a commercial solution that turns machine data into valuable insights.Machine data is generated by web sites, applications, servers, networks, mobile devices,
Trang 29and the like Splunk consumes machine data and allows users to search and visualize it tomonitor and analyze everything from customer clickstreams and transactions to networkactivity to call records Splunkstorm [52] is a cloud-based service of Splunk.
Apache Whirr [53] is a collection of scripts that has sprung out as a project of its own.The purpose of Whirr is to simplify controlling virtual nodes inside a cloud like Ama-zon Web Services Whirr controls everything from launching, removing and maintaininginstances that Hadoop then can utilize in a cluster
Cloudera’s Distribution, including Apache Hadoop (CDH) [54] is an open source Hadoopdistribution The current version (4) contains a virtual machine with Hadoop MapReduceconfigured along with Apache Whirr This is to simplify launching and configuring HadoopMapReduce clusters inside a cloud
Windows Azure HDInsight Service [55] is a latest Microsoft commercial solution forcloud computing It makes HDFS/MapReduce software framework available on WindowsAzure platform In particular, it simplifies the configuring, running of Hadoop jobs byproviding JavaScript and Hive interactive consoles
Amazon Elastic MapReduce (Amazon EMR) [56] is a commercial solution that enablesbusiness, data analysts, and developers to easily process vast amounts of data It utilizes
a hosted Hadoop framework running on the web-scale infrastructure of Amazon ElasticCompute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3)
Savanna: Hadoop as a Service with OpenStack [57] This project has just been initiated
by Mirantis [58] as an incubator OpenStack project The aim of this project is to enableusers to provision and manage Hadoop cluster on OpenStack
Machine learning techniques can automatically handle large scale data processing and eralize experiences from learning patterns on a set of training data and apply on a new dataset slightly different from the training data without retraining Machine learning algorithmscan help to automate the process to do log analysis
Trang 30gen-2.4.1 Feature Extraction from Logs
As the logs can be viewed as textual data, some natural language processing techniquescan be applied Many studies extract N-gram frequencies from the logs [59] An n-gram is a contiguous sequence of n items from a given sequence of text or speech It
is a common feature when doing text categorization For example, the study of [59] didlanguage classification and subject classification of newsgroup articles based on calculatingand comparing profiles of N-gram frequencies There is also one related study in abnormaldetection of logs The study of [60] used N-gram frequencies of network logs from Apacheservers as features to do intrusion detection It is usually to take a word rather than acharacter as a gram in the text mining area, and use contiguous sequences of n items instead
of any permutation of all items
According to the study of [61], TF-IDF is another famous feature in natural languageprocessing area Not only is the frequency of one word in a specific document considered,but also the discriminative power of one word in a document collection is included Thestudy of [62] applied this measure to its message count features, and improved the detectionaccuracy of abnormal logs
Additionally, many researches use data mining techniques to mining patterns from thedata as features As described in the study of [63], the authors created a dictionary of eventtypes by a sequential text clustering algorithm and identified individual sets of messages
in a log file that belong to one process or failure through the PARIS algorithm (PrincipleAtom Recognition In Sets)
Parts of the log analysis researches extract structure information from logs by explicitprogramming They define specific important attributes for their logs and extract them asfeatures The study [62] pointed out that although the logs are non-structured data butthe source codes generating these logs have static templates Two kinds of attributes arerepresentative, that is, the state variables which should have a small constant number ofdistinct values and the object identifiers which should identify program objects and havemultiple values In the study of [62], the state ratio vectors and message count vectorswere treated as features
Trang 31Machine Learning Techniques
Many machine learning techniques could be used in the automated detection of abnormallogs Some of them have been investigated in this area while the others have been stud-ied in some related fields, such as text mining, anomaly detection These techniques arecategorized into classification techniques, clustering techniques and statistical techniques
Classification Techniques In machine learning, classification [64] is a process toapproximate a function mapping a vector into classes by looking at input-output exam-ples of the function It is a kind of supervised learning techniques which means it needslabelled data to supervise the learning process The authors of the study [65] used Sup-port Vector Machines (SVM) to classify system call sequences of solved problems, but thecorrelations between system behaviours and known problems cannot be applied to detectunknown problems Bayesian networks [66], [40] is another commonly used classificationtechnique in anomaly detection, for example, the paper of [67] proposed a kind of variantBayesian network to do network intrusion detection
The advantages of supervised learning techniques are efficient and fast, as they getthe ‘correct’ answers during training, that is, the clear feedbacks help them learn quickly.Besides, it is more powerful to detect the known problem since the patterns have beentaught But the disadvantages are obvious that it is hard to get enough reliable labelled data.First, it is hard to provide absolutely normal data (i.e containing no problems), and if someerroneous data are labelled as normal data, it increases the false positive error Secondly,
as even the human expert cannot detect every kind of problem, supervised learning will fail
to detect a totally unknown problem Finally, the supervised learning system can never besuperior to human experts, since the goal is to imitate the experts [68]
Clustering Techniques Since the complete clean data (i.e containing only normalinformation) are almost impossible to be guaranteed, many researches use the unsupervisedlearning methods Majority of them adopt the clustering techniques Clustering [64] is theprocess of grouping the data into classes or clusters, so that objects within a cluster havehigh similarities in comparison to one another but are very dissimilar to objects in other
Trang 32is that the input parameter k (i.e the number of clusters) needs to be set by people Asuitable k is crucial for the final results but finding a optimal k is an NP-hard problem[71] The authors of the study [72] proposed an improved algorithm called Y-means toovercome the parameter selection problem Y-means automatically selected the value of
k through splitting and merging clusters, and set a threshold to separate normal clustersand abnormal clusters It solves the shortcomings of K-means but still has the selectionproblem of parameter threshold t As for the logs, the categories of logs are complicatedand difficult to decide
Self-Organizing Feature Maps (SOFM) [73] is another popular clustering method.SOFM is also called Kohonen map or network, which was first described as an artificialneural network by the Finnish professor Teuvo Kohonen [73] SOFM can discover thesignificant patterns in the data by capturing the topology of the original input data WhenSOFM is used in the anomaly detection area, the researchers usually assume that the normaldata instances are quite close to their closest cluster centroids, but the abnormal instancesare far away from their closest cluster centroids [69] The authors of the study [74] usedSOFM for the fault monitoring in signal systems SOFM has also been leveraged in theintrusion detection The studies of [75] and [76] aimed to classify regular and irregular in-trusive network traffic for a given host via SOFM The abnormal high value of the distance
of the winning neuron with respect to the input vector was treated as an irregular behaviour.The study of [77] is a more relevant research on log analysis That is a case study aboutusing SOFM on the server log data But their training data are only the numerical data(e.g CPU load, User-time) rather than the text contents The advantage of SOFM is that it
is independent of the data distribution and cluster structure, but the shortcoming is that if
Trang 33the data do not have constant variance, the global threshold will be too low for the data oflarger variation and leave these anomalies undetected [68] The authors of the study [78]proposed a technique to use local thresholds for each cluster Also, SOFM is not efficientenough for large data processing, especially on-line processing Zheng et al [79] proposed
a more efficient algorithm combined with the fast nearest-neighbor searching strategy.The second kind of assumption is that the anomalies do not belong to any cluster aftertraining [69] The clustering techniques that do not force every data instance to belong toone cluster have been widely used under this assumption for anomaly detection For ex-ample, DBSCAN [80] is a kind of density-based clustering technique based on connectedregions with sufficiently high density As it is based on the density of data, DBSCAN candiscovery clusters with arbitrary shape and is anti-noise, so that it can find the clusters thatK-means cannot find DBSCAN has the disadvantage as SOFM, which is not capable tohandle local density variation within the cluster Ram et al [81] proposed a density var-ied DBSCAN algorithm which overcomes this shortcoming And because the definition
of density on high-dimensional data is meaningless, DBSCAN is usually not suitable forhigh-dimensional data
Clustering is a suitable type of machine learning algorithm if not enough complete cleanlabeled data can be guaranteed
Statistical Techniques Since the log analysis could be treated as an anomaly tion problem, many anomaly detection techniques have been explored Generally, the sta-tistical anomaly detection techniques leverage a statistical model to formulate the problem,and the anomalies are assumed to be in the low probability regions of the model
detec-Principal component analysis (PCA) has been studied in many fault detection problems.PCA separates the high-dimensional data space into two orthogonal subspaces The signif-icant changes in the residual portions of a sample vector indicate that it is an anomaly Theauthors of the study [82] used PCA for diagnosing network-wide traffic anomalies In thelog analysis problem, Xu et al [62] used PCA to detect anomalous logs PCA is a kind
of effective method when dealing with high-dimensional data But the shortcoming of thismethod is that it needs a threshold to separate the normal data and abnormal data To select
Trang 34a suitable threshold is hard and needs prior knowledges.
The open source projects provide an important alternative for organizations that do notwish to use a commercially provided cloud Among those open source cloud solutions,OpenStack [9] is proving to be the number one of choice with the following strengths[83]:
∙ Very young project
∙ Lots of corporate backing
∙ Codebase is simplified (Python only)
∙ Only use the components that we need (lightweight)
∙ Excellent for large deployments
∙ Excellent APIs
OpenStack’s mission is to provide ubiquitous, free, open source software for poweringpublic and private clouds There are a number of software projects working towards thismission Some of these are official projects under the direct control of the OpenStackcommunity and some are related to OpenStack, providing useful additional capabilities,but are not part of OpenStack charter or release process In the recent release (Havana), theOpenStack has offered nine core services as in the following Table 2.2
Trang 35Table 2.2: OpenStack servicesService Project Name Description
1 Dashboard Horizon Enables users to interact with OpenStack services to
launch an instance, assign IP addresses, set accesscontrols, and so on
2 Compute Nova Provisions and manages large networks of virtual
ma-chines on demand
3 Networking Neutron Enables network connectivity as a service among
in-terface devices managed by other OpenStack vices, usually Compute Enables users to create andattach interfaces to networks Has a pluggable archi-tecture that supports many popular networking ven-dors and technologies
ser-4 Object Storage Swift Stores and gets files Does not mount directories like
a file server
5 Block Storage Cinder Provides persistent block storage to guest virtual
ma-chines
6 Identity Service Keystone Provides authentication and authorization for the
OpenStack services Also provides a service catalogwithin a particular OpenStack cloud
7 Image Service Glance Provides a registry of virtual machine images
Com-pute uses it to provision instances
8 Telemetry Service Ceilometer Monitors and meters the OpenStack cloud for billing,
benchmarking, scalability, and statistics purposes
9 Orchestration Service Heat Orchestrates multiple composite cloud applications
by using either the native HOT template mat or the AWS CloudFormation template format,through both an OpenStack-native REST API and aCloudFormation-compatible Query API
Trang 36for-The relationship among these OpenStack services is described as in Figure 2-4.
Figure 2-4: OpenStack conceptual architecture [3]
As in Figure 2-4, Dashboard provides a web front-end for other services All servicesauthenticate through a common Identity Service Compute retrieves virtual disks ("im-ages") and associated metadata in the Image Store ("Glance") to provision VM Networkprovides a virtual networking for Compute Block Storage provides storage volums forCompute Image Store can store the actual virtual disk files in the Object Store Teleme-try can monitor and meter the Compute, Network, and Image Storage The Orchestration
Trang 37provides a way to deploy appliction running on other services.
A logical architecture diagram as in Figure 2-5 provides a more complex view onhow each individual OpenStack service interact with each other However, as OpenStacksupports a wide variety of technologies, it does not represent the only possible architecture
Figure 2-5: OpenStack logical architecture [3]
There are many ways to install and deploy OpenStack for production-sized systemthrough software distributions such as Ubuntu, Red Hat, openSUSE The community hasput a lot of effort to keep the OpenStack documentation [84] update with the latest Open-Stack release For development purpose, Devstack [85] can be used to build a completedevelopment enviroment As we can see in Figure 2-6 and Figure 2-7, a multi-nodeslab can be setup by 2 Ubuntu Server 12.04 LTS 64 bit in 2 VirtualBox with approriateDevstack’s localrc settings The setup is performed with OpenStack Folsom release
Trang 38Figure 2-6: Devstack’s localrc for controller node (192.168.1.5)
Figure 2-7: Devstack’s localrc for compute node (192.168.1.6)
Figure 2-8 shows nova services on both controller and compute nodes by nova-manageservice list command
Figure 2-8: nova-manage service list
Trang 39As we can see in Figure 2-9, an instance is launched successfully from Horizon ofmulti-nodes
Figure 2-9: Launch an instance from Horizon dashboard
As an OpenStack cloud is composed of so many different services, there are a largenumber of log files On Ubuntu, as in Table 2.3, most services use the convention ofwriting their log files to subdirectories of the /var/log/ directories
Table 2.3: OpenStack Log LocationService Log Location
Trang 40if they are more "severe" than the particular log level For example DEBUG will allow alllog statements through while ERROR will only log every message including exception orcritical issue.
The Apache Hadoop [13] software library is an open source framework that allows forthe distributed processing of large data sets across clusters of computers using simple pro-gramming models Hadoop has two master-worker systems The first system is MapReduceframework, that processes large data on many machines The other is Hadoop DistributedFile System (HDFS), specialized in handling large data stream Each system consists of
a single master and multiple workers The two core components, Hadoop MapReduceand HDFS, are inspired respectively from the Google’s MapReduce [86] and Google FileSystem [87] papers
A Hadoop job consists of a group of Map and Reduce tasks performing data-intensivecomputation As we can see in Figure 2-10, the Map task consists of a map phase and theReduce task consists of a shuffle, sort, and reduce phase At map phase, raw data is readand converted into key/value pairs Map() function is applied to any pair At shuffer phase,all key/value pairs are sorted and grouped by their keys At reduce phase, all values withthe same key are processed within the same Reduce() function