As an example of the latter, security researchers may want tosearch datasets in order to identify and discover previously unknown relationships or structures in the data, such as intrusi
Trang 1KNOWLEDGE DISCOVERY IN COMPUTER
NETWORK DATA: A SECURITY PERSPECTIVE
by Kendall E Giles
A dissertation submitted to The Johns Hopkins University in conformity with the
requirements for the degree of Doctor of Philosophy
Baltimore, MarylandOctober, 2006
© Kendall E Giles 2006All rights reserved
Trang 2UMI Number: 3240712
Copyright 2006 byGiles, Kendall E
All rights reserved.
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copysubmitted Broken or indistinct print, colored or poor quality illustrations andphotographs, print bleed-through, substandard margins, and improperalignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscriptand there are missing pages, these will be noted Also, if unauthorizedcopyright material had to be removed, a note will indicate the deletion
®
UMI
UMI Microform 3240712 Copyright 2007 by ProQuest Information and Learning Company All rights reserved This microform edition is protected against unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road
Trang 3From a security perspective, computer network data is analyzed largely for twopurposes: to detect known structures, and to identify previously unknown struc-tures As an example of the former, it is considered standard procedure to filternetwork traffic for previously identified viruses in order to prevent infection and toreduce virus spread As an example of the latter, security researchers may want tosearch datasets in order to identify and discover previously unknown relationships
or structures in the data, such as intrusions into a network by an external hacker.However, among other limitations, traditional methods of network data analysis areinsufficient when processing large volumes of network traffic, do not allow for thediscovery of local structures, do not visualize high-dimensional data in meaningfulways, and do not allow user input during search iterations
We present the development, analysis, and testing of a new framework forthe analysis of network traffic data In particular, and among others, the frame-work addresses the following questions: How is network traffic represented in high-dimensional space? (normalized graph Laplacians); How can we extract features
Trang 4from the data? (flow-based feature extraction); How to embed high-dimensionalrepresentations in low dimensions? (Laplacian Eigenmaps); How to intuitively vi-sualize low-dimensional structures? (Fiedler Space projections); How to addressscaleability concerns? (proximity computation, partitioning, and eigenpair com-putation approximations); How can the user be involved in the search? (iterativedenoising applied to network data); How might this framework be used on empiricalnetwork data? (application to computer network intrusion data, backscatter data,and computer network application data) As such, this work presents theoretical
as well as practical contributions, and these results are discussed within the text of traditional methods and techniques Thus, due to its theoretical and appliedbenefits of visualizing and classifying a variety of unsupervised, heterogeneous, high-dimensional computer network traffic datasets, we feel that Iterative Denoising can
con-be a unifying technology for protection, detection, and response groups coordinatingaround a network security monitoring system
Dr Carey Priebe (Advisor)
Dr Fabian Monrose
Dr David Marchette
Dr Donniell Fishkind
Trang 5Foremost, I would like to thank the members of my committee, Dr Carey Priebe,
Dr David Marchette, Dr Fabian Monrose, and Dr Donniell Fishkind, for theirextreme patience and willingness to take a chance on helping me see this dissertationthrough to completion I would also like to thank Dr Michael Trosset for workingwith me and helping me to learn about many of the theoretical aspects containedherein
I must also thank my wife for letting me realize this goal And to my Mom andDad, no amount of thanks can suffice
Trang 62 A Taxonomy of Computer Network Data Analysis Approaches
2.2 Modeling ee
2.4 Manifold Learning 0.0 eee ee ee es
3 The Iterative Denoising Framework
iiivviiviii
Trang 7" “6° HAAa aăA ( ee3.4.5 PartiiOn LH nu ng cv v k k k k kg3.4.6 ÍnteractE c Q Q Q Q Q Q HQ và gà kg TT va3.5 Application: Science News Corpus 0.00 ee eee3.5.1 An Iterative Denoising of Documents 3.5.2 A Detailed Analysis of Clustered Documents
3.5.3 Tllustration by Comparison of Two Key Features
Iterative Denoising of KDDCup
6 Conclusions
Vita
Trang 8Categories of Network TYaflc 6 ch k ee Network Traffic Features 6 6 ẶẰẮŠ MN áaa | a q HH Records Used for Analysis 6 6 6 6 oe ee Q Q Q Q VU VU Q Q ee ng Normal Traffic Confusion Matrix, k=1 1 1 1 ee ee ee et ee et ee Attack Traffic Confusion Matrix,k=1 2 0 1 we ee ee te et ee Combined Traffic Confusion Matrix,k=1 6.0.0.0 eee eee et ee es Attack Type Colors 6 6 6 ee ee HH nu cv cv can cv cv CV CV ca KT
Flow Features ee Dataset Window Distribution, © 6 6 6 0 ee ee ko Hierarchical k-Means Confusion Matrix © 6 6 ee ee ee eee ee Linear Iterative Denoising Confusion Matrix © 6 6 6 LH LH ee Nonlinear Iterative Denoising Confusion Matrix 2 6 6 ee ee ee ee ee
Trang 9Curse of Dimensionality: As the Dimensionality Increases, the Range of the Feature
Space that Must Be Searched Increases Dramatically 2 0.02 ee eee 10
Pairs Plot of KDD Cup Attack Data, Showing the Combinatorial Problem with
Visu-alizing Large Numbers of Features «6 6 6 eee ee ee ee ee t 11
A Taxonomy of Computer Network Data Analysis Approaches .4 19The Data Modeling Approach 2 1 eee eee ee tt te 22The Algorithmic Modeling Approach 1 6 eee eee ee ee te 25
Iterative Denoising Flowchart 2 6 1 ee ee ee ee ee v 33Denoising Detail, 2 6 1 ng cv na cv kg v KV vn 43
An Iterative Denoising Tree on Science News Corpus 1 1 1 eee ee es 68
Node 1, Fiedler Space Embedding: Anthropology (yellow), Astronomy (black), ioral Sciences (pink), Earth Sciences (light gray), Life Sciences (orange), Math & CS
Behav-(red), Medicine (green), Physics (blue) 6 6 6 6 ee 70
Node 4, Fiedler Space Embedding: Anthropology (yellow), Astronomy (black), ioral Sciences (pink), Earth Sciences (light gray), Life Sciences (orange), Math & CS
Behav-(red), Medicine (green), Physics (blue) ee ee 2 72
Node 3, Fiedler Space Embedding: Anthropology (yellow), Astronomy (black), ioral Sciences (pink), Earth Sciences (light gray), Life Sciences (orange), Math & CS
Behav-(red), Medicine (green), Physics (blue) 6 1 6 ee 74
Node 8, Fiedler Space Embedding: Anthropology (yellow), Astronomy (black), ioral Sciences (pink), Earth Sciences (light gray), Life Sciences (orange), Math & CS
Behav-(red), Medicine (green), Physics (blue), 6 26 6 6 ee ee 75
Node 9, Fiedler Space Embedding: Anthropology (yellow), Astronomy (black), ioral Sciences (pink), Earth Sciences (light gray), Life Sciences (orange), Math & CS
Behav-(red), Medicine (green), Physics (blue), 6 6 1 6 we ee ee 75
Trang 103.9 Node 10, Fiedler Space Embedding: Anthropology (yellow), Astronomy (black),
Behav-ioral Sciences (pink), Earth Sciences (light gray), Life Sciences (orange), Math & CS (red), Medicine (green), Physics (blue) we ee Q HQ HQ gu Q v kia
3.10 Four-Class Science News, Root Node: Astronomy (black), Physics (blue), Medicine
(green), Math & CS (red), 6 6 ee
3.11 Physics and Math/CS Node, With Corpus-Dependent Feature Extraction (Iterative
Denoising): Astronomy (black), Physics (blue), Medicine (green), Math & CS (red) .
3.12 Physics and Math/CS Node, Without Corpus-Dependent Feature Extraction
(Hierar-chical Clustering): Astronomy (black), Physics (blue), Medicine (green), Math & CS
3.13 Node 4 Computed Without Corpus-Dependent Feature Extraction: Anthropology
(yel-low), Astronomy (black), Behavioral Sciences (pink), Earth Sciences (light gray), Life
Sciences (orange), Math & CS (red), Medicine (green), Physics (blue),
4,1 Normal Traffic, Iterative Denoising Tree 0 HQ HQ Q Q na va4.2 Normal Traffic—Root Node, Fiedler Spa€8 6 0 ee ee ng Và sa4.3 Normal Traffic—Node 2, Fiedler Space 6 ce v Q v kia4.4 Normal Traffic—Node 4, Fiedler Space 6 0 ee a4.5 Normal Traffic—Iterative Denoising Tree, 3 levels 6 6 1 ee ee es4.6 Normal Traffic—Node 4, Denoised Partition 1, Fiedler Space 2 0 0 ee es4.7 Normal Traffic—Node 4, Denoised Partition 3, Fiedler Space 2 0.00.044.8 Normal Traffic—Local Time-Series Structure 1 c Q k Q Q Q cu Ủ4.9 Attack Traflic—lterative Denoising Tree HQ L Q HQ ee4.10 Attack Traffic—Root Node, Fiedler Space 2 ee HQ Q Q va kia4.11 Attack Traffic—Node 4, Fiedler Space 2 c 1 we ee ee và4.12 Attack Traffic—Iterative Denoising Tree, 3 levels 2 6 ee ee
4.13 Attack Traffic—Node 4, Denoised Partition 1, Fiedler Space 0 ee ee4.14 Attack Traffic—Node 4 Denoised Partition 4, Fiedler Spaces 2 2 000s4.15 Combined Traffic—Iterative Denoising Tree 6 6 0 Q Q và kia4.16 Combined Traffic—Root Node, Fiedler Space 6 0 ce Q HH et Quà sa4.17 Combined Traffic—Iterative Denoising Tree, 3 levels ee ee Q Q4.18 Combined Traffic—Node 3, Denoised Partition 1, Fiedler Space 1 ee4.19 Combined Traffic—Node 3, Denoised Partition 4, Fiedler Space 2 eo es
5.1 Flows by Number of Packets 6 ‹ ee et HQ na Q v TQ na5.2 Flows by Number of Bytes 2 6 6 ce ng kg cv v kg va5.3 Flows by Duration 2.0 6 HQ nu cu va và va va xa5.4 Variation Explained by Principal Components 6 ee Q Q ee ee5.5 Scatterplot of PƠI and PC2 by Class © 6 6 6 Q Q HQ HQ Hạ va5.6 Iterative Denoising Tree Part 1 2 6 1 0 ee ee5.7 Iterative Denoising Tree Part 2 6 0 ee ng và kà Ka
Trang 115.9
Root Node of Fiedler Space Embedding, Showing in Particular Clear Separation of FTP Traffic (Yellow) and Multiple Groups of NNTP Traffic (Red) ‹ Root Node Cluster Features of Multiple NNTP Groups, Showing Features Clustered by Application Behavior, 2 6 1 6 ee ee ee ee VN V V Và
Trang 12Chapter 1
Introduction
Across many fields, the acceleration of technology and human inquisitiveness
has led to a vast (over)abundance of heterogeneous, high-dimensional data This
means that a user, who wants to understand a large, complex set of data and findinteresting information and relationships in that data, needs a sufficiently flexibleand powerful computational framework in hand to facilitate data processing andknowledge discovery For example, imagine that a user has been presented a largecollection of text documents and wants to examine and understand those documentsfrom an analytical perspective This broad desire can take many forms The usermight have an information retrieval task in mind, where it is desired to find a set
of documents relevant to a specific query Or the user might wish to understandrelationships between multiple documents The user might also wish to identify
the topic of discussion in a collection of emails, or to cluster them according to
Trang 13relevant criteria However, such tasks often have technical constraints Increasinglythe user must analyze large, unstructured datasets, meaning that the dataset maynot include class labels for the documents, and that the number of documents to be
analyzed is large and possible complex + So the user’s task is to explore the data(the corpus of documents), extract meaningful, implicit, and previously-unknown
information from a large unstructured corpus
From this scenario we can identify several relevant issues and needs First, if weconsider a word or phrase in one document as one dimension, then the dimensionality
of the search space, from a performance perspective, would be prohibitively sive and difficult for operations on a corpus even on the order of tens-of-thousands
expen-of documents and tens-expen-of-thousands expen-of words per document The computationalperformance of processing such high dimensional data can be limiting Moreover,visualizing and comprehending high dimensional spaces can be difficult for the user,who typically understands data best in two or three dimensions Second, in large,complicated datasets, an important finding for the user might be relationships found
in local structures, where features of the data may have differing relationships indifferent parts of the data Third, the lack of existing class labels limits the ability
of a user to analyze the corpus without first applying some structure to the data.Certainly, then, a flexible framework and tool is needed that tease out these local
structures and relationships of possible interest, display useful information to the
‘Complex data is defined to be heterogeneous, meaning that there may be local (as opposed toglobal) structures that characterize some of the data.
Trang 14user, and address scaleability and high-dimensionality concerns.
But while the analysis of documents is a common task that, as we have just seen,
is complicated by several challenges, we also take a look at the analysis of computernetwork data to see if the challenges we identified above are repeated in a differentdata domain In particular, we initially consider the challenges in the area of ofcomputer intrusion detection, followed by a discussion of the challenges in computerapplication detection
Computer intrusion detection is an important subfield of network data analysis.Roughly speaking, the goal is to detect or identify unauthorized access or attacksagainst a computer network Network traffic data is monitored by automated and
human means in order to record an alarm as soon as possible to the attack incident.Alarms can be set based on previously-identified attacks, using rules, filters, ormodels that have been shaped to known attack structures However, a common, yetdifficult problem is in identifying attacks that have not been previously identified.For example, Nmap is a popular network administration tool that is also used
by attackers for performing network scans A defender could implement a filter todetect illicit scans that checks for SYN packets sent to multiple ports on a targethost from a given source within a specified time window However, the defender’sfilter would not then detect port scans from attackers using a scan rate longer thanthe defender’s specified time window, nor would the filter catch multiple attackersworking together, for example, each scanning just one port on a particular target
Trang 15This filter also fails to address the fact that there are multiple ways to scan ports,such as FIN scanning and UDP scanning, in addition to the SYN scanning methodthat the filter was written to address So, depending on specific filters to detectintrusions is no longer a sufficient strategy There is a need for a system that canhelp a user distinguish previously-unknown attacks from normal traffic.
The detection of applications consuming network resources is also another portant security and administrative area of concern Towards defending againstand combating malicious network activity, organizations and agencies frequentlyuse network sensors or monitors at key points on the network in order to collecttraffic for analysis (e.g., [Marchette, 2001]) The packets in computer network traf-
im-fic are often aggregated into collections known as flows [Claffy et al., 1993], whichcan be described as sequences of packets, possibly containing multiple protocols,related to a single application transaction A request for a webpage, for example,may contain multiple HTTP requests for elements within the page, along with mul-tiple TCP acknowledgements, and possibly other protocol transactions Because ofthe variety of applications and protocols on the Internet, it is critical for networkadministrators to understand the types of applications and protocols that are beingused on their network From a security perspective, without knowing the nature ofthe traffic on the network, information may be compromised, resources may be usedinappropriately, and attackers may take advantage of unauthorized access Moregenerally, it is important to understand the mix of applications that consume net-
Trang 16work resources, to monitor and track different user populations, and to help resolve
Quality of Service issues [Roughan et al., 2004, Moore and Zuev, 2005a, Moore andPapagiannaki, 2005]—insight important to network designers and administrators,
in addition to security personnel
A common approach is to identify traffic based on computer services using
well-known ports [Numbers] This global-filtering approach is commonly used in typicalfirewall and network monitoring applications, such as [SNORT], [Porras and Neu-mann, 1997], [Logg and Cottrell, 2003], [Fraleigh et al., 1995], and [Moore et al.,2001a] While classifiers and clustering algorithms can target specific, known types
of flows, a flexible and forward-thinking approach is to not rely on global filters,since these methods do not capture previously-unknown flows, nor do they helpthe user understand the relationships between different flows In addition, the de-pendence on well-known ports is no longer exclusively reliable, mainly due to two
trends First is the use of protocols such as HTTP to wrap other protocols in order
to bypass firewalls Second is the use of non-standard port numbers for malicioustraffic, such as backdoors 2 Other features in addition to port can be used to filterfor applications, such as packet size
These two examples, the analysis of computer intrusion data and the analysis
of computer network application traffic, provide motivating problems and
applica-tions that frame and focus the remaining discussion In summary, from a security
2A backdoor is a mechanism by which unauthorized access is made available on a computersystem With access, an attacker can, for example, relay email spam [Boxmeyer], or launch worms
Trang 17perspective, computer network data is analyzed largely for two purposes: to detectknown structures, and to identify previously unknown structures As an example
of the former, it is considered standard procedure to filter network traffic for viously identified viruses in order to prevent infection and to reduce virus spread
pre-As an example of the latter, security researchers may want to search datasets inorder to identify and discover previously unknown relationships or structures in thedata, such as intrusions into a network by an external hacker This motivates thefollowing presentation of challenges in the state-of-the-art of network traffic analysisand our contribution for addressing these challenges by this work
1.1 Challenges
From our discussions above, we can distill three main challenges facing the puter network traffic analyst In the following we describe: making distributionalassumptions, and using class labels and global filters; addressing scalability and thecurse of dimensionality; providing an intuitive way to visualize high-dimensionaldata, extracted structres, and information
com-Making distributional assumptions, and using class labels and global ters: In the data analysis domain, one goal is to identify interesting structuresthat may lead to insights and inferences about information in the data The ap-plication of filters or models to the data in this case would likely cause a loss of
Trang 18fil-valuable data and information, because one does not yet know the structures toapproximate or search for Also, large, high-dimensional datasets often have localstructures that vary with location in the data—where “different relationships holdbetween variables in different parts of the measurement space” [Breiman et al., 1984].
These local structures may be of interest to the user, but these local structures will
likely be missed with the global filter/model-approximation approach Thus, it is ofimportance to handle nonhomogeneous data
We address this problem by taking a non-parametric, unsupervised approach
to data analysis We do not make distributional assumptions by trying to forceour data into possibly inappropriate molds Also, we assume the data is new andunstructured, in the sense that there are no previously-identified classes in the data
We do not have known examples to use for training, and so must utilize unsupervisedclassification approaches Also, our analysis method, which we discuss below, isdesigned not to take the simplistic approach of applying a global filter to analyze theentire dataset We do not assume that a large dataset is everywhere homogeneous,and so our methods are designed to extract local structures that may remain hiddenfrom a global approach
Addressing scalability and the curse of dimensionality: Another seriouslimitation is that traditional methods are not able to easily handle high-dimensional
data Ethereal [Analyzer], a popular protocol analyzer, may be useful when
Trang 19ana-example, are many megabytes if not gigabytes in size [for Internet Data Analysis ,CAIDA]; the Abilene Observatory makes traffic data available from their 10 gigabitsper second network [Observatory].
This latter issue is commonly known as the Curse of Dimensionality [Bellman,
1961] This curse refers to a computational complexity phenomenon that affectsmany statistical and machine learning data analysis approaches, such as nearest-neighbor or k-means classifiers The basic problem is that as the dimensionality
of the data increases, the ability to make partitional distance calculations becomesincreasingly difficult
For example, Figure 1.1 shows 100 points uniformly distributed in one dimension.Any interval, say, of width = 1 contains a number of points However, if we increasethe number of dimensions by one, as in Figure 1.2, which shows 100 points uniformlydistributed in two dimensions, then there are 2D intervals, with each side’s width
= 1, that do not contain points, and those that do are more sparse than in onedimension In three dimensions, most cells will be empty
As the number of dimensions increases, the points become increasingly sparse Inhigh dimensions, the edge distances between points become similar for most points
In other words, the ability to distinguish proximities for the purpose of classificationbecomes increasingly difficult as the dimensionality increases For this reason, theperformance traditional clustering and classification approaches that use a proximitymeasure tends to the pathological in high dimensions
Trang 21Figure 1.3: Curse of Dimensionality: As the Dimensionality Increases, the Range of the Feature Space
that Must Be Searched Increases Dramatically.
Providing an intuitive way to visualize high-dimensional data, extractedstructures, and information: Viewing and comprehending high-dimensionaldata is beyond the abilities of most human users, which are typically limited totwo or three dimensions But in high numbers of dimensions, it is perhaps more im-portant and useful for users to have visualizations of their data in order to increaseinterpretability and understanding of possible events and structures of interest
As a simple example of the problems humans face when trying to visualize trivial datasets, consider Figure 1.4 This figure shows a pairs plot of some computer
non-intrusion detection attack data used in our analysis (to be detailed in Section 4.2)
A scatterplot is a simple graph that shows the relationship between two variables byplotting their datapoints For our computer intrusion data, there are eight features
Trang 22(such as the number of data bytes from source to destination, and the number ofconnections to the same host as the current connection in the apst two seconds)and n = 8264 attacks In this figure, the different colors represent different types
of computer network attacks (the correspondence of attack type to color will beexplained further in Section 4.2)
Figure 1.4: Pairs Plot of KDD Cup Attack Data, Showing the Combinatorial Problem with Visualizing
Large Numbers of Features.
Pairs plots are useful for multi-variate data analysis with small numbers of tures, as they allow a user to quickly scan the data for interesting patterns that
Trang 23fea-m variables, a pairs plot shows a scatterplot for each of (7) cofea-mbinations Even
with just m = 8 features, we have 28 unique plots (the upper triangle mirrors thelower)—this is not a feasible tool to use for analyzing high-dimensional data Yet
the state-of-the-art suggests this very method of multi-variate raw data
presenta-tion (e.g., Wright et al [2006], to be reviewed further in Secpresenta-tion 2) But from this
simple pairs plot example, it seems reasonable to suggest that a human analyst may
be overwhelmed by the number of scatterplots when trying to discern structure indata—our network application dataset, analyzed in Section 5, contains just 20 fea-tures, resulting in 190 scatterplots an analyst would have to review Fortunately,our method, detailed in this work, provides a way for an analyst to easily visualizestructures in such multi-variate data
So only considering user-interface visualization and clustering difficulties withtraditional analysis methods on computer network data, not to mention scalabilityissues with large datasets, the Curse of Dimensionality, and the problem of discov-ering local structures, there exists a need for a better way to analyze and visualizelarge, high-dimensional datasets
1.2 Contributions
We note that the omnipresence of security threats to computer systems brings
a certain urgency to finding and addressing these challenges Simply, the continued
threat of computer network attacks and other malicious activity motivates the need
Trang 24for vigilance regarding and analysis of computer network traffic This work presentsthe development, analysis, and testing of a new framework for the analysis of com-puter network traffic data Whereas much research focuses on the data mining part
of knowledge discovery, which assumes the data has already been preprocessed andthe appropriate features extracted, we instead address the entire knowledge discov-ery problem as a system process We focus on a framework that allows the user tointeractively process possibly noisy data, extract features, mine the data for infor-mation, and iterate as needed As such, this work presents theoretical as well aspractical contributions We provide an approach for the analysis of network data, theclassification of network flows, and the visualization of results More importantly,
we provide a method for the proper extraction and visualization of structures thatmay be of interest to a user Finally, we demonstrate how our flexible analysis
framework can easily be extended depending on particular classification /clustering
needs Using these implementations, we detail results from an analysis of threecorpora of computer network traffic
We address the above-mentioned problems, in particular the discovery of
lo-cal structures in complex network traffic datasets and the visualization of dimensional classification results for a user, with our framework called Iterative
high-Denoising Iterative Denoising [Giles et al., 2006] is an unsupervised knowledgediscovery methodology for heterogeneous, high-dimensional data This framework
allows a user to visualize and to discover potentially meaningful relationships and
Trang 25structures in large datasets We detail how this methodology can be used in theanalysis of computer network traffic, and demonstrate its performance on multipleheterogeneous computer network traffic datasets.
As specific contributions, using Iterative Denoising we demonstrate the discoveryand identification of structures in computer network application data—the unsu-pervised identification of traffic flows by type In addition, our approach discoverslocal structures that reflect specific application process behaviors Also, our systemtakes a non-parametric approach to knowledge discovery, and so does not make as-sumptions that may be unfounded in the network security context of ever-changingnetwork loads and traffic distributions Our system provides a visual interface thatallows the human user to quickly and easily see such structures in the data, and weprovide unique visualizations of these structures in tree and structure-space formats
In an operationa context, the Iterative Denoising methodology presented herecan serve as possibly an important Indications and Warnings component of NetworkSecurity Monitoring systems A definition for Digital Indications and Warnings isgiven as: “the strategic monitoring of network traffic to assist in the detection and
validation of intrusions” [Bejtlich, 2005] Many of the current network monitoring
and analysis tools simply report automated warnings, with many false positives,that require human investigation and validation In this role, the human provides acontext that automated systems cannot provide—humans consider the larger picture
of events that may be happening and have happened in order to determine whether
Trang 26or not a warning is a legitimate incident Indeed, because our methodology is able
to process large, high-dimensional collections of network traffic data and is able to
present classification results in a dimensional context suitable for human analysis,
a system implemented with our methodology may be of great benefit to analysts,researchers, and Network Security Monitoring systems
Moreover, this context lies at the intersection of three communities in a networksecurity system—those who try to prevent intrusions, those who detect intrusions,and those who respond to intrusions [Bejtlich, 2005], and our methodology helps
to unify those often-fragmented groups For example, the security policies of theprevention team can be implemented in Iterative Denoising as the features to beextracted in our Extract Features step (see Section 3.4.2) The detection teamcan benchmark and tune methodology parameters for a particular monitoring en-
vironment (e.g., the parameters discussed in the Compute Proximities, Embed andPartition sections; see Sections 3.3.2, 3.3.3, and 3.3.4, respectively) The response
team can provide feedback to the prevention team on monitoring success, and canprovide feedback to the detection team on validating new events of interest Inthis way, in addition to its visualization benefits, our methodology could serve as acentral component of an effective network security monitoring system
Trang 271.3 Structure of the Thesis
In Chapter 2 we place our methodology in the context of the current researchliterature and implementation state-of-the-art We provide a helpful taxonomy thatserves as a logical categorization for various research contribution in the context ofcomputer network traffic analysis In Chapter 3 we present our methodology forthe analysis of high-dimensional computer network traffic Some of the features ofthe methodology are highlighted with the analysis of an interesting text corpus InChapter 4 we demonstrate how our methodology can be applied to the analysis ofintrusion detection data In Chapter 5 we demonstrate how our methodology can beapplied to the analysis of computer network application data Finally, Conclusionsand Bibliography follow
Trang 28previously-a security perspective, previously-along two mpreviously-ain previously-axes: Intuition previously-and Generpreviously-ality Intuitionrefers to the degree to which a computer traffic analysis approach gives a user someunderstanding of the structures that may be contained within the data For exam-ple, does the approach give the user a sense of distributional constraints behind a
Trang 29particular protocol or attack occurrence? Or does the approach simply string-matchheader fields in a dictionary? Generality refers to the degree to which a computertraffic analysis approach is able to scale to increasing traffic loads or is not lim-ited to parameterized domains and distributional assumptions For example, doesthe approach assume a traffic characteristic is distributed Gaussian or that viruspropagation rates grow exponentially? Or does the approach not make parametricdistributional assumptions?
Of course, as with any binning, some analysis approach assignments into thistaxonomy can be argued However, the goal is to establish the general contextwithin which to discuss previous approaches and how our new approach fits in withthe literature Likewise, we note that just as there are many approaches to computernetwork data analysis, there are many application environments and situations inwhich a particular approach might be an appropriate solution So in some respect,there may be other axes along which we can group the various approaches However,from the standpoint of discussing analysis characteristics that are relevant to userswho seek methods to help in their understanding of large quantities of increasingly-complex network traffic, we feel that Intuition and Generality are very relevantfeatures Therefore, in the following, we discuss representative computer networkdata approaches and their relationship to our taxonomy
Trang 30ac-can be thwarted by attackers is given in [Ptacek and Newsham, 1998] Filtering
approaches are low on the Intuition axis because such approaches do not typicallyprovide any insight into the processes that generate the traffic or events of interest,
Trang 31For example, in [Kim et al., 2004], the authors present a flow-based abnormal traffic
detector that labels flows as abnormal if certain characteristics, such as packet size
or protocol, are above a certain threshold or match a particular pattern, using aseries of template functions While this system can report frequencies of abnormalflows, it does not have the ability to show the user, for example, how close a P2Pflow may appear to a denial of service flow in some sense of distance
Popular representatives of the filtering approach include the popular Snort and
BRO implementations [Paxson, 1999, SNORT], and largely follow the model of
pas-sive protocol analysis by parsing sniffed IP packets, stream reconstruction, andsignature analysis via pattern matching Filtering is also used to describe initialcharacteristics of interesting phenomenon, such as the binning of suspected denial
of service attack flows by packet header characteristics using backscatter datasets
[Moore et al., 2001b, 2006] The authors in [Yegneswaran et al., 2003] perform
a similar but broader analysis, by summarizing portscan packet arrival statistics
(time, source IP, source port, destination IP, destination port, and TCP flags) from
a collection of firewall logs rather than analyzing backscatter packet arrivals; theirfocus was on summarizing and characterizing intrusion activity in general
However, while filtering can be used in anomaly-detection scenarios, such as in
[Lane and Brodley, 1997] where they create profiles of typical user behavior to mine if new events are normal’ or ’anomalous’, this approach would not handle newusers on the system without first creating a new user entry in the dictionary An-
Trang 32deter-other approach that would best be classified as Filtering is that of [Karagiannis et al.,
2005] Here, the authors associate flows to IP addresses in order to combine hostcommunication patterns (i.e., identifying cliques or communities of communicat-ing hosts) with packet header attribute metrics (i.e., the typical source/destinationIP/port features) in order to classify flows by application They perform this flow
classification using a collection of rules (heuristics) and a library of signatures (which
they refer to as a ‘library of graphlets’) In [Roughan et al., 2004], a collection ofsignatures for a variety of applications is developed offline through the statisticalanalysis of a training set of application network data, and then these signatures are
applied for online application classification In [Barford and Plonka, 2001, Barford
et al., 2002] the authors advocate statistical filtering of network traffic, and apply
wavelet filters for the detection of outage, flash crowd, attack, and measurement
failure network traffic anomalies Bit-stream filters are used to classify P2P
appli-cations by examining flow payloads in [Karagiannis et al., 2004], and the authors
also use heuristics on packet headers for classification, such as searching for destination IP pairs that concurrently use both TCP and UDP protocols within a
source-specific time window Finally, detecting backdoors is detailed in [Zhang and Paxson,2000), where the authors develop heuristics and filters to find patterns in the timing
of keystroke packets in order to detect SSH, Telnet, and FTP backdoors, amongothers
In contrast to the filtering approaches, the framework presented here does not
Trang 33utilize a collection of pre-defined templates and filters for classification As ously mentioned, one can only create a filter once structures have been identified inthe data to filter on We focus on the analysis of datasets that contain unknownstructures.
previ-2.2 Modeling
Modeling approaches encompass methods that utilize mathematical or cal modeling techniques to reach conclusions about computer network data Thedata is assumed to be generated by some parameterized function, and so these ap-proaches often concentrate on finding estimates for the parameters so that inferencescan be made These approaches are analogous to Breiman’s data modeling culture[Breiman, 2001], and can be summarized by Figure 2.2 Here, network event or flow
statisti-vectors # are input to a parameterized model so that the response variables y can
be evaluated, often using goodness-of-fit or residual-analysis tests
*x———> model |————>yFigure 2.2: The Data Modeling Approach
These approaches are low on the Generality axis because they depend on rameter estimates based on model or distribution assumptions that may not reflect
pa-nature—e.g., Gaussian or Poisson distribution assumptions may reflect
mathemat-ical expediency more than accuracy These approaches are high on the Intuition
Trang 34axis because the models present simplified descriptions of possibly complex and
of-ten unknown processes With only having to specify a few parameters, systems canthen be implemented using the models for prediction and classification, for example.This approach is popular with research trying to model the spread of worms or
viruses on the Internet, as in [Staniford et al., 2002b, Kephart and White, 1993],
due to the obvious analogy with similar earlier models made to model the spread
of viruses or diseases in humans [Bailey, 1975] As an example of network forensicsperformed to reconstruct the propagation history of a specific worm, [Kumar et al.,2005] used a worm’s disassembled code and packets received from unused blocks of
Internet space to create deterministic models that allow certain properties of theworm and its victims to be characterized A Gaussian mixture model was used to
detect packet feature outliers in [Lu and Traore, 2005] in order to detect networkintrusions Biological models are studied in [Forrest et al., 1996], where a method
for anomaly detection is based on studying short-range correlations in a process’system calls, a work inspired by methods used by natural immune systems Traffic
flows are clustered in [McGregor et al., 2003] using a parametric approach—the
features of a given flow are modeled as mixtures of Gaussian distributions, and they
use Expectation-Maximization [Dempster et al., 1977] to estimate the distributionparameters They also use Kiviat Figures [Kolence and Kiviat, 1973] to plot the
cluster feature values, a technique that can be used in this case since they only have
a small number of features in their analysis
Trang 35Traffic flows were also considered in [Moore and Zuev, 2005a] Here, the authors
apply a supervised, parametric Naive Bayes classifier, with variations of kernel sity estimation and feature selection, to a hand-labeled set of network traffic flows
den-In [Staniford et al., 2002a], the authors create a Bayes belief network to tag a sensed
packet with some measure of how anomalous the packet is believed to be The goal is
to be able to sense stealthy portscans from normal background traffic Packets thatare estimated to be anomalous are added to a graph of previously-tagged anoma-lous packets that form correlated subgraphs The packet is added to the appropriatesubgraph using a simulated annealing approach based on a set of heuristics to de-termine candidate edge weights Heuristics are formed by training on a separatecollection of network traffic Other examples of approaches that assume the data
follows particular distributions include [Ye et al., 2001, 2002]
In contrast to the modeling approaches, the framework we present makes nodistributional assumptions, and so in this sense is considered a non-parametric ap-proach
2.3 Machine Learning
Machine Learning approaches encompass methods that utilize Breiman’s rithmic techniques to reach conclusions about computer network data, as shown in
algo-Figure 2.3 There are no assumptions about how nature works, but these approaches
only assume the data fed to the model are drawn ?.?.đ from an unknown
Trang 36multi-variate distribution The criteria for how well an algorithmic model works is based
on how well the model’s outputs predict y Perhaps making little comment on theprocesses of nature, if the algorithmic model can faithfully reproduce the outputs y,then it is considered successful The goal of this approach is not to explain nature,but rather to provide a tool, such as a classifier, that can be used for predictionand analysis In the case of backscatter, for example, a classifier could be used todetect different types of denial of service attacks, or provide an analysis platformfor investigating (detecting) different properties of attacks
x————>_ nature |——y
Figure 2.3: The Algorithmic Modeling Approach
These approaches are high on the Generality axis because they do not depend
on parameter estimates based on model or distribution assumptions that may notreflect nature These approaches are low on the Intuition axis because the models donot give the user a feel for the geometric relationship of the inputs to the outputs Astypical examples, a neural network was used in [Ryan et al., 1998] to build patterns
of behavior for users on a system, based on the commands they typically use, in order
to detect illegitimate use Also, genetic algorithms are used for feature selection,
and decision trees are used for classification in [Stein et al., 2005] A supervised
Trang 372003], where training data is collected for each of a set of server applications, and
a decision tree classifier is constructed They also used a small number of features:TCP packet header flags, windowed mean interarrival time, and windowed meanpacket length
There are few works that actually use visualization as a part of their tion process In one, Wright et al [2006] uses scatterplots, lineplots, and heatmaps
classifica-to create templates or motifs of application packet arrival patterns Since differentapplications in general have different connection patterns between client and server,the concept is that these connection patterns can be displayed such that a user cantry to recognize patterns of previously-identified protocols, or unknown patterns inthe case of anomalies While this approach may itself seem like a filtering technique,where the human user matches the visual motif to a dictionary of templates, theauthors do mention a way to “score” a time-series motif using a technique called
“dynamic time warping” from the speech recognition community that can be used
to compare and match time-series patterns, which may help to automate the fication process
classi-It should be noted that these visual motifs, as scatterplots of the data features,can only effectively display two or three features at a time Moreover, it is not clearhow the procedure could possibly display multiple application connections in thesame “space” without the scatterplots becoming indistinguishable In comparison,while our Iterative Denoising procedure uses scatterplots to display objects, the di-
Trang 38mensions used are projections of all the original objects’ features together, ratherthan two at a time As such, we can analyze high-dimensional datasets effectively.Also, objects of different classes can be displayed in the same space, so that rela-tionships between classes can be investigated Finally, we present results that donot depend on using time as one of the object’s features.
But while machine learning methods may provide good classification accuracy forthe problems targeted, and may provide high-Generality frameworks for a variety ofdatasets due to the lack of distribution parameters, as in the Modeling approaches,Machine Learning approaches do not give the user intuition about the nature of re-lationships in the data, since the emphasis is on getting a good output-match, ratherthan on understanding the data In contrast to the machine learning approaches,one important aspect of our approach is to help the user interpret and understandthe data being analyzed We do this in part by constructing low-dimensional pro-jections into a Euclidean space that is intuitive for the user
2.4 Manifold Learning
The last category of computer network data analysis approaches, in some sense,attempts to address the shortcomings of the other categories, with respect to Gen-erality and Intuition Manifold Learning approaches are the result of the realizationthat traditional approaches are often not suitable for large, complex, and high-
Trang 39These approaches try to give the user an understanding of the variety and ship of low-dimensional and local structures that may be hidden in high-dimensionaldata For this reason, these approaches are high on the Intuition axis Similarly,these approaches are high on the Generality axis because these approaches do nottypically make distributional assumptions about the nature of the data—in fact, theassumption is that a complex dataset is likely to contain many different local distri-butions of data, and these are the very things that should be discovered rather thanassumed away Here, the data itself should drive the analysis to provide information
relation-to the user, rather than have the user destroy information by pressing the data inrelation-toassumed distributional molds
Examples in this category include [Lakhina et al., 2004a,b], where the authors use
a dimensionality reduction method (principle components analysis) to reduce
high-dimensional Abilene traffic measurements to a low-high-dimensional space for anomalydetection In fact, these papers show that dimension reduction methods can be suit-
able for computer network traffic In [Patwari et al., 2005], the authors present a sualization tool for displaying two-dimensional representations of a high-dimensionalset of network traffic to allow monitoring of network changes over time Finally,[Labib and Vemuri, 2004] uses principle component analysis to display and detect
vi-denial-of-service and network probe attacks in an intrusion detection dataset
Lastly, we detail [Bishop and Tipping, 1998], which presents a method for
cre-ating hierarchical mixtures of latent variable models for data visualization The
Trang 40authors show how latent or missing variables can be modeled, for example, by amixture of Gaussian distributions, and maximum likelihood can be used to fit themodel to the data These latent variables are fixed in order to be the two variablesused for visualizing the high-dimensional projections of the original data points Theparameters of the latent variables are estimated using the Expectation Maximiza-tion algorithm The mixtures, or clusters, form a partitioning of the data, and eachcluster can then be further modeled, estimated, and projected Similar to our ap-proach, the authors do not use class labels to construct their tree However, withineach node, the authors use linear methods, whereas our approach focuses on nonlin-ear methods In addition, their parametric density model approach requires certaindistributional assumptions that we do not make, as our approach is nonparametric.Note that the tree structure in [Bishop and Tipping, 1998], and the tree structureour Iterative Denoising work is based on [Priebe et al., 2004a], are extensions ofthe more general method of unsupervised recursive partitioning [Hastie et al., 2001],[Everitt, 2005], [Duda et al., 2001], and [Ripley, 2005].
Our Iterative Denoising framework fits best in this quadrant, and is meant toprovide high Intuition and Generality We do not utilize global filters or templates,and instead emphasize unsupervised classification We also do not make perhaps un-warranted distributional assumptions about our data We want our method to be asdata domain agnostic as possible, and so we emphasize non-parametric approaches.Finally, we concentrate on helping the user understand their data by providing an