Studying and developing a cdn caching algorithm using machine learning

customer-• Outsourcing material to determine which outsourcing approach to follow.• Content management is primarily based on techniques for the management of the cache In this section, w

Trang 1

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY FALCUTY OF COMPUTER SCIENCE AND ENGINEERING

Major: Computer Science

Council : Computer Science Instructor: Assoc Prof Thoai Nam Reviewer: Dr Nguyen Le Duy Lai

—o0o—

Student 1: Tran Trung Quan - 1752044 Student 2: Pham Trong Nhan - 1752394

Trang 2

We would love to show our deep and honest gratitude to our advisor, Assoc Prof ThoaiNam, for his guidance, advice, enthusiasm, and encouragement in helping us research andimplement this study

We would like to extend our thanks to two postgraduate students from the High-PerformanceComputing Lab, Mr Tran Ngoc Anh Tu and Mr La Hoang Loc, for their assistance in di-recting and implementing this research

We would like to thank lecturers in the Faculty of Computer Science and Engineering,

Ho Chi Minh City University of Technology, for their enthusiastic transfer of knowledgeduring the years we studied at the university

We are mindful that this project is still incomplete and involves inevitable mistakes Tofurther change, we would love to receive feedback from the lecturers

Finally, we wish you health, prosperity, and success on your chosen paths

Trang 3

Content Delivery Network is not a new concern, but are still a challenge due to the growth

of digital content and network infrastructure Vietnam currently does not have much in-depthresearch on this subject There is a lot of knowledge about the study process that is not part

of the program at the university level, but we promise that this is our research under theguidance of Assoc Prof Thoai Nam The research content and results are legitimate andhave never been published before The data used for analysis and feedback has been collected

by me from several different sources and will be indicated in the reference section

Besides, we have also used some reviews, evaluations, and figures of other authors andorganizations All have citations and annotations

We are entirely responsible for the content of our research The Ho Chi Minh City versity of Technology is not involved in the copyright infringements caused by our research

Trang 4

1.1 Overview 1

1.2 Goal 2

1.3 Scope 2

1.4 Thesis Structure 2

2 Knowledge Base 3 2.1 Content Delivery Network 3

2.1.1 Introduction 3

2.1.2 CDN Taxonomy 3

2.1.3 Content Placement 7

2.1.4 Request Routing 10

2.1.5 Open Issues in CDNs 11

2.2 A Light-weight Content Distribution Scheme for Cooperative Caching in Telco-CDNs [19] 12

2.2.1 Approach 12

2.2.2 Model of Light-weight Content Distribution for Cooperative caching 13 2.2.3 Color-based caching scheme’s Evaluation 16

2.3 Color-based Cooperative Cache and its Routing Scheme for Telco-CDNs [24] 20 2.3.1 Approach 20

2.3.2 Color Tag Based Cooperative Caching, Color Tags Management Algo-rithm and Routing AlgoAlgo-rithm 21

2.3.3 Research Result 24

2.4 Emulation of the color-based caching scheme in Telco-CDNs with Mininet us-ing real data [29] 25

2.4.1 Analyze CDN real log from SBD Inc - Workload types 26

2.4.2 Trace-based system analysis 27

2.4.3 Development CDN Emulation Tool 29

2.5 Simulated Annealing [32] 32

2.5.1 Local Search 32

2.5.2 Basic Simulated Annealing 33

2.5.3 Mathematical Modeling 35

2.5.4 Finding minimal traffic time of color-based cooperative caching in Telco-CDNs by using Simulated Annealing algorithm 36

2.6 Bayesian Optimization 39

2.6.1 Overview 39

2.6.2 Introduction to Bayesian Optimization 40

2.6.3 Model of the function: Gaussian Processes 41

Trang 5

2.6.4 Choice of kernels 42

2.6.5 Acquisition functions 43

3 Proposed Solution 46 3.1 Analysis: Current Problem with Normal Separator Rank Algorithm 46

3.1.1 Problems 46

3.2 Proposed Solution 47

3.2.1 Finding minimal traffic time of color-based cooperative caching in Telco-CDNs by using Bayesian Optimization 47

4 Evaluation 53 4.1 First Phase: Evaluation with Simulated Datasets 54

4.1.1 Setting 54

4.1.2 Experiment 1 55

4.1.7 Summary 61

4.2 Second Phase: Evaluation with Real Datasets 61

4.3 Conclusion 64

5 Summary 66 5.1 Achievements 66

5.1.1 Knowledge about CDN system and related problems 66

5.1.2 Using Bayesian Optimization to solve the optimization problems 66

5.1.3 Achievements in experiences, knowledge, and soft skills 66

5.2 Drawbacks 66

5.3 Future Improvements 67

Trang 6

List of Tables

2.1 Example of a record in system log 27

3.1 Sampling set of different increased rank number 52

3.2 Number of separator ranks in different sampling sets 52

4.1 Setting for evaluating with Simulated Datasets 54

4.2 Initial sampling separator ranks 54

Trang 7

List of Figures

2.1 CDN Composition 4

2.2 CDN Composition - Relationship 5

2.3 CDN Composition - Interaction Protocols 6

2.4 Content Distribution and Management 6

2.5 Surrogate placement 7

2.6 Content management subsystem 8

2.7 End-user to CDN interaction 8

2.8 Utilities for modeling objectives and constraints 9

2.9 Taxonomy of request-routing mechanisms 10

2.10 Example of contents cached in three servers according to their color tags and popularities [19] 13

2.11 Proposed LFU-LRU Hybrid Cache Architecture [19] 14

2.12 Request handling algorithm with proposed hybrid caching scheme [19] 14

2.13 Server colorization algorithm [19] 15

2.14 Popularity class and corresponding tags with four colors [19] 16

2.15 Unidirectional ring topology with 8 nodes with their corresponding colorations [19] 16

2.16 2D-mesh topology with two different colorations [19] 17

2.17 NTT mesh topology in Japan and its coloration [19] 17

2.18 Simulation parameters [19] 17

2.19 The number of contents in each popularity class.[19] 18

2.20 Configurations of the computing host for GA [19] 18

2.21 Comparison of convergence [19] 18

2.22 Normalized traffic on a ring-based network with 8 nodes.[19] 19

2.23 Normalized traffic on a 2D-mesh network with 25 nodes.[19] 19

2.24 Normalized traffic on the NTT network with 55 nodes.[19] 19

2.25 Cache hit ratio with a change in the popularity [19] 20

2.26 Separator ranks for each gamma parameter when 1000 contents are classified into five popularity classes [24] 21

2.27 Iterative calculation for separator ranks algorithm [24] 22

2.28 Color-based routing that finds the nearest cache server with the requesting color-tag [24] 23

2.29 Structure of the color-tag based cache server [24] 23

2.30 Color based routing algorithm [24] 24

2.31 NTT-like mesh topology and its coloration with four colors [24] 25

2.32 Traffic reduction under different routing and caching strategies [24] 25

2.33 Video Streaming workflow [29] 26 2.34 The proportion of number of requests and content size for Multipurpose Internet

Trang 8

2.35 Upper: General hit rate for 3 services, Lower: Sum latency for 3 services [29] 29

2.36 Components of the tool [29] 30

2.37 The tool’s workflow [29] 30

2.38 Request-response flow [29] 31

2.39 Number of content was requested in each interval in CDN 38

2.40 T min comparison between Simulated Annealing and Original algorithms 38

2.41 T est num run comparison between Simulated Annealing and Original algo-rithms 39

3.1 Test function of original algorithm with increased rank = 1 50

3.2 Test function of original algorithm with separator rank (Increased rank = 1) 51

3.3 Test function of original algorithm with increased rank = 4 51

4.1 The topology of France CDN network 53

4.2 Evaluation Bayesian Optimization (BO) method with original method (Sam-pling set: ”all-rank”, Increased rank = 1) 55

4.3 Evaluation different initial sampling points of Bayesian Optimization (Sam-pling set: ”all-rank”, Increased rank = 1) 57

4.4 Compare three acquisition functions (Sampling set: ”all-rank”, Increased rank = 1, Initial sampling points: 200) 58

4.5 Compare two sampling sets ”all-rank” and ”only-sorted” (Increased rank = 1, Initial sampling points: 200) 59

4.6 Compare different initial sampling points of ”only-sorted” sampling set (In-creased rank = 1) 60

4.7 Compare increased rank = 1 and increased rank = 4 (Sampling set: ”all-rank”, Initial sampling points: 200, Acquisition function: MPI) 60

4.8 Compare increased rank = 1 and increased rank = 4 (Sampling set: ”only-sorted”, Initial sampling points: 200, Acquisition function: MPI) 61

4.9 Evaluation minimum estimate traffic time through 24 intervals between Bayesian Optimization (BO) and Original method (Acquisition function: MPI, Search space: ”all-rank”, Initial sampling point = 200, Increase rank = 1) 62

4.10 Evaluation number of executing Test function between Bayesian Optimization (BO) and Original method through 24 intervals (Acquisition function: MPI, Search space: ”all-rank”, Initial sampling point = 200, Increase rank = 1) 63

4.11 Compare the difference percentage with the original method through 24 inter-vals of different sampling sets (Initial sampling points: 200, Acquisition func-tion: MPI, Increase rank = 1) 64

Trang 9

Content Delivery Networks have emerged to overcome Internet congestion and overload byoffering infrastructure and mechanisms to deliver content and services in a scalable manner.CDN applications can be found in many industries, such as research institutions, media adver-tising, data centers, Internet Service Providers (ISPs), e-commerce, network carriers, and othercarrier businesses.

Although CDNs could reduce video traffic, their servers are usually located in different work locations Even though several CDN providers place their cache servers on ISP networks,this method still does not reduce traffic considerably [2] The reason is that the servers are only

net-in limited locations and CDN providers have no global knowledge of the underlynet-ing network.Several ISPs are planning to build their own CDNs by placing cache servers on their networks,which are called Telco-CDNs [2], [3] They reduce the traffic on the peering links as well asinternal communication links by confining the video requests to their networks However, be-cause of the limited storage space and inefficient allocation of contents, the traffic reductionachieved is not sufficient Moreover, the objective of the Telco-CDN is to improve the effi-ciency of both the overlay and the underlying network infrastructure, which is different fromconventional CDN

To address this problem, recent studies aim to minimize traffic by increasing cache capacity.They use a cooperative caching strategy, which adds multiple cache storage by sharing cacheserver contents In the study of Li et al [2], they proposed a content allocation algorithm based

on Genetic Algorithm (GA) to always give a sub-optimal solution in the context of the networktraffic Although such strategies could eliminate a large amount of traffic, the time it takes tomeasure the solution using a cluster is especially long Such a long calculation will lead tocache allocation inconsistencies, as access patterns vary by 20-60% per hour [4]

To overcome the limitations of the GA algorithm, Nakajima et al [19], [24] proposed alight-weight color-based caching scheme using co-operative caching [2], [5] and hybrid caching

Trang 10

and the duplication of popular content by grouping caches and servers using a novel colortag scheme These color tags are efficiently distributed to the caches and servers through alightweight color distribution scheme Even though the scheme reduces the computation timewhile keeping an approximate solution compared with the GA in terms of network traffic, thescheme’s computation time to find the color’s separator rank is still pretty long when the number

of content categories increases The color distribution scheme is quietly equivalent to a force strategy Therefore, some calculations are not necessary

brute-In this research, we propose a solution based on Bayesian Optimization to solve the problem

of lowering the calculation time of determining the color’s separator rank Then, we’ll evaluatethe simulated and real datasets to determine the best-fitting parameter of the Bayesian Opti-mization approach as it applies to our situation based on comparison with the original color’sseparator rank finding algorithm

1.2 Goal

The main objective of this research is to improve the computation time of the color separatorrank finding algorithm using Bayesian Optimization and evaluation it with the previous researchand other solutions To achieve such a goal, we plan to carry out the following tasks:

• Study CDNs and their related properties and problems

• Study the light-weight color-based caching scheme using co-operative caching and hybridcaching

• Study Bayesian Optimization

• Apply the Bayesian Optimization solution to solve the computation time problem

• Evaluating our solution with previous research and other algorithms

1.3 Scope

Although the color-based cooperative caching scheme is a new scheme to solve the contentdistribution and delivery in the CDN system, it still has many shortcomings that need improve-ment In our thesis, we only concentrate on re-implementing the content’s color distributionalgorithm that computes faster than the previous one Additionally, we do not consider the newmethod will give an approximate solution compared to the old one

In general, we will examine suggesting methods that are more time-optimal in terms ofcomputing time while remaining within a certain range of acceptable errors in comparison tothe original answer

1.4 Thesis Structure

In the next section, we will study CDNs, color-based caching schemes, and the theoreticalbackground required for this research In Chapter 3, we discussed the details of our proposedBayesian Optimization approach In the next section, we evaluate our solution to find the bestsetting for the Bayesian Optimization method and compare it with the original solution In thefinal chapter, we conclude and schedule the plan for the next stage

Trang 11

Dealing with high traffic demands puts enormous pressure on a web server, and web serversare gradually completely overloaded by an increase in traffic, and the website containing itscontents temporarily becomes inaccessible.

A Content Delivery Network is a collaboratory collection of Internet network componentsthat replicates content across a variety of Web servers in order to distribute content to end-userstransparently and efficiently They provide network capacity services by optimizing bandwidth,enhancing accessibility, and ensuring consistency by duplication of content The typical func-tionalities of a CDN include:

• Request redirection and content delivery services, sending a request to the nearest relevantcache CDN server using congestion bypass mechanisms

• Content outsourcing and distribution services, to replicate and cache content from the gin server to distributed Web servers

ori-• Content negotiation services, to meet specific needs of each individual user (or group ofusers)

• Management services, included network management, accounting management, and tent utilization monitoring and reporting

con-A CDN distributes content across the globe to a network of Web servers to deliver content

to end-users in a secure and timely manner The material is repeated either on-demand asusers request it or maybe replicated beforehand by selecting the contents of the distributed webservers The contents of the adjacent mirrored Web servers are delivered to the client As

a result, the user unknowingly ends up connecting with a nearby mirrored CDN server andretrieves files from that server

Taxonomy of CDNs based on four key issues:

• CDN Composition

Trang 12

• Content distribution and management

cus-Figure 2.1: CDN Composition

CDN Organization, there are 2 general approaches for building CDN:

• In the overlay approach, application-specific servers and caches at various locations onthe network control the delivery of specific content types (e.g Web content, streamingmedia, and real time video) Apart from offering basic network access and assured QoSfor specific request/traffic, essential network elements such as routers and switches do notplay an active role in the distribution of content CDN services replicate content on cacheservers all over the world As content requests are sent from end-users, they are forwarded

to the closest CDN node

• In the network approach, a code is given to define particular application types and toforward applications based on predetermined policies in the network elements, includingrouters and switches Examples of this method involve systems that route content requests

to local caches or shift traffic to specialized servers that are optimized to serve specificcontent types

Servers, The servers used by a CDN are of two types – origin and replica servers:

• Origin server is where the definitive version of the content resides It is updated by thecontent provider

• A replica server preserves a copy of the content but may act as an authoritative guide forclient responses The origin server connects with distributed replica servers to update the

Trang 13

content contained in the replica server A cache server allows copies (i.e caches) of data atthe edge of the network to prevent the need to reach the origin server to address any contentrequest.

Relationships, includes components such as clients, surrogates, origin servers, proxy caches,and other network elements These components interact to copy and cache data on a CDN.Replication requires making and storing a replicated copy of the content on multiple comput-ers Usually, it requires ”pushing” content from the origin server to the replica server

Figure 2.2: CDN Composition - Relationship

The graphical representations of those relationships are:

1 Client-to-surrogate-to-origin server

2 Network element-to-caching proxy

3 Caching proxy arrays

4 Caching proxy meshes

Interaction Protocols, Based on the interaction relationships mentioned above, we maydefine the interaction protocols that are used to communicate with CDN components Suchinteractions can be narrowly divided into two types: interaction between network elements andinteraction between caches

Trang 14

Figure 2.3: CDN Composition - Interaction Protocols.

Content/Service Types, CDN providers host third-party content for fast delivery of any ital content, including – static content, dynamic content, streaming media and different contentservices There are 3 main types to consider:

dig-• Static content refers to content with a low level of transition It does not change based at therequest of the customer This includes static HTML pages, embedded images, executable,PDF documents, audio and/or video files

• Dynamic content refers to content that is personalized by the user or generated on-demand

by the implementation of an application process It also varies based on the user’s requests

• Streaming media can be live or on-demand Content is transmitted ”immediately” from theencoder to the media server, and then to the media device Streaming servers are adoptedwith advanced protocols for service distribution through the IP network

Content Distribution and Management

Figure 2.4: Content Distribution and Management

Content control in the CDN is strategically critical for effective content distribution andoptimal efficiency Distribution of content require:

• Collection and distribution of content based on the type and frequency of individual userrequests

• Location of surrogates to such geographical locations so that the edge servers are centered

Trang 15

customer-• Outsourcing material to determine which outsourcing approach to follow.

• Content management is primarily based on techniques for the management of the cache

In this section, we focus on Surrogate placement because of relatively essential part to ourColor-based cooperative caching in CDN

Figure 2.5: Surrogate placement

The aim of ideal surrogate placement is to reduce the perceived latency of users for obtaininginformation and to decrease total network bandwidth usage for transmitting replicated contentfrom servers to clients The optimization of all of these metrics results in decreased maintenanceand connectivity costs for the CDN provider

For surrogate server placement, the CDN administrators also determine the optimal number

of surrogate servers using single-ISP and multi-ISP approach [13]:

• In a single-ISP strategy, a CDN provider usually deploys at least 40 surrogate servers acrossthe network edge to enable the distribution of content [14] The strategy in a single-ISPapproach is to position one or two surrogates within the scope of the ISP coverage of eachmajor city The ISP is equipping the surrogates with large caches The downside to thisstrategy is that the surrogates may be put at a distance from the clients of the CDN provider

• In the Multi-ISP approach, the CDN supplier positions as many ISP Presence Points (POPs)worldwide as possible Supplies are put next to customers and, as a result, the content isdelivered securely and on time from the ISP of the requesting customer Apart from theexpense and sophistication of setup, the key downside of the multi-ISP strategy is that eachsurrogate server receives less (or no) content requests which can result in idle resourcesand low CDN performance [15]

Various operating subsystems operate in the content distribution network, including contentmanagement and routing requests [16] The content management subsystem, as shown in Fig.2.6, is responsible for choosing the content to be mirrored and the proxy server(s) that will hostreplicated content for the quality of service end-user demands (QoS) The request router has aseries of policies for routing end-user requests to either the load balancing server(s) or the QoSserver(s)

The content management subsystem (CMS) is vital to QoS In the sense of CDNs, CMSdetermines where to locate surrogate servers, what to replicate, and which surrogate servers tokeep replicas of content

Trang 16

Figure 2.6: Content management subsystem.

In general terms, content placement algorithms are either pull or push, based on how gate servers get content from the origin server

surro-• Pull-based technique: Caching employed to increase content availability and reduce tent access latency (Reactive)

con-• Push-based technique: preemptively store content to meet an estimated demand tive)

(Proac-• Cooperative technique: content placement algorithms retrieve missing content from theneighboring surrogates

Figure 2.7: End-user to CDN interaction

Trang 17

Depending on the direction of content flow between the origin server(s) and the surrogateservers.

• In push-based content placement, content providers estimate end-user requests or dict content access patterns and replicate content from the origin server(s) to surrogates,prior to receiving end-user requests for content And contingencies are in place to cater tounpredicted end-user requests and content access patterns

pre-• In pull-based content placement, the end-user requests prompt the surrogates to importand store content from the origin server(s) or the adjacent surrogate server(s) Initially, allend-user requests will result in a failure, since the surrogate will need to retrieve contentfrom the original or neighboring surrogate server(s) Gradually, as the repository on thesurrogate server expands, the end-user requests will result in hits and the end-user requestswill be immediately fulfilled by the surrogate server Surrogates can cooperate with eachother to download content from a surrogate(s) that is closer to the origin server or directlyfrom the origin server with cooperative or non-cooperative schemes

The efficiency of pull and push-based content placement rely significantly on the racy of the prediction and estimation models for the prediction of end-user requests or contentaccess patterns But, cooperative push-based content placement algorithms also yield high per-formance over other content placement algorithms [17]

accu-Content placement algorithms are sensitive to content access patterns of correlated and ferred videos since they can significantly impact the popularity of content Content popularityfollows heavy-tail distributions, such as those parameterized by power law and exponentialfunctions:

re-• Pull-based approach is simple and effective, it will not give high hit ratio as the data in thelong tail will generate misses

• The cache hit rate is also dependent on the cache size and the eviction probability of thecontent

Figure 2.8: Utilities for modeling objectives and constraints

Trang 18

The goal is to decrease the latency, i.e to reduce traffic between the substitutes and theserver of origin(s) The problem of latency reduction comprises objectives that reduce distanceand/or traffic between end users and surrogates but are not limited to Distance or traffic be-tween providers and origin servers is greatly reduced and more content is specifically imposed

on consumers and the QoS indirectly increases the perceived latency of the end-user A tance metric is typically used to evade the assumed latency of the end-user However, networkconditions are also used for the end-user perceived latency, such as traffic volume and roundtrip times (RTT)

dis-The goals of decreased operation costs apply to computing costs, connectivity costs for thenetwork, and transmission costs The cost of bandwidth usage is dependent on the distribu-tion to end-users of content, the retrieval of content from source servers or other surrogates toaccommodate demands of end-users, the downloading of contents, the replication of content,and/or the maintaining of a cohesive content of copies

Those targets are therefore conflicting and do not preclude each other For instance, the cost

of network bandwidth is proportional to the network traffic needed to access or retrieve mation In CP models it is necessary however to use the bandwidth to prevent the congestion ofCDN traffic on the network backbone

infor-The underlying network status is rarely taken into account by CP models and algorithms.However, it should be a primary aim to gain fault tolerance The CP model must in this instanceensure content compatibility in a ’lossy’ network within QoS, where backups and capacitymight not be available due to the failure of the connection

Request-routing mechanisms inform the client about collection of the replica servers ated by the request-routing algorithms From this information, the replica server, the originserver, and the client will communicate to choose the best route that optimizes the cost of thecommunication Request-routing systems can be categorized according to a variety of criteria

cre-Figure 2.9: Taxonomy of request-routing mechanisms

Trang 19

Fig 2.9 shown classified request-routing mechanisms There are three major mechanismswidely used in real application: DNS-based request routing, HTTP redirection, and URL rewrit-ing.

DNS-based request routing

The DNS-based request routing system allows content delivery providers to map a symbolicname of a replacement host with its numerical IP address using the updated DNS server It

is used for the collection and distribution of full-site content A domain name has many IPaddresses connected with it in the DNS-based request routing When the content request of theend user is made, the DNS server of the service provider returns the IP addresses of the serverscontaining the replica of the requested object The DNS solution of the client selects one ofthese servers To make a decision, the solver can issue the probes to the servers and choosethe response times to these probes It can also gather historical information from clients on thebasis of past access to these servers

The benefit of this strategy is transparency, since the providers are referred to by their DNSnames and not their IP addresses DNS-based solution is highly common due to its flexibilityand isolation from any real replicated operation The downside of DNS-based request-routing

is that it increases network latency due to extended DNS search times

HTTP Redirection

HTTP redirection propagates the replica server collection information in the HTTP headers.HTTP protocols allow a Web server to respond to a client request with a special message thatallows the client to re-submit the request to another server The HTTP redirect can be used forboth full-site and partial-site content collection and execution Flexibility and simplicity are thekey benefits of this approach The most important drawback to HTTP redirection is the lack ofclarity In addition, the overhead perceived by this method is important as it adds unnecessaryround-trip messages into the handling of requests as well as over HTTP

URL Rewriting

In URL rewriting approach, the origin server redirects clients to various surrogate servers

by rewriting the dynamically generated URLs of the sites It is primarily used for the collectionand distribution of partial-site content where embedded objects are submitted as a response toclient requests URL rewriting can be either pro-active or reactive In pro-active method, theURLs for embedded objects on the main HTML page are formulated before the contents areloaded to the origin server The reactive method includes rewriting the built-in URLs of theHTML page as the client request hits the origin server

The key benefit of URL rewriting is that clients are not tied to a single surrogate server sincethe rewrite URLs have DNS names that point to a group of surrogate servers The drawbacks ofthis method are the delay in URL parsing and the potential bottleneck introduced by the in-pathfeature Another downside is that content with a changed connection to the surrounding proxyserver rather than the root server is non-cacheable

With the growing of CDN networks, there are several challenges that need to be discussed.Several service providers are improving their CDN system steadily to increase the effectiveness

of content delivery

Trang 20

Akamai Intelligent Platform

Akamai owns one of the largest CDNs with 20% of Internet traffic nowadays In suchlarge delivery platforms many technological challenges arise First, defending websites fromdistributed Denial-of-Service attacks [8] motivates caching and filtering techniques, which canhandle large amounts of requests Secondly, Akamai uses the principle of a cloud or elasticCDN, where storage capabilities are modified dynamically to satisfy the demand [9]

Google

Google Global Cache (GCC) system comprises caches installed at ISP premises GCC istargeted at serving YouTube content locally and reducing bandwidth costs [10] GCC’s impor-tance inspired the analysis of the relationship between service providers and network operatorsand particularly the design of price models for rental in-network caching capacities at opera-tor sites Security is another related challenge; previous work has suggested schemes for theconfidentiality of content [11] allowing straightforward caching of encrypted flows but cachingencrypted content is one of the challenging open topics in the future

Netflix Open Connect

Similarly to GCC, the Netflix CDN is partially deployed within ISPs [12] but their contentcatalog is much smaller and the file popularity more predictable Netflix was thus highly in-novative in researching space-time request profiles, strategies for pre-loading caches overnight,and in reducing daytime flow An open research challenge in this sense is to detect the popular-ity shifts in advance and to identify video files online, whether they are caching or not

Amazon AWS

AWS provides Amazon Cloudfront, a virtual CDN that delivers CDN facilities using thecloud storage platform Amazon enables one to rent caching services dynamically per hour byadjusting the storage capacity Along with other related technologies suggested by Akamai andothers, this cloud or elastic cache architecture motivates further research into the new businessparadigm and the dynamic cache location and dimensioning

2.2 A Light-weight Content Distribution Scheme for

Coop-erative Caching in Telco-CDNs [19]

This circumstance leads to the concept of placing CDNs inside ISPs; several ISPs have sidered building Telco-CDNs or ISP-Operated CDNs which are CDNs managed by ISPs ratherthan global CDN providers In Telco-CDNs, cache servers are located directly in the backbonenetworks of the ISPs, and ISPs manage several cache servers with complete knowledge of thenetwork’s properties Therefore, in such a scenario, ISPs are possible to apply many advanced

Trang 21

con-techniques like grouping several cache servers to combine capacity and improve the availability

of cached contents in the network

Researchers lead by Nakajima et al was proposed a color-based caching algorithm in theTelco-CDNs network, which is a merge of both cooperative caching and hybrid caching Coop-erative caching is a method in which several cache servers are aggregated to expand capacity

To achieve an efficaciously utilize cache servers, there must be an optimization of the contentdistribution process which aims to maximize disk capacity, and the content duplication processwhich duplicates popular contents on many servers As a result, the system will improve the hitrate of cache servers

Researchers proposed a light-weight approach for cache distribution which will colorizeboth cache servers and contents to distribute contents with minimal computing Each cacheserver in the network is colorized with a specific color and only caches content when the colortag of the content fits its color, thus minimizing redundant network content In addition, com-mon contents will be added several colors to maximize hit rates for each server These tags areconstantly changed along with their new popularity

Figure 2.10: Example of contents cached in three servers according to their color tags andpopularities [19]

Cooperative LFU-LRU Hybrid Caching Strategy

Although the color tags of the content are changed regularly, it is not possible to track therapid variance in content popularity by simply using colored caches To meet such rapid shifts,each cache server has separate storage that is handled with LFU and LRU The LFU area cachesthe contents for optimal cache delivery using the same color tags as the servers, while the LRUarea caches the majority of the contents to suit the dramatic increases in the popularity of thecontent Since cache servers manage the LFU area with color tags that are calculated based onglobally gathered access logs, the contents in the network improves hit ratios and reduces thetraffic

Trang 22

Figure 2.11: Proposed LFU-LRU Hybrid Cache Architecture [19]

Figure 2.12: Request handling algorithm with proposed hybrid caching scheme [19]

Server Colorization

Using the Welsh-Powell algorithm, which is a well-known algorithm for solving the color problem, researchers introduced a colorization algorithm for cache servers that can useexactly N colors and equally distribute all available colors in the network

Trang 23

four-Figure 2.13: Server colorization algorithm [19]

Content Colorization

The origin server regularly extracts access logs from cache servers and colorizes the contents

to identify popularity groups for them First, the origin server sorts the content by its most recentpopularity and then colorizes the most popular content New content are initially tagged as low-popularity between periodic updates They can also be cached in the LRU region to avoid thelowering of the hit ratio The numerous tags used for classifying the popularity groups andthe number of 1 bits are seen in the figure below The algorithm of content colorizing will beintroduced in later researchs

Trang 24

Figure 2.14: Popularity class and corresponding tags with four colors [19]

The evaluations of the model are carried out using the unidirectional ring, 2D-mesh andmesh network of the NTT backbone network in Japan Figures 2.15, 2.16 and 2.17 show theadopted ring, the 2D-mesh and the NTT topologies and their colourings, respectively More-over, every node in all the three networks has a client that generates content accesses

Figure 2.15: Unidirectional ring topology with 8 nodes with their corresponding colorations.[19]

Trang 25

Figure 2.16: 2D-mesh topology with two different colorations [19]

Figure 2.17: NTT mesh topology in Japan and its coloration [19]

The popularity of the contents is defined by the Gamma distribution The total content,cache capacity, and gamma parameters are shown in Figure 2.18

Figure 2.18: Simulation parameters [19]

Figure 2.19 indicates the number of contents of the evaluations in each popularity class Forexample, in the NTT topology, the most popular five contents with all 4-bits tags are cached onall servers The second most popular 20 contents with three 1-bits tags are cached in 75% of all

Trang 26

Figure 2.19: The number of contents in each popularity class.[19]

The number of contents in each popularity class is obtained from the result of the optimal allocations computed by the Genetic Algorithm using the setup configuration indicated

sub-in Figure 2.20 Popularity groups are set only after convergence of the normalized traffic erated by content accesses in Figure, which happens at generation 3000, 8000, and 10000 forring, 2D-mesh, and NTT topology, respectively

gen-Figure 2.20: Configurations of the computing host for GA [19]

Figure 2.21: Comparison of convergence [19]

Traffic Reduction

In the proposed coloring-based approach, the traffic reduction was compared between theno-cache, Perfect-LFU, and GA strategies No-cache is essentially a network without a cacheserver for performance comparing purposes only Figs 2.22, 2.23, and 2.24 indicate a decrease

in traffic for the ring, the 2D-mesh, and the NTT topologies, respectively

Trang 27

Figure 2.22: Normalized traffic on a ring-based network with 8 nodes.[19]

Figure 2.23: Normalized traffic on a 2D-mesh network with 25 nodes.[19]

Figure 2.24: Normalized traffic on the NTT network with 55 nodes.[19]

Through these three figures, the decrease in traffic is more apparent as the network becomeslarger, as cache servers can eliminate more hop counts from clients to the roots of large net-works In the 2D mesh topology, coloration-1 accomplished a significantly better reductionthan coloration-2, as the likelihood of intermediate servers of different colors increases Thisfinding suggests that an effective colorization method for the server could help minimize traf-fic In all topologies, the traffic reduction of the color-based scheme is close to the sub-optimalresult measured with GA and better than others

Trang 28

Hit Ratio

The result of hybrid caching technique, researchers compared the hybrid caching with acaching that only has the colored area without the LRU field The hybrid one divides the cachecapacity to 90% for the colored area and 10% for the LRU area To replicate the actions ofthe upgrade activity, we’re adding five new contents with the highest popularity The obtainedresult is shown in Figure 2.25

Figure 2.25: Cache hit ratio with a change in the popularity [19]

In comparison, when the new content is added, the color-only cache drops hit ratio by13.9%, while the hybrid caching approach limit the degradation to just 2.3% Also, beforethe new content insertions, the hybrid one achieved a better traffic reduction since the modifiedLRU area is able to cache the mid-high contents that do not match the server color

2.3 Color-based Cooperative Cache and its Routing Scheme

for Telco-CDNs [24]

Digital content, especially Video-on-Demand (VoD) services still play an important roleover the years, Nakajima et al from the research above still working to improve their model fordealing with the numerous increases in Internet traffic In this study researchers concentrate ontwo primary factors in the reduction of traffic: content distribution and duplication of popularcontents In fact, content distribution enhances the efficient caching capacity by storing separatecontents between cache servers, while duplication of common content boosts caching hit ratesfor servers The researchers continue to develop their scheme that groups caches and serversusing a the color tag scheme [19], also they improve the cooperative hybrid caching scheme thatfollows rapid changes in access patterns to maintains hit rates when an access pattern changesrapidly In addition, an effective routing system is proposed to further minimize traffic using thecolor information given by the color tag scheme The suggested routing scheme passes requests

to the closest server that fit the color to the requested text, thus needing only a small size of anadditional routing table

The results of the evaluation show that the colored cooperative caching scheme achieves

a value similar to the sub-optimal result measured by the Genetic Algorithm The suggestedrouting scheme also eliminates more than 30% of traffic relative to the shortest-path routing,

Trang 29

and the color distribution scheme can detect changes in access pattern bias with a few seconds

of the expected time

Al-gorithm and Routing AlAl-gorithm

Color Tag Based Cooperative Caching

Nakajima and researchers continue to improve their cooperative caching scheme with colortags which described in Figure 2.10 Initially, to specify the location of the cache on the net-work, all cache servers and contents are assigned color tags Each cache server has a singlecolor tag and stores the contents when the server color matches any of the content colors Sincecaching servers store different content based on colors, effective caching capacity increases,which reduces traffic to the external network The color tags of the servers are set as a four-color theorem that efficiently distributes information

The co-operative caching approach is endorsed with the hybrid caching scheme proposed

in the last research Figure 2.11 illustrates the configuration of our hybrid caching schemewith two different caching regions In summary, each cache server divides its storage area intolarge colored LFUs that store color matching contents and small updated LRU areas that storerecently accessed contents independently of their color tags to detect rapid changes in accesspatterns In practice, for example, in VoD services, the VoD provider can set the fraction based

on a log analysis or some heuristics to improve the hit rates

Color Tags Management Algorithm

The content’s tags are updated regularly according to their ranks of popularity The gin server calculates the content’s popularities based on the collected access logs to a gammadistribution A set of bias parameter k of the gamma distribution and corresponding separatorranks that are the last popularity index in each popularity class is used to determine the number

ori-of contents in each popularity class For example, appropriate separator ranks for different kvalues with 1000 contents into five popularity classes, are shown in Figure 2.26

Figure 2.26: Separator ranks for each gamma parameter when 1000 contents are classified intofive popularity classes [24]

The objective is to find the optimum separator ranks for a given access pattern, which

Trang 30

corre-minimal traffic The algorithm in Figure 2.27 is used to find the best set of traffic-minimizingseparator ranks The best set of separator ranks that reduce traffic are checked by this algorithm

by steadily modifying each separator rank until no better separator ranks can be found

Figure 2.27: Iterative calculation for separator ranks algorithm [24]

Light-Weight Routing Algorithm Using Color Tags

Researchers also propose a routing algorithm that uses color information to help minimizetraffic by forwarding requests to the closest color matching servers The user sends a requestfor content with the content’s color-tag information If the request is received by a cache server,first, the server search the color information in the request URL and find the closest server thatmatches the color of the content Cache servers also do not need a wide routing table that blendscontent and server IDs Figure 2.28 shows a basic concept of the color-based routing

Trang 31

Figure 2.28: Color-based routing that finds the nearest cache server with the requesting tag [24]

color-Figure 2.29 shows a block diagram of a cache server that enables color based routing based

on the algorithm in Fig 2.31 The numbers in the parenthesis in Fig 2.29 correspond to those

in the algorithm Each cache server has request and response routing tables, LFU/LRU hybridcache area, routing agent, and network interfaces to support the routing operation

Figure 2.29: Structure of the color-tag based cache server [24]

Trang 32

Figure 2.30: Color based routing algorithm [24]

Because the routing algorithm needs only two small color-based routing tables, it does notneed a large overhead routing In truth, the number of columns in the request routing table is

at most the number of colors + 1 This is very small because there are normally millions ofcontents

Evaluations are conducted using NTT-like topology in Japan Figure 6 indicates the ogy followed in the evaluations The server colors are set by the modified version of the Welsh-Powell algorithm in such a way that the server colors are assigned to the server The evaluationassume that the content access requests are generated from clients interconnected to each node

topol-of the topology and the content requests are generated by the Gamma distribution The tion properties and calculation host is the same as the last research shown in Fig 2.18 and 2.20respectively The traffic reduction and hybrid caching evaluation is quietly the same with thelast research

Trang 33

evalua-Figure 2.31: NTT-like mesh topology and its coloration with four colors [24]

Hybrid Caching Result

The Color Based Routing (CBR) against the Shortest-Path Routing (SPR) presented in stra’s algorithm [25], the Hash-Based Routing (HBR) proposed in [26], and the Nearest ReplicaRouting (NRR) which routes requests to the nearest server with the cached content [27] Figure2.32 shows the normalized traffic under different routing and caching algorithms

Dijk-Figure 2.32: Traffic reduction under different routing and caching strategies [24]

The CBR (8-color) obtains the highest reduction that saving 31.9% of traffic relative to SPR(4-color) for combined traffic of internal and external connections The CBR schemes alsoachieve a greater reduction in traffic than the HBR schemes As a result, 8-color CBR decreasestraffic by 41.4% relative to the HBR (mod4)

2.4 Emulation of the color-based caching scheme in

Telco-CDNs with Mininet using real data [29]

The research is based on the real data log files from a large CDN solution vendor in Vietnam

Trang 34

infrastructure in their work is dedicated to FPT Corporation, which is also used in our project

as a base structure

Live streaming, Video on Demand (VoD), and web browser services are the key features

of the OTT platform Both the HLS and MPEG-DASH protocols, which are the most popularstreaming formats, are used to provide streaming videos CDN providers use both static contentand dynamic packaging mechanisms, depending on the needs of their customers The Figure2.33 presents how the video streaming process is handled by CDN It’s all start from the userrequests a manifest file And the original video will be segmented into multiple equal-sizedchunks and will be sent to the user when being requested

Figure 2.33: Video Streaming workflow [29]

But here comes with the problem of the difference between user requirement of video ity To solve this problem, the CDN vendor applies 2 solution for their system, but depending ontheir customer requirements, they will select suitable strategies First solution is prepackagingand statically storing each format copy of original content Where as the second is only cachingone source content, usually the original content and dynamically packaging it into any formatneeded

Trang 35

qual-0.136 The delay for a content which is requested by a

user

con-tent from the server

[03/Dec/2018:00:00:00, +0700] The time that server receives user request

/38f16b08fdbe06b13a7698f141672c7a

Table 2.1: Example of a record in system log

CDN log files

The CDN vendor uses the monitor tools to trace and optimize their system The CDNprovider uses the logging service of Nginx which is one of the most popular web server frame-works, which also supports monitoring system status and network traffic We can use these rawlog files to analyze the system performance, evaluate the user experience to improve servicequality

The Table 2.1 shows an example of a raw log file When a request package goes through thesystem network, edge servers will monitor their status and record it Hit statuses are denoted bycache servers and regional cache servers

1 ”MISS”: the content has not been cached at any cache

2 ”HIT”: the content has been cached at some edge caches

3 ”HIT1”: the content has been cached at some regional caches

4 ”-”: the content has been cached at local devices

Data analysis

For each daily log files of the system can reach to minimum of 10GB, we use Apache Spark

to process and analyze And due to noises and error information are needed to discarded, wehave to pre-process raw data Then we can evaluate the current solution of the CDN providerwith hit rate and latency These metrics will reflect the system status when the workload inten-sity and content popularity change over time

Records Classification

The log files contain data that belong to several services Data from each service will havedifferent characteristics that the caching algorithms are very sensitive to As mentioned fromabove, the system serves Live Streaming, VoD and Website services, so we classify the recordsinto these services

With live-streaming records, their name contains patterns such as “live”, “tv” and the sion channel name The VoD records are the other chunks and manifest file requests, the name

televi-of which does not contain any pattern related to live streaming class, will have an extensionlike: “.dash”, “.ts”, “.m3u8”, “.mpd” The remaining records belong to the website services

Trang 36

The CDN provider uses 2 different packaging mechanisms for dynamic-packaging and packaging content The hit value of all requests that ask for these contents will be ”HIT” withdynamic packing content A regular content will always have at least one ”MISS” in the CDNmechanism We can list all of the packaging content that does not contain any ”MISSES” intheir request Other content is categorized as ”non-packaging”.

non-Content Characteristics

Understanding the requested data features helps the enterprise choose an optimal solution.The graph 2.34 shows that the system mainly handles manifest and chunk files, which are used

in VoD and live streaming services

Figure 2.34: The proportion of number of requests and content size for Multipurpose InternetMail Extensions [29]

The live streaming service has a huge number of requests and requested chunks of eachspecific video are equally segmented, and the segment length is usually shorter than VoD’ssegment length In contrast, the VoD class has the least number of user requests and its chunksare longer than live streaming’s chunks Although some downloading files can have a huge size,the main part of website files are front end scripts, which have very small sizes, so the median

of website file size is the smallest

Overall Performance

Latency directly impacts the quality of user experience In the upper of Figure 2.35, theedge hit rate and whole hit rate are approximately the same and much larger than the regionalhit rate, which means that almost the content is cached by edge servers As the quantity ofuser requests grows, so does the hit rate The latency of the system in 7 days is depicted atthe bottom of Figure 2.35 to evaluate the system’s adaptive capacity as the number of incomerequests increases The average latency reaches peaks at peak intervals, and the upper limit oflatency exceeds 40s, resulting in extended stalls for 25% of user requests

Trang 37

Figure 2.35: Upper: General hit rate for 3 services, Lower: Sum latency for 3 services [29]

In conclusion, they analyze a real case study of a CDN vendor and theirs CDN system.Because the considered workload, which only contains log information from one of the firm’sclients for the previous seven days, is tiny and limited in terms of customer service informa-tion, they are unable to provide a more complete analysis and better recommendations for theenterprise

To the best of our knowledge, CDNSim [30] is the sole public CDN simulation tool, whichsimulates a CDN following an event-driven mechanism CDNSim is a discrete event networksimulator based on the OMNeT++ library [31] OMNeT++ is a popular C++ framework, how-ever its network hardware, protocol, and traffic management models may be lacking in accu-racy A discrete event simulation, on the other hand, is a modeling approach for simulating areal system that is based on queuing theory In other words, the events in CDNSim will follow

a statistical process such as the Poisson process, but the behaviors of the real system do notfollow any preset random process assumptions

From the limitation of CDNSim, their work is to build a realistic CDN emulation tool, whichcan provide a flexible topology and network configuration with high fidelity

Design

Mininet is a library that simulates the whole network stack Basic network elements such

as switch, linux kernel host, and link are supported OpenvSwitch (OvS) switches managed

by a controller are supported by Mininet Packets in the network will be routed according tothis controller’s flow table The OvS method uses increasing CPU resources as the network’s

Trang 38

same job as genuine routers in the network, in order to decrease resource usage A static routingtable will be generated at each router during the warmup step Components of the emulationtool are shown in Figure 2.36 We minimize the link between each server and its nearby routers

to simplify the architecture, and each server will run at its own router In further depth, this tool

is made up of five parts:

Figure 2.36: Components of the tool [29]

1 Link: Mininet provides a configurable link class with high reliability We can configurebandwidth, loss and delay for each link

2 Origin Server: The origin server serves all the available contents

3 Caching Server: A caching server contains replicas of some contents

4 Client: A client requests content from the nearest caching server

5 Router: Routers have responsibility for routing packets

Figure 2.37: The tool’s workflow [29]

The emulation tool’s workflow is depicted in Figure 2.37 Topology construction, contentgeneration, and routing table building modules initialize the network’s topology, content gen-eration, and host IP settings after reading the configuration file The tool will then run thecolorizing module or not, depending on whether the caching algorithm is color-based or not.The emulation stage is the following phase At this point, the program will launch server mod-ules on each server host (router node) in the network, as well as client modules that will submit

Tiêu đề	Studying And Developing A Cdn Caching Algorithm Using Machine Learning
Tác giả	Tran Trung Quan, Pham Trong Nhan
Người hướng dẫn	Assoc. Prof. Thoai Nam
Trường học	Ho Chi Minh City University of Technology
Chuyên ngành	Computer Science
Thể loại	Graduation Thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	77
Dung lượng	1,97 MB