Distributed Frameworks for Analyzing

Unlike static graphs, dynamic graphs change their topologies over time. In other words, they have different topologies at different time points.16 Each of these time point-specific topologies is a static graph itself which is called a snapshot. Thus, dynamic graphs can be considered as a sequence of snapshots [55].

There are two types of dynamic graphs: (i) temporal/historical/time-varying graphs, whose entire history (i.e., sequence of topologies at all time points) of topo- logical changes is already known to the user when s/he analyzes them, and (ii) streaming graphs, which are continuously being updated as we analyze them [7].

Both types of graphs raise some interesting questions, which are somewhat different from the questions asked about static graphs, e.g., how rumors/diseases spread in dynamic networks [56], which edges/links are important for disease/rumor propagation [57], how to compute “time-respecting” shortest paths [56] or node- disjoint paths [58] in dynamic networks, how to detect evolving communities [59], Figure 5.7 (a) A graph, (b) its adjacency matrix, and (c) in-degree calculation via SPMV.

how to classify vertices of dynamic networks [60], etc.

In general, two types of analyses can be done on dynamic graphs: (i) point- in-time analysis—which analyzes a specific snapshot to answer some questions about it (e.g., what are the highest degree vertices in the Facebook friendship graph on Dec 30, 2017?) and (ii) time-range analysis—which analyzes graphs in a time range to answer some questions related to their dynamic nature (e.g., which non- hubs became hubs17 in Facebook friendship graph between Jan 1, 2010, and Jan 1, 2017?). While traditional big static graph frameworks can still be used for point-in- time analysis, time-range analysis requires special purpose frameworks.

Note that although both temporal and streaming graphs raise similar research questions, answering them often requires one to apply distinct techniques. In general, graph frameworks should be designed specifically to store and analyze any of these types of dynamic graphs. In the next two subsections we will discuss some notable frameworks that can handle each of these types of dynamic graphs.

5.4.1 Frameworks for Analyzing Temporal Graphs

Till now a few frameworks have been proposed in literature that can process big temporal graphs. We will discuss two of them in this section: DeltaGraph [55]

and Chronos [61]. These frameworks focus on solving different problems related to temporal graphs as discussed below.

5.4.1.1 DeltaGraph

DeltaGraph is a distributed graph database system designed to support snapshot queries in a big dynamic graph. Specifically, it supports two types of queries: (i) single-point snapshot query—which is used to search for the snapshot/graph of a given time point (e.g., “retrieve the subgraph induced by the vertices a, b, and c on Jan 2, 2018”) and (ii) multipoint snapshot query—which is used to search for all the snapshots between a range of time points (e.g. “retrieve all the subgraphs induced by the vertices a, b, and c between Jan 2, 2018 and Jun 13, 2018”). To process queries in an efficient manner, it maintains an in-memory index in the form of a rooted directed acyclic graph (DAG). The root of this DAG is an empty graph, the lowest level vertices of this DAG represent18 snapshots, and internal nodes typically represent the intersection graph19 of its children. An edge between a parent and a child in this DAG is associated with the information about the dif- ference (called delta) between the graphs represented by the parent and child. Thus, the graph corresponding to a node of this DAG can be formed by applying the deltas across all the edges from the root to that node. A weight is assigned to each such edge ( parent–child relation) which captures the cost of reading its associated delta and applying that on the parent graph to form the child graph. Thus, single point snapshot queries can be efficiently answered by computing the shortest path

from the root of this DAG to one of its leaves, which correspond to the given time point. Similarly, multipoint snapshot queries can be answered by first finding the leaves that fall under the given time range and then computing a Steiner tree that connects those leaves to the root.

5.4.1.2 Chronos

Chronos has been designed to efficiently store and analyze big dynamic graphs.

It partitions the vertices of the input dynamic graph and then assigns each of these partitions along with the entire history of vertex-states in that partition to a worker machine. Han et al. [61] noticed that partitioning vertices in this fash- ion yields much better performance than partitioning snapshots. Partitioning this way allows Chronos to execute UDFs on each vertex across all the snapshots in a batch. This strategy also allows Chronos to do incremental computations (doing a computation on a snapshot utilizing the results of that computation done on a previous snapshot;e.g., computing the SSSP distances in the first snapshot, using these results to update SSSP distances in the second snapshot, then using these results to update SSSP distances in the second snapshot, and so on) efficiently. Chronos also allows one to store and use deltas of consecutive snapshots to achieve greater efficiency in incremental computation. It can be implemented both in a single powerful machine and in a distributed system.

5.4.2 Frameworks for Analyzing Streaming Graphs

Although analysis of temporal graphs gives valuable insights, we are often interested in analyzing the most recent graph in real time. For example, Facebook may want to recommend friends and ads to users based on the up-to-date friendship and/or follower–followee network. Streaming graph frameworks are the best choice for such applications. So far, a number of frameworks have been specifically designed to store, update, and/or analyze streaming graphs. Some frameworks of this class are GraphTau [62], Kineograph [63], GraphInc [64], TIDE [65], DISTINGER [66], and BLADYG [67]. GraphTau is a framework built on Apache Spark that achieves efficient and fault tolerant processing of streaming graphs via several optimization techniques. Kineograph implements two layers: the storage layer is responsible for storing and updating graphs in real time and the computation layer is responsible for processing the recently generated graph (by the storage layer) efficiently. GraphInc is a Pregel-based framework designed to handle real-time incremental analysis of streaming graphs. TIDE computes the probability of an edge to be present in the current graph using a user-defined function and uses that to generate the most recent graph from one or more previous snapshots. DISTINGER is a distributed implementation of STINGER [68], a single-machine framework discussed in the next section. BLADYG is a block-centric framework adapted to handle streaming graphs in a distributed setting.

Analyzing Big Dynamic Graphs

Although distributed systems are obvious choices for developing frameworks for big dynamic graphs, a few single-machine frameworks have shown their capabilities of storing and/or processing such graphs in powerful machines. Here we will discuss three such frameworks: LLAMA, SLOTH [69], and STINGER [68] as representatives of this class of frameworks.

5.5.1 LLAMA and SLOTH

LLAMA is a single-machine framework designed to efficiently perform whole graph analysis on systems that receive incremental graph updates in a steady rate.

LLAMA allows both in-memory and out-of-core execution of graph algorithms.

Since most graph algorithms do not change the graph itself, many frameworks use a compact immutable/read-only data structure (called compressed sparse row or CSR, in short) to represent graphs. While CSR is highly space efficient, it is neither suitable for static graphs (due to its immutability) nor does it provide a way to store graphs in a persistent storage. Mutable data structures allow a framework to handle dynamic graphs. On the other hand, storing snapshots in a persistent storage allows a platform to perform out-of-core execution as well as enables the user to process the same snapshot multiple times. The author of LLAMA attempted to achieve these two goals in order to store and process large dynamic graphs efficiently. To achieve persistency, the author designed an on-disk graph representation that closely mimics the CSR format. He achieved mutability by adding multi-versioned array support to CSR. Specifically, LLAMA uses a large multi-versioned array (LAMA) data structure (called the vertex table) to store vertex information. For versioning of this array, LLAMA uses a software copy-on-write technique. It also uses an array of edge records (called an edge table) to store the information of the edges added/

deleted in each snapshot. Vertex table contains pointers to the edge table which are used to determine the adjacency list of each vertex. LLAMA was able to decrease the reading cost by using this augmented CSR representation while allowing the programmer to process dynamic graphs. It also decreases the writing cost by using a buffer that temporally stores newly added/deleted edges before updating the snapshot in the storage. The author of LLAMA also built SLOTH—a sliding-window framework based on LLAMA, which reduces space requirement by storing only the most recent k snapshots (k is the window size) and deleting the older ones.

5.5.2 STINGER

STINGER (Spatio-Temporal Interaction Networks and Graphs Extensible Representation) is a single-machine framework that supports storing and

analyzing streaming graph updates that may come in a nonsteady rate and as such may not have temporal resolution. STINGER maintains a single in-memory representation of the most recent snapshot. For this purpose, it uses a modified version of the CSR data structure to store graphs. This data structure combines the ideas of both CSR and adjacency list. Specifically, it stores the record of the neighbors of each vertex as a list of arrays. When a neighbor gets deleted, a nega- tive value is placed in its cell—which creates a “hole” in that place. When a new neighbor is added, its record is placed in a hole; if no hole is present, a new array is created to place it and the new array is linked with the existing arrays of neighbors. Thus, STINGER is able to update the graph (edge insertions/deletions) efficiently. To achieve high parallelism, STINGER runs graph-reader processes (called by the graph algorithm running in this framework) along with graph- updater processes in parallel. The authors of STINGER reasoned that in a big graph, the probability of conflict between graph readers and graph updaters will be very low. However, this may cause inconsistencies since different passes of a reader process may read different information. STINGER tolerates such inconsistencies because it assumes that the current graph is already an approximate representation of the concerned real-world network, in which case, these inconsistencies will not affect the result significantly.

Walk-through of iR: An illustrative example

Fundamental Principle and Properties of Big Data