Giraph in action (MEAP) ; 5. What’s Apache Giraph : a Hadoop-based BSP graph analysis framework • Giraph. Hi Mirko, we have recently released a book about Giraph, Giraph in Action, through Manning. I think a link to that publication would fit very well in this page as. Streams. Hadoop. Ctd. Design. Patterns. Spark. Ctd. Graphs. Giraph. Spark. Zoo. Keeper Discuss the architecture of Pregel & Giraph . on a local action.
|Published (Last):||2 September 2018|
|PDF File Size:||4.53 Mb|
|ePub File Size:||1.87 Mb|
|Price:||Free* [*Free Regsitration Required]|
In computer science and mathematics, graphs are abstract data structures that model structural relationships among objects. They are now widely used for data modeling in application domains for which identifying relationship patterns, rules, and anomalies is useful. These domains include the web graph, social networks, the Semantic Web, knowledge bases, protein-protein interaction networks, and bibliographical networks, among many others. The ever-increasing size of graph-structured data for these applications creates a critical need for scalable systems that can process large amounts of it efficiently.
In a graph data structure, the representation of a collection of unordered lists, one for each vertex in the graph. Each list describes the set of neighbors of its vertex.
The largest number of vertices that must be traversed to travel from one vertex to another when paths that backtrack, detour, or loop are excluded from consideration. The connection between any two nodes vertices in a graph. A graph that has many nodes with few connections, and few nodes with many connections. A link-analysis algorithm that is used by the Google web search engine. The algorithm assigns a numerical weight to each element of a hyperlinked set of documents of the web graphwith the purpose of measuring its relative importance within the set.
The process of finding a path connection between two nodes vertices in a graph such that the number of its constituent edges is minimized. Graph that represents the pages of the World Wide Web and the direct links between them. The web graph is a dramatic example of a large-scale graph. Google estimates that the total number of web pages exceeds 1 trillion; experimental graphs of the World Wide Web contain more than 20 billion nodes pages and billion edges hyperlinks.
Graphs of social networks are another example. Facebook reportedly consists of more than a billion users nodes and more than billion friendship relationships edges in The LinkedIn network contains almost 8 million nodes and 60 million edges.
Social network graphs are growing rapidly. Facebook went from roughly 1 million users in to 1 billion in Several graph database systems — most notably, Neo4j — support online transaction processing workloads on graph data see Related topics. But Neo4j relies on data access methods for graphs without considering data locality, and the processing of graphs entails mostly random data access. For large graphs that cannot be stored in memory, random disk access becomes a performance bottleneck.
Furthermore, Neo4j is a centralized system that lacks the computational power of a distributed, parallel system.
Large-scale graphs must be partitioned over multiple machines to achieve scalable processing. Unlike Ln, MapReduce is not designed to support online query processing. MapReduce is optimized for analytics on large data volumes partitioned over hundreds of machines.
Apache Hadoop, an open source distributed-processing framework for large data sets that includes a MapReduce implementation, is popular in industry and academia by virtue of its actjon and hiraph. However, Hadoop and its associated technologies such as Pig and Hive were not designed mainly to support scalable processing of graph-structured data.
Some proposals to adapt the MapReduce framework or Hadoop for this purpose were made and acfion article starts by looking at two of them. The most robust available technologies for large-scale graph processing are based on programming models other than MapReduce.
The remainder of the article describes and compares two such systems in depth:. At the conclusion of the article, I also briefly describe some other open source projects for graph data processing. I assume that readers of this article are familiar with graph concepts and terminology.
Giraph in Action
For any who might not be, I include a glossary of terms. Surfer is an experimental large-scale graph-processing engine that provides two primitives for programmers: Propagation is an iterative computational pattern that transfers information along the edges from a vertex to its neighbours in the graph.
MapReduce is suitable for processing flat data structures such actjon vertex-oriented taskswhile propagation is optimized for edge-oriented tasks on partitioned graphs.
Surfer resolves network-traffic bottlenecks with graph partitioning adapted to the characteristics of the Hadoop distributed environment.
Another proposed MapReduce extension, GBASE, uses achion graph storage method that is called block compression to store homogeneous regions of graphs efficiently. Then, it compresses all nonempty blocks through a standard compression mechanism such as GZip. Finally, it stores the compressed blocks together with some meta information into a graph database. The key feature of GBASE is that it unifies node-based and edge-based queries as query vectors and unifies different operations types on the graph through matrix-vector multiplication on the adjacency and cation matrices.
The queries are classified into global queries that require traversal of the whole graph and targeted queries that usually must access only parts of the graph. Before GBASE runs the matrix-vector multiplication, it selects the grids that contain the blocks that are relevant to the input queries. The framework groups all intermediate values that are associated with the same intermediate key and passes them to the Reduce function.
Manning | Giraph in Action
The Reduce function receives an intermediate key with its set of values and merges them together. Periodically, the buffered pairs are written to local disk and partitioned into regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the designated master program instance, which is responsible for forwarding the locations to the reduce workers. When a reduce worker is notified of the locations, it reads the buffered data from the acttion disks of the map workers.
The buffered data is then sorted by the intermediate keys girapy that all occurrences of the same key are grouped. The output of the Reduce function is appended to a final output file gieaph this reduce partition.
MapReduce isolates the application developer from the details of running a distributed program, such as issues of data distribution, scheduling, and fault tolerance. From the graph-processing point of view, the basic MapReduce programming model is inadequate because most graph algorithms are iterative and traverse the graph in some way. Hence, the efficiency of graph computations depends heavily on interprocessor bandwidth as graph structures are firaph over the network iteration after iteration.
Processing large-scale graph data: A guide to current technology
For example, the basic MapReduce programming model does not directly support iterative data-analysis applications. To implement iterative programs, programmers might manually issue multiple MapReduce jobs and actionn their execution with a driver program. In practice, the manual orchestration of an iterative program in MapReduce has two key problems:.
In Related topicsfind links to the full papers that propose girah extensions. These two proposals promise only limited success:.
Thus, a crucial need remains for distributed systems that can effectively support scalable processing of large-scale graph data on clusters of horizontally scalable commodity machines. The Giraph and GraphLab projects both propose to fill this gap. InGoogle introduced the Pregel system as a scalable platform for implementing graph algorithms see Related topics.
InApache Giraph launched as an open source project that clones the concepts of Pregel. Giraph can run as a typical Hadoop girap that uses the Hadoop cluster infrastructure.
In Giraph, actionn programs are expressed as a sequence of iterations called supersteps. During a superstep, the framework starts a user-defined function for each vertex, conceptually in parallel. The user-defined function specifies the behaviour at a single vertex V and a single superstep S. Messages are typically sent along outgoing edges, but you can send a message to any vertex with a known identifier.
Each superstep represents atomic units of parallel computation. Figure 1 illustrates the execution mechanism of the BSP programming model:. In this programming model, all vertices are assigned an active status at superstep 1 of the executed program. All active vertices run the compute user function at each superstep. Each vertex can deactivate itself by voting to halt and griaph to inactive state at any superstep if it does receive a message.
A vertex can return to the active status if it receives a message in the execution of any subsequent superstep. This process, illustrated in Figure 2, continues until all vertices have no messages to send, and become inactive. Hence, program execution ends when at one stage all vertices are inactive. Figure 3 illustrates an example for the communicated messages between a set of graph vertices for computing the maximum vertex value:.
In Superstep 1 of Figure girapbeach vertex sends its value to actlon neighbour vertex. In Superstep 2, each vertex compares its value with the received value from its neighbour vertex. If the received value is higher than the vertex value, then it updates its value with the higher value and sends the new value to its neighbour vertex.
If the received value is lower than the vertex value, then the vertex keeps its im value and votes to halt. Hence, in Superstep 2, only the vertex with value 1 updates its value to higher received xction 5 and sends its new value.
This process happens again in Superstep 3 for the vertex with the value 2, while in Superstep 4 all vertices vote to halt and the program ends.
Like the Hadoop framework, Giraph is an efficient, scalable, and fault-tolerant implementation on clusters of thousands of commodity computers, with the distribution-related details hidden behind an abstraction. On a machine that performs computation, it keeps vertices and edges in memory and uses network transfers only for messages.
During program execution, graph vertices are partitioned and assigned to workers. The default partition mechanism is hash-partitioning, but custom partition is also supported. The master node assigns partitions to workers, coordinates synchronization, requests checkpoints, and collects health statuses.
Workers are responsible for vertices. A worker starts the compute function for the active vertices. It also sends, receives, and assigns messages with other vertices. During execution, if a worker receives input that is not for its vertices, it passes it along. To implement a Giraph program, design your algorithm as a Vertex. You must define a VertexInputFormat for reading your graph. For example, to read from a text file with adjacency lists, the format might look like vertex, neighbor1, neighbor2.
You need also to define a VertexOutputFormat to write back the result for example, vertex, pageRank. The Java code in Listing 1 is an example of using the compute function for implementing the PageRank algorithm:. Listing 2 shows an example of using the compute function to implement the shortest-path algorithm:.