Systems for Concurrent Real-Time Graph Analytics Open Access
Downloadable ContentDownload PDF
Proliferation of big data and various analytics developed around it have brought the value of data to users and enterprises. Formally, big data is defined as a large volume of data, arriving at higher velocity that is composed of variety of data types, and requires specific technology and analytical methods such as batch or stream analytics for its transformation into value to gain insight. Graph data model is envisioned to solve the variety characteristics of big data by capturing and storing the relationship between various data entities, and thus offers enrichment of the big data to make it more amenable to value extraction. On the other hand, modern storage drives suchas solid state drives (SSD) and non-volatile memory (NVM) media packed in new interfaces such as non-volatile memory express (NVMe) offer much better random and sequential IO throughput and latency than the hard-disk drives (HDD). Hence they are highly suited for high arrival velocity of the data, and for managing and analyzing the data when they do not fit in the main memory to tackle the sheer volume of data. Unfortunately, the current generation of graph analytics systems are specialized for one class of graph analytics at the cost of generality, and suffer from a number of system challenges. In this dissertation, we revisit the current system software stack to identify the bottlenecks, and come up with a novel general-purpose system architecture to concurrently execute the diverse classes of real-time batch and stream graph analytics on a complex multi-stream evolving graph by defining new systems abstractions and design conventions. First, OmniGraph proposes a novel data-model to handle the heterogeneity of the complex multi-stream evolving graph by partitioning edges based on their relationship types. Then GraphOne proposes a highly efficient graph data store for one such type of edges using main memory and NVMe SSD, and offers new abstractions of data visibility to unify the various lasses of analytics irrespective of their requirements of fine-grained and coarse-grained ingestion, and GraphView APIs to perform the real-time batch and stream analytics from the same data-store. For further optimizing the batch analytics on the scale of trillion-edge graph data, G-Store proposes a better data representation, a graph specific caching policy, and locality inspired data-layout in the multi-SSD volume. Finally, Falcon is a new kernel IO stack to achieve IO scalability in NVMe and multi-SSD volumes that can offer native IO performance to graph analytics engines. Additionally, these works have identified many systems abstractions and designconventions that are bottlenecks in achieving desired performance for graph analytics, and have proposed new ones to not only improve the systems understanding for the graph analytics engines but are also applicable for many other use-cases.