SPARTI: Scalable RDF Data Management Using Query-Centric Semantic Partitioning

SIGMOD Semantic Big Data |

Presentation (ppt)

Semantic data is an integral component for search engines that provide answers beyond simple keyword-based matches. Resource Description Framework (RDF) provides a standardized and flexible graph model for representing semantic data. The astronomical growth of RDF data raises the need for scalable RDF management strategies. Although cloud-based systems provide a rich platform for managing large-scale RDF data, the shared storage provided by these systems introduces several performance challenges, e.g., disk I/O and network shuffling overhead. This paper investigates SPARTI, a scalable RDF data management system. In SPARTI, the partitioning of the data is based on the join patterns found in the query workload. Initially, SPARTI vertically partitions the RDF data, and then incrementally updates the partitioning according to the workload, which improves the query performance of frequent join patterns. SPARTI utilizes a partitioning schema, termed SemVP, that enables the system to read a reduced set of rows instead of entire partitions. SPARTI proposes a budgeting mechanism with a cost model to determine the worthiness of partitioning. Using real and synthetic datasets, SPARTI is compared against a Spark-based state-of-the-art system and is shown to execute queries around half the time over all query shapes while maintaining around an order of magnitude enhancement in storage requirements.