Scalable, Near-Zero Loss Disaster Recovery for Distributed Data Stores

Scalable, Near-Zero Loss Disaster Recovery for Distributed Data Stores

Virendra Marathe, Alex Kogan, , Samer Al-Kiswany

14 May 2020

This paper presents a new Disaster Recovery (DR) system, called Slogger, that differs from prior works in two principle ways: (i) Slogger enables DR for a linearizable distributed data store, and (ii) Slogger adopts the continuous backup approach that strives to maintain a tiny lag on the backup site relative to the primary site, thereby restricting the data loss window, due to disasters, to mil- liseconds. These goals pose a significant set of challenges related to consistency of the backup site’s state, failures, and scalability. Slogger employs a combination of asynchronous log replication, intra-data center synchronized clocks, pipelining, batching, and a novel watermark service to address these challenges. Furthermore, Slogger is designed to be deployable as an “add-on” module in an existing distributed data store with few modifications to the origi- nal code base. Our evaluation, conducted on Slogger extensions to a 32-sharded version of LogCabin, an open source key-value store, shows that Slogger maintains a very small data loss window of 14.2 milliseconds which is near the optimal value in our evalua- tion setup. Moreover, Slogger reduces the length of the data loss window by 50% compared to incremental snapshotting technique without having any performance penalty on the primary data store. Furthermore, our experiments demonstrate that Slogger achieves our other goals of scalability, fault tolerance, and efficient failover to the backup data store when a disaster is declared at the primary data store.


Venue : 46th International Conference on Very Large Databases (VLDB)