Online Maintenance of Very Large Random Samples on Flash Storage

Suman Nath; Phillip B. Gibbons

Online Maintenance of Very Large Random Samples on Flash Storage

Suman Nath ,
Phillip B. Gibbons

VLDB Journal, vol. 19, issue 1 | January 2010 , Vol 19(1)

Special Issue for VLDB 2008 Best Papers

Download BibTex

Recent advances in ﬂash storage have made it an attractive alternative for data storage in a wide spectrum of computing devices, such as embedded sensors, mobile phones, PDA’s, laptops, and even servers. However, ﬂash storage has many unique characteristics that make existing data management/analytics algorithms designed for magnetic disks perform poorly with ﬂash storage. For example, while random reads can be nearly as fast as sequential reads, random writes and inplace data updates are orders of magnitude slower than sequential writes. In this paper, we consider an important fundamental problem that would seem to be particularly challenging for ﬂash storage: eﬃciently maintaining a very large random sample of a data stream (e.g., of sensor readings). First, we show that previous algorithms such as reservoir sampling and geometric ﬁle are not readily adapted to ﬂash. Second, we propose BFile, an energy-eﬃcient abstraction for ﬂash storage to store self-expiring items, and show how a B-File can be used to eﬃciently maintain a large sample in ﬂash. Our solution is simple, has a small (RAM) memory footprint, and is designed to cope with ﬂash constraints in order to reduce latency and energy consumption. Third, we provide techniques to maintain biased samples with a B-File and to query the large sample stored in a B-File for a subsample of an arbitrary size. Finally, we present an evaluation with ﬂash storage that shows our techniques are several orders of magnitude faster and more energy-eﬃcient than (ﬂash-friendly versions of) reservoir sampling and geometric ﬁle. A key ﬁnding of our study, of potential use to many ﬂash algorithms beyond sampling, is that “semi-random” writes (as deﬁned in the paper) on ﬂash cards are over two orders of magnitude faster and more energy-eﬃcient than random writes.

All articles published in this journal are protected by copyright, which covers the exclusive rights to reproduce and distribute the article (e.g., as offprints), as well as all translation rights. No material published in this journal may be reproduced photographically or stored on microfilm, in electronic data bases, video disks, etc., without first obtaining written permission from Very Large Data Bases Endowment Inc.