Impact of Disk Corruption on Open-Source DBMS

来源：互联网发布：淘宝客服怎么登陆编辑：程序博客网时间：2024/05/22 09:43

http://www.cs.wisc.edu/adsl/Publications/corrupt-mysql-icde10.pdfImpact of Disk Corruption on Open-Source DBMSSriram Subramanian, Yupu Zhang, Rajiv Vaidyanathan, Haryadi S. Gunawi,Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Jeffrey F. NaughtonDepartment of Computer ScienceUniversity of Wisconsin, Madison{srirams,yupu,vaidyana,haryadi,dusseau,remzi,naughton}@cs.wisc.eduAbstract—Despite the best intentions of disk and RAID manufacturers,on-disk data can still become corrupted. In this paper,we examine the effects of corruption on database managementsystems. Through injecting faults into the MySQL DBMS, we findthat in certain cases, corruption can greatly harm the system,leading to untimely crashes, data loss, or even incorrect results.Overall, of 145 injected faults, 110 lead to serious problems. Moredetailed observations point us to three deficiencies: MySQL doesnot have the capability to detect some corruptions due to lackof redundant information, does not isolate corrupted data fromvalid data, and has inconsistent reactions to similar corruptionscenarios.To detect and repair corruption, a DBMS is typically equippedwith an offline checker. Unfortunately, theMySQL offline checkeris not comprehensive in the checks it performs, misdiagnosingmany corruption scenarios and missing others. Sometimes thechecker itself crashes; more ominously, its incorrect checkingcan lead to incorrect repairs. Overall, we find that the checkerdoes not behave correctly in 18 of 145 injected corruptions, andthus can leave the DBMS vulnerable to the problems describedabove.I. INTRODUCTIONDisks corrupt data. Although it is well known that entiredisks fail [29], [33], recent studies have shown that disks alsocan corrupt data that they store [4]; in a study of 1.5 milliondrives over three years, Bairavasundaram et al. found thatnearly 1% of SATA drives exhibited corruption. More reliableSCSI drives encounter fewer problems, but even within thisexpensive and carefully-engineered drive class, corruption stilltakes place.Fortunately, RAID vendors have employed increasinglysophisticated corruption detection and recovery techniques inorder to combat disk corruption [6], [8]. For example, byplacing a checksum of data with every block, one can detectwhether a block has become corrupted; once detected, thecorruption can be repaired by accessing another copy of theblock or parity information.Unfortunately, these schemes are not always enough. Recentwork has shown that even with sophisticated protection strategies,the “right” combination of a single fault and certain repairactivities (e.g., a parity scrub) can still lead to data loss [19].Thus, while these schemes reduce the chances of corruption,the possibility still exists; any higher-level client of storagethat is serious about managing data reliably must consider thepossibility that a disk will return data in a corrupted form.In this paper, to better understand the possible problemscaused by disk corruption, we first observe its impact on adatabase management system, the MySQL DBMS, through aseries of fault injections. By carefully injecting corruptionsinto a running MySQL server, we can evaluate how MySQLdeals with disk corruption. More specifically, we want toanswer three questions: Can MySQL detect disk corruptionproperly? Can MySQL keep running and return valid datadespite the presence of corruption? What can we infer aboutthe framework for corruption handling in MySQL?We find that corruption can be quite damaging, leading tosystem crashes, data loss, and even incorrect results. Overall,of 145 injected faults, 110 lead to serious problems. Moredetailed observations point us to three deficiencies in MySQL.First, MySQL ignores some corruptions; some are detectablebut ignored and some are undetectable due to lack of redundantinformation in its data structures. Second, MySQL doesnot isolate corruption from valid data; a corrupt record canmake other valid records inaccessible. Finally, MySQL haswidely inconsistent corruption handling, which leads us to theconclusion that MySQL does not employ a proper frameworkfor corruption handling.Since all corruptions are hard to detect online, a DBMSrequires some form of offline corruption detection and repairtools. In the world of file systems, offline tools such as theubiquitous file system checker (fsck) are used in this capacitytoday [21]. Originally conceived to help file systems recoverfrom untimely crashes, fsck has remained a useful tool tohelp the file system recover from unexpected corruption infile system metadata. By carefully combing through the ondiskimage, fsck can find and fix many small problems.One might think that a DBMS does not need such atool, as concurrency control and recovery has been carefullydeveloped to handle many similar problems [24]. However,while concurrency control and recovery avoid many corruptionscenarios and are critical in recovering from others, in generalthey are not able to catch or repair corrupted metadata resultingfrom disk malfunctions. The evidence from the marketplaceunfortunately confirms this reality; many tools exist to detectand repair these kinds of errors in commercial DBMSs, includingSQL Server [22], Oracle [26], [27], [28], and DB2 [16],[17]. To an unfortunate extent, neither the problems these toolsaddress nor the approaches they employ in their solutionshave appeared in the research literature. A partial explanationfor this might be that such tools require detailed knowledgeof proprietary aspects of the DBMSs and how they storeand manage their metadata. However, the recent explosion inpopularity of open source DBMSs such as PostgreSQL [37]and MySQL [39] has changed the landscape, and it is nowpossible for the research community at large to explore issuesrelated to metadata corruption, and to do so in the meaningfulcontext of substantial systems with large user communities.Therefore, in this paper, we also study how effective offlinecheckers are at detecting corruption in an on-disk imageof a database. Specifically, we examine the robustness ofmyisamchk, an offline checker for MySQL. We find that theoffline checker is not comprehensive in the checks it performs,misdiagnosing many corruption scenarios and missing others.Sometimes the checker itself crashes. More ominously, its incorrectchecking can lead to incorrect repairs. Overall, we findthat myisamchk does not behave correctly in 18 of 145 injectedcorruptions, and thus can leave the DBMS vulnerable to theproblems described above, including unexpected crashes, dataloss, and incorrect results.Thus, in this paper, we make two explicit contributions:• We perform the first study of the effect of corruption on arunning database (MySQL), and find that corruption cancause great harm (Section IV).• We perform the first study of the ability of an existingoffline checking tool (myisamchk) to detect corruption inon-disk structures, and find that it misses many significantcases (Section V).Before describing each of our two main contributions, wefirst present the background and related work (Section II) andthen our fault injection methodology (Section III). After themain body of the paper, we conclude with future directions(Section VI).II. BACKGROUND & RELATED WORKIn this section, we provide a brief background on diskfailures, with a focus on corruption. We then discuss whyRAID is not a complete solution in dealing with corruption.After that, we briefly discuss the state of the art of both onlineand offline corruption detection techniques within file systemsas well as the current approach within DBMS.A. Disk Corruption: Why It Happens?We broadly define disk corruption as something that occurswhen one reads a block of data from the disk and receivesunexpected contents (e.g., the contents are not what werepreviously written to that location). Thus, the read “succeeds”(i.e., the disk does not return a failure code) but the data withinthe block is not as expected. For this reason, corruption issometimes referred to as a silent error.Disk corruption can occur for a multitude of reasons. Onecause comes from the magnetic media: the classic problemof “bit rot” which occurs when the magnetism of a singlebit or a few bits are flipped. This type of problem can often(but not always) be detected and corrected with low-level ECCembedded in the drive.Interesting errors also arise in the disk controllers due totheir complexity; modern Seagate drives contain hundreds ofthousands of lines of low-level firmware code that managethe operation of the disk [30]. This complexity can lead to anumber of bugs which manifest as corruption.One example of such a bug is a lost write (or phantomwrite), where a disk reports that a write has completed but infact it was never written to the disk [40]. The next time a clientreads such a block, it will receive the old contents, and thusperceive the problem as a corruption. A misconfigured drivecan also result in lost writes; for example, if a drive cache isset to write-back mode (instead of write-through), a write willbe acknowledged when it is put into the disk cache but beforeit has been written to disk. If power is lost before the actualwrite to the media surface, the write is seemingly lost.A similar problem is known as a misdirected write [44]. Inthis case, the controller writes the data to disk but to the wronglocation. A misdirected write can thus lead to two perceivedcorruptions; one where the block should have been written,and one where the block was accidentally written. In eithercase, subsequent reads receive the “wrong” contents.There are other causes of perceived disk corruption. Forexample, as data sits in the main memory of a host system,a bad DRAM could corrupt the data [20]; although it iswritten correctly to disk, the data is corrupt when written andwill later be perceived as such. Similarly, buggy operatingsystems software [10], [12] could accidentally overwrite thein-memory data before writing it to the disk, again leading toa subsequently perceived corruption.B. Disk Corruption: How Often It Occurs?Until recently, there was very little data on how oftencorruption arose in modern storage systems. Although therewas much anecdotal information [6], [40], [44], and a hostof protection techniques that systems employ to handle suchcorruption [19], there was little hard data.Recently, a study by Bairavasundaram et al. demonstratesthat corruption does indeed occur across a broad range ofmodern drives [4]. In that study of 1.5 million disk drivesdeployed in the field, the authors found more than 400,000blocks have checksum mismatches over three years. Theyalso found that nearline disks develop checksum mismatchesan order of magnitude more often than enterprise class diskdrives. Furthermore, checksum mismatches within the samedisk show high spatial and temporal locality, and checksummismatches across different disks in the same storage systemare not independent. The data shows that corruption takesplace, and systems must be prepared to handle it.C. Doesn’t RAID Help?The end-to-end argument states that failure recovery mustbe done at the highest level; protection mechanisms at lowerlevels may improve performance but fundamentally do notsolve the desired problem [32]. In the world of storagesystems, it would be ideal if RAID storage [29] guaranteed thatdata was not corrupt. Although we believe that while RAIDcan indeed improve DBMS reliability, it is not a completesolution for the following reasons.First, RAID is designed to tolerate the loss of a certainnumber of disks or blocks (e.g., RAID-5 tolerates one, andRAID-6 tolerates two), but not to to identify corruption. Forexample, in RAID-5, if a block in a parity set is corrupt, theparity computation will be incorrect, but which block is corruptcannot be identified with RAID alone.Second, ironically, commercial RAID systems also corruptdata; a recent paper by Krioukov et al. demonstrates how mostcommercial RAID-5 designs, which should be able to toleratethe loss of any one disk or block, have flaws where a singleblock loss leads to data loss or silent corruption [19].Finally, not all systems incorporate more than one disk. Forexample, consider a typical commodity system running witha single disk drive; in such systems, there is essentially noprotection against most forms of corruption described above.D. Doesn’t Checksumming Help?Checksumming techniques have been used in numeroussystems over the years to detect data corruption [6], [11], [36],[38], [40]. For example, Tandem systems have long employedchecksums [6]. When a block is read from disk, so too isits stored checksum. A checksum is then computed over thedata block and compared to the stored checksum; if the twodo not match, the block is declared corrupt and recoveredfrom a mirror copy. Similar to RAID, although checksumscan improve corruption detection, it is still not a completesolution for three reasons.First, memory is not perfect. For example, a bit-flip inmemory before a checksum is computed could lead to acorrupt block being written to disk; the disk system will safelystore the corrupted block. A recent, large-scale field study bySchroeder et al. emphatically show that bit-flips do occur [34].Second, software is not perfect; large code bases are typicallyfull of bugs [10], [45]. Some of those bugs may indeedcorrupt data before it is written to disk, and again may thussurvive despite checksum and RAID protections.Lastly, Krioukov et al. also show that checksumming doesnot protect against complex failures such as torn writes, lostwrites, and misdirected writes [19].E. The File System ApproachMany high-end file systems often claim that they havesupport for corruption handling. However, their robustness islittle known due to the proprietary nature of these systems.With open-source file systems, there is room for evaluation.For example, Prabhakaran et al. presented the details ofcorruption detections in several commodity file systems [30],including Linux ext3 file system [42], ReiserFS [31], IBMJFS [7], and Windows NTFS.In many cases, they found that these file systems are able todetect metadata corruption in the absence of checksums. Theirapproach is to store implicit redundant information to crosscheckmetadata consistency. For example, file systems suchas ReiserFS [9] and XFS [41] store page-level information ineach internal page of a B-Tree. Thus, a corrupt pointer thatdoes not connect two pages in adjacent levels can be detectedchecking the page-level information. These file systems showthat some redundant information can be useful for onlinecross-checking without imposing significant overhead.Although many corruptions are detected, Prabhakaran etal. also found that in some cases these file systems fail tocheck the integrity of their own metadata. They show that theundetected corruptions result in system crashes, the spreadingof corruption, and unmountable file systems [30].To remedy this problem, file system developers createdoffline tools to scan and repair file system metadata that wasinconsistent. The classic repair tool is fsck [21]. Despite thepresence of RAID and checksumming, fsck remains usefuleven today; high-end file systems also have their own offlinecheckers. Some new file systems have tried to make do withoutan offline checker, e.g., SGI’s XFS famously was said to have“no need to fsck, ever” [13], but soon introduced such a toolto handle corruptions that were observed in the field.Unfortunately, building a robust checker is not straightforward.An analysis of the Linux ext2 checker by Gunawi et al.shows that some important repairs are missing, leaving somecorruptions unattended, and some repairs are incorrect, makingthe file system more corrupt and sometimes unusable [15].F. The DBMS ApproachIn the DBMS world, there have been many reports ofdatabase corruptions [1], [2]. In many cases, the sources of thecorruptions are hard to pinpoint, and hence are not reported.Nevertheless, the fact that error messages such as “Databasepage corruption on disk” appear in the error logs suggest thatdatabase systems read corrupt contents from the disk. Butagain, the research literature has not extensively addressed howrunning databases deal with such corruption.The presence of offline tools to scan and repair databasemetadata is also less clear. Some tools exist [16], [17], [22],[26], [27], [28]. The existence of the tools certainly indicatesthat databases are corrupted in practice, despite the presence ofRAID and checksums. However, due to the proprietary natureof these database systems and their on-disk formats, there islittle published on the details of how these offline check andrepair tools work.Evaluations of open-source file systems have unearthedmany weaknesses in the ways modern file systems deal withcorruption [15], [30]. However, to the best of our knowledge,there has been no similar published study in the DBMSliterature. However, as open-source database systems such asPostgreSQL [37] and MySQL [39] have become both popularand important, we believe a new opportunity has arisen toboth evaluate the state of the art of database checking andpotentially to improve it. The open nature of these systemsmake evaluation possible, and in this paper (Sections IVand V), we demonstrate how fault injection can be used toassess the resilience of MySQL (in particular) to various typesof corruption.III. METHODOLOGYAn integral part of ensuring the long-term availability ofdata is ensuring the reliability and availability of pointers andformat information. Pointers are fundamental to the constructionof nearly all data structures, while format information iscritical for the correctness of reading the data and metadata.This observation is especially true for database managementsystems, which rely on pointers to access data correctly andefficiently, and on format information to determine how toparse both metadata and data. Unfortunately, as mentionedin the previous section, information stored on a disk canbe corrupt. A robust DBMS should detect and repair suchcorruption of its metadata.One difficulty with a pointer-corruption study is the potentiallyhuge exploration space for corruption experiments. Todeal with this problem, we utilize a fault injection techniquecalled type-aware pointer corruption (TAC) [5]. TAC reducesthe search space by systematically changing the value of onlyone pointer of each type in the DBMS, then exercising theDBMS and observing its behavior.We further narrow the largesearch space by corrupting the pointers to refer to each type ofdata structure, instead of to random values. For example, ratherthan corrupting a B-Tree pointer to point to a random page, weintroduce types to the pages (e.g., grand-child, sibling, parentpage), and then change the pointer to point to different typesof pages.TAC simulates field-level corruption. As mentioned in SectionII-A, different problems can lead to different types ofcorruption. For example, a misdirected write can corrupteverything on a page, not just a particular field. This typeof page-level corruption can be simulated as well by slightlyextending our fault-injection framework. So far, we have onlyconsidered field-level corruption as it allows for detailed analysisof the system’s responses to different field-corruptions.To exercise the DBMS as thoroughly as possible, anotherchallenge is to coerce the DBMS down its different code pathsto observe how each path handles corruption. This requiresthat we run workloads exercising all relevant code paths incombination with the induced faults. In this paper, we onlyfocus on read workloads. Specifically, we run three kinds ofqueries: single selection queries (e.g., WHERE field = X),range selection queries (e.g., WHERE field BETWEEN XAND Y), and full table scans. By running different queries,we can analyze how the injected corruptions affect differentworkloads.Section IV presents the results of our online pointer andformat corruptions for MySQL. Specifically, we inject corruptionswhen the server is running and observe if it detects andhandles the corruptions. Unfortunately, some corruptions arenot detected online. Thus, we then inject the same corruptionsLP MP RPB) ParentRootA) Myself Childkey1 key2C) Grand− C) Grand− ChildC) Grand− Child Cousin Uncle Nephew Leaf Uncle LeafD) LeftE) LeftF) LeftG) LeftH) RightI) Right NephewJ) RightK) RightLPPageMPPageRPPageTarget PointerCorrupted Pointer CousinFig. 1.B-Tree pointer corruption. The graph above shows key-pages ofan index B-Tree. Each box represents a key-page which contains aset of key-pointer pairs. A pointer is a page number. We corrupt threetarget pointers of a non-leaf page (page A): left-most (LP), middle(MP), and right-most (RP) pointers. For example, the LP pointer canbe corrupted to point to the parent page (page B).and analyze whether the MySQL offline checker is able todetect them (Section V). As we will see, even the offlinechecker fails to detect some of the corruptions, thus leavingthe DBMS vulnerable to on-disk corruption.All experiments except where specified are performed on theMyISAM Storage Engine of MySQL version 5.0.67 runningon the Linux 2.6.12 operating system. We have not testedMySQL with other storage engines. In total we have injected145 corruption scenarios. Due to the sheer volume of experimentaldata, it is difficult to present all results for thereader’s inspection. We try to present the complete results ofour analysis in tables (for those interested in all the data),and then provide qualitative summaries of the results that arepresented within the tables.IV. ONLINE CORRUPTIONDespite the presence of corruption, we expect a runningDBMS to be highly reliable and available. More specifically,we expect a reliable DBMS to have a strong mechanism fordetecting disk corruption such that corrupt metadata is notwrongly used by the DBMS. Moreover, to be highly available,a DBMS has to keep running and return as much valid data aspossible to the users. To see how MySQL stands with respectto these issues, we pose three questions that relate to reliability,availability, and framework for corruption handling:1) Can MySQL detect disk corruption properly?A B C D E F G H I J K L MMP × × 6= 6= 6= 6= 6= 6= 6= 6= 6= √ √LP × × 6= 6= 6= 6= 6= 6= 6= 6= 6= √ √RP × × 6= 6= 6= 6= 6= 6= 6= 6= 6= √ √(a) Single selection queryA B C D E F G H I J K L MMP × × 6= √ √ √ √ 6= 6= 6= 6= √ √LP × × 6= 6= 6= 6= 6= 6= 6= 6= 6= √ √RP × × 6= 6= 6= 6= 6= 6= 6= 6= 6= √ √(b) Range selection queryTABLE IOnline detection of B-Tree pointer corruption. The tables above report the results of our B-Tree pointer corruption. The results dependon the query that is executed. The first and the second tables show the results of a single and a range selection query respectively. Theleft-most column shows the pointers that we corrupt (i.e., MP, LP, and RP, as illustrated in Figure 1). The row-header represents the newpages (i.e., page A to M) that the corrupted pointer is now pointing to. “√” marks that the corruption is detected; for example, when MPpoints to an out-of-bound page (M). “×” represents a server crash; for example, when a cycle is introduced when MP points to itself (pageA). “6=” implies that the server returns the wrong results to the user; for example, when MP points to its grand-child (page C), records madeinaccessible by this corruption are not returned to the user.2) Can MySQL keep running and return valid data despitethe presence of corruption?3) Based on our results, what can we infer about theframework for corruption handling in MySQL?To answer these questions, we first present the results ofour fault-injection experiments on a running MySQL server(Section IV-A). Then, we answer the questions by presentingour qualitative observations on the results (Section IV-B).Finally, we conclude this section and present some preliminaryresults for PostgreSQL (Section IV-C).A. ResultsIn this section, we present the results of our online pointerand format corruptions. For pointer corruption, we corrupt theB-Tree, record, and overflow pointers. For format corruption,we corrupt the format information stored in the index and datafiles. For each corruption case, we describe the MySQL datastructures that we corrupt, our findings and observations. Inall cases, we find that the presence of corruption would leadto server crashes, data loss, or even incorrect results.1) B-Tree Pointer Corruption: The first class of pointercorruption that we inject is B-Tree pointer corruption. Foreach database table, MySQL manages three files: an indexfile (.MYI), a format file (.FRM), and a data file (.MYD). Foreach index defined on a table, MySQL stores a B-Tree in theindex file in the form of key-pages (we also refer to a keypageas a page). A page is usually 1 KB. The index file hasa header page (index file header) that has pointers to the rootpages of all B-Trees in the index file. A key-page contains aheader (page header), describing the key-page, and a set ofkey-value pairs where the value carries two pointers: a keypointer(i.e., page number) which points to a child page and arecord-pointer which points to the corresponding record storedin the data file. In this experiment, we corrupt the key-pointerby making it point to another page and observe how MySQLhandles this class of corruption while it is running.Figure 1 illustrates a 5-level B-Tree. We corrupt threedistinct pointers: the left-most (LP), middle (MP), and rightmost(RP) pointers of a non-leaf page (page A). To exercisecorruption scenarios, we corrupt these pointers. To reduce thecorruption space, we identify eleven categories of pages (pagesA to K as shown in Figure 1). For example, we corrupt theleft-most pointer (LP) to point to: the parent (page B), leftcousin (page E), left nephew (page F), and so on. To be ableto detect the corruptions, two keys that wrap the middle pointer(key1 and key2) can be utilized.Table I summarizes our results. In addition to corruptingthe three pointers to point to page A to K, we also force themto point to pages belonging to the index file header (pageL) and to out-of-bound pages (page M). To analyze how thecorruptions affect different workloads, we also run two typesof queries: single and range-selection queries. In total, we haveinjected 39 B-Tree pointer corruption scenarios. Unfortunately,MySQL does not detect and handle many of these corruptionsonline; MySQL returns wrong results to users, or the servercrashes. Below, we further explain the results.Detected error (√): Out of the 39 scenarios, MySQLdetects only 6 or 10 of them depending if we execute a singleor a range-selection query.Most of the corruptions detected arethose where pointers point to an out-of-bound page (M) or to apage belonging to an index file header (L). The former is easilydetected because reading an out-of-bound page will result ina low-level read error. The latter is detected because MySQLalways checks the key-page header, specifically the length ofused key-value pairs in the page (which should always be00 01 02 03 04 05 06 0D Data Out-of-bound05 √•× √•× √•× √•× √•× √•× √•× √•× √♠ √♣ 06 √•× √•× √•× √•× √•× √•× √•× √•× √♠ √♣ 0D √•× √•× √•× √•× √•× √•× √•× √•× √♠ √♣ 0B √•× √•× √•× √•× √•× √•× √•× √•× √♠ √♣ 0C √•× √•× √•× √•× √•× √•× √•× √•× √♠ √♣TABLE IIOnline detection of overflow pointer corruption. The table above reports the results of our overflow pointer corruption. “×” marksthat the server hangs and“√” represents that the corruption is detected. When an overflow corruption is detected, depending on the type ofthe corruption, MySQL reacts differently: sometimes it does not return any valid data and marks the corresponding table as crashed (♠),sometimes returns partial valid data and kills the executed query (♣), and sometimes returns all valid data and skips the corrupt recordwithout notifying the corruption to the user (•).greater than 4 bytes and less than 1 KB). Pages that belongto the index file header have different structures that alwaysstore “0xFE 0xFE” at the same byte offset. Thus, MySQL candetect that they are not valid key-pages.Wrong results (6=): In many cases, MySQL blindly truststhe corrupt pointers. As a result, incorrect results are returnedto users. Specifically, users could get empty records or thewrong number of records (since portions of the tree are silentlylost). For example, this can happen when the middle pointerpoints to its grand-child page (e.g., MP points to page C).All the keys in the grand-child page are valid with respect tokey1 and key2. Thus, the corruption is not easily detectableand some portions of the B-Tree (other pages reachable fromthe MP page) are not reachable anymore.Server crashes (×): Finally, MySQL does not anticipate acycle; when the three target pointers are corrupted to point tothe page where they are stored (page A) or to the parent ofthat page (page B), MySQL server does not detect the createdloop. The server keeps calling the search routine on the samepages infinitely. This routine only stops when the result isfound or not found. Since the MySQL server does not trackthe previous pages that have been traversed, this loop causesan infinite traversal that eventually causes a stack overflow.The server crashes, and a lost connection error occurs.2) Record Pointer Corruption: In our next set of experiments,we inject record pointer corruption. Record pointersare stored in key-pages in the index file. A record pointerof a key-value pair points to the actual record that holds thecorresponding key. We have injected numerous corruptions.Here, we briefly describe the interesting results.First, we created a table with fixed-size records with anauto-incremented key and corrupt a record pointer of a keyvaluein the index file such that it points to another recordin the table. Thus, the key stored in a corrupt key-value pairdoes not match with the key stored in the record that it pointsto. For example, we take a key-value pair with a key of 500and corrupt it by making the record pointer points to a recordwith a key of 600.We ran a single selection query on the key (SELECT *WHERE key = 500), and the server behaves correctly; theserver returns an empty result.We suspect that MySQL verifiesthat the record pointed by the corrupt key-value pair does nothave the same key.We observed a different behavior when we ran a rangeselection query (e.g., selecting records with keys between450 to 550); MySQL only returns a subset of the records,specifically records with keys between 450 to 499. MySQLalways trusts the key stored in the record; when the B-Treetraversal hits key 500, it finds that the key in the record is 600,which is larger than the end of the range query (550). Thus, theserver stops traversing the B-Tree and only returns a subset ofthe records. This confirms that when a range selection queryis executed, MySQL never checks the fact that the key in therecord is different than the key in the key-value pair.Another interesting result is when we corrupt the recordpointer of a dynamic (variable-length) record. With dynamicrecords, the record pointer is a byte offset, which implies thatit can point to any byte in the data file. In this case, MySQLalways checks the record information (e.g., record length) inthe record header. In the case of a corrupt pointer, the recordlength is not as desired. MySQL returns an error code to userswithout giving any result. The error states that the table hasbeen marked as crashed and should be repaired.3) Overflow Pointer Corruption: Next, we inject pointercorruptions into the data file. With fixed-size records, thedata file does not store any pointers because a record canbe fetched given its record number. With the variable-lengthrecord format, a record cannot always span contiguous bytes.Thus, a record can be put in one or more frames. When arecord is deleted, all the frames that it occupies are markeddeleted. When a record is inserted, it can reuse unused frames.If the new record does not fit in a frame, multiple frames areallocated for the record. Thus, in each frame, MySQL stores apointer (the overflow pointer) to the next frame and a signatureheader that describes the frame. Only frames with hexadecimalsignatures 05, 06, 0B, 0C, and 0D have an overflow pointer.This overflow pointer cannot point to all types of frames;a valid pointer can only point to a frame with a signaturebetween 07 and 0C; more details can be found elsewhere [25].Table II shows the result of our overflow pointer corruption.We inject corruptions that make an overflow pointer invalid.For example, a starting frame of a small record (05) shouldSig: 01 Sig: 03 Sig: 01 Sig: 07Infinite loopCorrupt pointer Valid pointerRec #0 Rec #1 Rec #2 Rec #1Fig. 2.Server hangs. The figure illustrates a corruption scenario that causesMySQL server hangs.not point to a deleted framed (00). Furthermore, because anoverflow pointer is a byte offset (i.e., it can point to any bytein the data file), we also force an overflow pointer points todata and to an out-of-bound offset.We found that MySQL detects all overflow pointer errors(√). However, depending on the corruption, different resultsare returned and different error messages are thrown. Forexample, if an overflow pointer accidentally points to data,MySQL is very conservative by not returning any valid data(even though it has fetched some), but rather emits an errormessage stating that the table has been marked as crashedand should be repaired (♠). However, if an overflow pointerpoints to an out-of-bound offset, the server kills the executedquery by returning only valid records that have been fetchedso far (♣). Finally, if an overflow pointer points to an invalidframe, the server detects the error, skips this corrupt record,and continues scanning the next record (•). The users thenwould get all valid records, even those that are located afterthe corrupted record. In this case, the server does not propagatethe error message to users.Moreover, a certain scenario of overflow pointer corruptionmakes the server enter an infinite loop (×). Specifically, thishappens on a full-scan query when an overflow pointer pointsto an invalid frame that is located before the frame that holdsthe overflow pointer. Figure 2 illustrates the bug. MySQLscans the variable-length frames one-by-one, looking for anystarting frame. When there is an invalid overflow pointer(e.g., the starting frame of record #1 points to the startingframe of record #0), the corruption is detected from the givensignatures. But, rather than moving to the next valid frame(i.e., record #2), MySQL scans the wrong next frame, (i.e.,record #1, which is the frame next to the invalid frame). Inthis case, the server gets stuck in an infinite loop.Beyond the corruption scenarios shown in the matrix inTable II, we also performed a more specific fault injection:an overflow pointer is corrupted to point to a “valid” framethat actually belongs to another record. But, in MySQL, aframe does not hold information about its owner. Thus, it is notstraightforward for MySQL to detect this corruption online. Asa result, the corrupt record is presented to users like a validrecord, except part of the data belongs to another record.4) Index Format Corruption: We now corrupt importantformat information that is stored in the index file header,shown in the left column of Table III. This format informationis crucial for parsing both metadata (e.g., keys, key-pointers,etc.) and data (e.g., columns). Due to space constraints, wedo not provide the descriptions of the fields; their descriptionscan be found elsewhere [25]. For each field, we corrupt thevalue to zero (0), a value less than the actual one (<), a valuelarger than the actual one (>), and the maximum possiblevalue (Max). Format information is used differently dependingon the query workload. Thus, we ran three types of query:full-scan, single selection, and range selection.Table III depicts how various types of format corruptionare handled in an inconsistent manner; some corruptions aredetected (√), some are not. When a corruption is not detected,MySQL sometimes returns incorrect results to the user (6=),sometimes returns valid results (.), leaving the corruption unnoticeable,and sometimes crashes (×) in some unanticipatedscenarios.5) Record Format Corruption: In our final online experiment,we corrupt dynamic-record length information storedin the data file. MySQL is able to detect the discrepancybetween the length of a record and the total length of itsframes.MySQL tracks the cumulative length of the frames thathave been fetched with respect to a record. If the cumulativelength is larger than the record length, MySQL stops thequery and returns only valid records that have been fetched sofar. However, if the cumulative length is less than the recordlength, the server emits a hard error message saying that thetable has been marked as crashed and should be repaired.B. ObservationsWe now answer the questions we posed earlier in thepaper. In short, our results have shown that MySQL doesnot detect all kinds of corruption that can arise, the MySQLserver is not highly available in the midst of corruptions,and finally MySQL does not have a consistent framework forcorruption handling. Below, we describe these observations inmore detail.1) Incomplete Detection: We find that MySQL ignoresmany corruptions, which leads to incorrect results being returned,crashes, and data loss. After further analysis, we findtwo reasons for these problems: in some cases MySQL ignoresdetectable corruptions and in some other cases MySQL doesnot have the ability to detect certain corruptions.Ignored detectable corruptions: There are cases wherecorruption can be detected from implicit redundant informationstored in MySQL data structures. Thus, with someadditional work, some corruptions are actually detectable.However, detectability does not always lead to detection aswe see in these three examples:First, in B-Tree pointer corruption (Section IV-A.1), whena pointer is corrupt such that it points to a page not reachablefrom the parent page (e.g., MP points to page D through Kin Figure 1), MySQL could detect this by checking the keyswith respect to key1 and key2. However, since MySQL doesFull scan Single selection Range selectionFormat info 0 < > Max 0 < > Max 0 < > MaxState headerheader length √ √ . √ √ √ . √ √ √ . √keys √ √ √ √ √ √ √ √ √ √ √ √number of records 6= . . . 6= . . . 6= . . .data file length 6= 6= √ √ 6= 6= . . 6= 6= √ .Base headerrecord length √ √ × × √ √ × × √ √ × × pack rec. length √ √ × × 6= 6= 6= × 6= 6= 6= × rec ref. length √ . √ √ × √ √ √ × √ √ √key ref. length . . . . 6= √ 6= √ × √ × √max key blk len . . . . × . . . × . . .fields . . √ √ . . √ √ . . √ √Key def.key segments √ √ √ √ √ √ √ √ √ √ √ √block length . . . . × . √ √ × . √ √Key segmentlength . . . . 6= √ . × √ √ √ × Record infolength . . √ √ . . √ √ . . √ √TABLE IIIOnline detection of format corruption. The table above reports MySQL corruption handling of different format corruptions. We corrupta format value to zero (0), a value less than its actual value (<), a value larger than its actual value (>), and the maximum possible value(Max). This format information is stored in the index file header. “×” represents a server crash, “6=” implies that the server returns wrongresults to the users, “.” marks that the corruption is silently ignored, and “√” marks that the corruption is detected.not perform such a check, incorrect results are returned (“6=”in Table I).Second, a record pointer corruption (Section IV-A.2) shouldbe easily detectable; the MySQL server could compare thekey stored in the index with the one stored in the record.But, rather than utilizing this redundant information, MySQLalways trusts the keys stored in the records. As a result,incorrect results are returned.Third, in the index format corruption (Section IV-A.4), whenthe data file length specified in the state header is corrupted tozero, MySQL returns no result to the user without any errormessage, blindly believing that the data file is empty althoughthe number of records stored in the state header can give thecorrect information. A similar situation occurs when the datafile length is corrupted to half of the actual value; MySQL onlyscans half of the data file. Another example is when the servercrashes because the record length stored in the base header iscorrupted to a maximum value. These corruptions actually canbe caught simply by verifying the same information stored inthe format file.Undetectable corruptions: We find that several corruptionsare hard to detect because MySQL does not store enoughimplicit redundant information in its data structures. We findmany instances of this issue:First, in the B-Tree pointer corruption (Section IV-A.1), itis hard to verify that a pointer properly connects two pages inadjacent levels because a page does not store its page level.For example, if a pointer is corrupt such that it points to oneof its grand-children (e.g., MP points to its C in Figure 1),MySQL cannot detect this easily.Second, it is hard to detect an invalid root pointer becausethe index file header does not store the height of the B-Treeand the root page does not store its page level. Thus, a rootpointer that points to a non-root page is considered valid,leading to a silent data loss (i.e., some pages connected fromthe original root page are not reachable anymore). If the indexfile header stores the height of the B-Tree and each key-pagehas page-level information, their values can be cross-checked.Third, it is difficult to catch a page in a B-Tree that points toanother page belonging to another B-Tree because a page doesnot store information about to which B-Tree it belongs to. Atable can have more than one index thus more than one B-Treecan be saved in the same index file. A page in a B-Tree shouldnot be allowed to point to a page belonging to another B-Tree.However, since the page does not specify owner information,such a corruption scenario is not detected. As a result, usersget incorrect results or the server kills the executed query withan error message.Fourth, in the overflow pointer corruption (Section IV-A.3),it is also hard to catch a frame in a record that points toanother frame belonging to another record because a framedoes not hold information about its owner. Thus, when acorrupt overflow pointer points to a “valid” frame that actuallybelongs to another record, MySQL cannot easily detect thiscorruption online. As a result, the corrupt record is presentedto users like a valid record, except part of the data belongs toanother record.Fifth, in the index format corruption (Section IV-A.4), it ischallenging to verify true leaf and non-leaf pages. The pageheader has a one bit field that specifies whether the pageis a leaf page (bit = 0) or a non-leaf page (bit = 1). Whenwe corrupt the bit, thus making a non-leaf page a leaf pageand vice-versa, the server sometimes hits an infinite loop,sometimes returns an empty result to users, and sometimesdetects incorrect keys due to incorrect parsing. Detecting thiscorruption is challenging if not impossible. If only redundantinformation such as page level were stored in the page header,such detection would be straightforward.In summary, MySQL should peruse available informationin its data structures to cross-check its metadata consistencyto the greatest extent possible. Furthermore, our findingsalso show that adding extra information might be useful forcorruption detection or even recovery. The file system story inSection II shows that adding implicit redundancy can be doneefficiently.2) Reduced Availability: A system crash reduces availability.Thus, failure should be avoided in most systems. Unfortunately,in our experiments, we have shown that MySQLcrashes in many cases of corruption.Reduced availability also happens when MySQL fails toreturn valid data to users. When a minimal corruption occurswe might wish MySQL give us as many valid records aspossible. For example, if there is only one corrupt record (e.g.,due to a corrupt overflow pointer), we might wish valid recordswere still accessible. However, that is not always the case inMySQL. In the overflow pointer corruption (Section IV-A.3),when an overflow pointer accidentally points to data, MySQLdoes not return any valid records (“♠” in Table II). When anoverflow pointer points to an out-of-bound offset, the serveronly returns valid records that have been fetched so far (“♣”in Table II). Hence, due to this inconsistent handling, a smallcorruption in MySQL can make a large number of recordsinaccessible.To improve availability, corruption should be detected andisolated. Detection is crucial; our findings have shown thatcorrupt metadata can lead to crashes. Worse, it might leadto the propagation of the corruption. This result emphasizesthat catching corrupt metadata is a crucial factor in increasingavailability. Furthermore, after corrupt metadata is detected,the corruption and also the operation on the metadata shouldbe isolated; more specifically, the operation should be able tocontinue processing other valid metadata.3) No Framework for Corruption Handling: Finally, webelieve that MySQL might not have a framework for corruptionhandling. This conclusion is suggested by its inconsistentreactions in handling corruption. We define inconsistent handlingas the case where similar failure scenarios are handleddifferently. From our results, we find five cases of inconsistenthandling in each class of corruption we injected:First, in the B-Tree pointer corruption (Section IV-A.1),when we corrupt the middle pointer to point to any pagereachable from the left-uncle, MySQL detects the corruptions(“√” in Table I-b when MP points to D, E, F, or G). However,when the middle pointer is corrupted to point to any pagereachable from the right-uncle, MySQL does not detect thecorruptions and delivers the wrong results to the users (“6=”in Table I-b when MP points to H, I, J, or K). These twocases are similar but handled differently. It turns out that, forthe first case, MySQL “coincidentally” detects the corruption;the error message actually comes from the detection of anout-of-bound key-pointer due to the abnormal behavior of thesearch routine after it follows the corrupted middle pointer.Second, in the record pointer corruption (Section IV-A.2),MySQL reacts to a corrupt record pointer differently dependingon the executed query. In the case of a single selectionquery, users get correct (empty) result; in the case of a rangeselection query, users get wrong (partial) results without anyerrors thrown; in the case of a dynamic length record, a harderror is thrown and no result is returned (even the valid ones).This shows that MySQL corruption handling is sometimes softand sometimes hard.Third, in the overflow pointer corruption (Section IV-A.3),depending on the corrupt value, MySQL gives widely differentreactions ranging from marking the table as crashed (♠ inTable II) to killing the executed query (♣), and sometimessilently returning without any error-code (•).Fourth, in the index format corruption (Section IV-A.4),Table III clearly depicts how format corruptions are handledin an inconsistent manner, depending on the workload and onthe corrupt value. For example, when the key reference lengthin the base header is corrupted, sometimes the corruption isdetected (√), but sometimes it is not. When the corruption isnot detected, MySQL sometimes returns incorrect results tothe user (6=) and at times crashes (×).Fifth, in the record format corruption (Section IV-A.5),when a query hits a corrupt dynamic-record length, dependingon the corrupt value, MySQL sometimes stops the query andreturns only valid records that have been fetched so far, butsometimes emits a hard error message saying that the tablehas been marked as crashed.In summary, we believe that MySQL does not have aproper framework for corruption handling. When inconsistenthandling is observed, usually it implies that the corruptionhandling code is diffused throughout the code base [14],[30]. Such diffusion usually results in unpredictable and oftenundesirable fault-handling strategies, which might turn intofrustration for human debugging [30].C. SummaryWe have found that MySQL does not detect and handlecorruptions well. We believe that the observations we havemade are not specific to MySQL; in addition to MySQL, wehave applied our fault injection method to PostgreSQL version8.3, another open source DBMS. Our initial experiment showsthat PostgreSQL has similar problems as MySQL. For example,in PostgreSQL, pages in the index file store left and rightsibling pointers. When the right sibling pointer of a page iscorrupted so that it points to one of its left sibling pages, theSELECT query on the table based on index scan makes theserver to hang as it hits an infinite loop. Beyond the scenariodescribed above, we have also injected 24 more corruptions toPostgreSQL and found that 12 of them highlight the problemsobserved in this section.V. OFFLINE CORRUPTIONOnline detection of hundreds of possible corruption scenariosis often not feasible. One primary reason is because fullcross-checks must be performed to detect all scenarios. Thus,a DBMS offline checker should be the last tool that catchesall corruptions in the database. When a corruption has beendetected by an offline checker, a repair utility can be run, thusrestoring the tables to a consistent condition. However, if theoffline checker misses some corruption scenarios, one wouldnot run the repair utility and corrupt data can leak into therunning system, which may cause more corruptions.In this section, we analyze the robustness of the MySQLoffline checker, myisamchk, in dealing with the same corruptionscenarios we have injected in the online case. Thischecker runs in two modes: check and repair. In this first mode,myisamchk attempts to find all corruptions in the database,while in the second, it tries to rebuild the tables and indexfiles. Thus, we pose two questions:1) Can myisamchk find all corruptions in the database?2) Can myisamchk correctly repair the database?To answer these questions, we first present the results ofour fault-injection experiments on myisamchk (Section V-A)and then summarize our observations (Section V-B).A. Results1) Check Mode: We have injected the same B-Tree andoverflow pointer corruptions described in Sections IV-A.1and IV-A.3. All cases except one are detected by myisamchk;myisamchk crashes when a left-most key points to the samepage where the key is stored. More detailed observation showsthat in many cases of detected corruptions, the error messagesthrown do not precisely describe the injected corruptions. Thissuggests that the checks performed do not capture the actualcorruptions. Hence, perhaps it is not surprising to discover acorner-case bug.The most interesting findings of our offline experimentsarose when we inject format corruptions (as in Section IVA.4). As depicted in Table IV, the offline checker blindly trustssome format information. As a result, the checker crashes (×)when such information is not as expected. This system crash isunacceptable because a checker should not trust any value itretrieves from the disk; its basic purpose is to find corruptmetadata. Other than this, Table IV also shows that manycorruption scenarios are left undetected.Format info 0 < > MaxState headerheader length √ √ . √keys √ √ √ √number of records √ √ √ √data file length √ √ √ √Base headerreclength √ √ . .pack reclength . . √ × rec reflength √ √ √ √key reflength √ √ √ √max key blk len × . . .fields . . √ √Key defkeysegs √ √ √ √block length √ . √ × Key segmentlength √ √ √ × Record infolength . . √ √TABLE IVOffline detection of format corruption. The table reports myisamchkcorruption handling of different format corruptions. “×”,“.”, and “√” represent server crash, ignored corruption, and detectionrespectively.2) Repair Mode: When we inject format corruptions, wealso find that the repair performed by myisamchk couldbe problematic. For example, when the record length (“reclength”)specified in the base header of the index file iscorrupted, myisamchk throws an error message saying thatit found wrong records in the data file and suggests a repair.When the repair is finished, however, all records in the tableare discarded and the record length still remains corrupted.After studying the code, we determined the reason. InMySQL the record length is essential to parsing records fromthe data file. However, myisamchk assumes that this field isalways correct. Thus, once the field gets corrupted, it willnever locate the corruption. Then during the repair, myisamchkwill not be able to read any record from the data file by usingthe wrong record length, thus leaving no record after the repair.In fact, this erroneous repair could be avoided by a simplefix, which makes use of the redundant information inside thedata file itself and from the format file.B. ObservationsIn summary, our results show that the offline checkermyisamchk is far from robust; it does not catch all corruptionsand it does not always repair the database correctly. Ourobservations point to the same issues faced by the runningMySQL (Section IV-B). Mainly, some detectable corruptionsare ignored and some corruptions are not detectable due to thelack of redundant information. As a result, the checker itselfcan crash and even worse an erroneous repair could happen.The fact that myisamchk does not perform a complete setof checks is not surprising given the minimal implementation# Checks Performed4 Checking data file:Check validity of deleted block links, deleted frames,overflow pointers, size of deleted blocks9 Checking keys:Check delete links (range-check and alignment),compare key-value pairs (range-check and alignment),check record-pointer, page length, auto-increment key.2 Checking file sizes:check length of index and data file15 TotalTABLE VChecks performed by myisamchk. The table summarizes the 15checks performed by the MySQL offline checker.of the checker (under 2000 lines of code). A more detailedstudy shows that the checker only performs 15 checks, shownin Table V. Many important checks are either omitted oroverlooked. Redundant information in the format file (e.g.,column and key definitions, file size, record count, etc.) isnot used to verify the consistency of the index file. B-Treechecks are also not comprehensive. For example, key-valuepairs comparison is done only on per-page level; key-valueordering across siblings and parent/child is not checked. Thus,there is room for improvement in building a more robustMySQL checker.VI. CONCLUSIONIn the world of storage systems, it would be ideal if RAIDstorage guaranteed that data was not corrupt. Unfortunately, nosuch guarantee is possible (though techniques can make theodds of perceived corruption lower). Thus, a DBMS must, atthe highest level, be responsible for the correctness of its data.This notion is particularly true of the DBMS metadata, whichno client of the DBMS can even access; if the DBMS doesnot safeguard its own metadata, no other components can.In this paper, we have begun the exploration of the datacorruption problem on database management systems. Wehave shown that the MySQL and PostgreSQL DBMSs do nottolerate such faults particularly well, and that MySQL offlinechecker catches some but not all corruptions, thus leaving thesystem susceptible to corruption if it arises.However, we believe our work is only the first step towardsthe “hardening” of database management systems to the problemsof corruption. Many problems remain, including:Online checking: A running DBMS should likely performinternal integrity checks while it runs to protect against otherforms of corruption, including those from bad memory [23]as well as from disk.There is a large body of work regarding techniques for detectingand recovering from data corruption [35], including theuse of in-memory redundancy with checksums and replicas,or the use of fault-tolerant data structures [3], where a singlepointer fault cannot lose a large amount of data, unlike whatwe have seen in Section IV-A.1. Although all these techniquesare not new, it would be interesting to find out why they arenot deployed in practice. One reason might be a lack of studyin quantifying how much performance overhead is imposedand how much reliability is gained when a certain redundancyor protection is added. This would be an important issue tolook into further.Aside from existing techniques, we believe a proper frameworkis needed in deploying the techniques. One possiblesolution is having a centralized framework that focuses oncorruption handling [14]. Without a centralized framework,handling hundreds of corruption scenarios is proven to bedifficult, diffused, and inconsistent.Robust offline checkers: The existence of repair tools againindicates that we need them in practice. However, as we haveobserved, the repair process of checkers (both for DBMSs andfile systems) is typically ad hoc. Thus, the quest in buildingmore robust checkers has begun recently. For example, Gunawiet al. utilize a declarative approach to write hundredsof checks and repairs in a clear and compact manner [15].Others have used more formal frameworks as the foundationfor corruption repair. For example, Khurshid et al. suggest theuse of symbolic execution [18] and Wang et al. define theproblem of corruption-repair as a global optimization problemby using structural Hamming and edit distance [43].Thus, further work is clearly required. Only through acombined offline and online approach will a high-performance,robust, and truly corruption-robust DBMS be realized.REFERENCES[1] http://bugs.mysql.com/search.php?search for=corruption&status=All&cmd=display.[2] http://search.postgresql.org/search?m=1&q=corruption.[3] Yonatan Aumann and Michael A. Bender. Fault Tolerant Data Structures.In The 37th Annual Symposium on Foundations of Computer Science(FOCS ’96), Burlington, Vermont, October 1996.[4] Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder,Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. An Analysisof Data Corruption in the Storage Stack. In Proceedings of the 6thUSENIX Symposium on File and Storage Technologies (FAST ’08), pages223–238, San Jose, California, February 2008.[5] Lakshmi N. Bairavasundaram, Meenali Rungta, Nitin Agrawal, AndreaC. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Michael M.Swift. Systematically Benchmarking the Effects of Disk Pointer Corruption.In Proceedings of the International Conference on DependableSystems and Networks (DSN ’08), Anchorage, Alaska, June 2008.[6] Wendy Bartlett and Lisa Spainhower. Commercial Fault Tolerance: ATale of Two Systems. IEEE Transactions on Dependable and SecureComputing, 1(1):87–96, January 2004.[7] Steve Best. JFS Overview. www.ibm.com/developerworks/library/l-jfs.html, 2000.[8] J. Brown and S. Yamaguchi. Oracle’s Hardware Assisted Resilient Data(H.A.R.D.). Oracle Technical Bulletin (Note 158367.1), 2002.[9] Florian Buchholz. The structure of the Reiser file system.http://homes.cerias.purdue.edu/florian/reiser/reiserfs.php, January 2006.[10] Andy Chou, Junfeng Yang, Benjamin Chelf, Seth Hallem, and DawsonEngler. An Empirical Study of Operating System Errors. In Proceedingsof the 18th ACM Symposium on Operating Systems Principles (SOSP’01), pages 73–88, Banff, Canada, October 2001.[11] Michael H. Darden. Data Integrity: The Dell|EMC Distinction. http://www.dell.com, May 2002.[12] Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and BenjaminChelf. Bugs as Deviant Behavior: A General Approach to InferringErrors in Systems Code. In Proceedings of the 18th ACM Symposium onOperating Systems Principles (SOSP ’01), pages 57–72, Banff, Canada,October 2001.[13] Rob Funk. fsck / xfs. http://lwn.net/Articles/226851/.[14] Haryadi S. Gunawi, Vijayan Prabhakaran, Swetha Krishnan, Andrea C.Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Improving File SystemReliability with I/O Shepherding. In Proceedings of the 21st ACMSymposium on Operating Systems Principles (SOSP ’07), pages 283–296, Stevenson, Washington, October 2007.[15] Haryadi S. Gunawi, Abhishek Rajimwale, Andrea C. Arpaci-Dusseau,and Remzi H. Arpaci-Dusseau. SQCK: A Declarative File SystemChecker. In Proceedings of the 8th Symposium on Operating SystemsDesign and Implementation (OSDI ’08), San Diego, California, December2008.[16] IBM. http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.pd.doc/pd/c0020760.htm.[17] IBM. http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/core/c0009137.htm.[18] Sarfraz Khurshid, Ivan Garca, and Yuk Lai Suen. Repairing StructurallyComplex Data. In 12th International SPIN Workshop on Model Checkingof Software (SPIN ’05), San Francisco, CA, August 2005.[19] Andrew Krioukov, Lakshmi N. Bairavasundaram, Garth R. Goodson, KiranSrinivasan, Randy Thelen, Andrea C. Arpaci-Dusseau, and Remzi H.Arpaci-Dusseau. Parity Lost and Parity Regained. In Proceedings of the6th USENIX Symposium on File and Storage Technologies (FAST ’08),pages 127–141, San Jose, California, February 2008.[20] Xin Li, Michael C. Huang, , and Kai Shen. An Empirical Study ofMemory Hardware Errors in A Server Farm. In The 3rd Workshop onHot Topics in System Dependability (HotDep ’07), Edinburgh, UK, June2007.[21] Marshall Kirk McKusick, Willian N. Joy, Samuel J. Leffler, andRobert S. Fabry. Fsck - The UNIX File System Check Program. UnixSystem Manager’s Manual - 4.3 BSD Virtual VAX-11 Version, April1986.[22] Microsoft. http://technet.microsoft.com/en-us/library/ms176064.aspx.[23] Dejan Milojicic, Alan Messer, James Shau, Guangrui Fu, and AlbertoMunoz. Increasing Relevance of Memory Hardware Errors: A Casefor Recoverable Programming Models. In 9th ACM SIGOPS EuropeanWorkshop ’Beyond the PC: New Challenges for the Operating System’,Kolding, Denmark, September 2000.[24] C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz. ARIES:A Transaction Recovery Method Supporting Fine-Granularity Lockingand Partial Rollbacks Using Write-Ahead Logging. ACM Transactionson Database Systems, 17(1):94–162, March 1992.[25] MySQL Team. MySQL Internals: MyISAM. http://forge.mysql.com/wiki/MySQL Internals MyISAM.[26] Oracle. http://www.ordba.net/Tutorials/OracleUtilitiesDBVERIFY.htm.[27] Oracle. http://www.oracleutilities.com/Packages/dbms repair.html.[28] Oracle. http://www.oracle-base.com/articles/8i/DetectAndCorrectCorruption.php.[29] David Patterson, Garth Gibson, and Randy Katz. A Case for RedundantArrays of Inexpensive Disks (RAID). In Proceedings of the 1988 ACMSIGMOD Conference on the Management of Data (SIGMOD ’88), pages109–116, Chicago, Illinois, June 1988.[30] Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal,Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. IRON File Systems. In Proceedings of the 20th ACMSymposium on Operating Systems Principles (SOSP ’05), pages 206–220, Brighton, United Kingdom, October 2005.[31] Hans Reiser. ReiserFS. www.namesys.com, 2004.[32] Jerome H. Saltzer, David P. Reed, and David D. Clark. End-to-endarguments in system design. ACM Transactions on Computer Systems,2(4):277–288, November 1984.[33] Bianca Schroeder and Garth Gibson. Disk failures in the real world:What does an MTTF of 1,000,000 hours mean to you? In Proceedingsof the 5th USENIX Symposium on File and Storage Technologies (FAST’07), pages 1–16, San Jose, California, February 2007.[34] Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. DRAMerrors in the wild: A Large-Scale Field Study. In Proceedings of the2009 Joint International Conference on Measurement and Modeling ofComputer Systems (SIGMETRICS/Performance ’09), Seattle, Washington,June 2007.[35] Gopalan Sivathanu, Charles P. Wright, and Erez Zadok. Ensuring DataIntegrity in Storage: Techniques and Applications. In The 1st Interna-tional Workshop on Storage Security and Survivability (StorageSS ’05),FairFax County, Virginia, November 2005.[36] Christopher A. Stein, John H. Howard, and Margo I. Seltzer. UnifyingFile System Protection. In Proceedings of the USENIX Annual TechnicalConference (USENIX ’01), Boston, Massachusetts, June 2001.[37] Michael Stonebraker and Lawrence A. Rowe. The Design of POSTGRES.In IEEE Transactions on Knowledge and Data Engineering,pages 340–355, 1986.[38] Sun Microsystems. ZFS: The last word in file systems.www.sun.com/2004-0914/feature/, 2006.[39] Sun Microsystems. MySQL White Papers, 2008.[40] Rajesh Sundaram. The Private Lives of Disk Drives.http://www.netapp.com/go/techontap/matl/sample/0206tot resiliency.html, February 2006.[41] Adan Sweeney, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto,and Geoff Peck. Scalability in the XFS File System. InProceedings of the USENIX Annual Technical Conference (USENIX’96), San Diego, California, January 1996.[42] Stephen C. Tweedie. EXT3, Journaling File System. http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html, July 2000.[43] Hongyi Wang, Bingsheng He, Vijayan Prabhakaran, and Lidong Zhou.Crystal: The Power of Structure Against Corruptions. In The 5thWorkshop on Hot Topics in System Dependability (HotDep ’09), Lisbon,Portugal, June 2009.[44] Glenn Weinberg. The Solaris Dynamic File System. http://members.visi.net/thedave/sun/DynFS.pdf, 2004.[45] Junfeng Yang, Can Sar, and Dawson Engler. EXPLODE: A Lightweight,General System for Finding Serious Storage System Errors. In Pro-ceedings of the 7th Symposium on Operating Systems Design andImplementation (OSDI ’06), Seattle, Washington, November 2006.ACKNOWLEDGMENTWe thank the anonymous reviewers for their tremendousfeedback and comments, which have substantially improvedthe content and presentation of this paper. We also thankSwaminathan Sundararaman and Abhishek Rajimwale for theirinsightful comments.This material is based upon work supported by the NationalScience Foundation under the following grants: CCF-0621487,CNS-0509474, CCR-0133456, as well as by generous donationsfrom NetApp, Inc and Sun Microsystems.Any opinions, findings, and conclusions or recommendationsexpressed in this material are those of the authors and donot necessarily reflect the views of NSF or other institutions.