Lazy Analytics Let Other Queries Do the Work For You 懒惰分析：让其他查询语句为你工作

来源：互联网发布：端口不通怎么办编辑：程序博客网时间：2024/06/04 00:01

Lazy Analytics Let Other Queries Do the Work For You

William Jannen, Michael A. Bender, Martin Farach-Colton†, Rob Johnson,
Bradley C. Kuszmaul‡, and Donald E. Porter
Stony Brook University, †Rutgers University, and ‡Massachusetts Institute of Technology

（本文其实是某公选课的作业，虽然是第一次翻译，但照着原文应该可以大致理解。若有与原文严重不符之处，敬请指教，在此先谢过）

先上原文：

Abstract

We propose a class of query, called a derange query, that maps a function over a set of records and lazily aggregates the results. Derange queries defer work until it is either convenient or necessary, and, as a result, can reduce total I/O costs of the system.

Derange queries operate on a view of the data that is consistent with the point in time that they are issued, regardless of when the computation completes.（不论何时完成计算） They are most useful for performing calculations where the results are not needed until some future deadline. When necessary, derange queries can also execute immediately. Users can view partial results of in-progress queries at low cost.

1 Introduction

Queries on production databases have varying require-ments for response time and data timeliness. Some trans-actions service end-user requests, and must minimize la-tency in order to minimize user-perceived delays. Other queries are not urgent, and hence can be scheduled op-portunistically, but nonetheless need a specific point-in-time-consistent view of the data. Examples of the second class of queries include periodic reports and summary computations, such as issuing monthly bills, identifying patterns in online purchases, and monitoring trends in so-cial media.

Long-running summary computations can starve other high-priority, latency-sensitive tasks, if both classes of operations are run on the same machine. To alleviate resource contention on production databases, it is com-mon to maintain replicas or additional databases where summary computations are performed [2]. This may re-quire additional physical resources, management effort, and/or licenses, and requires keeping multiple databases in sync.

We propose a new class of query for summary com-putations that can minimally impact other operations. A derange query maps a function over a range of records,

and incrementally aggregates the result. Derange queries defer work until it is necessary (e.g., the result of the query is needed), or convenient (e.g., other necessary work has read the required data into memory). Thus, de-range queries are most useful for calculations whose re-sults are needed at some future deadline. However, once issued, derange queries can be scheduled immediately.

A key idea underlying the derange query model is to integrate background work with I/O scheduling. The goal of a derange query is to make maximum use of all I/Os in the system; when any query executes, we want to amortize the I/O cost of that query across as many active queries as possible. At the same time, we do not want background tasks to impact latency-sensitive operations negatively. The derange query model allows one to integrate these goals into one I/O scheduler.

Derange queries can be easily implemented as mes-sages in a write-optimized dictionary (WOD), such as a B^e -tree [7], a log-structured merge tree [15], or a log-structured merge tree variant [5, 18, 19, 21]. As the name implies, WODs are popular for high-performance databases and file systems [4, 9, 10, 11, 12, 13, 14, 16, 17, 19, 20] because of the very high insertion performance— typically less than 1 I/O per insertion or deletion. WODs are so fast because they buffer and batch writes. The primary focus of write-optimization has been on improving the efficiency of writes through batching.

This paper identifies an opportunity to integrate write batching in a WOD with background queries that access the same data. There are several benefits to implementing a derange query as a message in a WOD:

Derange queries on overlapping input ranges can be transparently batched and processed together, requiring each input value be read only once.

Repeated derange queries at multiple points in time on the same input range may complete by reading every version of the data exactly once.

I/O required to ingest new data can contribute to completing a derange query, and I/O required to process a derange query can accelerate ingesting new data.

Derange queries can significantly reduce the cost of summary computations on highly volatile data sets, and could make data analytics possible on high performance production databases without harming update perfor-mance. In fact, the higher a data set’s update rate, the faster a derange query would complete.

The remainder of this paper is organized as follows. Section 2 discusses WODs, and explains how the properties of B^e -trees apply to derange query design. Sec-tion 3 outlines the proposed derange query implementation using a concrete example. Section 4 reasons about derange query performance. Section 5 presents scenarios where derange queries are particularly beneficial. Section 6 summarizes related work, and Section 7 discusses opportunities for future exploration.

2 Write-Optimized Dictionaries

This section explains B^e -trees [7], an example of a write-optimized dictionary (WOD). Derange queries could be implemented in other WODs, including LSM-trees [15] and their variants [18, 19, 21]. However, our pro-posed implementation relies heavily on upsert opera-tions, and B^e -trees have asymptotically superior upsert performance.

We limit our discussion to the features of B^e -trees that are most relevant to derange query design. Bender et al. [6] offer a more complete description of B^e -trees, in-cluding comparisons with other WODs.

2.1 B^e -Trees

A B^e -tree, like a B-tree, is a search tree for organizing persistent data. Internal nodes store pivot keys and child pointers, and leaf nodes store key-value pairs. What sets a B^e -tree apart from a standard B-tree is that internal B^e - tree nodes also allocate a buffer to store messages. The structure of a B^e -tree is illustrated in Figure 1.

Messages encode updates to key-value pairs. All mes-sages are inserted into the B^e -tree root, and when the message buffer fills in a root or other non-leaf node, mes-sages in the full buffer are flushed to one or more chil-dren. Flushing moves messages from a parent to a child’s buffer; flushes may cascade down the tree; and messages are ultimately applied to key-value pairs at a leaf. The flushing process estimates the children that would re-ceive enough messages to amortize the cost of rewriting the parent and child buffers. Thus, messages make their way down a root-to-leaf path in batches, until they are eventually applied at a B^e -tree leaf.

Upserts. B^e -trees can effectively implement blind operations—operations on a key-value pair without first reading it—using upsert messages.

Figure 1: A B^e -tree. Internal nodes store pivot keys and child pointers, and leaf nodes store key-value pairs. Internal nodes also allocate a buffer to store messages, which are flushed down the tree in batches.

An upsert message specifies a key, a function, and a set of function arguments. When a key-value pair is queried, all upsert messages along the pair’s root-to-leaf path are gathered, and their functions are applied in order.

Upserts can be used to compactly encode updates to ranges of bytes within a object, modifications to fields of structured data, or data-dependent computations. The flexibility of upsert messages is essential to the implementation of derange queries; as described in Section 3, upserts allow derange queries to incrementally and lazily aggregate the results of deferred work.

Temporal ordering. The relative position of messages within buffers of a B^e -tree preserves the temporal order of updates. At any point in time, multiple versions of a key-value pair may exist in the tree (e.g. an insert mes-sage overwrites an existing key-value pair), and multiple in-flight messages may contain updates to a given value (e.g. two upserts target the same key-value pair). Node flushing preserves the message ordering until messages are applied to key-value pairs at B^e -tree leaves.

Queries. All messages needed to answer a query reside in buffers on the root-to-leaf search path. Because non-leaf messages may contain outstanding updates, all mes-sages along the root-to-leaf path must be searched, and updates are applied in reverse chronological order.

Message targets. A single message may apply to one key-value pair, all key-value pairs (broadcast), or a range of key-value pairs (rangecast). A rangecast message [22] is addressed to a contiguous range of keys, specified by a beginning and ending key, inclusive.

Since broadcast and rangecast messages may apply to many key-value pairs, these messages may split during a node flush. When a message splits, the original mes-sage is discarded, and new messages with appropriate subranges are created in its place.

3 Derange Query Design

A derange query can be implemented as a rangecast up-sert message. A derange query has the form

DERANGE(R;FILTER;MAP;FOLD;k)

where

1. R is an input range.

2. FILTER is a predicate to remove records that do not meet appropriate criteria.

3. MAP is a function to apply to each record in the input range that meets the filter criteria.

4. FOLD is a function to propagate the results.

5. k is a key specifying where results are accumulated.

The aggregation record associated with key k is incrementally updated as a derange query lazily completes. After each application of MAP to an input record, out-puts are accumulated by inserting a message of the form:

UPSERT(k;FOLD;result_M_AP )

Upsert messages offer a flexible means to propagate MAP results. Upserts can encode complex data-dependent operations as well as simple operations like incrementing a counter. Inserting small upsert messages into the root of a B^e -tree imposes little I/O overhead.

3.1 Derange Query Example

To get a feel for how a derange query works, we will show how a fictional online retailer, called “Market-place”, could use derange queries for data analytics.

Suppose Marketplace manages its inventory using a product database with records of the form:

Item f

productId

num

warehouse

address

quantity

num

value

num

price

num

Every hour, Marketplace would like to calculate the cumulative value of all products in its New York warehouses in order to identify trends and make inventory de-cisions. Marketplace could perform these calculations with a derange query where:

R = ( ¥;¥)

FILTER

= return Item.warehouse == NY

MAP

return Item.quantity * Item.value

FOLD

totalValue += result

k = InventoryAtjjTIMESTAMP

Marketplace would first start by initializing its aggregation record, k. In this example, the value of k is a sim-ple integer, totalValue, initialized to 0.

The range R = ( ¥;¥) means that this query will examine every record in the database. But since the query should only track items in warehouses located in NY, the FILTER function is used to exclude records that do not match this criteria. Note that, if the primary index for the database used geography, the range could select for only records in NY warehouses and avoid reading irrelevant data; the FILTER function can select data based on criteria that is not included in the indexing schema.

When a derange query message reaches a leaf of the tree, the value of each record it observes is the value that existed when the derange query was first issued. At that point, the MAP function is called on all records that fall within R and satisfy the FILTER function. The output of each MAP function—here the total value of a single product in the warehouse’s inventory—is propagated to the aggregation record, k, using an upsert where the FOLD function updates k’s running total.

This simple example demonstrates the utility that de-range queries provide. Marketplace’s inventory calcula-tions are performed on views of the data at fixed times-tamps, but query results are not needed right away. If a particular region of the tree remains unchanged between two derange queries, then a single I/O will sat-isfy both operations. However, even when the tree is up-dated frequently, all derange queries see a point-in-time-consistent view of the data, regardless of when the actual calculation is performed.

3.2 Query Completion

One challenge that arises when lazily executing independent, distributed computations is determining what fraction of the total work has completed. To solve this prob-lem, we add a small amount of bookkeeping to the aggre-gation record: one required field, outstandingMessages, and one optional field, recordsProcessed.

The outstandingMessages field is a simple counter. A derange query message may apply to many records in the tree, and as explained in Subsection 2.1, a node flush may cause a rangecast message to split. Each time a de-range query message splits, we issue an upsert message to the derange query’s aggregation record to increment the outstandingMessages counter. To complete the book-keeping, we issue an upsert message that decrements the counter when a derange query message reaches a B^e -tree leaf. The outstandingMessages counter is initialized to 1 in order to account for the initial derange query message inserted at the root of the B^e -tree.

The recordsProcessed field counts the number of key-value pairs that have folded their MAP results to the aggregation record. Due to the laziness of flushing and the opacity of the internal B^e -tree structure, an applica-tion has no control over the progress of a derange query

without manually triggering message flushes. By querying the recordsProcessed field, an application can reason about the meaningfulness of a partially completed result.

4 Derange Query Cost

This section explains how derange queries improve the performance of summary computations in much the same way that WODs improve the performance of in-serts and updates.

As explained in Subsection 2.1, a B^e -tree node is only dirtied when a substantial amount of new data is written—enough to amortize the cost of rewriting the parent and child nodes. For a tree with a node size of B, a branching factor of B^e , and a buffer size of B B^e , the amount of new data written during each node flush is

at least （B-_Be）/_Be = B^1-^e . We call this the batching factor.

Batching is why inserts and upserts in a B^e -tree are B¹ ^e times faster than in a B-tree.

Derange queries bring the benefits of batching to queries. A derange query spanning a range ofL items

touches

nodes during its execution

Derange query messages are flushed along with other mes-sages in batches of size at least B¹ ^e . Hence the amortized I/O cost of a derange query spanningL items is

In contrast, a normal range query spaning L items requires

I/Os. The batching factor divides the cost; as a result, derange queries have the potential to provide as much speedup for queries as write optimization provides for inserts.

5 Derange Query Opportunities

In this section, we discuss the types of environments where derange queries would be particularly useful.

Mixed workload environments. A typical web-scale database serves at least two kinds of queries: small ran-dom queries that must be answered quickly, and large analytic queries that might take several hours in the best case but can be delayed by many more hours without hurting their value to the business. An example might be a credit-card database where customer purchases cre-ate many high-priority inserts, and large queries are performed overnight to find new fraud patterns.

If most of the I/O’s needed by the big query can be piggybacked onto the small queries, then both types of queries can be performed without increasing the cost of the database or slowing down the small queries.

Point-in-time computations. In the common case, instances of the same derange query, repeated at multiple points in time, would be satisfied by reading each ver-sion of the data exactly once. Thus, derange queries can

be used to increase the granularity of reporting.

Queries on overlapping ranges. Derange queries can make it easy to batch otherwise unrelated queries. For example, consider a system that performs one summary computation every 24 hours, and another summary com-putation ever 12 hours. Manually batching these com-putations would essentially require writing two versions of the 12-hour computation—one that runs on its own and another that runs as part of the 24-hour computation. With derange queries, developers need to write only one version of each computation, and the system will batch them automatically when possible.

6 Related Work

Amvrosiadis et al. observed that common file system maintenance tasks (e.g. backup, defragmentation, virus scanning, etc.) are frequently executed independently despite their largely overlapping working sets. The Duet [3] framework places hooks in the page cache to notify processes when requested data is available. This lets background tasks leverage the I/O performed by fore-ground work. Derange queries similarly leverage the internal work done by the B^e -tree when it flushes messages to apply updates, piggybacking on I/O.

In the MapReduce [8] programming model, users filter and sort input data, independently process the filtered data, and combine the computations’ outputs into a final result. MapReduce makes these types of operations easy to program for distributed data sets. Derange queries provide a similar programming model, but can optionally defer execution. The motivating use cases of this pa-per have been single-node, high performance, production databases, but derange queries could also be extended to work on a distributed storage system.

LINQ [1] features deferred execution, which delays the evaluation of an expression until its value is required. However, from the time an expression tree is created to the time the query is executed, the database may change. A derange query defers execution until the message is applied, but the message is always applied to the value of the data at the time the message is inserted.

7 Future Work

Even when derange queries cannot be delayed arbitrarily, they can provide significant speedups. Part of our future work is to analyze and empirically evaluate the perfor-mance opportunities created by derange queries.

When executing a derange query with a fixed dead-line, the ability to systematically execute portions of the query would be useful. Otherwise, a burst of deferred work might need to be scheduled at the query deadline,

eroding the benefits of batching. Derange queries create opportunities for I/O scheduling and workload management.

Acknowledgments

We thank the anonymous reviewers and our shep-herd, Cindy Rubio Gonzalez, for their insightful com-ments on earlier drafts of the work. This research was supported in part by NSF grants CNS-1409238, CNS-1408782, CNS-1408695, CNS-1405641, CNS-1149229, CNS-1161541, CNS-1228839, IIS-1247750, CCF-1314547, CNS-1526707, Sandia National Labora-tories, and VMware.

References

[1] LINQ and deferred execution. https://blogs.msdn.microsoft.com/

charlie/2007/12/10/linq-and-deferred-execution/, 2007. Viewed March 10, 2016.

[2] How to perform summary analytics on production databases? http://goo.gl/lQXW7V, 2016. Viewed March 8, 2016.

[3] G. Amvrosiadis, A. D. Brown, and A. Goel. Opportunistic storage maintenance. In SOSP, pages 457–473, 2015.

[4] Apache. HBase. http://hbase.apache.org, Last Accessed May 16, 2015, 2015.

[5] M. A. Bender, M. Farach-Colton, J. T. Fineman,

Y. R. Fogel, B. C. Kuszmaul, and J. Nelson. Cache-oblivious streaming B-trees. In SPAA, pages 81–92, 2007.

[6] M. A. Bender, M. Farach-Colton, W. Jannen,

R. Johnson, B. C. Kuszmaul, D. E. Porter, J. Yuan, and Y. Zhan. An introduction to B^e-trees and write-optimization. ;login: Magazine, 40(5):22–28, Oct 2015.

[7] G. S. Brodal and R. Fagerberg. Lower bounds for external memory dictionaries. In SODA, pages 546–554, 2003.

[8] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137–150, 2004.

[9] J. Esmet, M. A. Bender, M. Farach-Colton, and

B. C. Kuszmaul. The TokuFS streaming file system. In HotStorage, page 14, 2012.

[10] Google, Inc. LevelDB: A fast and lightweight key/value database library by Google.

http://github.com/leveldb/, Last Accessed May 16, 2015, 2015.

[11] W. Jannen, J. Yuan, Y. Zhan, A. Akshintala,

J. Esmet, Y. Jiao, A. Mittal, P. Pandey, P. Reddy, L. Walsh, M. Bender, M. Farach-Colton,

R. Johnson, B. C. Kuszmaul, and D. E. Porter. BetrFS: A right-optimized write-optimized file system. In FAST, pages 301–315, 2015.

[12] W. Jannen, J. Yuan, Y. Zhan, A. Akshintala,

J. Esmet, Y. Jiao, A. Mittal, P. Pandey, P. Reddy,

L. Walsh, M. Bender, M. Farach-Colton,

R. Johnson, B. C. Kuszmaul, and D. E. Porter. BetrFS: Write-optimization in a kernel file system. ACM TOS, 11(4), Nov. 2015.

[13] A. Lakshman and P. Malik. Cassandra - a decentralized structured storage system. OS Rev., 44(2):35–40, 2010.

[14] MongoDB. The MongoDB 2.6 Manual, 2014. http://docs.mongodb.org/manual/, Viewed May 27, 2014.

[15] P. O’Neil, E. Cheng, D. Gawlic, and E. O’Neil. The log-structured merge-tree (LSM-tree). Acta Informatica, 33(4):351–385, 1996.

[16] K. Ren and G. A. Gibson. TABLEFS: Enhancing metadata efficiency in the local file system. In USENIX ATC, pages 145–156, 2013.

[17] RocksDB. rocksdb.org, 2014. Viewed April 19, 2014.

[18] R. Sears and R. Ramakrishnan. bLSM: a general purpose log structured merge tree. In SIGMOD, pages 217–228, 2012.

[19] P. Shetty, R. P. Spillane, R. Malpani, B. Andrews,

J. Seyster, and E. Zadok. Building

workload-independent storage with VT-trees. In FAST, pages 17–30, 2013.

[20] Tokutek, Inc. TokuDB v6.5 for MySQL and MariaDB. http://www.tokutek.com/ products/tokudb-for-mysql/, 2013. See https://web.archive.org/web/

20121011120047/http://www.tokutek.com/ products/tokudb-for-mysql/.

[21] X. Wu, Y. Xu, Z. Shao, and S. Jiang. LSM-trie: An LSM-tree-based ultra-large key-value store for small data items. In USENIX ATC, pages 71–82, 2015.

[22] J. Yuan, Y. Zhan, W. Jannen, P. Pandey,

A. Akshintala, K. Chandnani, P. Deo, Z. Kasheff, L. Walsh, M. Bender, M. Farach-Colton,

R. Johnson, B. C. Kuszmaul, and D. E. Porter. Optimizing every operation in a write-optimized file system. In FAST, pages 1–14, 2016.

翻译如下：

摘要

我们提出了一种叫做“错乱查询（derange query）”的查询方式。它将函数运行结果存在一个集合中，然后“懒懒地”整合要返回的结果。错乱查询将查询工作一直推迟，直到处理起来十分方便或者有必要立刻去处理它为止。这样可以减少系统总的I/O开销

错乱查询是在一个“数据和被使用的时间点一致” 的视图上进行操作的，而不去管数据是什么时候被计算机算出来的。这种查询方式在当结果要过一段时间才需要被使用时非常好用，但它也可以被立即执行。用户可以以较低的代价查询正在运行的查询语句的部分结果。

1.介绍

查询语句对于某一数据库来说，有着关于响应时间和数据时效性的不同要求。有些处理命令直接为最终用户的请求而服务，这时必须把延迟降到最低而减少用户对延迟的感知；有些其他的查询语句并不需要很迫切地去执行，因此可以被“投机地”安排查询时间（但仍需要一个对数据的“时间点一致视图”）。第二类查询的例子包括定期报告和汇总计算，比如发月薪，分析网购模式和监测社会媒体的走向。

如果将长时间运行的汇总计算和其他高优先级、延迟敏感的任务在一台机器上同时运行，那么后者会很不爽。为缓解对数据库资源的竞争，在执行摘要计算时，创造副本或附加数据库是很常见的。这可能会需要额外的物理资源，管理成本，和/或许可证，并要求保持多个数据库同步。

我们提出了一种新的汇总计算，它能最低限度地影响其他操作的执行。错乱查询映射到一个范围内，然后逐步合计结果。错乱查询将查询工作一直推迟，直到处理起来十分方便（比如其他必要的工作已将所需资料读到存储器中）或者有必要立刻去处理它（比如现在就需要查询结果的时候）为止。因此，错乱查询对于那些在将来某个时间才会用到结果的计算是非常有用的。当然，需要的话，扰乱查询也可以立即执行。

错乱查询所蕴含的核心思想之一是将工作背景与I/o安排相结合，从而最小化系统的I/O开销。当查询语句执行时，我们希望把当前查询语句的I/O开销分到尽可能多的、目前处于活跃状态中的查询语句中去。同时，我们不希望后台任务对延迟敏感操作产生负面影响。错乱查询模式允许你将这些目标整合成一个I/O调度程序

错乱查询可以很容易地被作为写优化数据结构（write-optimized dictionary，WOD）而实现，例如B^e -tree，

日志结构的合并树及其变体。顾名思义，写优化数据结构因其高性能的数据库和文件系统而受欢迎，它的插入性能非常高，通常每次增/删的开销小于1个I/O。它们缓冲并批量写入，全力提高通过批处理进行写入的效率，因而速度很快。

本文找出了一个能够将批量写入整合到WOD中，并结合查询语句的背景访问相同的数据的条件。将错乱查询以写优化数据结构中一条信息的形式实现有很多优势：

1、重叠的输入范围可以被明显地批处理并在一起“加工”，这样每个输入值只读取一次就可以了。

2、重复的、在同一输入范围内多个时间点的错乱查询，可能会通过准确阅读每个版本的数据一次来完成。

3、I/O需要获取新的、有助于完成错乱查询的数据，同时也需要一个可以加速获取新数据的错乱查询。

错乱查询可以显著降低对于易变数据集的汇总计算，并能使对数据的分析在高性能的数据库中运行，而不会出现影响性能的数据更新。事实上，数据集更得越快，错乱查询完成得也就越快。

本文的其余部分是这样组织的。第2节讨论WOD，并解释B-e树的特性如何适用于错乱查询的设计。第3节用一个具体的例子介绍了推荐使用的错乱查询所取得的成就。第4节讲错乱查询高性能的原因。第5节介绍什么情况下使用错乱查询会非常高效。第6节讲了与之相关的一些其它工作，第7节讨论了对错乱查询未来的展望。

2.写优化数据结构

这节介绍一个写优化数据结构（WOD）的具体实例——B-e树。错乱查询被其他WOD实现，包括LSM树和它们的变体。然而，我们所推荐的实现方案很大程度上依赖于更新插入操作，而且B-e树有渐进较高（？）的更新插入（upsert）操作。

我们把讨论的范围限制在B-e树对于错乱查询来说最具相关性的特征。Bender等人提供了一套更完整的对于B-e树的介绍，其中包括了与其他WOD的比较

2.1 B-e树

一个B-e树，像多路搜索树（B-tree）一样，是一个组织的持久化数据搜索树。内部节点存储枢键和孩子指针，叶节点存储键-值对。两者的区别在于B-e树的内部节点也分配一个缓冲区来存储信息。它的结构如图所示

内部节点存储枢键和孩子指针，叶节点存储键值对。内部节点也分配一个缓冲区来存储信息，这是在批冲进树。

数据的编码升级成键值。所有的数据都插入B-e树的根节点，当数据缓冲区中填进了根节点或其他非叶子节点时，缓冲区的全部数据被清空到一或多个儿子节点上。清空操作将数据从父节点移到儿子节点的缓冲区中，缓冲操作或许会将该树“倾倒”，数据最终会被以键值对的形式被存在一个叶子节点上。清空过程会估算并使得孩子节点得到足够的信息，以偿还重写父子节点缓冲区的成本。因此，数据将沿着节点从根部被分批送到叶子节点，直到最终被应用在一个B-e树的叶子节点上为止。

更新插入

B-e树能高效地实现“盲操作”，即不读取键值对的值就对它进行操作

一次更新插入操作唯一确定一个键，一个函数，和一组函数的参数。当查询到一个键值对时，所有更新插入的，从根节点到叶子节点上的数据都被聚集起来，然后它们的功能被有序地执行。

更新插入可用于：(1) 简化更新操作的代码，将它们限制在一个对象内的若干比特位范围中 (2) 限制结构化数据的范围 (3) 限制对数据计算的计算范围。更新插入操作的灵活性是实现错乱查询的基础；就像第三节所描述的，更新插入操作可以使错乱查询逐步地、懒懒地处理那些可以推迟的查询工作。

时间顺序

B-e树缓冲区所保存的数据的相对位置关系中保存了更新操作的时间顺序。在任何时刻，一个键值对的多个版本可能同时存在于树上（例：插入消息覆盖现有的键值对）；多个传送中的数据可能包含对一个给定值的更新（例如两个更新插入操作目标是相同的键值对时）。节点清空数据时会保留数据的顺序，直到信息以键值对形式被应用在B-e树叶子节点上为止。

查询

所有的数据必须能对一个位于缓冲器根-叶节点搜索路径的查询操作作出回应。由于非叶子节点上的数据可能包含未完成的更新，所有数据沿着根-叶路径必须被查到，并按照时间倒序来更新。

消息的目标

单个数据可以适用于一个键值对、所有的键值对（全局广播）、或者某一范围内的键值对（局部广播）。一个局部广播(rangecast)的数据对应于一个连续连续范围内键的地址，被包含在“开始”和“结束”两个键之间。

由于全局广播和局部广播的数据可能适用于许多键值对，这些数据可以在一个节点执行清空操作时“分裂”。当数据“分裂”时，原来的数据会被带有适当附加信息的新数据代替。

3.错乱查询的设计

错乱查询可以作为局部广播而被实现。错乱查询有如下形式：

DERANGE(R;FILTER;MAP;FOLD;k)

R：输入范围

Filter：过滤器，筛掉不符合某些标准的记录

Map:一个适用于每条符合输入范围、筛选条件的记录的映射

Fold：传输结果

K：指定结果在什么地方开始累积

合计出的记录由键k合计在一起，它们随着错乱查询的执行逐渐被更新。当映射中的每个应用都对应了一个输入的记录时，输出就会按照如下形式被插入并整合在一起：

UPSERT(k;FOLD;resultMAP )

更新插入信息提供了一种灵活的手段来传送映射结果。更新插入操作能编码复杂的数据相关的操作，以及简单的操作如递增计数器。向B-e树的根节点插入小型的更新插入操作的数据会减少I/O开销。

3.1错乱查询的例子

为了具体了解错乱查询如何工作，我们会演示一个叫“Market Place”的虚拟网店如何用错乱查询分析数据。

假设MarketPlace用一个如下格式的产品数据库来管理他的存货：

Item{

productId : num

warehouse : address

quantity : num

value : num

price : num

}

每一个小时，网店要计算它在纽约仓库里所有产品的总价值，从而识别趋势并作出库存决策。网店可能会用错乱查询进行这样的计算：

R = (−∞,∞)

FILTER = return Item.warehouse == NY

MAP = return Item.quantity * Item.value

FOLD = totalValue += result

k = InventoryAt||TIMESTAMP

市场会先初始化其聚集的记录K。在这个例子中，k的值是一个integer型，总价值totalValue初始化为0。

范围R =（¥；¥）意为这个查询语句会考察数据库中的所有记录。但由于查询应该只查找位于NY仓库的物品，过滤函数Filter就是用来排除不符合这个标准的记录的。注意，如果数据库的主键使用了地理做主键，可选择的范围仅为纽约仓库的相关数据，避免无关数据的出现；过滤函数能按照表中没有的条件执行查询数据操作。

当错乱查询到达树的叶子节点时，它所观测的每条记录的值是当错乱查询首次执行时就存在的。在这一节点上，映射函数被所有在范围R之内并满足筛选条件的记录调用。每个映射函数的输出，即仓库中库存产品的单价，使用更新插入操作通过Fold函数计算出K的运营总量，然后传递并汇总到一个记录K中。

这个简单的例子演示了错乱查询所提供的功能。网店的库存计算通过每一时刻的数据视图实现，但查询结果并不要求马上得出。如果树的某一特定区域在两次错乱之间未被改变，那么一个I/O就可以满足两次操作。然而，即使树更新频繁，所有错乱查询看到的都是时间点一致的数据视图，无论何时执行实际的计算。

3.2查询的比较

这种懒懒的查询方式单独运行时出现的最大挑战，就是通过分布式计算确定总工作完成了多少。为了解决这个问题，我们添加了少量“记账薄”到聚集的记录中：一个必需的字段outstandingmessages，和一个可选的字段recordsprocessed。

outstandingmessages字段是一个简单的计数器。一次错乱查询的信息可能会被应用到树上的许多记录，像第2.1节解释的那样，一次节点的清空操作可能会导致rangecast的“分裂”。每次错乱查询的信息“分裂”时，我们发送一条更新插入操作到错乱查询的结果集里让outstandingmessages计数器自增。为了完成记账薄，当错乱查询指令达到B-e树的叶子节点时，我们发送一个更新插入操作，递减计数器。为了记录最初的错乱查询指令插入到B-e树的根节点上的那次操作，outstandingmessages计数器初始化为1。

Recordsprocessed字段则计算映射结果-结果集里键值对的数量。由于清空操作的“懒”和B-e树结构内部的不透明度，应用没有手动令扰乱查询发起清空命令的控制权。通过查询recordProcessed字段，应用可以推理得出结果集中一部分结果的意义。

4、错乱查询的开销

这一节解释错乱查询如何提高摘要计算的性能，WOD也以差不多的方式提高插入和更新操作的性能。

如第2.1节所述，B-e树节点只在大量新的数据被写入（足以分期偿还重写父子节点的开销）时被“弄脏(?dirtied)”。对一个节点大小为B，分支因子为B^e，缓冲区大小为B-B^e，节点进行清空操作时被写入的新数据总量为至少(B-B^e)/B^e=B^(1-e). 我们把这个叫做定量因子（Batching factor）。这也解释了插入和更新操作在B-e树中比在B树中快B^(1-e)倍的原因。

错乱查询带来对定量查询的好处。错乱查询生成的一系列 L个项目会在执行时到达O((logB N)/(e*B^(1-e)+L/B^(2-e))个节点。错乱查询指令和其他指令一起被批量清空时的大小为至少B^(1-e).　因此分批偿还错乱查询生成的L个项目的I/O开销需要O( (logB N)/e + L/B ) 个I/O. 定量因子划分好了开销，作为结果，错乱查询有潜力令查询操作加速，加速后速度与写优化为插入操作提供的速度相当

5.错乱查询的机遇

在这一节，我们讨论在什么情况下错乱查询会变得特别有用。

混合工作负载环境

一个典型的Web-sacle数据库至少提供两种查询：小型随机查询，快速应答；大型解析查询，最快也要几小时但可以推迟而不损害它们的商业价值。以信用卡数据库为例，客户的购买行为产生很多高优先级的插入，大型查询则在夜间运行，寻找新的欺诈模式(?find new fraud patterns)。

如果大部分I/O所需的大型查询可以附上小型查询，那么两种查询类型可以在不增加数据库开销或减缓小型查询的条件下运行。

时间点计算

在一般情况下，相同的错乱查询实例，在多个时间点的重复，会通过准确读取每个版本的数据一次实现。因此，错乱查询能被用来增加报告的粒度(系统内存扩展增量的最小值)

在重叠范围内的查询

错乱查询便于批量操作那些无关的查询。例如，一个系统每24个小时执行一次汇总计算，每12小时执行一次另一个汇总计算。手动定量这些计算实质上要写入两个12小时计算的版本，一个自己运行，另一个作为24小时计算的一部分从而运行。有了错乱查询，开发商只需要为每个计算写一个版本，而系统会自动分批去处理它们。

6.相关工作

Amvrosiadis等人观察发现，常见的文件系统维护任务（例如备份、磁盘碎片整理、病毒扫描等）经常独立地重复执行，而它们的工作集有很多重叠的部分。

Duet框架列出了页面缓存中，当需要使用的数据为可用状态时负责通知进程负责“勾取”数据的hooks. 这让后台任务能够利用由前台工作产生的I/O。错乱查询相似地利用B-e树的清空操作更新数据，捎带上所需的I/O.

在MapReduce程序模型中，使用者筛选、分类输入进来的数据，独立处理过滤后的数据，并结合计算的输出得到最终结果。MapReduce使这些类型的操作容易被编程成分布式数据集。错乱查询提供了一个类似的程序模型，但可以任意地推迟执行。本文中激励使用者使用的案例，本文已单独列为一节（高性能的产品数据库），但错乱查询也可以被扩展使用在一个分布式存储系统中。

LINQ的特征是延期执行，这就要求延迟表达式求值，直到那个值现在就必须求出来。然而，从树被创建到查询语句执行，这之间数据库可能会发生改变。错乱查询延期执行直到信息被应用，但信息总是在它被插入时需要数据的值。

7.对未来的展望

就算错乱查询不能任意延迟，但它们可以提供明显的加速。我们今后工作的一部分就是分析和实证评估错乱查询所创造的运行机会。

当执行错乱查询与但又有一个固定的deadline时，系统地执行部分查询功能会比较有用。否则，一些延期执行的工作可能需要被安排在查询期限之前，这会降低定量处理的优势。错乱查询创造I/O调度和工作负载管理的机会。

（许可书和参考文献略）

阅读全文

0 0

Lazy Analytics Let Other Queries Do the Work For You 懒惰分析：让其他查询语句为你工作

Lazy Analytics Let Other Queries Do the Work For You

William Jannen, Michael A. Bender, Martin Farach-Colton†, Rob Johnson,Bradley C. Kuszmaul‡, and Donald E. Porter Stony Brook University, †Rutgers University, and ‡Massachusetts Institute of Technology

先上原文：

翻译如下：

William Jannen, Michael A. Bender, Martin Farach-Colton†, Rob Johnson,
Bradley C. Kuszmaul‡, and Donald E. Porter
Stony Brook University, †Rutgers University, and ‡Massachusetts Institute of Technology