数据存储重定向技术

来源：互联网发布：linux黑客工具编辑：程序博客网时间：2024/05/16 18:17

FujitsuLaboratories today announced that it has developed new parallel distributeddata processing technology that enables pools of big data as well as continuousinflows of new data to be efficiently processed and put to use within minutes.

Fujitsu实验室今天宣布，他们已经开发出了一种新的并行分布数据处理技术，可以在分钟时间级别内将大数据池中的数据，包括持续进入的新数据进行有效配置和使用

The amount oflarge-volume, diverse data, such as sensor data and humanlocation data, continues to grow, and various data processing technologies arebeing developed to enable these pools and streams of big data to be quicklyanalyzed and put to use. When the priority is on high-speed performance,methods that process the data in memory are used, but when dealing with verylarge volumes of data, disk-based methodologies are typically used as volumesare too large to process in memory. When using disk-based techniques, however,if the objective is to immediately reflect the newly received data in theanalytical results, many disk accesses are necessary. This results in theproblem that analytical processing cannot keep pace with the volume of dataflowing in.

例如传感器数据和人类定位数据的大规模多样化数据一直在增长，并且各种数据处理技术也一直在发展以能够对于这些大数据池和流进行更快的分析和使用。曾经使用的那些在缓存上处理数据，被视为高速高性能的方法，在处理大数据时，如果还是按照基于磁盘方法论的典型方法来处理，则需要非常大的缓存。而且，当使用这种技术时，如果需要对于新接收到的数据进行立即反应并显示到分析结果中时，多次磁盘读取就不可避免了。此时就带来一个问题，分析处理过程与大数据流动无法同步。

To address this problem, Fujitsu hasdeveloped technology that slashes the number of disk accesses by approximately 90% compared toprevious levels by dynamically reallocating data on disks to match trends indata accesses. Whereas producing analytic results of new data could takeseveral hours in the past, with this new technique results are available inminutes. This development excels at both volume and velocity when processingbig data, an objective that has been difficult to achieve until now.

This technology will be one of thetechnologies underpinning human-centric computing, which will provide relevantservices for every location.

针对这些问题，Fujitsu 已经开发了一项技术用以削减磁盘读取次数，通过满足数据读取趋势动态重定向磁盘数据，相比于之前减少次数大约90%。在过去新数据分析结果的得到可能要耗费数小时，然而，使用这种新技术只需要数分钟就能得到结果。这一开发在操作大数据时的数量和速率上都十分优秀，实现了一直以来都很难达到的目标。

In recent years, the amount oflarge-volume, diverse data, particularly chronological data such as sensor dataand human location data, continues to grow at anexplosive pace. There is a strong demand to take this type of "bigdata" and efficiently extract valuable information that can be put toimmediate use in delivering services, such as various navigation services.

近些年来，大量的大而多样的数据，尤其是例如传感器数据和人类定位数据的时序性数据，以爆炸式的方式持续激增。对于这种类型的“大数据”进行有效的信息提取并即刻用于例如大量导航设备的传达设备的需求十分强烈。

A number of data-processing techniqueshave emerged for handling big data (Figure 1). One of these, parallel batchprocessing, as in Hadoop, has become a focus of attention. In parallel batchprocessing, the dataset is divided and quickly processed by multiple servers.

大量的数据处理技术已经出现在了大数据控制上（图1）。他们的其中一个，如Hadoop中的并行批处理已经得到了集中地关注。在并行批处理中，数据集通关过多个服务器被独立而快速的处理。

Another technology that has alsoreceived interest is complex event processing (CEP), which handles a stream ofincoming data in real time. This has the benefit of being extremely fastbecause it processes data in memory.

另外一个也被乐于接受的技术是复杂事件处理（CEP），用于操作实时而来的数据流。这项技术之所以处理起来很快，是因为它将数据存在了内存中。

The goal of extracting valuableinformation more quickly, from larger datasets, requires a data-processingtechnology that is disk-based and can quickly produce analytic results.

我们的目标是从巨大的数据集中更加快速的提取有价值的信息，这需要基于磁盘并且能够快速得到分析结果的数据处理技术。

While there are both batch and incrementaldisk-based processing techniques, obtaining analytic results from either onequickly (responsiveness) remains a problem.

同时，批处理技术和基于磁盘增量的处理技术中，通过任何一个快速（返回）获得的分析结果仍然存在问题。

Because batch techniques perform abatch process on a snapshot of the data, there will always be a fixed lag-timebefore new information can be reflected in the analytic results.

因为批处理技术是在数据快照基础上执行批处理过程，在新数据被反应到分析结果之前始终存在一个固定的滞后时间。

Conversely, with incrementalprocessing, new data is processed consecutively as it arrives, but updating theanalytic results directly requires the disk to be accessed numerous times. Thiscreates a bottleneck for analytic processing overall, which ultimately cannotkeep up with the pace of incoming data (Figure 2). Quickly reflecting new datain analytic results, therefore, required addressing the problem of reducing thenumber of disk accesses.

于此相反，执行增量处理，新数据会随着其到来而被连续处理，但是更新分析结果时直接需要磁盘的巨多次读取。这样从整体上造成了分析处理的瓶颈，最终导致了分析处理无法跟上新进数据的速度（图2）。因此，为了在分析结果上快速反应新数据，需要解决减少磁盘存取次数这一难题。

Fujitsu has developed a technology itcalls "adaptive locality-aware data reallocation," which dramaticallyreduces the number of accesses, along with distributed parallel middleware forincremental processing.

Fujitsu 已经开发了一种叫做“自适应临近原则数据重定向（分配）”的技术，，能够动态地减少存取次数，是增量处理技术之外的一种分布式并行中间件。

With adaptive data localization, datais optimally allocated by the following three steps (Figure 3):

• Record data-access history: Recordssets of continuously accessed data.

• Calculate optimal allocation: Basedon step 1, group sets of data that tend to be accessed continuously.

• Reallocate data dynamically: Based onstep 2, specify a location on disk for data belonging to a group and allocateit there.

为自适应数据位置，通过以下3步对数据进行优化定向（分配）（图3）：

l 记录数据存取记录。记录持续存取的数据集。

l 计算优化定向（分配）：基于第一步，将数据集按照易于持续存取的方式分组。

l 动态数据重定向（分配）：基于第二步，按照数据归属的组指定数据在磁盘中的位置，并将数据定向（分配，安置）到那里。

This makes it possible to acquiredesired data through a fewer number of continuous accesses, not numerous randomaccesses, which vastly increases overall throughput in a distributed-processingsystem. Also, by monitoring and automatically recognizing patterns of dataaccess, this technology can gradually accommodate the hard-to-anticipate datacharacteristics of social-infrastructure systems.

这样一来，在整体上吞吐量大量增加的分布式处理系统中，通过少量连续存取而不是大量随机存取获得需要数据变为可能。同时，通过对数据读取的监控和自动模式识别，这一技术能够逐渐适应社会基础设施系统中难以预料的数据特性。

This technology can perform analyticprocessing on big data using incremental processing while accepting data asquickly as it arrives, allowing for rapid analytic processing of current data.

这项技术能够利用增量处理对大数据进行分析处理，同时尽快接受到达的数据，允许对当前数据进行快速分析处理。

This technology was used in theanalytical processing portion of an electronic commerce recommendation system,where it was shown to operate with about one-tenth the number of disk accessesof previous technologies. Consequently, whereas batch processing hadconventionally been used for analytical processing of large data volumes,incremental processing is now suitable. This greatly reduces the time requiredfor new data to be reflected in analytical results. When applied to analyticprocesses that had been run as overnight batches because of the hours-longprocessing time required with batch processing, this technology can be used toutilize analytical results in a matter of minutes.

这项技术被用于电商推荐系统的分析处理部分，其展现出的操作，在磁盘读取数量方面是之前技术的十分之一。所以，较之传统上批处理方式用于分析处理大数据卷，增量处理现在很适用。在新数据需要反应在分析结果的需求下，该技术能够很大程度上减少时间开销。在应用于分析处理下那些需要通过批处理执行数小时之久而长期工作的任务，这项技术能够在分分钟利用到分析结果。

Fujitsu Laboratories plans to move forward to make furtherperformance enhancements to the technology and conduct verification testingwith the aim of applying it to commercial products and services in fiscal 2013.

Fujitsu 实验室成员计划进一步加强该技术性能并进行验证测试，目标是在2013财政上使其适用于商业产品和服务。

0 0