Sector/Sphere:High Performance Distributed File System and Parallel Data Processing Engine

来源：互联网发布：oracle数据割接编辑：程序博客网时间：2024/04/29 11:51

1. Overview
sector/sphere was created by Dr. Yunhong Gu in 2006 and it is now maintained by a group of open source developers, available from : http://sector.sourceforge.net/
sector : Distrubuted file system
sphere: parallel data processing framework
There is a test, in some cases,sector/sphere is about twice as fast as Hadoop

2. Sector
Sector system architecture:

the figure shows the overall architecture of the sector system, which consistsof three parts:
Security Server: maintains user accounts, user passwd, file access infomation, ip addresses of the authorized slave nodes
Master: maintains the metadata of the files stored in the syste, controls the running of all slave nodes, responds to users' requests
Slaves: the nodes that store the files managed by the system and process the data upon the request of a sector client
The clients includes:
1. sector file system client api: access sector files in applications using the c++ api
2. sector system tools
3. FUSE: mount sector file system as a local directory
4. sphere programming api
A more detail figure:

Feature:
1. Compared to Hadoop, sector does not split user files into blocks, instead, every sector slice is stored as one single file in the native file system
2. Sector runs an independent security server, this design allows different security service providers to be deployed. In addition, multiple sector masters can user the same security service
3. Topology aware and application aware
4. uses UDP for message passing and UDT for transfer

Replication:
1. provide software level falut tolerance(no hardware RAID is required)
2. all files are replicated to a specific number by defalut
3. by default, replication is created on furthest node

UDT:
A high performance data transfer protocol designed for transferring large volumetric datasets over high speed wide area networks. Such settings are typically disadvantageous for the more common TCP protocol.
UDT uses UDP to transfer bulk data with its own reliability control and congestion control mechanisms. The new protocol can transfer data at a much higher speed than TCP does.

Limitations:
1. File size if limited by available space individual storage nodes
2. Users my need to split their datasets into proper sizes
3. Sector is designed to provide high throughput on large datases, rather than extreme low latency on small files

3. Sphere
Sphere is a parallel data processing engine integrated in Sector and it can be used to process data stored in Sector in parallel,
Sphere users a stream processing computing paradigm. A stream is an abstraction in sphere and it represents either a dataset or a part of a dataset(A sector dataset consists of one of more physical files)
This figure illustrates how sphere processes the segments in a stream.
SPE: Sphere Proccessing Engine

This figure illustrates the basic model that sphere supports. sphere also supports some extensions of this model, which occur quite frequently
1. Processing multiple input streams.
2. Shuffling input streams.
Interested guyscan refer to: “Sector and Sphere: The Design and Implementation of a High Performance Data Cloud”

4. References
Sector and Sphere: The Design and Implementation of a High Performance Data Cloud
http://sector.sourceforge.net/
http://en.wikipedia.org/wiki/Sector/Sphere
http://dongxicheng.org/mapreduce/streaming-mapreduce-sphere/
http://en.wikipedia.org/wiki/UDP-based_Data_Transfer_Protocol
http://udt.sourceforge.net/