database Scaling、Partitioning and Sharding

来源：互联网发布：衣服布料淘宝编辑：程序博客网时间：2024/05/17 03:48

Scaling

Horizontal Scaling:

is about adding more machines to enable improved responsiveness and availability of any system including database. The idea is to distribute the work load to multiple machines.

Vertical Scaling:

is about adding more capability in the form of CPU, Memory to existing machine or machines to enable improved responsiveness and availability of any system including database. In a virtual machine set up it can be configured virtually instead of adding real physical machines.

In a database world horizontal-scaling is often based on partitioning of the data i.e. each node contains only part of the data , invertical-scaling the data resides on a single node and scaling is done through multi-core i.e. spreading the load between the CPU and RAM resources of that machine.

With horizontal-scaling it is often easier to scale dynamically by adding more machines into the existing pool - Vertical-scaling is often limited to the capacity of a single machine, scaling beyond that capacity often involves downtime and comes with an upper limit.

A good example for horizontal scaling is Cassandra , MongoDB.. and a good example for vertical scaling is MySQL - Amazon RDS (The cloud version of MySQL) provides an easy way to scale vertically by switching from small to bigger machines this process often involves downtime.

In-Memory Data Grids such as GigaSpaces XAP, Coherence etc.. are often optimized for both horizontal and vertical scaling simply because they're not bound to disk. Horizontal-scaling through partitioning and vertical-scaling through multi-core support.

You can read more on this subject on my earlier posts: Scale-outvs Scale-up and TheCommon Principles Behind the NOSQL Alternatives

Partitioning

Horizontal Partitioning in data base

Keeping all the fields EG: Table Employees has id, name, Geographical location, email, designation, phone

EG:1. Keeping all the fields and distributing records in multiple machines. say id= 1-100000 or 100000-200000 records in one machine each and distributing over multiple machines.

EG:2. Keeping separate databases for Regions EG: Asia Pacific, North America

Key: Picking set of rows based on a criteria

Vertical Partitioning in data base

It is similar to Normalization where the same table is divided in to multiple tables and used with joins if required.

EG: id, name, designation is put in one table and phone , email which may not be frequently accessed are put in another.

Key: Picking set of columns based on a criteria.

sharding

(http://en.wikipedia.org/wiki/Shard_(database_architecture)):

A database shard is a horizontal partition of data in a database or search engine. Each individual partition is referred to as a shard or database shard. Each shard is held on a separate database server instance,to spread load.

Some data within a database remains present in all shards, but some only appears in a single shard. Each shard (or server) acts as the single source for this subset of data.

Disadvantages include :

· A heavier reliance on the interconnect between servers

· Increased latency when querying, especially where more than one shard must be searched.

· Data or indexes are often only sharded one way, so that some searches are optimal, and others are slow or impossible.

· Issues of consistency and durability due to the more complex failure modes of a set of servers, which often result in systems making no guarantees about cross-shard consistency or durability.

In practice, sharding is complex. Although it has been done for a long time by hand-coding (especially where rows have an obvious grouping,as per the example above), this is often inflexible. There is a desire to support sharding automatically, both in terms of adding code support for it,and for identifying candidates to be sharded separately. Consistent hashing is one form of automatic sharding to spread large loads across multiple smaller services and servers.

Shards compared to horizontal partitioning

Horizontal partitioning splits one or more tables by row, usually within a single instance of a schema and a database server. It may offer an advantage by reducing index size (and thus search effort) provided that there is some obvious, robust, implicit way to identify in which table a particular row will be found, without first needing to search the index, e.g., the classic example of the 'CustomersEast' and 'CustomersWest' tables, where their zip code already indicates where they will be found.

Sharding goes beyond this: it partitions the problematic table(s) in the same way, but it does this across potentially multiple instances of the schema. The obvious advantage would be that search load for the large partitioned table can now be split across multiple servers(logical or physical), not just multiple indexes on the same logical server.

Splitting shards across multiple isolated instances requires more than simple horizontal partitioning. The hoped-for gains in efficiency would be lost, if querying the database required both instances to be queried, just to retrieve a simple dimension table.Beyond partitioning, sharding thus splits large partitionable tables across the servers, while smaller tables are replicated as complete units.

This is also why sharding is related to a shared nothing architecture—once sharded, each shard can live in a totally separate logical schema instance / physical database server / data center / continent. There is no ongoing need to retain shared access (from between shards) to the other unpartitioned tables in other shards.

This makes replication across multiple servers easy (simple horizontal partitioning does not). It is also useful for worldwide distribution of applications, where communications links between data centers would otherwise be a bottleneck.

There is also a requirement for some notification and replication mechanism between schema instances, so that the unpartitioned tables remain as closely synchronized as the application demands. This is a complex choice in the architecture of sharded systems: approaches range from making these effectively read-only(updates are rare and batched), to dynamically replicated tables (at the cost of reducing some of the distribution benefits of sharding) and many options in between.

A shared nothing architecture (SN) is a distributed computing architecture in which each node is independent and self-sufficient, and there is no single point of contention across the system. More specifically, none of the nodes share memory or disk storage.

0 0