ceph存储 ceph中PG的意义
来源:互联网 发布:android 5高级编程 编辑:程序博客网 时间:2024/05/16 19:05
ceph中引入了PG(placement group)的概念,PG是一个虚拟的概念而已,并不对应什么实体,具体的解释下面很清楚。
下图中可以看出,ceph先将object映射成PG,然后从PG映射成OSD。object可以是数据文件的一部分,也可以是journal file,也可以目录文件(包括内嵌的inode节点)
如果有一个OSD,默认有192个PG。
如果有两个OSD,默认有2*192=384个PG
PG (Placement Group) notes
Miscellaneous copy-pastes from emails, when this gets cleaned up itshould move out of /dev.
Overview
PG = “placement group”. When placing data in the cluster, objects aremapped into PGs, and those PGs are mapped onto OSDs. We use theindirection so that we can group objects, which reduces the amount ofper-object metadata we need to keep track of and processes we need torun (it would be prohibitively expensive to track eg the placementhistory on a per-object basis). Increasing the number of PGs canreduce the variance in per-OSD load across your cluster, but each PGrequires a bit more CPU and memory on the OSDs that are storing it. Wetry and ballpark it at 100 PGs/OSD, although it can vary widelywithout ill effects depending on your cluster. You hit a bug in how wecalculate the initial PG number from a cluster description.
There are a couple of different categories of PGs; the 6 that exist(in the original emailer’sceph-s output) are “local” PGs whichare tied to a specific OSD. However, those aren’t actually used in astandard Ceph configuration.
Mapping algorithm (simplified)
Many objects map to one PG.
Each object maps to exactly one PG.
One PG maps to a single list of OSDs, where the first one in the listis the primary and the rest are replicas.
Many PGs can map to one OSD.
A PG represents nothing but a grouping of objects; you configure thenumber of PGs you want (seehttp://ceph.com/wiki/Changing_the_number_of_PGs ), number ofOSDs * 100 is a good starting point, and all of your stored objectsare pseudo-randomly evenly distributed to the PGs. So a PG explicitlydoes NOT represent a fixed amount of storage; it represents 1/pg_num‘th of the storage you happen to have on your OSDs.
Ignoring the finer points of CRUSH and custom placement, it goessomething like this in pseudocode:
locator = object_nameobj_hash = hash(locator)pg = obj_hash % num_pgosds_for_pg = crush(pg) # returns a list of osdsprimary = osds_for_pg[0]replicas = osds_for_pg[1:]
If you want to understand the crush() part in the above, imagine aperfectly spherical datacenter in a vacuum ;) that is, if all osdshave weight 1.0, and there is no topology to the data center (all OSDsare on the top level), and you use defaults, etc, it simplifies toconsistent hashing; you can think of it as:
def crush(pg): all_osds = ['osd.0', 'osd.1', 'osd.2', ...] result = [] # size is the number of copies; primary+replicas while len(result) < size: r = hash(pg) chosen = all_osds[ r % len(all_osds) ] if chosen in result: # osd can be picked only once continue result.append(chosen) return result
User-visible PG States
Todo
diagram of states and how they can overlap
- creating
- the PG is still being created
- active
- requests to the PG will be processed
- clean
- all objects in the PG are replicated the correct number of times
- down
- a replica with necessary data is down, so the pg is offline
- replay
- the PG is waiting for clients to replay operations after an OSD crashed
- splitting
- the PG is being split into multiple PGs (not functional as of 2012-02)
- scrubbing
- the PG is being checked for inconsistencies
- degraded
- some objects in the PG are not replicated enough times yet
- inconsistent
- replicas of the PG are not consistent (e.g. objects arethe wrong size, objects are missing from one replicaafter recoveryfinished, etc.)
- peering
- the PG is undergoing thePeering process
- repair
- the PG is being checked and any inconsistencies found will be repaired (if possible)
- recovering
- objects are being migrated/synchronized with replicas
- backfill
- a special case of recovery, in which the entire contents ofthe PG are scanned and synchronized, instead of inferring whatneeds to be transferred from the PG logs of recent operations
- incomplete
- a pg is missing a necessary period of history from itslog. If you see this state, report a bug, and try to start anyfailed OSDs that may contain the needed information.
- stale
- the PG is in an unknown state - the monitors have not receivedan update for it since the PG mapping changed.
- remapped
- the PG is temporarily mapped to a different set of OSDs from whatCRUSH specified
- ceph存储 ceph中PG的意义
- ceph中 PG的意义
- ceph存储 PG的数据恢复过程
- Ceph中PG和PGP的区别
- Ceph PG的不同状态
- ceph存储 PG的状态机和peering过程
- ceph存储 ceph集群pool、pg、crush设置
- ceph存储 ceph中对crush算法的认知
- ceph 存储池PG查看和PG存放OSD位置
- ceph存储 ceph中librados相关
- ceph存储 ceph中restful设计原理
- ceph存储 ceph中pglog处理流程
- ceph pg和pgp的区别
- ceph pg split
- ceph pg数量调整
- ceph - pg 常见状态
- ceph: PG 状态
- ceph PG数 告警
- git log 常用命令及技巧
- 个人收藏的linux服务器所有命令
- 微信自定义菜单流程
- 脚本注释 CNHK
- 使用原生态的api上传文件的实现:
- ceph存储 ceph中PG的意义
- oracle set指令
- 数据库创建以及读写
- LeetCode 解题报告 Single NumberII
- IE8下使用jquery中的ajax不走回调函数success
- rabbitmq——用户管理
- Cisco交换机端口假死(err-disable)解决方法
- Android开发之VersionCode和VersionName知识
- java开发常用小程序合集