greenplum segment恢复的过程

来源:互联网 发布:js开发中定时器用哪个 编辑:程序博客网 时间:2024/04/29 06:22

#此时已经知道坏了两个SEGMENT 在启动命令里加上-R以限制模式启动

[gpadmin1@hadoop1 ~]$ gpstart -R   
20101027:14:11:55:gpstart:hadoop1:gpadmin1-[INFO]:-Starting gpstart with args: -R
20101027:14:11:55:gpstart:hadoop1:gpadmin1-[INFO]:-Gathering information and validating the environment...
20101027:14:11:55:gpstart:hadoop1:gpadmin1-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 4.0.1.0 build 1'
20101027:14:11:55:gpstart:hadoop1:gpadmin1-[INFO]:-Greenplum Catalog Version: '201005134'
20101027:14:11:55:gpstart:hadoop1:gpadmin1-[INFO]:-Starting Master instance in admin mode
20101027:14:11:56:gpstart:hadoop1:gpadmin1-[INFO]:-Obtaining Greenplum Master catalog information
20101027:14:11:56:gpstart:hadoop1:gpadmin1-[INFO]:-Obtaining Segment details from master...
20101027:14:11:56:gpstart:hadoop1:gpadmin1-[INFO]:-Master Started...
20101027:14:11:56:gpstart:hadoop1:gpadmin1-[INFO]:-Shutting down master
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[WARNING]:-Skipping startup of segment marked down in configuration: on hadoop1 directory /home/gpadmin1/gp4datap1/aligp0 <<<<<
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[WARNING]:-Skipping startup of segment marked down in configuration: on hadoop1 directory /home/gpadmin1/gp4datap2/aligp1 <<<<<

20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:---------------------------
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:-Master instance parameters
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:---------------------------
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:-Database                 = template1
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:-Master Port              = 2345
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:-Master directory         = /home/gpadmin1/gp4master/aligp-1
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:-Timeout                  = 60 seconds
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:-Master standby start     = On
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:---------------------------------------
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:-Segment instances that will be started
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:---------------------------------------
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:-   Host      Datadir                           Port    Role
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:-   hadoop2   /home/gpadmin1/gp4datam1/aligp0   40000   Primary
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:-   hadoop2   /home/gpadmin1/gp4datam2/aligp1   40001   Primary
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:-   hadoop2   /home/gpadmin1/gp4datap1/aligp2   30000   Primary
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:-   hadoop3   /home/gpadmin1/gp4datam1/aligp2   40000   Mirror
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:-   hadoop2   /home/gpadmin1/gp4datap2/aligp3   30001   Primary
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:-   hadoop3   /home/gpadmin1/gp4datam2/aligp3   40001   Mirror
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:-   hadoop3   /home/gpadmin1/gp4datap1/aligp4   30000   Primary
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:-   hadoop1   /home/gpadmin1/gp4datam1/aligp4   40000   Mirror
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:-   hadoop3   /home/gpadmin1/gp4datap2/aligp5   30001   Primary
20101027:14:11:57:gpstart:hadoop1:gpadmin1-[INFO]:-   hadoop1   /home/gpadmin1/gp4datam2/aligp5   40001   Mirror

Continue with Greenplum instance startup Yy|Nn (default=N):
> y
20101027:14:11:58:gpstart:hadoop1:gpadmin1-[INFO]:-Starting standby master
20101027:14:11:58:gpstart:hadoop1:gpadmin1-[INFO]:-Checking if standby master is running on host: hadoop2  in directory: /home/gpadmin1/gp4master/aligp-1
20101027:14:11:58:gpstart:hadoop1:gpadmin1-[INFO]:-No db instance process, entering recovery startup mode
20101027:14:11:59:gpstart:hadoop1:gpadmin1-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait...
.....
20101027:14:12:04:gpstart:hadoop1:gpadmin1-[INFO]:-Process results...
20101027:14:12:04:gpstart:hadoop1:gpadmin1-[INFO]:-----------------------------------------------------
20101027:14:12:04:gpstart:hadoop1:gpadmin1-[INFO]:-   Successful segment starts                                            = 10
20101027:14:12:04:gpstart:hadoop1:gpadmin1-[INFO]:-   Failed segment starts                                                = 0
20101027:14:12:04:gpstart:hadoop1:gpadmin1-[WARNING]:-Skipped segment starts (segments are marked down in configuration)   = 2    <<<<<<<<
20101027:14:12:04:gpstart:hadoop1:gpadmin1-[INFO]:-----------------------------------------------------
20101027:14:12:04:gpstart:hadoop1:gpadmin1-[INFO]:-
20101027:14:12:04:gpstart:hadoop1:gpadmin1-[INFO]:-Successfully started 10 of 10 segment instances, skipped 2 other segments
20101027:14:12:04:gpstart:hadoop1:gpadmin1-[INFO]:-----------------------------------------------------
20101027:14:12:04:gpstart:hadoop1:gpadmin1-[WARNING]:-****************************************************************************
20101027:14:12:04:gpstart:hadoop1:gpadmin1-[WARNING]:-There are 2 segment(s) marked down in the database
20101027:14:12:04:gpstart:hadoop1:gpadmin1-[WARNING]:-To recover from this current state, review usage of the gprecoverseg
20101027:14:12:04:gpstart:hadoop1:gpadmin1-[WARNING]:-management utility which will recover failed segment instance databases.
20101027:14:12:04:gpstart:hadoop1:gpadmin1-[WARNING]:-****************************************************************************
20101027:14:12:04:gpstart:hadoop1:gpadmin1-[INFO]:-Starting Master instance hadoop1 directory /home/gpadmin1/gp4master/aligp-1 in RESTRICTED mode
20101027:14:12:05:gpstart:hadoop1:gpadmin1-[INFO]:-Command pg_ctl reports Master hadoop1 instance active
NOTICE:  Master mirroring synchronizing
20101027:14:12:08:gpstart:hadoop1:gpadmin1-[WARNING]:-Database started but warnings generated       <<<<<
20101027:14:12:08:gpstart:hadoop1:gpadmin1-[INFO]:-Check status of database with gpstate utility
[gpadmin1@hadoop1 ~]$ psql -c 'select * from gp_segment_configuration;'
 dbid | content | role | preferred_role | mode | status | port  | hostname | address | replication_port | san_mounts
------+---------+------+----------------+------+--------+-------+----------+---------+------------------+------------
    4 |       2 | p    | p              | s    | u      | 30000 | hadoop2  | hadoop2 |            10000 |
    6 |       4 | p    | p              | s    | u      | 30000 | hadoop3  | hadoop3 |            10000 |
   10 |       2 | m    | m              | s    | u      | 40000 | hadoop3  | hadoop3 |            20000 |
   12 |       4 | m    | m              | s    | u      | 40000 | hadoop1  | hadoop1 |            20000 |
   11 |       3 | m    | m              | s    | u      | 40001 | hadoop3  | hadoop3 |            20001 |
    5 |       3 | p    | p              | s    | u      | 30001 | hadoop2  | hadoop2 |            10001 |
    7 |       5 | p    | p              | s    | u      | 30001 | hadoop3  | hadoop3 |            10001 |
   13 |       5 | m    | m              | s    | u      | 40001 | hadoop1  | hadoop1 |            20001 |
    1 |      -1 | p    | p              | s    | u      |  2345 | hadoop1  | hadoop1 |                  |
   14 |      -1 | m    | m              | s    | u      |  2345 | hadoop2  | hadoop2 |                  |
    2 |       0 | m    | p              | s    | d      | 30000 | hadoop1  | hadoop1 |            10000 |
    8 |       0 | p    | m              | c    | u      | 40000 | hadoop2  | hadoop2 |            20000 |
    3 |       1 | m    | p              | s    | d      | 30001 | hadoop1  | hadoop1 |            10001 |
    9 |       1 | p    | m              | c    | u      | 40001 | hadoop2  | hadoop2 |            20001 |
(14 rows)

[gpadmin1@hadoop1 ~]$ gprecoverseg
20101027:14:12:36:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Starting gprecoverseg with args:
20101027:14:12:36:gprecoverseg:hadoop1:gpadmin1-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 4.0.1.0 build 1'
20101027:14:12:36:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Obtaining Segment details from master...
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Greenplum instance recovery parameters
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:----------------------------------------------------------
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Recovery type              = Standard
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:----------------------------------------------------------
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Recovery 1 of 2
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:----------------------------------------------------------
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Synchronization mode                        = Incremental
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Failed instance host                        = hadoop1
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Failed instance address                     = hadoop1
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Failed instance directory                   = /home/gpadmin1/gp4datap1/aligp0
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Failed instance port                        = 30000
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Failed instance replication port            = 10000
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Recovery Source instance host               = hadoop2
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Recovery Source instance address            = hadoop2
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Recovery Source instance directory          = /home/gpadmin1/gp4datam1/aligp0
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Recovery Source instance port               = 40000
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Recovery Source instance replication port   = 20000
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Recovery Target                             = in-place
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:----------------------------------------------------------
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Recovery 2 of 2
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:----------------------------------------------------------
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Synchronization mode                        = Incremental
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Failed instance host                        = hadoop1
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Failed instance address                     = hadoop1
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Failed instance directory                   = /home/gpadmin1/gp4datap2/aligp1
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Failed instance port                        = 30001
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Failed instance replication port            = 10001
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Recovery Source instance host               = hadoop2
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Recovery Source instance address            = hadoop2
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Recovery Source instance directory          = /home/gpadmin1/gp4datam2/aligp1
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Recovery Source instance port               = 40001
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Recovery Source instance replication port   = 20001
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:-   Recovery Target                             = in-place
20101027:14:12:37:gprecoverseg:hadoop1:gpadmin1-[INFO]:----------------------------------------------------------

Continue with segment recovery procedure Yy|Nn (default=N):
> y
20101027:14:12:38:gprecoverseg:hadoop1:gpadmin1-[INFO]:-2 segment(s) to recover
20101027:14:12:38:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Ensuring 2 failed segment(s) are stopped
.
20101027:14:12:39:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Updating configuration with new mirrors
20101027:14:12:39:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Updating mirrors
.
20101027:14:12:40:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Starting mirrors
20101027:14:12:40:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait...
..
20101027:14:12:42:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Process results...
20101027:14:12:42:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Pausing prober
20101027:14:12:42:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Updating configuration to mark mirrors up
20101027:14:12:43:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Unpausing prober
20101027:14:12:43:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Updating primaries
20101027:14:12:43:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Commencing parallel primary conversion of 2 segments, please wait...
...
20101027:14:12:46:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Process results...
20101027:14:12:46:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Done updating primaries
20101027:14:12:46:gprecoverseg:hadoop1:gpadmin1-[INFO]:-******************************************************************
20101027:14:12:46:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Updating segments for resynchronization is completed.
20101027:14:12:46:gprecoverseg:hadoop1:gpadmin1-[INFO]:-For segments updated successfully, resynchronization will continue in the background.
20101027:14:12:46:gprecoverseg:hadoop1:gpadmin1-[INFO]:-
20101027:14:12:46:gprecoverseg:hadoop1:gpadmin1-[INFO]:-Use  gpstate -s  to check the resynchronization progress.
20101027:14:12:46:gprecoverseg:hadoop1:gpadmin1-[INFO]:-******************************************************************
[gpadmin1@hadoop1 ~]$ psql -c 'select * from gp_segment_configuration;'
 dbid | content | role | preferred_role | mode | status | port  | hostname | address | replication_port | san_mounts
------+---------+------+----------------+------+--------+-------+----------+---------+------------------+------------
    4 |       2 | p    | p              | s    | u      | 30000 | hadoop2  | hadoop2 |            10000 |
    6 |       4 | p    | p              | s    | u      | 30000 | hadoop3  | hadoop3 |            10000 |
   10 |       2 | m    | m              | s    | u      | 40000 | hadoop3  | hadoop3 |            20000 |
   12 |       4 | m    | m              | s    | u      | 40000 | hadoop1  | hadoop1 |            20000 |
   11 |       3 | m    | m              | s    | u      | 40001 | hadoop3  | hadoop3 |            20001 |
    5 |       3 | p    | p              | s    | u      | 30001 | hadoop2  | hadoop2 |            10001 |
    7 |       5 | p    | p              | s    | u      | 30001 | hadoop3  | hadoop3 |            10001 |
   13 |       5 | m    | m              | s    | u      | 40001 | hadoop1  | hadoop1 |            20001 |
    1 |      -1 | p    | p              | s    | u      |  2345 | hadoop1  | hadoop1 |                  |
   14 |      -1 | m    | m              | s    | u      |  2345 | hadoop2  | hadoop2 |                  |
    8 |       0 | p    | m              | r    | u      | 40000 | hadoop2  | hadoop2 |            20000 |
    2 |       0 | m    | p              | r    | u      | 30000 | hadoop1  | hadoop1 |            10000 |
    9 |       1 | p    | m              | r    | u      | 40001 | hadoop2  | hadoop2 |            20001 |
    3 |       1 | m    | p              | r    | u      | 30001 | hadoop1  | hadoop1 |            10001 |

(14 rows)

 

after a few seconds...

 

[gpadmin1@hadoop1 ~]$ psql -c 'select * from gp_segment_configuration;'
 dbid | content | role | preferred_role | mode | status | port  | hostname | address | replication_port | san_mounts
------+---------+------+----------------+------+--------+-------+----------+---------+------------------+------------
    4 |       2 | p    | p              | s    | u      | 30000 | hadoop2  | hadoop2 |            10000 |
    6 |       4 | p    | p              | s    | u      | 30000 | hadoop3  | hadoop3 |            10000 |
   10 |       2 | m    | m              | s    | u      | 40000 | hadoop3  | hadoop3 |            20000 |
   12 |       4 | m    | m              | s    | u      | 40000 | hadoop1  | hadoop1 |            20000 |
   11 |       3 | m    | m              | s    | u      | 40001 | hadoop3  | hadoop3 |            20001 |
    5 |       3 | p    | p              | s    | u      | 30001 | hadoop2  | hadoop2 |            10001 |
    7 |       5 | p    | p              | s    | u      | 30001 | hadoop3  | hadoop3 |            10001 |
   13 |       5 | m    | m              | s    | u      | 40001 | hadoop1  | hadoop1 |            20001 |
    1 |      -1 | p    | p              | s    | u      |  2345 | hadoop1  | hadoop1 |                  |
   14 |      -1 | m    | m              | s    | u      |  2345 | hadoop2  | hadoop2 |                  |
    8 |       0 | p    | m              | s    | u      | 40000 | hadoop2  | hadoop2 |            20000 |
    2 |       0 | m    | p              | s    | u      | 30000 | hadoop1  | hadoop1 |            10000 |
    9 |       1 | p    | m              | s    | u      | 40001 | hadoop2  | hadoop2 |            20001 |
    3 |       1 | m    | p              | s    | u      | 30001 | hadoop1  | hadoop1 |            10001 |

(14 rows)

 

注意到gp_segment_configuration中mode字段在不同阶段的值。

1、

由于hadoop1上30000和30001端口上的两个PRIMARY INSTANCE宕掉了,与之相对应的在hadoop2上40000和40001端口上的两个MIRROR INSTANCE的MODE字段值变为c,也就是change logging,用于记录在此阶段(原先的PRIMARY INSTANCE宕机的时间段)产生的日志内容。

2、

执行gprecoverseg命令以后,四个INSTANCE的MODE字段均变为r,resyncing,此时系统在做的就是应用日志内容,同步PRIMARY和MIRROR。

3、

再次查看各INSTANCE的状态,此时已经同步完成,MODE列均为s了,也就是synchronized。

 

下面是GPADMIN4.0文档上的一段话

In the event of a segment failure, the file replication process is stopped and the mirror segment is automatically brought up as the active segment instance. All database operations then continue using the mirror. While the mirror is active, it is also logging all transactional changes made to the database. This system state is known as Change Tracking mode. When the failed segment is ready to be brought back online, administrators initiate a recovery process to bring it back into operation. The recovery process synchronizes with the mirror and only copies over the changes that were missed while the segment was down. This system state is known as Resynchronizing mode. Once all mirrors and their primaries are synchronized again, the system state becomes Synchronized.