My PeopleSoft Disaster Recovery Adventure
来源:互联网 发布:淘宝和卷皮哪个更好 编辑:程序博客网 时间:2024/05/01 02:26
It was a sunny and cool mid-afternoon Tuesday late in the fall. Everyone in the office was back from lunch, and I was looking at another relatively uneventful week of production support. The financials folks were closing the sub-ledgers, and the HRMS were getting ready for year end bonuses. The spreadsheet that I had been tweaking all afternoon went unsaved as I was returning to my cubicle with a Payday candy bar and an ice water.
Then the lights went out. And there was that eerie stillness as the air conditioning stopped and PCs CPU fans fell silent. Employees paused and looked up from blank screens. After a long five seconds, everything jumped back to life.
It took about five minutes, but word spread quickly as to the cause. We could see it all too clearly from the 12th floor window. A large water main under the street had busted near the adjacent building, sending thousands of gallons of water up through the concrete, over the curb, and pouring into the ventilation shafts for the downtown power generators.
There was this one guy on the street hopelessly trying to divert the flow of water around one ventilation shaft with a large piece of plywood. But he was only one person and a large whirlpool had formed where the other ventilation shaft had been 30 feet up the street.
It was rumored that the city took two hours to get the water turned off. But I wasnt there to see it because they had sent everyone home. Fifteen minutes after the power outage, the life safety guy came across the emergency intercom telling everyone that we would have to evacuate the building because it might loose power. We were obviously running on backup generators at that time.
Before I went home for the day, I did one last check of the PeopleSoft systems. Our UNIX database and application servers couldnt see the application disks. That sounded bad.
7 days earlier
The PeopleSoft security administrator had turned in her two week notice and her boss delegated her duties and assignments. As the lone consultant, I was assigned to review the disaster recovery documentation that she had prepared and to send it to the Disaster Recovery team. We planned to test the complete disaster recovery plan two months later.
The scope of the document was simply to instruct a user who knew little about PeopleSoft how to start it back up in the unlikely event we ever had to bring it up in the disaster recovery center. It had the following bullet points:
1) Start the application server and the web server using a custom start up script.
2) Start the UNIX process scheduler server using the start-up script
3) Start the NT process scheduler using psadmin
4) Perform some sanity checks
This assumed the infrastructure had been restored and the database had been restored and was ready to go.
Given those assumptions, I was adding some practical instructions as to how to reconfigure the app servers given that IP addresses would likely change and thinking about infrastructure dependencies like networking, IP Aliases, etc. But I had no idea how timely this assignment would be, or how incomplete the document ultimately was.
Day 0
When the water rushed down the ventilation shaft of the power companys power generators (knocking out power to the entire downtown area), it then flowed to the adjoining rooms. One of these rooms was known affectionately to the IT staff as Data Center 1, or DC1. Even though my client had built a brand-new state of the art data center in their own building, much of the IT infrastructure had remained here. The SAN storage array that drove most of the critical applications resided here. The water filled this data center up to a depth of about three feet. Servers lived or died depending on how high they were mounted in the rack.
Data Center Two (DC2) was built directly under Data Center One so the water dumped in through the ceiling. Here the server survival odds were more random. Some servers received quite a soaking while others remained relatively dry.
Within hours a team was assembled an on a plane to the disaster recovery site in Pennsylvania.
The Disaster Recovery Architecture
Several months earlier, an IT initiative took place where the enterprise applications that were critical to the ongoing business operations were identified. At that time it was determined that PeopleSoft FSCM was a first level priority, and PeopleSoft HRMS was a third-level priority. The logic was that HRMSs critical functions (like benefits and payroll) were already outsourced. Financials was considered critical because it was needed to invoice customers and collect money.
Here is a simplified diagram of the PeopleSoft pre-disaster architecture:
Clients connected to one of two UNIX web servers through a load balancer. The application servers also ran on the web servers. One of the App/Web server boxes had the PeopleSoft application installed on SAN disk storage and the report repository directory was also installed on SAN disk storage.
The database server was a UNIX server running Oracle. The database files along with the UNIX process scheduler used SAN Disk storage.
The SAN had the capacity to replicate any changes made to any files in a real-time fashion to the disaster recovery facility. This was enabled for the PeopleSoft UNIX servers.
The Windows Cluster was used to provide an NT process scheduler as well as a file server for the production environment. The application used SAN disk storage, but it was on a different frame and wasnt replicated to the disaster recovery environment.
My client had allocated servers in the disaster recovery environment that would be used for the PeopleSoft Financials application and other critical applications. However, there was no redundancy built in to the servers at the disaster recovery facility. Here's a quick sketch of the PeopleSoft architecture at the disaster recovery facility:
Ironically, the online portion of the HRMS system kept right on running. Since HRMS wasnt on the disaster recovery SAN, it didnt have any components in an old data center. The new data center was un-touched, so it didnt miss a beat. The NT process scheduler and file server did go for a swim, but at the time I wasnt worried. We could live without them for a few days.
Day 1
The first 24 hours were spent establishing the infrastructure at the disaster recovery facility, and starting the mission-critical applications. PeopleSoft Financials was deemed critical, but it was at a lower priority than other applications. As a result, the PeopleSoft Financials servers and components didnt start coming on-line until 48 hours after the flood.
An IT Command Center was established later in the day to field all of the calls, prioritize them, and route them to the appropriate infrastructure project manager. They also tracked the issues and reported progress.
In the flooded data centers, crews were busy pumping out water and trying to salvage what they could. Wet servers were moved into a staging area to be cleaned and testing. Miraculously, the SAN that so many applications depended on stayed relatively dry, and a first priority was to move it out of the data center before the harsh conditions damaged it further.
Day 2
Since the PeopleSoft Financials system was still down, we got together with the business community to plan out how we were going to restart the application. The disaster happened during the end of month close, so we needed to get the system up and running as quickly as possible. It was critical that everything start cleanly without corrupting any data since the database backup functionality hadnt yet been established in the DR facility.
We decided on the following plan:
1) When the database starts, an admin person would log on (via SQL Plus) and see what date and time the database was current through. We thought we could get an idea by looking at the most recent entries in the process monitor or integration broker tables, since these tables are updated very frequently. This information would be used by the business community to identify transactions that needed to be re-keyed.
2) The PeopleSoft administrator would then put all Jobsets (jobs scheduled through the master process scheduler) on hold by running a SQL script. We werent as worried about putting user defined recurrences on hold since end users werent be running processes under their own ID anyway.
3) All user accounts would be locked out except a small subset needed to test the application. Once again this was accomplished with a SQL script.
4) The administrator would then start the web server and application server and allow the subset of users a chance to do sanity checks on the application. If it checked out, we would enable the accounts of users who were directly responsible for completing the Financials closing activities. If the application continued to perform well, wed turn it on for the entire user base.
Guess what? Nothing is ever as easy as you think. Here are some of the little bugs that extended the process.
I received word that the PeopleSoft servers were accessible (although the database was still down) in the mid-afternoon, so I tried to log on to the database server first. My password wasnt accepted and the SA had to reset it.
Once I was logged on, the mirrored mount points seemed to be there and I was able to execute psadmin. Several home directories and users were missing and the SA had to bring them back from a tape backup. Once the users were re-established, we had to reset all of the passwords to their prior values.
On the web/application server, I was able to log on once the SA had reset my password. Unfortunately none of the home directories existed (they were on local storage and werent replicated to the DR site). The SA had to bring them back from a tape backup.
Finally we received word that the database was back on-line.
First, I ran SQL to check out the process monitor tables, and determined that the database was current to at least five minutes of the outage. The users later verified that we didnt loose any transactions. I dont know how much mirroring to your DR site costs, but I suspect it was worth every penny!
Next, I ran an SQL script to lock out users and disable all of the Jobsets. So far so good.
I ran another script to update the distribution node URL to go directly to the web server instead of going through the load balancer. There was no load balancer at the disaster recovery site.
We had been running performance monitor, but our performance monitor database was MIA. So I commented out the EnablePPMAgent setting from the application server and process scheduler configuration.
Then I tried to start up the application server. It errored out with Missing or invalid version of SQL library libpsora. Turns out the Oracle Home directory was missing. It took some time to bring this back from tape. I guess it should have been mirrored or pre-installed on the server.
Finally the application server and web server started. While users were doing their Sanity Checks, I started the UNIX process scheduler.
I had to tweak the REN server configuration to remove the second app server that we no longer had.
We decided to leave jobsets inactive and manually run whatever had to be run immediately. Since so many of our jobsets imported files from systems that were no longer available, this seemed like a prudent approach until we had the time to analyze all of the jobs.
Day 3
Things started getting a little unstable as work to bring the remaining applications up caused changes in about every corner of the technical infrastructure. For example, the Oracle database listener went down for no obvious reason, and many servers that resided in the undamaged data center were unpingable from the disaster recovery site.
The crontab file on our batch server had disappeared. We schedule our outbound and inbound file processing via Cron, so this was a problem. Fortunately this was easy to correct once we saw what the problem was.
We unlocked all of the user accounts to allow full access to the application. PeopleSoft performed well running at the disaster recovery site as a result of its server-centric architecture. Most users didnt notice a performance impact. Other applications that required more data processing on the client didnt fare as well since the network bandwidth was now very limited.
The Windows process scheduler server was never identified as one to be moved to the disaster recovery site. Fortunately only the servers storage resided in the flooded data center the server itself stayed dry in the undamaged one. So alternate storage was identified and we were able to restore the application directories relatively quickly. The only catch was that we lost all changes since the most recent backup which amounted to a few files that we were able to re-create.
Unfortunately, performance on the windows process scheduler was much slower since the database now resided across the country. We could live with this since most of the heavy processing would be done on the UNIX database server.
Getting the windows cluster network name working again took several days due to an error that stated You were not connected because a duplicate name exists on the network every time a user tried to connect. Resolving the error wasnt a priority for the SAs because of the number of boxes that were still down. We had to tell users how to re-map their drives if they needed things like app designer or the PeopleSoft ODBC driver. Also, we had to change file locations at Setup Financials/Supply Chain > Common Definitions > File Locations and Images > File Locations to change the cluster name to the server name so app engines and SQRs that used this information would work.
Journal posting wasnt working, and it didnt take long to determine why the COBOL environment wasnt working. It seemed to be a problem with the COBOL license manager which didnt like the fact that it found itself on a new server. We had to completely reinstall COBOL to get it working again.
Toward the afternoon we felt like we knew what jobs could run, but we didnt want them to play catch up and run once for each missed run-time since we went down. So I set the next start date/time up the required number of days by running update scripts for each job in this format:
I didnt activate the jobs through the script, but I let the analyst do it on-line through the jobset pages. I would have let the analyst update the next start date/time too, but I dont believe this field is accessible on-line.
Lessons Learned
I must say I learned a lot during this experience. Here are some of the lessons I took away:
Dont put off creating a disaster recovery plan
Test the disaster recovery plan
Include every server that your application is dependent on in the disaster recovery plan.
Carefully review the infrastructure groups disaster recovery plan for each one of your servers. Understand the priority and make sure it corresponds with the applications needs.
Even if you have real-time replication to the data center, dont expect the application to be available immediately. Servers still have to be provisioned, networking must be set up, and other applications are probably in line before yours.
Dont forget about impacts to your batch job schedule. At least plan to meet with the people who know it well to talk through things before restarting the process schedulers.
Even after your application is back up and running, expect new problems to show up as the entire infrastructure is being re-established in a new environment.
Have a comprehensive system checkout script, and use it before you let users get into the system. Execute it daily for the first week.
You may not have regular backups in the DR facility like you did back home. Manage the system accordingly.
Dont expect to have dev/test environments for the foreseeable future. Have an idea about how business can be conducted without one. For example, how comfortable do you feel about applying those tax updates without testing, especially given limited backup functionality?
Consider the effort to migrate everything from the DR center back home, at least at a high level.
Then the lights went out. And there was that eerie stillness as the air conditioning stopped and PCs CPU fans fell silent. Employees paused and looked up from blank screens. After a long five seconds, everything jumped back to life.
It took about five minutes, but word spread quickly as to the cause. We could see it all too clearly from the 12th floor window. A large water main under the street had busted near the adjacent building, sending thousands of gallons of water up through the concrete, over the curb, and pouring into the ventilation shafts for the downtown power generators.
There was this one guy on the street hopelessly trying to divert the flow of water around one ventilation shaft with a large piece of plywood. But he was only one person and a large whirlpool had formed where the other ventilation shaft had been 30 feet up the street.
It was rumored that the city took two hours to get the water turned off. But I wasnt there to see it because they had sent everyone home. Fifteen minutes after the power outage, the life safety guy came across the emergency intercom telling everyone that we would have to evacuate the building because it might loose power. We were obviously running on backup generators at that time.
Before I went home for the day, I did one last check of the PeopleSoft systems. Our UNIX database and application servers couldnt see the application disks. That sounded bad.
7 days earlier
The PeopleSoft security administrator had turned in her two week notice and her boss delegated her duties and assignments. As the lone consultant, I was assigned to review the disaster recovery documentation that she had prepared and to send it to the Disaster Recovery team. We planned to test the complete disaster recovery plan two months later.
The scope of the document was simply to instruct a user who knew little about PeopleSoft how to start it back up in the unlikely event we ever had to bring it up in the disaster recovery center. It had the following bullet points:
1) Start the application server and the web server using a custom start up script.
2) Start the UNIX process scheduler server using the start-up script
3) Start the NT process scheduler using psadmin
4) Perform some sanity checks
This assumed the infrastructure had been restored and the database had been restored and was ready to go.
Given those assumptions, I was adding some practical instructions as to how to reconfigure the app servers given that IP addresses would likely change and thinking about infrastructure dependencies like networking, IP Aliases, etc. But I had no idea how timely this assignment would be, or how incomplete the document ultimately was.
Day 0
When the water rushed down the ventilation shaft of the power companys power generators (knocking out power to the entire downtown area), it then flowed to the adjoining rooms. One of these rooms was known affectionately to the IT staff as Data Center 1, or DC1. Even though my client had built a brand-new state of the art data center in their own building, much of the IT infrastructure had remained here. The SAN storage array that drove most of the critical applications resided here. The water filled this data center up to a depth of about three feet. Servers lived or died depending on how high they were mounted in the rack.
Data Center Two (DC2) was built directly under Data Center One so the water dumped in through the ceiling. Here the server survival odds were more random. Some servers received quite a soaking while others remained relatively dry.
Within hours a team was assembled an on a plane to the disaster recovery site in Pennsylvania.
The Disaster Recovery Architecture
Several months earlier, an IT initiative took place where the enterprise applications that were critical to the ongoing business operations were identified. At that time it was determined that PeopleSoft FSCM was a first level priority, and PeopleSoft HRMS was a third-level priority. The logic was that HRMSs critical functions (like benefits and payroll) were already outsourced. Financials was considered critical because it was needed to invoice customers and collect money.
Here is a simplified diagram of the PeopleSoft pre-disaster architecture:
Clients connected to one of two UNIX web servers through a load balancer. The application servers also ran on the web servers. One of the App/Web server boxes had the PeopleSoft application installed on SAN disk storage and the report repository directory was also installed on SAN disk storage.
The database server was a UNIX server running Oracle. The database files along with the UNIX process scheduler used SAN Disk storage.
The SAN had the capacity to replicate any changes made to any files in a real-time fashion to the disaster recovery facility. This was enabled for the PeopleSoft UNIX servers.
The Windows Cluster was used to provide an NT process scheduler as well as a file server for the production environment. The application used SAN disk storage, but it was on a different frame and wasnt replicated to the disaster recovery environment.
My client had allocated servers in the disaster recovery environment that would be used for the PeopleSoft Financials application and other critical applications. However, there was no redundancy built in to the servers at the disaster recovery facility. Here's a quick sketch of the PeopleSoft architecture at the disaster recovery facility:
Ironically, the online portion of the HRMS system kept right on running. Since HRMS wasnt on the disaster recovery SAN, it didnt have any components in an old data center. The new data center was un-touched, so it didnt miss a beat. The NT process scheduler and file server did go for a swim, but at the time I wasnt worried. We could live without them for a few days.
Day 1
The first 24 hours were spent establishing the infrastructure at the disaster recovery facility, and starting the mission-critical applications. PeopleSoft Financials was deemed critical, but it was at a lower priority than other applications. As a result, the PeopleSoft Financials servers and components didnt start coming on-line until 48 hours after the flood.
An IT Command Center was established later in the day to field all of the calls, prioritize them, and route them to the appropriate infrastructure project manager. They also tracked the issues and reported progress.
In the flooded data centers, crews were busy pumping out water and trying to salvage what they could. Wet servers were moved into a staging area to be cleaned and testing. Miraculously, the SAN that so many applications depended on stayed relatively dry, and a first priority was to move it out of the data center before the harsh conditions damaged it further.
Day 2
Since the PeopleSoft Financials system was still down, we got together with the business community to plan out how we were going to restart the application. The disaster happened during the end of month close, so we needed to get the system up and running as quickly as possible. It was critical that everything start cleanly without corrupting any data since the database backup functionality hadnt yet been established in the DR facility.
We decided on the following plan:
1) When the database starts, an admin person would log on (via SQL Plus) and see what date and time the database was current through. We thought we could get an idea by looking at the most recent entries in the process monitor or integration broker tables, since these tables are updated very frequently. This information would be used by the business community to identify transactions that needed to be re-keyed.
2) The PeopleSoft administrator would then put all Jobsets (jobs scheduled through the master process scheduler) on hold by running a SQL script. We werent as worried about putting user defined recurrences on hold since end users werent be running processes under their own ID anyway.
3) All user accounts would be locked out except a small subset needed to test the application. Once again this was accomplished with a SQL script.
4) The administrator would then start the web server and application server and allow the subset of users a chance to do sanity checks on the application. If it checked out, we would enable the accounts of users who were directly responsible for completing the Financials closing activities. If the application continued to perform well, wed turn it on for the entire user base.
Guess what? Nothing is ever as easy as you think. Here are some of the little bugs that extended the process.
I received word that the PeopleSoft servers were accessible (although the database was still down) in the mid-afternoon, so I tried to log on to the database server first. My password wasnt accepted and the SA had to reset it.
Once I was logged on, the mirrored mount points seemed to be there and I was able to execute psadmin. Several home directories and users were missing and the SA had to bring them back from a tape backup. Once the users were re-established, we had to reset all of the passwords to their prior values.
On the web/application server, I was able to log on once the SA had reset my password. Unfortunately none of the home directories existed (they were on local storage and werent replicated to the DR site). The SA had to bring them back from a tape backup.
Finally we received word that the database was back on-line.
First, I ran SQL to check out the process monitor tables, and determined that the database was current to at least five minutes of the outage. The users later verified that we didnt loose any transactions. I dont know how much mirroring to your DR site costs, but I suspect it was worth every penny!
Next, I ran an SQL script to lock out users and disable all of the Jobsets. So far so good.
update psoprdefn set acctlock = 1
where oprid not in (ASMITH,BWHITE,CSTORY,DSWAN, BATCH, VP1, PTWEBSERVER)
/
update PS_SCHDLDEFN set schedulestatus = 0
/
I ran another script to update the distribution node URL to go directly to the web server instead of going through the load balancer. There was no load balancer at the disaster recovery site.
We had been running performance monitor, but our performance monitor database was MIA. So I commented out the EnablePPMAgent setting from the application server and process scheduler configuration.
Then I tried to start up the application server. It errored out with Missing or invalid version of SQL library libpsora. Turns out the Oracle Home directory was missing. It took some time to bring this back from tape. I guess it should have been mirrored or pre-installed on the server.
Finally the application server and web server started. While users were doing their Sanity Checks, I started the UNIX process scheduler.
I had to tweak the REN server configuration to remove the second app server that we no longer had.
We decided to leave jobsets inactive and manually run whatever had to be run immediately. Since so many of our jobsets imported files from systems that were no longer available, this seemed like a prudent approach until we had the time to analyze all of the jobs.
Day 3
Things started getting a little unstable as work to bring the remaining applications up caused changes in about every corner of the technical infrastructure. For example, the Oracle database listener went down for no obvious reason, and many servers that resided in the undamaged data center were unpingable from the disaster recovery site.
The crontab file on our batch server had disappeared. We schedule our outbound and inbound file processing via Cron, so this was a problem. Fortunately this was easy to correct once we saw what the problem was.
We unlocked all of the user accounts to allow full access to the application. PeopleSoft performed well running at the disaster recovery site as a result of its server-centric architecture. Most users didnt notice a performance impact. Other applications that required more data processing on the client didnt fare as well since the network bandwidth was now very limited.
The Windows process scheduler server was never identified as one to be moved to the disaster recovery site. Fortunately only the servers storage resided in the flooded data center the server itself stayed dry in the undamaged one. So alternate storage was identified and we were able to restore the application directories relatively quickly. The only catch was that we lost all changes since the most recent backup which amounted to a few files that we were able to re-create.
Unfortunately, performance on the windows process scheduler was much slower since the database now resided across the country. We could live with this since most of the heavy processing would be done on the UNIX database server.
Getting the windows cluster network name working again took several days due to an error that stated You were not connected because a duplicate name exists on the network every time a user tried to connect. Resolving the error wasnt a priority for the SAs because of the number of boxes that were still down. We had to tell users how to re-map their drives if they needed things like app designer or the PeopleSoft ODBC driver. Also, we had to change file locations at Setup Financials/Supply Chain > Common Definitions > File Locations and Images > File Locations to change the cluster name to the server name so app engines and SQRs that used this information would work.
Journal posting wasnt working, and it didnt take long to determine why the COBOL environment wasnt working. It seemed to be a problem with the COBOL license manager which didnt like the fact that it found itself on a new server. We had to completely reinstall COBOL to get it working again.
Toward the afternoon we felt like we knew what jobs could run, but we didnt want them to play catch up and run once for each missed run-time since we went down. So I set the next start date/time up the required number of days by running update scripts for each job in this format:
update ps_schdldefninfo set nextstartdttm = nextstartdttm + 3 where SCHEDULENAME = ARUPDATE;
I didnt activate the jobs through the script, but I let the analyst do it on-line through the jobset pages. I would have let the analyst update the next start date/time too, but I dont believe this field is accessible on-line.
Lessons Learned
I must say I learned a lot during this experience. Here are some of the lessons I took away:
Dont put off creating a disaster recovery plan
Test the disaster recovery plan
Include every server that your application is dependent on in the disaster recovery plan.
Carefully review the infrastructure groups disaster recovery plan for each one of your servers. Understand the priority and make sure it corresponds with the applications needs.
Even if you have real-time replication to the data center, dont expect the application to be available immediately. Servers still have to be provisioned, networking must be set up, and other applications are probably in line before yours.
Dont forget about impacts to your batch job schedule. At least plan to meet with the people who know it well to talk through things before restarting the process schedulers.
Even after your application is back up and running, expect new problems to show up as the entire infrastructure is being re-established in a new environment.
Have a comprehensive system checkout script, and use it before you let users get into the system. Execute it daily for the first week.
You may not have regular backups in the DR facility like you did back home. Manage the system accordingly.
Dont expect to have dev/test environments for the foreseeable future. Have an idea about how business can be conducted without one. For example, how comfortable do you feel about applying those tax updates without testing, especially given limited backup functionality?
Consider the effort to migrate everything from the DR center back home, at least at a high level.
- My PeopleSoft Disaster Recovery Adventure
- Disaster Recovery
- Active Directory Disaster Recovery
- Security Planning and Disaster Recovery
- Pro SQL Server Disaster Recovery
- Bulletproofing, Backups, and Disaster Recovery Scenarios
- Disaster Recovery Planning For Oracle DB
- Apress.Pro.SQL.Server.Disaster.Recovery.Mar.2008.eBook-BBL
- My cat sitting adventure 我的猫保姆经历
- New High-Availability/Disaster-Recovery solution provided by SQL Server 2012
- Disaster Recovery, High Availability, and Continuous Availability - What's the Difference?
- Disaster Recovery: What to do when the SA account password is lost in SQL Server 2005
- Suzhou Adventure
- recovery
- recovery
- Recovery
- 谈谈PeopleSoft
- PeopleSoft基础知识
- Android中常用的有四种保存方式
- 表示三维运动的BVH文件格式解析
- 创业和打工的区别
- QQ右上角的那个翻页是怎么做的
- RHEL6 安装 MPlayer
- My PeopleSoft Disaster Recovery Adventure
- hibernate多对一映射
- ubuntu 11.10 恢复菜单的命令提示操作
- as3 常用代码集锦
- Android学习第二天
- poj 3605
- 二维数组动态分配
- 虚拟机下Redhat9.0环境+Linux下挂载U盘
- 【程序4】 题目:将一个正整数分解质因数。例如:输入90,打印出90=2*3*3*5。