My PeopleSoft Disaster Recovery Adventure

来源:互联网 发布:淘宝和卷皮哪个更好 编辑:程序博客网 时间:2024/05/01 02:26
 It was a sunny and cool mid-afternoon Tuesday late in the fall. Everyone in the office was back from lunch, and I was looking at another relatively uneventful week of production support. The financials folks were closing the sub-ledgers, and the HRMS were getting ready for year end bonuses. The spreadsheet that I had been tweaking all afternoon went unsaved as I was returning to my cubicle with a Payday candy bar and an ice water.

Then the lights went out. And there was that eerie stillness as the air conditioning stopped and PC’s CPU fans fell silent. Employees paused and looked up from blank screens. After a long five seconds, everything jumped back to life.

It took about five minutes, but word spread quickly as to the cause. We could see it all too clearly from the 12th floor window. A large water main under the street had busted near the adjacent building, sending thousands of gallons of water up through the concrete, over the curb, and pouring into the ventilation shafts for the downtown power generators.

There was this one guy on the street hopelessly trying to divert the flow of water around one ventilation shaft with a large piece of plywood. But he was only one person and a large whirlpool had formed where the other ventilation shaft had been 30 feet up the street.

It was rumored that the city took two hours to get the water turned off. But I wasn’t there to see it because they had sent everyone home. Fifteen minutes after the power outage, the life safety guy came across the emergency intercom telling everyone that we would have to evacuate the building because it might loose power. We were obviously running on backup generators at that time.

Before I went home for the day, I did one last check of the PeopleSoft systems. Our UNIX database and application servers couldn’t see the application disks. That sounded bad.

7 days earlier

The PeopleSoft security administrator had turned in her two week notice and her boss delegated her duties and assignments. As the lone consultant, I was assigned to review the disaster recovery documentation that she had prepared and to send it to the Disaster Recovery team. We planned to test the complete disaster recovery plan two months later.

The scope of the document was simply to instruct a user who knew little about PeopleSoft how to start it back up in the unlikely event we ever had to bring it up in the disaster recovery center. It had the following bullet points:
1) Start the application server and the web server using a custom start up script.
2) Start the UNIX process scheduler server using the start-up script
3) Start the NT process scheduler using psadmin
4) Perform some “sanity checks”

This assumed the infrastructure had been restored and the database had been restored and was ready to go.

Given those assumptions, I was adding some practical instructions as to how to reconfigure the app servers given that IP addresses would likely change and thinking about infrastructure dependencies like networking, IP Aliases, etc. But I had no idea how timely this assignment would be, or how incomplete the document ultimately was.

Day 0
When the water rushed down the ventilation shaft of the power company’s power generators (knocking out power to the entire downtown area), it then flowed to the adjoining rooms. One of these rooms was known affectionately to the IT staff as “Data Center 1”, or DC1. Even though my client had built a brand-new state of the art data center in their own building, much of the IT infrastructure had remained here. The SAN storage array that drove most of the critical applications resided here. The water filled this data center up to a depth of about three feet. Servers lived or died depending on how high they were mounted in the rack.

Data Center Two (DC2) was built directly under Data Center One so the water dumped in through the ceiling. Here the server survival odds were more random. Some servers received quite a soaking while others remained relatively dry.

Within hours a team was assembled an on a plane to the disaster recovery site in Pennsylvania.

The Disaster Recovery Architecture
Several months earlier, an IT initiative took place where the enterprise applications that were critical to the ongoing business operations were identified. At that time it was determined that PeopleSoft FSCM was a first level priority, and PeopleSoft HRMS was a third-level priority. The logic was that HRMS’s critical functions (like benefits and payroll) were already outsourced. Financials was considered critical because it was needed to invoice customers and collect money.

Here is a simplified diagram of the PeopleSoft pre-disaster architecture:

Pre-disaster PeopleSoft architecture

Clients connected to one of two UNIX web servers through a load balancer. The application servers also ran on the web servers. One of the App/Web server boxes had the PeopleSoft application installed on SAN disk storage and the report repository directory was also installed on SAN disk storage.

The database server was a UNIX server running Oracle. The database files along with the UNIX process scheduler used SAN Disk storage.

The SAN had the capacity to replicate any changes made to any files in a real-time fashion to the disaster recovery facility. This was enabled for the PeopleSoft UNIX servers.

The Windows Cluster was used to provide an NT process scheduler as well as a file server for the production environment. The application used SAN disk storage, but it was on a different frame and wasn’t replicated to the disaster recovery environment.

My client had allocated servers in the disaster recovery environment that would be used for the PeopleSoft Financials application and other critical applications. However, there was no redundancy built in to the servers at the disaster recovery facility. Here's a quick sketch of the PeopleSoft architecture at the disaster recovery facility:
Post Disaster PeopleSoft Architecture

Ironically, the online portion of the HRMS system kept right on running. Since HRMS wasn’t on the disaster recovery SAN, it didn’t have any components in an old data center. The new data center was un-touched, so it didn’t miss a beat. The NT process scheduler and file server did go for a swim, but at the time I wasn’t worried. We could live without them for a few days.

Day 1
The first 24 hours were spent establishing the infrastructure at the disaster recovery facility, and starting the mission-critical applications. PeopleSoft Financials was deemed critical, but it was at a lower priority than other applications. As a result, the PeopleSoft Financials servers and components didn’t start coming on-line until 48 hours after the flood.

An “IT Command Center” was established later in the day to field all of the calls, prioritize them, and route them to the appropriate infrastructure project manager. They also tracked the issues and reported progress.

In the flooded data centers, crews were busy pumping out water and trying to salvage what they could. “Wet” servers were moved into a staging area to be cleaned and testing. Miraculously, the SAN that so many applications depended on stayed relatively dry, and a first priority was to move it out of the data center before the harsh conditions damaged it further.

Day 2
Since the PeopleSoft Financials system was still down, we got together with the business community to plan out how we were going to restart the application. The disaster happened during the end of month close, so we needed to get the system up and running as quickly as possible. It was critical that everything start cleanly without corrupting any data since the database backup functionality hadn’t yet been established in the DR facility.

We decided on the following plan:
1) When the database starts, an admin person would log on (via SQL Plus) and see what date and time the database was current through. We thought we could get an idea by looking at the most recent entries in the process monitor or integration broker tables, since these tables are updated very frequently. This information would be used by the business community to identify transactions that needed to be re-keyed.
2) The PeopleSoft administrator would then put all Jobsets (jobs scheduled through the master process scheduler) on hold by running a SQL script. We weren’t as worried about putting user defined recurrences on hold since end users weren’t be running processes under their own ID anyway.
3) All user accounts would be locked out except a small subset needed to test the application. Once again this was accomplished with a SQL script.
4) The administrator would then start the web server and application server and allow the subset of users a chance to do sanity checks on the application. If it checked out, we would enable the accounts of users who were directly responsible for completing the Financials closing activities. If the application continued to perform well, we’d turn it on for the entire user base.

Guess what? Nothing is ever as easy as you think. Here are some of the “little” bugs that extended the process.

I received word that the PeopleSoft servers were accessible (although the database was still down) in the mid-afternoon, so I tried to log on to the database server first. My password wasn’t accepted and the SA had to reset it.

Once I was logged on, the mirrored mount points seemed to be there and I was able to execute “psadmin”. Several home directories and users were missing and the SA had to bring them back from a tape backup. Once the users were re-established, we had to reset all of the passwords to their prior values.

On the web/application server, I was able to log on once the SA had reset my password. Unfortunately none of the home directories existed (they were on local storage and weren’t replicated to the DR site). The SA had to bring them back from a tape backup.

Finally we received word that the database was back on-line.

First, I ran SQL to check out the process monitor tables, and determined that the database was current to at least five minutes of the outage. The users later verified that we didn’t loose any transactions. I don’t know how much mirroring to your DR site costs, but I suspect it was worth every penny!

Next, I ran an SQL script to lock out users and disable all of the Jobsets. So far so good.
update psoprdefn set acctlock = 1
where oprid not in (‘ASMITH’,’BWHITE’,’CSTORY’,’DSWAN’, ‘BATCH’, ‘VP1’, ‘PTWEBSERVER’)
/
update PS_SCHDLDEFN set schedulestatus = 0
/


I ran another script to update the distribution node URL to go directly to the web server instead of going through the load balancer. There was no load balancer at the disaster recovery site.

We had been running performance monitor, but our performance monitor database was MIA. So I commented out the EnablePPMAgent setting from the application server and process scheduler configuration.

Then I tried to start up the application server. It errored out with “Missing or invalid version of SQL library libpsora”. Turns out the Oracle Home directory was missing. It took some time to bring this back from tape. I guess it should have been mirrored or pre-installed on the server.

Finally the application server and web server started. While users were doing their “Sanity Checks”, I started the UNIX process scheduler.

I had to tweak the REN server configuration to remove the second app server that we no longer had.

We decided to leave jobsets inactive and manually run whatever had to be run immediately. Since so many of our jobsets imported files from systems that were no longer available, this seemed like a prudent approach until we had the time to analyze all of the jobs.

Day 3

Things started getting a little unstable as work to bring the remaining applications up caused changes in about every corner of the technical infrastructure. For example, the Oracle database listener went down for no obvious reason, and many servers that resided in the undamaged data center were unpingable from the disaster recovery site.

The crontab file on our batch server had disappeared. We schedule our outbound and inbound file processing via Cron, so this was a problem. Fortunately this was easy to correct once we saw what the problem was.

We unlocked all of the user accounts to allow full access to the application. PeopleSoft performed well running at the disaster recovery site as a result of its server-centric architecture. Most users didn’t notice a performance impact. Other applications that required more data processing on the client didn’t fare as well since the network bandwidth was now very limited.

The Windows process scheduler server was never identified as one to be moved to the disaster recovery site. Fortunately only the server’s storage resided in the flooded data center – the server itself stayed dry in the undamaged one. So alternate storage was identified and we were able to restore the application directories relatively quickly. The only catch was that we lost all changes since the most recent backup which amounted to a few files that we were able to re-create.

Unfortunately, performance on the windows process scheduler was much slower since the database now resided across the country. We could live with this since most of the heavy processing would be done on the UNIX database server.

Getting the windows cluster network name working again took several days due to an error that stated “You were not connected because a duplicate name exists on the network” every time a user tried to connect. Resolving the error wasn’t a priority for the SA’s because of the number of boxes that were still down. We had to tell users how to re-map their drives if they needed things like app designer or the PeopleSoft ODBC driver. Also, we had to change file locations at Setup Financials/Supply Chain > Common Definitions > File Locations and Images > File Locations to change the cluster name to the server name so app engines and SQR’s that used this information would work.

Journal posting wasn’t working, and it didn’t take long to determine why – the COBOL environment wasn’t working. It seemed to be a problem with the COBOL license manager which didn’t like the fact that it found itself on a new server. We had to completely reinstall COBOL to get it working again.

Toward the afternoon we felt like we knew what jobs could run, but we didn’t want them to play “catch up” and run once for each missed run-time since we went down. So I set the next start date/time up the required number of days by running update scripts for each job in this format:
update ps_schdldefninfo set nextstartdttm = nextstartdttm + 3 where SCHEDULENAME = ‘ARUPDATE’;

I didn’t activate the jobs through the script, but I let the analyst do it on-line through the jobset pages. I would have let the analyst update the next start date/time too, but I don’t believe this field is accessible on-line.

Lessons Learned

I must say I learned a lot during this experience. Here are some of the lessons I took away:

• Don’t put off creating a disaster recovery plan
• Test the disaster recovery plan
• Include every server that your application is dependent on in the disaster recovery plan.
• Carefully review the infrastructure group’s disaster recovery plan for each one of your servers. Understand the priority and make sure it corresponds with the application’s needs.
• Even if you have real-time replication to the data center, don’t expect the application to be available immediately. Servers still have to be provisioned, networking must be set up, and other applications are probably in line before yours.
• Don’t forget about impacts to your batch job schedule. At least plan to meet with the people who know it well to talk through things before restarting the process schedulers.
• Even after your application is back up and running, expect new problems to show up as the entire infrastructure is being re-established in a new environment.
• Have a comprehensive system checkout script, and use it before you let users get into the system. Execute it daily for the first week.
• You may not have regular backups in the DR facility like you did back home. Manage the system accordingly.
• Don’t expect to have dev/test environments for the foreseeable future. Have an idea about how business can be conducted without one. For example, how comfortable do you feel about applying those tax updates without testing, especially given limited backup functionality?
• Consider the effort to migrate everything from the DR center back home, at least at a high level.

原创粉丝点击