Specific Guidance for Each Step in ITDR
The information below offers a comprehensive overview of the ITDR workflow steps and shows the guidance provided within the tool for each step.
Section O - Admin Work
DR Plan Owner
The name of the person who will be completing this DR Plan.
Select Organization or Business
See examples below
- Unit School/Unit: SoM, R&DE, H&S, UIT
- Department/Organization: Pediatrics, Finance and Administration, Program, Enterprise Technology
- Group/Team/Division: Genome, IT Systems, Director/Admin, DARRA / SeRA
Date Created
Add the date the plan was created. If this is a new plan being created today, click the clock icon to add the current date.
Select Service Application
Click the blue button below to "Select Service/Application". This will open the database of all previously identified Stanford services and applications. If the service you are creating this DR Plan is not listed, you will be able to add it in the next step. To create a new service/application entry, when the database loads, there is a blue "Create New Entry" button in the upper right-hand corner. Click that button and the form will open.
Service/Application Name
What service or application is this DR Plan being created for?
Admin-Description
This description is being carried over from the service/application description. To edit, please click the pencil icon below.
Service/Application Description
If the description below requires revision, click the edit pencil to open the field and complete the edits.
Deployment Model
The production deployment type is pre-selected based on the current entry for your service. If you want to change the production deployment type, click on the pencil and check or uncheck the box(es) with the appropriate deployment model(s) implemented for the production instance.
When to use this plan
This plan will be executed in the event of a service outage. This plan is to be executed only when a disaster has been declared and conditions meet scenarios defined in DR Plan, Part A.
Section 1 - Scope
Services and Applications in Scope of this DR Plan
If there are additional applications impacted by a system degradation/outage, or also recover along with the infrastructure of the primary service in scope of this plan, please list those additional applications here.
Section 2 - Vendor Information
Vendor Name
Click the blue "Create Vendor" button below to open the Vendor table. From there you can select the appropriate vendor. If the required vendor is not listed and you need to create a new vendor entry, once you open the vendor table, click the blue Vendor button in the upper right-hand corner and a form will open that allows you to create a new vendor listing.
Vendor Documentation
In the following step, you will be asked to select which document types you have secured from the vendor. If you do not have the vendor SOC reports, it is a best practice to request them. SOC2 reports are often restricted and shared only upon request, and it is possible you could be requested to sign a non-disclosure agreement (NDA) to receive it.
Common types of vendor documentation are described below. Please identify any vendor documentation you have secured in support of this service.
- SOC2 reports focus on controls related to security, availability, confidentiality, privacy, and processing integrity. These reports are often requested by customers to demonstrate a service organization's commitment to data security and compliance. These are often restricted to a narrow audience.
- SOC3 report is a public report that demonstrates an organization's commitment to data security and privacy. SOC3 reports are designed for general public use and are often less detailed than a SOC2 report.
- MSA, or Master Service Agreement, is a comprehensive contract that establishes the general terms and conditions for a long-term business relationship. MSAs often include information that will help you complete additional sections in this workflow.
- Other would be used for any type of documentation that falls outside of the predefined options. If you select "Other" a required text field will open for you to add a descriptor.
Section 3 - Service Infrastructure
Is there a non-production environment available for this service?
Select "Yes" if you have any type of non-production environment.
Non-Production Environments Available for this Service
Select all that apply. If "Other" is selected, a required text file will open for you to add a descriptor.
General Assumptions and Dependencies
Network, authentication, load balancer, and other foundational services are general dependencies common to most services. In this section, please identify and document any specific upstream dependencies, external services, or integrations that are crucial for your service. This includes dependencies on cloud providers and other specialized infrastructure or third-party services your application relies on.
Describe Vendor’s Infrastructure
Click on the edit pencil icon to the left of the vendor name in the table below. This will open a form for you to describe the vendor's infrastructure. You will be able to add/paste in text and might be able to leverage language from the service description, product architecture, contract, SLA, vendor DR plans, or other vendor documentation. You can also upload a diagram or other related documentation.
Vendor's Infrastructure
Describe the features of the vendor’s infrastructure that make the service highly available and resilient. Pull language from the service description, product architecture, contract, SLA, vendor DR plans, etc.
Vendor’s Infrastructure Attachment
If the vendor has provided any diagrams or documentation specific to the infrastructure, please upload it here.
Production Hardware/VM Details
For your PRODUCTION instance you can attach a document or diagram, or you can add each component manually. YOU DO NOT NEED TO DO BOTH! To add a component, select "List Components" then click the blue "+ Add More" button to open the form.
List Components - IaaS
- Provider
- Region / Availability Zone
- Instance Type
- O/S Name & Version
- Other Characteristics (auto-scale, etc.)
List Components - On Prem
- System
- Location
- Hostname
- Manufacturer Model #
- # Processors
- Memory
- O/S Name & Version
- IF VM, note which ESX host
- Additional Details
Disaster Recovery (DR) Instance Hardware/VM Details
For your DISASTER RECOVERY instance you can attach a document or diagram, or you can add each component manually. YOU DO NOT NEED TO DO BOTH! To add a component, select "List Components" then click the blue "+ Add More" button to open the form.
List Components - DR
Please note: The list of components for DR is the same as shown above for production.
Section 4 - Recovery Time and Resources
DR Team - Roles & Responsibilities, Contact Info
Click the blue "+ Add More" button to open a form to enter a member of the DR Team. Repeat the process for each member of the team.
- Role: The role or title (rather than a named person), e.g., engineer, field tech, or programmer.
- Skillset: Provide a list of the skills, experience, and certifications needed to complete the activities outlined in this plan. Also include any technology-related permissions that are required to access documentation or systems in scope of recovery activities.
- Current Primary Person/Team: The name of the primary person currently in this role. If this role is filled through an on-call schedule, provide the name of the on-call team.
- Primary Contact Phone Number: If your team uses a single phone number to support a rotating on-call schedule, please enter that phone number here.
- Primary Contact Email Addresses: Add the best email to reach the primary contact in the event of a system outage.
- Alternate Person/Team: The name of the secondary or alternate person currently in this role. If this role is filled through an on-call schedule, provide the name of the on-call team.
- Alternate Person/Team Phone Number: If your team uses a single phone number to support a rotating on-call schedule, please enter that phone number here.
- Alternate Person/Team Email Addresses: Add the best email to reach the primary contact in the event of a system outage.
- On-call Schedule: If your team leverages a published on-call schedule to identify primary and alternate staff, please link that schedule here.
Click the blue “+ Add More” button to open a form to enter a member of the DR Team. Repeat the process for each member of the team.
Vendor/Suppliers
Include all vendors that would support recovery efforts, including consultants and hardware vendors.
Section 5 - Outage Notifications, Escalations, Monitoring and Communications
Vendor Outage Notification
The next three fields are focused on vendor procedures and notifications to enable the service team to be notified of a system outage or degradation, report issues, and escalate the issue using the vendor's support services.
- Vendor Outage Notification: If your service is supported by a vendor, describe how the vendor relays outage or system interruption/degradation information to the Stanford service team, e.g., email, text, or call. If another Stanford team is engaged, describe that team's responsibilities.
- Vendor System Status Portal: Add the link to the vendor system status portal used by the Stanford service and /or support teams, or internal support teams, to view system status updates. If permissions are required to access the dashboard, please specify.
- Vendor Escalation: Describe the channels to escalate within a vendor’s organization, e.g., submit a ticket or call the help desk. If you have a SaaS or IaaS service, these terms might be described within the vendor documentation. If it is unknown, submit a request to the vendor for clarification.
Monitoring and Alerts
In this section, you will document how the service team has configured service monitoring, which channels are used to send alerts, which roles or teams are responsible for monitoring alerts, and the cadence of this monitoring.
- Which monitoring tools are being used?
Document the tools being used to monitor service health and send alerts when the system experiences an issue. - Which team is responsible for monitoring alerts?
Document the name(s) of the organizational team(s) responsible for receiving alerts and initiating triage on the issue and impact. - At what frequency are alerts monitored?
Document the frequency or practice for monitoring alerts, e.g., 24/7, hourly, daily, or upon receipt of a ticket. - How are alerts being communicated to the team responsible for monitoring?
If alerts are sent through multiple channels, select all that apply. - Other:
Note the alternative alert channel for this system.
- Which monitoring tools are being used?
Special Communication and Service Level Considerations
Some service outages require direct communications to the business users, specific clients, or a group of VIPs. For any communications that need to be executed outside of the centralized University IT Major Incident process, provide instructions. Details should include the target of the communication, who should execute the communication, and the cadence of the communication.
Section 6 - Recovery
Recovery Objectives
- Protect personnel, assets, documents, and intellectual property from further injury or damage.
- Minimize economic losses resulting from service interruptions.
- Define steps toward achieving an orderly and complete restoration of service functionality.
- Meet recovery time frame goals deemed critical by University leaders and UIT managers.
RTO stands for Recovery Time Objective, which is the maximum acceptable time it takes to restore a system or application after a disruption or outage. This is an amount of time, in hours, represented by a number.
- Length of time available for recovering disrupted systems and resources, based upon the acceptable level of downtime.
- Maximum amount of time the system can be down, post-disaster.
- A disruption becomes a disaster when the elapsed time for an incident exceeds the RTO.
- The expectation is that a disaster must be declared (and thus, the disaster recovery plan invoked) if you anticipate not being able to meet the RTO target for system and application restoration.
- Very important metric for the business; they must define the target.
RPO stands for Recovery Point Objective, refers to the maximum amount of data loss an organization can tolerate in the event of a disaster or system failure, measured as the time elapsed since the last successful data backup. This is an amount of time, in hours, represented by a number.
- Tolerance for loss of data measured in terms of the time between the last backup of data and the disaster event.
- Maximum amount of lost data allowable (time since last backup or data replication).
- Point at which information used by an activity must be restored to enable the activity to operate or resume.
- Primarily an IT metric, driven by the selected backup strategy.
Does this service have data backup implemented?
If yes:
- Describe which backup solution is being used.
Solutions can include disk-based replication, BaRS, HP Data Protector, Database Dumps, RMAN, as well as other solutions. - Describe the data being backed up.
Data could include operating system, configuration, and other types of data. - What is the retention policy for backups?
Retention time would be determined by your organization or perhaps by clients. If the retention policy is not determined, add a comment reflecting that it is not determined. It is recommended that the plan be updated when the retention policy be clarified. - Describe how often backups are performed.
Backups can be performed hourly, daily, weekly, or some at some other cadence. - What backups are being stored?
- Describe which backup solution is being used.
RTO Alert
The RTO entered for this service EXCEEDS the total estimated time to recover as detailed in the recovery steps. Please review the RTO and recovery steps to ensure they are accurate.
Recovery Steps, Execution Lists, Instructions
Use this form to detail step-by-step recovery instructions that would be followed in the event of a service disaster. The text box is configured to number your steps. Click the “+Add More” button to add additional sets of recovery steps. For each set of recovery steps, you will be required to enter an estimated time to execution. The sum of all recovery steps should equal the service recovery time objective (RTO).
- Process Name
Name the process, e.g., Troubleshooting, Application restore, Server rebuild, or Recovery of Full Redundancy. - Process Details
This text box will assist you with numbering the steps in this process. In each step, provide instructions. For scripts that need to be run, the scripts can be added directly to this text box, or a link for the script location can be added to one of the numbered steps. - Process Owner
List the name of the role, team or individual responsible for completing this recovery process. - Process Time Estimate
Note the amount of time in hours and minutes, rounding up to the nearest half hour. For example, if your process takes 3 hours and 15 minutes, you would enter 3.5; if your process takes 2 hours and 47 minutes, you would enter 3.
- Process Name
Section 7 - Scenarios
Select the Scenario(s)
Click "Select or Create" to open the list of scenarios. Select scenarios that are relevant to your service. If there are scenarios not listed that you would like to include, click the "Select or Create" button in the upper right-hand corner of the list to add your own.
Scenario Detail
- For each of the scenarios selected, add the steps to be taken to mitigate the impact.
Steps to mitigate the impact of this scenario: List the steps or approach to mitigate the impacts of this specific scenario.
Section 8 - Testing/Drills/Maintenance
Testing/DR Exercises
Document any testing, drills, or exercises conducted to validate the steps and RTO/RPO documented in this plan.
Details List the activities applied to validate the disaster recovery procedures and recovery objectives (RPO/RTO).
Scheduled Maintenance, Patching, Server Reboot
Document any regular maintenance activity on the system infrastructure.
Details Document any regularly scheduled maintenance activity on the system infrastructure. This information helps demonstrate the level of engagement and frequency of support provided.
Section 9 - Closing Step and Approval Setup
Additional Comments
If there are any additional details that have not already been addressed earlier in this plan, please add them in the text box below.
Additional DR Documentation
If there is any service related documentation supporting disaster recovery activity that was not uploaded earlier in the plan, upload it here.
Local Approval Instructions
Local Approval
This plan now needs to be approved by the service owner, referred to as "Local Approval." The local approver field has been pre-populated with the document owner named in Section 0. If that person is not the service owner, please edit the field to reflect the name of the service owner. To edit the field, click the pencil icon, and click the X to the right of the name that has been pre-populated. Once the pre-populated name is removed, type in the name or SUNet email address of the service owner.
Once you complete this section, a dialogue box will pop up giving you the option to return to your unit or school's CardinalShield ITDR Dashboard. Additionally, the service owner will be notified by email, advising them that this plan is waiting for their review and approval.
Closing Approval
Once Local Approval has been completed, the CardinalShield ITDR Program managers will be notified. This will prompt closing approval which includes a high-level review of plan details and formatting. This step is to ensure plans are being filed properly within CardinalShield and not to validate plan data. If you have any questions about this activity, you can email itresilience@stanford.edu.
Local Approver
The person named in this field should be the current service owner. They will be responsible for reviewing and providing local approval for this disaster recovery plan.
Section 10 - Local Approval
Date and Time
When you, the owner of the service named in this plan, are ready to provide local approval of this plan, click the clock icon to select the current date and time.
Signature
If you have already created a personal signature in CardinalShield, click "User Profile Signature" button below to add it to the Signature text box. If you have not created a personal signature, use your mouse and cursor to sign your name on the line in the Signature text box. Use reset or Undo to remove your attempt. When you have completed a preferred signature, click the blue "Confirm" button.
Section 12 - Closing Approval
Date and Time
When you, the owner of the service named in this plan, are ready to provide local approval of this plan, click the clock icon to select the current date and time.
Signature
If you have already created a personal signature in CardinalShield, click "User Profile Signature" button below to add it to the Signature text box. If you have not created a personal signature, use your mouse and cursor to sign your name on the line in the Signature text box. Use reset or Undo to remove your attempt. When you have completed a preferred signature, click the blue "Confirm" button.
