Operational Principles and Practices for All UIT Servers

Summary

This document defines how UIT servers should be built, configured, and operated - whether physical, virtual, or containerized, on campus or in the cloud.

Principles
Practices
AWS Practices
Puppet Practices
Editors and Contributors
References

Principles

Virtualize or Containerize by Default

Many of the principles and practices in this document apply equally to campus-based virtual machines or containers, or cloud based IaaS instances or containers. However, physical machines generally require a different approach (unless they boot from a SAN or NAS, but that requires extra custom work that is unique to physical systems). Since virtual servers are more portable, faster to provision and start, and are often more optimized, in an ideal world the only physical servers would be those hosting virtual servers. Deployments should use native virtualization, whenever possible. This does mean that refactoring the VM platform from on-prem to the cloud is often necessary. There is no need to migrate the VM layer of VMWare to AWS, for example.

Loose Coupling

Loosely coupled services are more robust, scalable, and adaptable than tightly coupled services. Services must be identified using DNS names rather than IP addresses, and must use DNS names rather than IP address when addressing other services. The names should be specific to the service: for an app called foo, having a DNS alias of foo.db.stanford.edu pointing to db.stanford.edu is more robust than pointing to db.stanford.edu directly, since the alias can be changed to point to a different database server without changing the configuration for foo. Services should be exposed as publicly as possible. Services should not rely on IP-based access controls, but should use strong authentication and transport level encryption rather to allow clients and other services to automatically scale, or change IP addresses (including moving to a different provider network). Where services do not use well-known ports they should use DNS SRV records to provide port information to clients.

Private IPs

Many organizations, and home networks, use RFC1918 private address spaces (10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16). RFC1918 address spaces must not be routed over the internet, but can be routed over large internal networks (like the SUNet shadow nets). Cloud providers also use the same RFC1918 addresses internally; in fact, AWS assigns the same subnet (172.31.0.0/16) to the default VPC in every AWS account. While it may be tempting to attempt to bridge or route RFC1918 networks between campus and external providers like AWS using DirectConnect or a VPN, it’s at odds with our Loose Coupling principle because it creates a tightly coupled dependency the campus network. Stanford is located in an active earthquake region, so we should never assume that the campus network will always be available.

Public IPs

To support loose coupling, service endpoints must have public IP addresses. For single server services, or pools of servers running without a load balancer, the servers must have public IP addresses. For clusters / pools of servers behind a load balancer, the load balancer must have a public IP, but the servers can have private IPs.

DNS Naming

All service endpoints must have names in DNS that resolve to public IPs. Servers behind load-balancers can have, but do not need, public DNS names. If the hosting provider creates and maintains fixed DNS names for services those names can be aliased in Stanford’s DNS using CNAME resource records. If the services or hosts have changing DNS names (for example, EC2 instances have DNS names based on their public IP), then the associated domains should be delegated to a DNS service (preferably the provider’s) that provides an API for timely changes.

Leverage the Features of Our Tools

We often deploy new tools without leveraging the advantages they bring. For example, consider Splunk. Before Splunk was deployed on campus, many UIT groups dumped raw log files to shared file systems, or forwarded logs via syslog/rsyslog to a central logger. After Splunk was deployed, logs continued to be forwarded to Splunk via rsyslog, which require few changes on the log generators, but ignored the capability in the Splunk Forwarder for parsing specific log files into structured data.

Use Service Specific Images

All non-physical systems should be run from pre-built images. Images should be specific to the services they run; there should be no “monster” images with all possible software installed. Images should be built using configuration management tools such as Puppet, Chef, or Ansible. UIT has standardized on Puppet for Linux systems and Microsoft SCCM for Windows. Post-boot configuration should be kept to a bare minimum, and images should be as ready to run as possible. However, images should not contain any configuration specific to an instance of the service or application. For example, a Drupal or WordPress image could contain the configuration for Stanford’s SAML IdP, and most of the SAML SP configuration, but would not contain the configuration for the database. Images should be frequently, and automatically, built and tested.

Configuration and Packaging Are Code

Local package definitions, configuration files, and configuration management tool definitions, scripts, etc. should be treated as code. See the Code Management section below for more details.

Stateless Systems

All service or application state should be outside the image, in databases, or on reliable storage systems (on campus: SAN or NAS; AWS: EFS, S3, or independent EBS volumes). In some environments - containers, for instance - state can also be injected at boot time via environment variables.

No Patching or Updates on Running (Virtual) Systems

Physical servers must be patched according to MinSec requirements. Since virtual servers (VMware or Hyper-V VMs, AWS EC2 instances, Docker Containers, etc.) should have no local state, they should be replaced with instances running a newer version of the image. If application patching also patches the database, it’s preferable to split the application and database patching into separate processes. If the patching cannot be separated, or the patches are not backwards-compatible (i.e. the new application version cannot use the old database version, or vice-versa), then the service will have to be stopped while a patched image is built and used to patch the database. Obviously, this is not ideal, and should be addressed with the application vendor.

Practices

Minimum Security Standards

Unless ISO has approved alternative mitigations, Stanford’s Minimum Security Standards must be followed for all environments (campus and cloud, physical and virtual).

Access Control

Servers should only run the minimum necessary services. Both host-based and network-based access controls should be implemented. Public services should obviously be open to the world. Ideally, those with strong authentication and transport level encryption should also open access to the service ports as widely as possible (re: The Loosely Coupled Services principle). Servers with weak or no authentication and no transport level encryption should only allow access from approved clients, or should use secure proxies or gateways. Take Splunk as an example: Splunk Indexers (servers) should be open to the world, while Splunk Forwarders authenticate to the Splunk Indexers using client certs over TLS-encrypted connections. All non-service ports should be blocked; administrative ports (such as SSH) should be restricted. Administrative ports for servers within private networks (AWS VPCs, for instance), should not be directly accessible from the internet. A secure bastion host should be used as a gateway to those servers, and access should only be allowed from the bastion. Access from the bastion must also require multi-factor authentication.

Secure Bastion Hosts

Secure bastion hosts should enforce multi-factor authentication (e.g. SSH key and Duo, or Kerberos and Duo), or only allow access via physically secured credentials (e.g. SSH keys generated on a PIN- and touch- protected Yubikey). While bastion hosts must be used to access other servers within the protected network, those hosts must not store credentials for access to servers. Bastion hosts using SSH keys should not allow users to upload additional trusted keys; only keys installed by configuration management should be trusted.

Software Installation

Software should never be manually installed. Configuration management tools should be used to install native OS packages from trustworthy sources, or locally created packages, or from versioned tarballs (also from trustworthy sources) with hash and/or signature verification.

Logging

Logs, especially access and audit logs, should be sent back to the Stanford’s Splunk service.

Monitoring

Most cloud providers have some basic service and instance monitoring and alerting capabilities. Most groups will need additional monitoring tools, either integrated with the cloud provider’s tools, or in addition to them. Application-level monitoring is still required in the cloud, but is usually application specific. High-level monitoring that can be done across a load balanced pool of servers can run from anywhere. Lower-level monitoring that needs access to individual servers behind a load balancer must be run inside the same environment as the servers, since individual servers are not directly accessible from the internet.

Code Management

All locally developed code - including applications, packaging, configuration files, configuration management definitions, scripts, etc. must be under the control of a revision control system, to facilitate collaboration, and ensure that code is preserved and auditable. UIT staff should use git and code.stanford.edu (public code could also be hosted on github.com). Sensitive data should be encrypted before being committed with a tool like git-crypt, and commits for sensitive systems should be GPG signed.

Process Automation

Manual processes are error-prone, and rely on manual documentations keeping up with changes to the process. Automated processes are less error-prone, especially when used frequently. Package, image, system, and configuration building and testing should be automated; deployment of updates should also be automated, although there are times when a human approval step is appropriate. Similarly, scaling and recovery should also be automated. If a system crashes, it should be automatically replaced. If a system is hung or misbehaving, it should be terminated (then automatically replaced), or removed from the pool and saved for further analysis.

Credential Management

Systems should not use local password stores. Configuration management tools should be used to create and manage user accounts, and configure authentication (.k5login files for Kerberized systems; SSH authorized_key files for other Linux systems). Systems should use frequently changing role-based credentials where available (AWS, for instance), or decrypt credentials as required. API keys, database passwords, and other credentials should be audited and rotated regularly.

Service Availability

All systems should be architected to survive failures. In practice, this means leveraging scaling services such as AWS Auto Scaling to ensure that the correct number of servers are running (even if the correct number is one). In some cases this will require the systems to update their own DNS records at startup. Systems should be located (or be allowed to run) in multiple locations. Use multiple ECHs on campus, and all the availability zones in a region in AWS.

Server Grouping

Servers in the same pool should be grouped into the same subnets (campus, GCP) or security groups (AWS). Unrelated servers should not share subnets or security groups. Servers in different tiers of the same application should also be grouped separately.

Load Balancing

Services should also mask the failure of individual systems, and to spread the load across multiple backends. Some applications can rely on round-robin DNS, while others can leverage failover DNS (as provided by AWS Route53). However, DNS solutions can cause problems for some clients, and in many cases, a load balancer is required to fully mask outages. In AWS Elastic Load Balancers (ELBs) can be associated with Auto Scaling Groups (ASGs) to automatically add and remove backend servers as the ASG scales up and down. NOTE: AWS ELB IP addresses change frequently, especially during scaling events. Regular DNS records should be CNAMES for the ELB CNAME, or the domain can be delegated to AWS Route 53, in which case an alias A record (not a CNAME, and specific to Route 53) can be used instead. Load balancers must be available in the same locations as the backend systems, but not necessarily the same networks (on campus the load balancers can be in a separate VLAN / CIDR space, in AWS the ELBs should have dedicated subnets in each availability zone). Depending on the application, load balancing may also be required between application tiers.

Backup and Recovery

The same backup and recovery requirements apply to application data, regardless of where the data is stored. Existing backup solutions can be used, or data can be backed up to AWS S3. S3 can also be configured to retain multiple versions of objects, and to migrate older objects (files) to long term storage (Glacier), to automatically expire objects (and old versions of objects) after a specific retention period. Automatic replication to S3 buckets in other regions can also be configured. Code and configuration is already backed up by code.stanford.edu, but can also be backed up with the application (with suitable protections for credentials).

Image Management

Most platforms do not include image management, so processes are needed to identify and remove old, unused images.

AWS Specific Practices

There are some specific practices that should be followed when building and running systems in AWS (using EC2 or ECS for servers).

AWS Account Management

AWS accounts are free (until resources are created / used), so create as many as are needed. Use consolidated billing to have one bill each month (another reason to use multiple AWS accounts: while the billing records can be configured to contain resource tags, the account incurring the cost is included in every entry). When creating a new AWS account, use a mailing list for the email address rather than a personal email address. Accounts under the UIT consolidated billing accounts receive lower egress for traffic from AWS to campus. When a new account is created, create a long, random password and store it securely (in multiple places). Enable MFA for the “root” account, and ensure that multiple people have access to the MFA (usually by having multiple people configure their Duo or Google authenticator app at the same time). Create admin roles, and enable SAML SSO to map workgroup membership to those roles. Create IAM user accounts for admins, and have each admin also configure MFA for their account. The admins should also generate and download AWS key pairs to use with the AWS CLI tools, and other tools that use the AWS API (such as Terraform or Packer). Create separate IAM user accounts (and keys) for automation tools.

Root Account Credential Escrow

As a matter of business continuity, the AWS accounts associated with production services should escrow root account credentials, perhaps with the Information Security Office or other central team’s repository. Best practice would then be to have roles for the respective admins operationally responsible for production services to be granted adequate privileges. In this way, operational staff have adequate authority, admin actions (changes, logins, etc.) are associated with a person, and roles can be lifecycle managed without affecting the entire AWS account.

CloudTrail

Cloud Trail provides an audit trail for events in an AWS account. Enable CloudTrail and configure Stanford’s Splunk service to pull logs from CloudTrail.

Network Segmentation

Create a VPC for each application (either in a shared AWS account, or in an application-specific AWS account). Within each VPC, create unique subnets for each tier in each Availability Zone. Ensure that routes exist to pass traffic from internal systems to the internet via the Internet Gateway (IGW). Applications should communicate with each other using over the “public” internet, rather than VPC peering; this maintains the loose coupling we want, allowing apps to be re-deployed as needed. Each VPC should have a unique bastion host.

Security Groups

Security groups serve two purposes in AWS: they identify a group of instances, and they control access to the same group of instances.

IAM Instance Role Profiles

AWS Identity and Access Management (IAM) supports the creation of IAM roles to define a set of permissions; those roles can be assigned to IAM users and groups. IAM also supports instance role profiles, which allow IAM roles to be assigned to EC2 instances. The roles include credentials that can be retrieved from the instance metadata; the credentials are rotated frequently, so processes should periodically refresh credentials from the instance metadata. The role credentials can be used to make CLI or API calls against AWS services.

Create Images

Amazon Manage Images (AMIs) can be created from existing EC2 instances. While it’s possible to create an image from a running EC2 instance, a better approach is to allow the instance to be rebooted as part of the image creation process so that the file system is consistent. The best approach is to automate image builds using a combination of a continuous integration tool (e.g. Jenkins or GitLab-CI), and a tool like Packer (which automates the creation of a new EC2 instance, then creates an image from it) to configure the image using a configuration management tool (e.g. Puppet). For example, to update an AWS auto-scaling group, the process is:

Build a new version of the AMI
Create a new launch configuration with the same settings as the current one, but pointing to the new AMI
Update the auto-scaling group to use the new launch configuration
Replace the old running instances with new ones. The easiest way is to double the desired capacity of the pool, wait for all the new servers to start, then reset the desired capacity back to its original value. By default, AWS will terminate the servers which use the older launch configuration as it shrinks the auto-scaling group (this is how www.stanford.edu is implemented).

Create Launch Configurations

Autoscaling launches systems based on Launch Configurations, which define the tags, AMI, instance type, security group, the default SSH key, IAM Instance Role Profile and any additional EBS volumes that will be created for each instance. Launch Configurations can also be used to associate user data (cloud-init scripts) with instances; the cloud-init scripts can be used to customize the instance when it boots (acquire credentials, attach external storage, etc). Multiple launch configurations can be maintained, but an autoscaling group can only be associated with one launch configuration at a time.

Auto Scale Everything

As mentioned before, even single instance services should use auto-scaling to ensure that downtime is minimized. Create appropriate policies for scaling a pool of servers up and down as demand changes (this will depend on the application - some may need to scale based on CPU or memory consumption, others may need to scale based on network I/O).

Use a Load Balancer

Most services will require more than a single instance, and most services with more than a single instance will require a load balancer. AWS provides an Elastic Load Balancer (ELB) service, and ELBs can be associated with an auto-scaling group. ELBs can be configured to perform healthchecks on backend instances, and remove unhealthy instances from the pool. ELBs can be configured to load balance HTTP and HTTPS traffic (with SSL offload), or to just load balance TCP traffic. Where it’s possible, it’s better to use HTTP/HTTPS than TCP. Ensure that your application has an endpoint that can be used for healthchecks by the load balancer.

Logging

Logs can be sent to Splunk directly from each instance, or they can be aggregated onto a single logging instance, which sends them to Splunk, or they can be aggregated into a shared EFS file system, and monitored by a Splunk Forwarder. In any of these scenarios, Splunk Forwarders with proper log file format configuration, and using client certificates for authentication, are preferred over generic syslog/rsyslog forwarders.

Databases

AWS’ Multi-AZ Relational Database Service (RDS) is preferred over databases running on EC2 instances. RDS MySQL, Aurora, and Postgres have no effective limit on the number of databases per instance, while RDS Oracle and SQL Server have a 12 database per-instance limit. RDS instances can be shared, but each service must have a unique alias for the RDS instance to facilitate splitting of the RDS instance, etc. If your application can use read-only replicas, you can configure RDS instances with read-only replicas. You can also configure read-only replication to replicas in other AWS regions.

Application Data

Other application data can be stored in EFS, or independent EBS volumes. EFS is preferred for shared data if the application is supported on NFS and performance is sufficient. Independent EBS volumes are similar to LUNs on a SAN, and can be formatted with any Linux file system. Cloud-init scripts are needed to find the EBS volume(s) and attach it to an instance, and the instance must be in the same availability zone as the EBS volume. EBS volume snapshots can be taken at any time, and used to create new EBS volumes. S3 is also an option for applications that support it directly or via a plugin / module.

Cost Monitoring

While almost everything in AWS is billed for, there are a few services that tend to make up the most significant portion of the bill: EC2 instances, RDS instances, storage usage (EFS, EBS, and S3), and data egress. AWS does not charge for data ingress (data going from outside into AWS). Reserved instances can be used to reduce ongoing costs, and to ensure that you will be able to get the instances you need, but billing can be complicated when used with consolidated billing. Spot instances can be useful when you can be flexible about when processing is done. Be aware that spot instances can be terminated at any time, so all state should be in external storage.

Puppet Practices

A centrally managed puppet service is available to departments within UIT. If you run your own Puppet infrastructure, make sure to follow the following best practices.

Use Puppet 5 or later

Puppet Enterprise 3.x was deprecated by Puppet Labs in December 2016; this means that there is also no official support for the free/community version of Puppet 3.x. While many Linux distributions still only provide Puppet 3.x, Puppet Labs provides Puppet latest releases via their own YUM and APT repos.

Use Modules

All functionality should be bundled into Puppet modules; the actual definitions for individual hosts / systems should be minimal. There is a large community building and publishing modules on the official Puppet Forge. UIT should leverage those shared modules as much as possible.

Use Hiera Data in Modules

There have been ways to use Hiera data in modules in previous versions of Puppet, but Puppet 4 has Hiera support built in. Hiera data in modules can make the code much cleaner, by removing many of the conditional checks for different OS families, versions, architectures, etc.

Every Module Has a Git Repo

Modern Puppet best practices uses a unique Git repository for each module, allowing every module to be versioned, and forked, independently. It also allows modules to be used in a masterless puppet environment, which is very useful for image building.

Use R10K and Librarian-Puppet

Since every module now has a unique version, some mechanism is required to manage the set of module versions used for a group of systems. Puppet already has support for multiple environments (production, dev, test, etc.). R10K is a tool that can manage the modules for a set of environments. Librarian-puppet is another tool than can be used to download and install a set of modules. The main differences between R10K and Librarian-Puppet are that Librarian-Puppet will attempt to recursively resolve all modules dependencies, but it does not work with environments; R10K works with environments but does not recursively resolve all module dependencies. Since both use the same Puppetfile format to define the set of require modules, R10K can be used to manage the environments, then Librarian-Puppet can be used within each environment to ensure that all module dependencies are fulfilled.

Operational Principles and Practices for All UIT Servers

Summary

Contents

Principles

Virtualize or Containerize by Default

Loose Coupling

Private IPs

Public IPs

DNS Naming

Leverage the Features of Our Tools

Use Service Specific Images

Configuration and Packaging Are Code

Stateless Systems

No Patching or Updates on Running (Virtual) Systems

Practices

Minimum Security Standards

Access Control

Secure Bastion Hosts

Software Installation

Logging

Monitoring

Code Management

Process Automation

Credential Management

Service Availability

Server Grouping

Load Balancing

Backup and Recovery

Image Management

AWS Specific Practices

AWS Account Management

Root Account Credential Escrow

CloudTrail

Network Segmentation

Security Groups

IAM Instance Role Profiles

Create Images

Create Launch Configurations

Auto Scale Everything

Use a Load Balancer

Logging

Databases

Application Data

Cost Monitoring

Puppet Practices

Use Puppet 5 or later

Use Modules

Use Hiera Data in Modules

Every Module Has a Git Repo

Use R10K and Librarian-Puppet

Related References