Nuclear protocols & policies for your AWS Production environment.

Nuclear bombs have nuclear codes, protocols, and launch policies… your AWS production environment should be the same.

Nuclear-bomb illustration. -??

Transition to a “no-review. no-deployment.” culture.

Although this is more of a security and operations thing, we’ll apply DevOps methodologies here. In other words, we’ll start with the culture portion of the equation rather than tackling technologies or processes first. Although, it is common-sense to review each and all input/commit; This common sense falls in a gray area when the requested is surrounded by different circumstances…

Critical patches and updates? The priority does not change the requirements; Therefore, a review must take place. Higher up request? This one is a bit touchy. If a higher-up wants to bypass the requirements, then this person will have to either deal with the next person up the ladder or get a written statement on the books.

Patches and Fixes need to be reviewed regardless of urgency or who’s requesting it. If it’s not reviewed, it’s not deployed.

Restrict API calls on mission-critical environments.

Regardless of how many company policies and memos you send, you’ll come to understand that there is always a chance somebody will do the exact opposite. It will most likely be an accident, and that’s OK because you have been planning for it!

Just as the president is the only one able to “Take the decision” —he can’t lunch the bomb itself, but he can start the chain of events leading to the detonation of the bomb. Only a chosen few should be allowed to make API calls that would disrupt the production environment. Until an alternative solution is exposed, I suggest locking down the production environment by leveraging AWS Tagging system and IAM by…

  1. Add Key: Values for the environment to all resources
  2. Create Production and Non-Production user groups
  3. Leverage IAM StringEquals or StringNotEquals to restrict API calls based on environment tags
  4. Create a ProductionAdmins role for production management
  5. Use Role assumption policies for Production group + ProductionAdmins
  6. Add service level API blocks by introducing aws:CalledVia condition

With the above, we can grant the chosen few the rights to assume a role that can make those disruptive API calls on our precious production environment and at the same time we are blocking “direct” and “indirect” interaction by unauthorized users.

Standardization is OK. Enforced Standardization is even better.

This one is one of my favorites because…

  1. We can’t apply the previous suggestions unless we standardize the tagging portion of our resources.
  2. Expecting everybody to provide the same specific values each time, is — believe it or not — asking too much!
  3. Again, it’s all about the culture.

Fortunately, there are several ways to go about doing something in respect to standardization. One can opt to use: AWS Service Catalogs for predefined tags, Using custom Lambda scripts, Using third-party tools such as Cloud Custodian, etc…

Keep in mind that you can’t go guns blazing and say only 10 tags are to be applied to resources and the rest can go down the drain. Again, this is a culture shift and you may want to start the conversation with a much softer approach — But yeah we’d love to get rid of all the clutter and all those non-compliant values in the tags!!

BONUS: this will help a lot in the cost control and budgeting of your cloud infrastructure.

Alarm & notification as warning and confirmation.

Ah yes! this is the holy-grail of operations and the constant source of conflict. Hopefully, you or your company already implemented a system of API calls confirmation/notification. If something is happening in the Production Environment, we want to know!

This can easily be implemented with the following formulas…

  1. New Resources? CloudWatch + CloudTrail + S3 + Lambda+ SNS
  2. Resource State change? CloudWatch + SNS topic -OR- AWS Config
  3. All the above? AWS Config + SNS -OR- CloudTrail + CloudWatch

The above can be supplemented with custom Lambda scripts that make sure that all members of the Production group are subscriptions to the SNS topic mentioned above. Having both the team and the person making the API calls get notification of what’s going on serves as…

  1. Blast radius container. If a disruptive API call is made, you want to make sure it is not repeated back to back.
  2. Intrusion detection / hacked account awareness. If James is on vacation, there’s no way James is terminating servers in production at the same time

Enforce deployment curfews.

Strangely enough, this one is one of those issues nobody thinks about. All the policies and processes we just mentioned would be of no use if we have somebody trying to detonate a bomb in the wee hours of the morning. Therefore, a preset curfew is needed to restrict Production’s ability to assume the role. We can do that by…

  1. We can use policy conditions with DateGreaterThan or DateLessThan
  2. OR.. we can use a mix of CloudWatch + CloudTrail + S3 + Lambda

You can go a bit further and limit where these bombs are detonated from by using the IpAddress condition; I think that’s too much — especially if we want to allow our admins to work from remotely.

In conclusion…

Just as a nuclear-bomb follows a set of protocols and policies to mitigate a catastrophe, we should also make sure our environments have their protocols and policies. Is no fun to have to triage issues that originated by small mistakes that should’ve been caught by the same processes mentioned above.

AWS Engineer and DevOps dude. Keep it simple and to the point!