With more than 5 million articles from over 7,000 brands, OTTO is one of the leading German online shopping platforms. In the future, it will open up to even more brands and partners as part of its transformation. OTTO is part of the internationally active Otto Group, with headquarters in Hamburg, and employs 6,100 people throughout Germany. In the 2020/21 financial year, OTTO generated revenues of 4.5 billion euros.
At OTTO, we faced several challenges to operate AWS CloudFormation StackSets at Scale. We must govern several hundred AWS accounts for our product teams, all while balancing the need for agility and control.
At this scale, operations can take a lot of time, because there are multiple operational tasks that we need to do when AWS accounts are leaving the AWS Organization or Teams are nuking the AWS account, StackSets Instances get drifted, because not all required resources for compliance can be secured (
SCP Limitations), existing AWS accounts are joining the AWS Organization and all mandatory StackSets needs to be deployed, and manual steps should be reduced to a minimum. Furthermore, there is no feature from the Service itself to gain an overview of the status of drifted Instances and the general health of your StackSet health and compliance.
The cloud competence center at OTTO IT, also known as the Governance at Scale (GAS) team, developed a solution for self-healing on StackSets, that is integrated into the OTTO tooling ecosystem with Confluence and Microsoft Teams.
OTTO worked with globaldatanet to set up its Landing Zone. globaldatanet is an award-winning AWS Advanced Consulting Partner and longtime Cloud Solution Provider for OTTO, supporting the team in cloud security and GAS. Their focus on building cloud-native solutions using Serverless supported over 100 companies within 5 years to develop and innovate products and services in the cloud.
In this post, we’ll demonstrate how to implement fully automated enterprise-scaled self-healing on StackSets using AWS StepFunctions and create a Dashboard to get an overview of your StackSet health and compliance and reduce operational time.
The solution workflow includes the following steps:
Let’s see how this works.
The following prerequisites are necessary for following along with the contents of this post:
The following architecture shows the whole solution of the Self Healing StackSets.
Architecture of fully-automated Self Healing Solution with integration to Confluence.
The solution requires a JSON file in the AWS parameter store, the easiest way is to create it automatically based on the StackSet configurations and the tags assigned there. We'll go into more detail about this in the next section of the Automatically create StackSets configuration Parameter Store article. In the following, we describe which tags we introduced to our StackSet and what we need these tags for.
⚠️ AWS tags do not allow commas in value, so ":" as divider for arrays
|antidependson||StackSet Name||antidependson marks stacksets which collide with each other.||MYSTACKSET|
|dependson||[List of StackSet Names]||List of Stacksets that need to be rolled out before deploying this stackset (e.g. Enable Config before Activate Config Rules). NOTE : Please reduce to only one dependson-stackset for now. Form "chains" for multi-dependencies.||MY-STACKSET1:MYSTACKSET2|
|mandatory||true or false||The stackset instances must be present on all AWS accounts||true|
|selfhealing||true or false||StackSet can be healed via Delete & Redeploy (exception e.g. IDP roles) - Parameter Overwrites will be cached.||true|
|region||[Regions]||List of Regions in which the stackset instances are to be deployed||eu-west-1:eu-central-1:us-east-1|
The automated generation of the Stackset-configuration via JSON inside the ParameterStore is a multi-purpose-utility:
The Lambda responsible for the task is invoked via a Events-Rule:
Every time a Stackset-Operation has been finished with status "succeeded".
This is due the tags on a Stackset are part of the stackset, not Additional items describing a Stackset, therefore a change to the tags always will result in a Stackset-Update-operation.
In terms of computerscience the Lambda is quite interesting, as the primary problem was to build a nonweighted tree based on the "dependson" and "antidependson" tags and then compile an ordered one-dimensional list, like in the good old "travelling salesmen"-problem.
AWS Step Functions is a cloud service that enables you to coordinate the components of distributed applications and microservices using visual workflows. It allows you to build and automate the execution of complex processes and tasks across multiple AWS services, using a visual interface to define and execute your workflows. Since the Self Healing Solutions needs a complex workflow we decided to use Step Functions for this Usecase. Following we will explain you the workflow of the Self Healing.
ƛ Serverless Functions
While developing the solution we faced several limitations. Here are our findings and solutions for that.
After each execution of the StackSet Health StepFunction, we aim to notify our GAS team about the actions taken during the previous run. Therefore, we have implemented a Teams notification that includes a status update, a link to the generated dashboard, and a link to the log file.
The following screenshot illustrates an example of a Teams notification. It provides a summary report and directs you to the dashboard and log file for further details.
Our StackSet Health Dashboard is a simple HTML file which will be generated trough a Lambda Function, saved in S3 and will be distributed trough a CloudFrount. You can integrate this Dashboards in your Confluence or any other internal Wiki. This Dashboard is secured via CloudFormation Function - additionally you can also add a Firewall to restrict the access to an specific CIDR or Geographic region and prevent access from third parties. The screenshot below provides an example of the overall StackSet Health status information for an entire AWS Organization.
In this post, we demonstrated a solution to automatically heal AWS CloudFormation StackSets at scale. By implementing this Solution Organisations we reduced manual effort for StackSet cleanup operations by 4 hours per week, improved the overall reliability of our StackSets, increased our compliance in the organisation, and managed to get a daily updated overview for all StackSet Instances using the dashboards. In summary, the self-healing CloudFormation StackSets solution combines automation, monitoring, and self-recovery capabilities to deliver a robust and resilient system for StackSets.