The evolution of our staging environment
Today, we will be talking about staging environment at Babbel and how we recently improved it. As a reader of our tech blog, there’s a good chance that you are already familiar with the concept of a staging environment. I will nevertheless start with a brief definition, so that we establish a common understanding before going into the details of how to secure a staging environment. Bear with me.
A brief introduction to staging environments
A staging environment is one or more servers or services, onto which you deploy your software first before you roll it out to your production environment. The staging environment shall resemble your production environment as closely as possible. The purpose of deploying to staging is to improve the robustness of your releases by doing a pre-release testing on this environment. The basic illustration of its place in the delivery process can be seen in the following illustration:
The first step is a usual one, where you, as a developer, work on a feature. Once it’s finished, in most of the common setups you send this feature to a Build server (like TravisCI or Jenkins), which runs your automated tests, linter checks, and in some cases produces a build artifact, e.g. a compiled binary or a docker image. Then in order to ensure it’s working properly, you’re deploying it (either automatically or manually, depending on how far away are you on your way to Continuous Delivery) to the staging environment. There the manual testing happens, which is made by either QA or yourself, depending on your company processes. Only after this step is passed, you are feeling free to deliver your feature to your customer.
You can read more about staging environment concept on Wikipedia.
Staging environment protection
There are several reasons, why you want to protect staging from external access: You don’t want to expose half-baked features (this is why you have staging in the first place) and because duplicate content may hurt your brand. There are different ways to approach it. Starting from the easiest one, like basic auth, to more and more comprehensive, like the one we use at Babbel.
Many years ago we had a setup where the staging environment was just protected by the HTTP basic auth. You know, that one, which asks you for the username and password in a standard browser dialog.
In ApplicationController of our Ruby on Rails application we would need to just insert a line
http_basic_authenticate_with name: "babbel", password: "secret"
It’s a wonderful solution at first, but it stops to be satisfactory at some point. For instance, you might want to actually give different internal users different passwords. Or you’re operating a microservice/serveless architecture that just don’t have a single entry point anymore.
Staging environment at Babbel before
For a long time at Babbel we were using a setup where staging environment is purely inaccessible from the outside of our Amazon Virtual Private Cloud, VPC (for simplicity let’s define it as “cloud corporate network”). There were a couple of security additions to that, but it was the main idea.
However, we came across an issue regarding mobile testing. For that, we needed to have our staging environment publicly accessible on network level but still somehow locked. We wanted to make it accessible for a mobile device farm, as we don’t have all the required device/OS combinations on our side and virtualized solutions are not always suitable. We had one more restriction, it is that this device farm cannot use VPN tunnels to access our private VPC. So, we had to come up with a better approach that opens our staging environment only to those who should gain access.
Let’s start thinking of how it can be done. Extremely simplified, our architecture looks like this:
Basically, we have two VPCs, one is accessible from the outside, which is our production one, and another is accessible only from the VPC itself (or from anywhere if you’re connected via VPN tunnel), which is our staging. Inside each subnet, we have a load balancer (AWS ELB) and a set of EC2 machines attached to it. Obviously, since Babbel has a big and multilayered infrastructure, we have not just one VPC and more than two subnets, and not one load balancer, but quite a few, because we have way more than one service. There are also API Gateways, Lambda functions, and other tooling. However, they don’t matter much for this story.
Staging environment at Babbel after
This is our new workflow for staging environment.
- Request comes to the CDN. CDN checks whether the request contains a signed cookie (will be explained below). If it does, skip to step 5.
- The user has to go to a service which we name “Passport service”
- At this service the user can log in using GSuite credentials.*
- In case of success, the user is redirected back to the CDN…
- … and the request can now go to the Backend service or S3.
*. The GSuite part is not ready yet, currently we’re using a shared secret authentication for this part
This is how the infrastructure for this approach looks like:
Ooooh, looks insecure, doesn’t it? Both our production and staging load balancers are publicly accessible, as well as both CDNs. As our initial task was to protect them, you are probably wondering how this is being achieved. The trick is in the Signed Cookie mechanism of Amazon CloudFront. The purpose of the SignedCookie is basically to make the CDN accessible only to requests that carry this signed cookie, which is signed by some particular trusted service. Just as a side note, in order to add this Signed Cookie mechanism, we didn’t need to change neither our applications, nor even the load balancers. Please keep in mind, it is not something that can be implemented with any CDN, but a specific feature of the CloudFront.
The user first has to log onto the “Passport service”. It uses a SAML authentication based on GSuite single sign-on, which means that we can control on the GSuite-organisation level, who should have access to the staging environment. The “Passport service” issues a signed cookie (actually it is three cookies), allowing the client now to access the CDN. If the Signed cookie is valid, then the CDN will be able to pass the request to the downstream service, e.g. load balancer.
However, there is one caveat. In order for CDN to work, we have to make our ELB (load balancer) publicly accessible and having a DNS record. It means that even if we protect our CDN, the ELB is still open and knowing its DNS name, random people can access it.
Fortunately, it is easily solvable by another mechanism. CDN by itself can add a custom header to the request. That’s what we leveraged. When request goes through the CDN, it acquires this new header and our backend server checks it. This is not optimal, because it requires this not-app specific logic still to be implemented in the app. However, it is just a temporary step, until we switch to the relatively new load balancers from AWS, ALB, which support the signed cookie check by themselves.
So, if you’re accessing our CDN without having the Signed Cookie – you will not get passed through. If then you try to access the ELB itself, avoiding the CDN – you will got blocked by the Shared Secret cookie check. Yay, our staging is public and at the same time secured!
Thank you for reading. We will appreciate if you want to tell us about the way your company approaches staging environment architecture.
Photo by Science in HD on Unsplash