How to do more with fewer servers

Do we increase the maximum number of server instances that can be allocated or do we optimize the current setup? We chose the latter.

By Leszek Zalewski

May 27, 2020

March, 2020, won’t be easily forgotten. As schools, offices, restaurants and more began to close all around the world due to the COVID-19 pandemic, it seemed like “normal” life was coming to a halt.

While we at Babbel were adjusting our daily office routines for makeshift desks in our living/bedrooms at home, we suddenly began to see a drastic increase in traffic on both our web and mobile applications.

This led to an exhaustion of our servers resource pool for auto scaling, which meant that we had to make a decision: Do we increase the maximum number of server instances that can be allocated or do we optimize the current setup? We chose the latter.

Find out how tuning auto scaling alarms and switching to Puma threads has resulted in a 2 × faster application run by only one third of the servers.

Traffic almost doubling since beginning of March

Overview

Throughout this post we will provide some background on application, its infrastructure and auto scaling rules. Followed by optimizations plan and all its phases and steps. At the end we will cover improvements overview across 3 key performance indicators: resources utilization, budget and application performance.

Table of contents:

Background
Plan
Phase 1: Throttle Scaling
Phase 2: Web Server
Phase 3: Revised Auto Scaling
Summary

Background

This post is about one of our core applications which handles the account and session management using Ruby on Rails.

Originally we were running it on AWS OpsWorks, with a fixed number of servers. Some time ago we migrated it to run within Docker on Amazon Elastic Container Service (or Amazon ECS for short) with AWS Fargate instances. This has enabled us to use auto scaling, which we haven’t changed since.

Application

Each instance:

had 1 vCPU and 2GB of memory
was running a Puma application server in clustered mode with 6 processes and disabled threaded mode – 1 thread per process (since the service was never prepared for thread-safety)

All of them were running within one ECS Service behind AWS Application Load Balancer (or ALB for short) in default round-robin fashion.

Auto scaling

AWS auto scaling is composed out of 3 elements:

Target defines what to scale and what are the min/max boundaries of size being adjusted
Policy defines how much should the size of the target change, and how often it can be triggered
CloudWatch Alarm acting as a trigger for a policy, whenever a defined threshold on a given metric is reached

We had target configured to run minimum of 15 instances and maximum of 32 instances for the given application with the following mix of policies and alarms:

Slow scale up: change size by +1
1. alarm on maximum service CPU being over 60% within two consecutive checks, each check spanning 5 minutes
2. alarm on ALB p99 latency above 1 second within one check across 1 minute
Fast scale up: change size by +3
1. alarm on ALB p99 latency above 3 seconds within one check across 1 minute
Slow scale down
1. alarm on average service CPU being below 15% within two consecutive checks, each check spanning 5 minutes

Combination of the above caused fast scale up early in the morning, and scale down only in the middle of the night. Since increase of the traffic we were running maximum 32 instances, each day for longer period of time.

Because of this we didn’t have any more room to breathe in case traffic keeps on growing, which prompted us to revise the above alarms and policies.

Inefficient auto scaling alarms, causing maximum number of instances to run for majority of the day

Budget

The other effect of increased traffic and constant maximum instances utilization was increased cost of running the given application – which almost doubled compared to baseline from months before.

Cost of ECS infrastructure has almost doubled as well

Optimizations

What made us think we could squeeze out more out of existing configuration was:

Resource underutilization: average CPU utilization was between 15% and 20%
On average 75% of time spent in application while waiting on IO, 90% in case of the most requested endpoint.
Some 3rd party APIs can be super slow: above 1 second – causing spikes for auto scaling.

Plan

To avoid increasing maximum capacity, we decided to revise our auto scaling and application setup. To achieve better utilization and scaling we planned three phases:

Throttle scaling: scale up slower and scale down faster, by being more resilient during single latency spikes.
Web Server: due to app IO nature – test Puma with threads.
Revise auto scaling: try to utilize single instance as much as possible before losing performance.

Charts glossary

All the charts focus on 3 key areas:

total number of instances running, indicating how auto scaling is working
average CPU utilization across the application
p95 latency on ALB, multiplied by 10 to increase visibility on the chart, to see whenever there’s degradation in performance

Phase 1: Throttle scaling

First attempt

At the moment a single spike in latency could be a trigger to boot up 3 new instances at once. To avoid it, we increased number of consecutive checks from 1 to 3 that had to be above the latency threshold to trigger either one of the scaling up policies.

For scale down we increased the CPU threshold from 15% to 20% and reduced a single check time from 5 minutes down to 2 minutes. This should help scale down sooner.

The result were more or less as expected:

we didn’t scale up to full 32 instances at the beginning of the day
we were able to scale down during the first half of the day
and then scale down a little bit sooner after the main traffic goes away
no visible change in p95 latency

First small improvements in auto scaling, first scale downs during the day

Second attempt

While checking latency graphs we noticed that p99 was spiking a lot, and wasn’t stable nor representative. Also there wasn’t much we could fix, as the vast majority of the spikes was caused by less frequent calls to OAuth providers.

Because of that we decided to switch from p99 latency for alerts, to p95 latency instead. p95 was way more stable, so if it’s high, it means that the system slows down for too many people. With this we also adjusted the thresholds: from 3 to 2 seconds for fast scale up, and from 1 to 0.7 second for slow scale up policy.

As we did not observe any regression in performance due to the last change , we decided to also:

increase the CPU threshold again from 20% to 25% for scale down policy
reduce the minimum amount of instances from 15 to 8, as during night we were way below 20% CPU

Results:

way better scaling: from 8 to 26 instances, instead of 15 to 32 instances
a bit better CPU utilization
a bit of latency regression, but not significant enough

Second improvements brought more variability through the day, also it doesn't run at maximum 32 instances anymore

Third attempt

With the last two improvements we had a bit more room to think about the CPU patterns we were seeing so far:

quite low average utilization
constantly spiking maximum peaks

CPU utilization showed that some instances were constantly doing more work than others which were waiting for IO. With the amount of IO operations this application had, it must have meant that requests routing wasn’t optimal with round robin on ALB. This could lead to situations where some instances while still processing long running IO requests, get even more requests assigned – and thus causing high latency spikes.

It turns out that AWS have released on November 25th 2019 a new routing algorithm for ALB, called “least outstanding requests”:

With this algorithm, as the new request comes in, the load balancer will send it to the target with least number of outstanding requests. Targets processing long-standing requests or having lower processing capabilities aren’t burdened with more requests and the load is evenly spread across targets. This also helps the new targets to effectively take load off of overloaded targets. AWS / What’s New

We were planning on using it earlier, but it wasn’t available in AWS terraform provider till March 6th 2020. Just in time.

After enabling the new algorithm, results looked good with better latency, but worse with the auto scaling. As it turned out, the timing of our deployment was unfortunate. We deployed during one of the attacks we were handling at that time.

This made us rethink an alarm for slow scale up policy based on maximum CPU. We changed it to average CPU being higher than 40% – instead of maximum above 50% – which was happening even more often now.

The last thing that we adjusted was the minimum number of instances: We increased it from 8 to 10, as auto scaling was behaving unstable.

This yielded good results:

p95 got a little bit better than before employing new ALB algorithm
even less instances running to support the traffic
better CPU utilization, still it meant that we were scaling on latency spikes instead of CPU utilization

The last point was a perfect indication that we should move on to web server configuration.

Maximum CPU spikes causing auto scaling to run at max instances again, after deploying usage of average CPU, we got better CPU utilization and stable auto scaling

Phase 2: Web server

Background

When we switched to ECS we also switched from Apache with Passenger to Puma. We weren’t running with multiple threads, due to the age of the application we weren’t sure if it was thread safe.

So now we had 3 options:

Increase number of processes in Puma cluster
Switch from processes to threads with Puma
Use a combination of both

We decided to give it a try for threaded mode in Puma first, for 2 reasons:

We were using an instance with a single CPU, so it didn’t make much sense to increase processes count
We wanted to test app for thread safety anyway 😊

Action

To make sure accounts application is thread safe we:

Went over used rack middlewares, to see if there’s nothing suspicious
Switched from using redis connection directly to using connection pool instead
Configured Puma to run only in threaded mode, using default 16 threads for maximum instead of 6 processes

With integration tests doing good in staging environment, and application not showing any unexpected errors for few days, decision was made to do a test run in production.

The results were more than satisfying:

huge drop in latency across the board: p99, p95, p90, p75
a bit less instances supporting same traffic
no real change in average CPU utilization
stabilized maximum CPU spikes, less spikes

After this we also tested a combination of 3 processes in cluster mode, 8 threads each, but without any improvements over plain threaded mode. Which lead to conclusion that we could safely move on to last phase.

Accounts application is exposed as gem to other services. Majority of services already leverage Datadog Application Performance Monitoring with distributed tracing to know how various layers behave. This allowed to see what was the impact on the gem side alone and thus all the clients that were utilizing it.

Latency on the gem dropped across the board

Phase 3: Revised auto scaling

As application was able to handle way more traffic, we decided to better utilize the CPU, mainly by increasing the auto scaling alarm thresholds:

for scale up, we increased average CPU threshold from 40% to 55%
for scale down, we also increased average CPU from 25% to 40%

This yielded nice results of running with maximum of 8 instances and minimum of 5. The only downside was higher p95 latency. Because of which we decreased CPU thresholds by 5%. The minimum number of instances was reduced down to 2 – so if the traffic comes back to baseline from February, even less instances would be used to support it.

Adjusted rules made application better utilize the CPU and run on less instances

Summary

The initial work on optimizations brought some required headroom for scaling and better cost efficiency.

Switch to threaded mode improved and stabilized performance, which contributed to the final cost – mainly due to the nature of the application which spends most of the time on IO operations.

Auto Scaling & Resources Utilization ✔️

At the beginning we were utilizing 15 up to 32 instances, now we run from 4 up to 11 instances, which is 3 × less.

Comparison of instances running for a week, compared with a month before and difference between them

Which means we were basically overprovisioning, and underutilizing resources. Now it looks better, where CPU utilizations is on average higher by 25%.

Comparison of CPU utilization for a week, compared with a month before and difference between them

Budget ✔️

At the moment, with traffic still being higher than at beginning of March, the ECS cost is:

4 × lower compared to cost from before optimizations
2.4 × lower compared to cost from before traffic increase

Performance ✔️

With performance we didn’t want it to worsen. This went two ways:

For the application clients it looks way better, as we reduced the latency on ALB:

p95 is on average faster by 250ms (259ms instead of 513ms)

Comparison of weekly p95 latency on Application Load Balancer, compared with a month before and difference between them

p99 is on average faster by 400ms (416ms instead of 823ms)

Comparison of weekly p99 latency on Application Load Balancer, compared with a month before and difference between themIncreased traffic since March

Within application itself it has worsened by around 32% or ~60ms.

Comparison of weekly p95 latency in application alone, compared with a month before and difference between them

The performance achieved on ALB compensates the worsened one on the application level, as in the end application is 2 × faster.

There’s still possibility that it could be improved by trying Puma Threaded mode together with Clustered mode, for example by using 3 processes with 8/16 threads each on instance with 2 CPUs.

Photo by NeONBRAND on Unsplash

Want to join our Engineering team?

Apply today!