AWS Fargate for Data Engineering

How we moved our data pipelines from AWS Lambda to AWS Fargate without reengineering our development and deployment workflows

By Riccardo Bassetto

December 5, 2019

As 2019 is coming to an end, we reflect on our progress to move away from legacy systems and build a modern Data Platform, that is flexible, scalable and reliable. The goal is to reduce struggles for data scientists and analysts, elevating their freedom in terms of possibilities to leverage all kinds of data for their daily tasks. Hence, we realized that it is a good opportunity to start a blog and write about our experiences.

Following this article, we will publish regularly about data stories at Babbel.

TL;DR: we are happy to share the Terraform Module we built to easily migrate tasks from AWS Lambda to AWS Fargate in a convenient way.

Background

The central figure of our legacy system is a MySQL data warehouse (AWS Aurora), that contains most of the processing logic expressed in SQL in the form of stored procedures. As almost the whole company depends on the data we deliver, we have to find a way to migrate to a modern Data Platform built around a proper data lake, in our case built on Snowflake, without affecting everyday business operations. The first step was extracting logic locked in those procedures, in order to monitor and maintain it more easily.

Since then, the core component of our infrastructure has been AWS Lambda for multiple reasons. AWS Lambda enabled us to migrate legacy pipelines to more elastic and simple components as well as to quickly build new parts of our platform. Introducing a clear DevOps strategy alongside was also a fundamental part of the process.

Today we orchestrate hundreds of lambda functions performing all kinds of ETL jobs from fetching data coming from external services to data cleansing and processing. Some of these lambdas react to CloudWatch Events, and others are part of more complex workflows built on top of AWS Step Functions.

We package all our lambdas with AWS Chalice, following this code structure:

.
├── .chalice/              // chalice configuration and policies
├── chalicelib/            // lambda functions logic
├── terraform/             // aws infrastructure
├── tests/                 // unit and integration tests
├── Makefile               // automations
├── app.py                 // lambda functions handlers
├── requirements-test.txt  // python packages required for tests
└── requirements.txt       // python packages required by lambdas

Challenges with AWS Lambda

The biggest limitation of AWS Lambda in our case is the runtime limit. However, AWS enabling 15 minutes runtime in October 2018 bought us some time to keep focusing on fighting technical debt without investing too much energy in designing a new, more complex, infrastructure for our pipelines.

We currently process around 70M events per day, and as Babbel is a fast-growing company, the quantity of data that our platform ingests and processes is continuously increasing. Thus, lambda’s memory and runtime demand increase as well. Following the growth, the need for new data services emerged for which more computational resources are needed.

Rethinking our development process was not an option. The productivity we reach with our team was mostly a consequence of the clean code structure paradigm that we embraced over time, which also cuts the time required for code reviews.

A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.

~ Systemantics: How Systems Really Work and How They Fail — John Gall ~

Scheduling Jobs on AWS Fargate

The two main drivers for the decision were the possibility to keep our development cycle and use the same code structure as in our lambdas. Thus, we were able to remain very productive while adopting new technology and use tools that we developed to foster fast and unified development of lambda functions.

We started writing Terraform modules for managing ECS clusters for multiple stages and deployment of single-purpose Fargate tasks triggered by CloudWatch events. The scope of the initiative was to build a small framework to port our lambdas to Fargate, limiting the changes to our code structure and potentially having the possibility to share code between lambdas and new tasks.

In particular, the module we are presenting here aims to manage the entire lifecycle of a task including privileges and permissions granted to it.

module "fargate_task" {
  source                  = "babbel/ecs-fargate-scheduled-task/aws"
  region                  = "eu-west-1"
  task_name               = "fargate-task"
  schedule_expression     = "rate(30 minutes)"
  ecs_cluster_arn         = "my-cluster"
  task_execution_role_arn = "arn:aws:iam::123456789012:role/my-task-execution-role"
  task_definition         = "task-definition.json"
  task_policy             = "policy.json"
  vpc_subnets             = ["subnet-123456789abcdefgh", "subnet-abcdefgh123456789"]
  vpc_security_groups     = ["sg-123456789abcdefgh"]
}

It’s written in Terraform 0.12 and comes with a complete example that sets up an entire stack of additional services needed for a successful deployment (VPC, Subnets, NAT Gateway, ECS Cluster, Task Execution IAM Role).

This module creates the following resources:

CloudWatch Events IAM Role: role giving permissions to the CloudWatch Events service to run Amazon ECS tasks on your behalf and to pass a task execution role and a task role to a scheduled Fargate task [Docs];
CloudWatch Events Rule: rule containing the schedule expression (cron or rate);
CloudWatch Events Target: target processing the event, in this case, an ECS Fargate task;
Task IAM Role: role assumed by the ECS Fargate task and attached to the necessary policies needed by the Docker container to make AWS API requests [Docs];
ECS Task Definition: a collection of parameters needed by ECS to run a Docker container, more here;
CloudWatch Log Group: a group of log streams coming from the ECS Fargate task sharing the same retention, monitoring, and access control settings.

Task definitions and IAM policies are defined in JSON format, the same way we were doing for our lambdas in the config.json and policy.json files of Chalice. The process of adding new tasks to an ECS Cluster is as simple as pushing the docker image to AWS ECR, writing the JSON configurations mentioned above, and creating a new module resource in the infrastructure code.

The module is available through the official Terraform Registry.

Conclusion

Running Docker containers on AWS Fargate enables us to gain flexibility when writing tasks, worrying less about runtime and memory when dealing with data operations. Overall it’s a convenient transition from AWS Lambda in the context of migrating pipelines and with the help of this Terraform module we are saving time and effort while doing it.

This post was co-edited by Dženan Softić (Team Lead Data Engineering). Thanks also to Andrzej Neumann (Team Lead Data Engineering) and Mehdi Talbi (Data Engineer) for their support during the project implementation and release.

Want to join our Engineering team?

Apply today!

AWS Fargate for Data Engineering

Background

Challenges with AWS Lambda

Scheduling Jobs on AWS Fargate

Conclusion

AWS Lambda and APIGateway as an AJAX-compatible API-endpoint with custom routing

Evolution of Babbel’s data pipeline on AWS: from SQS to Kinesis

How Travis CI helps us deploy to AWS Lambda