How to create a serverless app with AWS SAM for Big Data Handling

Introduction

Have you ever had to manage big data, such as system logs? No worries - AWS SAM and DynamoDB are here to help you. The reason for this article is the same - we have system logs, stored in a MySQL server. One of our clients complained that the logs listing page takes too much time to load in the administration panel (there were about 1 million logs). MySQL wasn't the right way to store this data, and we had to find a better solution, so we decided to separate the logs from the main application.

The big data can be handled in many different ways - serverless or with a dedicated server that processes the requests. But here (just like in every other task) the real question is "What exactly do we need?". This question is crucial because it can push us in the right direction. There is a lot of background information behind the scenes - we do not want unnecessary services to maintain as this will make the task and the application too complex (like Elastic Search). At the same time, we want to handle all of the requested functionalities - searching, sorting, filtering, writing, etc.

Structure

Used resources:

DynamoDB - for storing the logs
Lambda functions - for reading and writing logs
Simple Queue Service - for triggering the "logs writing" lambda function; two queues - LogsQueue is the main queue, DeadLetterQueue is the "failed jobs" queue
API Gateway - for triggering the "logs reading" lambda function
CloudFormation (via the Serverless Application Model templates) - for creating and managing all resources in the stack

The main application creates a job that contains the log information in the SQS Queue (LogsQueue). Then, the LogsQueue triggers a lambda function that writes the logs to the database (a DynamoDB table called LogsTable). As for the logs listing and filtering - there is an API Gateway with two available routes: "/" and "/{uuid}" associated with two lambda functions respectively - LogsReader (for retrieving and filtering all logs) and LogReader (for retrieving only a specific log using its UUID). These two functions read directly from the database. The whole structure is illustrated below.

Serverless App Structure

SQS Queues

The Logs serverless application uses two queues - one main queue (LogsQueue) and one dead-letter queue (DeadLetterQueue). The main queue is the bridge that connects the main application and the lambda function for storing the logs in the database.

Messages can sometimes not be processed for a variety of reasons, including incorrect conditions in the producer or consumer application or an unanticipated state change that interferes with the code of your application. These messages are forwarded to the dead-letter queue. This way, you're able to run the processes again (you can set the maximum trials count in the Queue settings). Because they allow you to isolate unconsumed messages and figure out why their processing fails, dead-letter queues are helpful for debugging your application as well.

Lambda Functions

With the compute service Lambda, you may run code without setting up or maintaining servers. Additionally, you can utilise layers to organise your app, and even better, you can use these layers in other lambda functions. By doing this, you can segregate the core code components (helpers, DB/Storage connections, PyDantic models, etc.) and reuse them throughout all lambda functions in the application without having to duplicate the code.

I've created one base layer (UtilsLayer) where I've put the DynamoDB connection and pagination clients (from boto3), the base models that the functions will inherit, helper functions, etc. Pydantic and Lambda Powertools played a big role here.

Lambda Powertools and PyDantic - the best combination for lambda functions data handling

This is an awesome collection of Python tools for AWS Lambda functions that make it easier to implement best practices like tracing, structured logging, validation, events parsing, and many more. The automatic event parsing is just fantastic - nice syntax, quick validation, and data management. You can check how it works here. Lambda Powertools can be used as a python package or directly as a layer in the application. Another very powerful aspect of this library is that it supports PyDantic. This raises the level of the application structure dramatically - you can implement the whole validation with just one decorator function (@parse_event). Feel free to check their documentation, it is worth it.

Here is an example of how Lambda Powertools and Pydantic are used together in the Logs Serverless Application.

You can see the base LogModel with all of its fields declared. It is located in the Utils Layer since all functions will use it.

from aws_lambda_powertools.utilities.parser import BaseModel, Field

class LogModel(BaseModel):
    id: UUID = Field(default_factory=uuid4)
    account_id: int = Field(gt=-1)
    user_id: int = Field(gt=-1)
    type: str = Field(min_length=1)
    sub_type: str = Field(min_length=1)
    url: str = Field(min_length=1)
    payload: Optional[str] = Field(min_length=1)

    submitter_id: Optional[str]
    submitter_country: Optional[str]
    submitter_city: Optional[str]
    submitter_platform: Optional[str]
    submitter_browser: Optional[str]
    submitter_agent: Optional[str]

    created_at: datetime = Field(default_factory=datetime.now)

    def to_dict(self, *args, **kwargs):
        data = self.dict(*args, **kwargs)
        data['id'] = data['id'].hex

        # Now the account_id and user_id are BigInt in the MySQL
        # Converting them to string for future DB structure updates
        if isinstance(data['account_id'], int):
            data['account_id'] = str(data['account_id'])

        if isinstance(data['user_id'], int):
            data['user_id'] = str(data['user_id'])

        data['created_at'] = data['created_at'].isoformat()

        # Internal fields for the GSIs
        data['account_id#type'] = f'{data["account_id"]}#{data["type"]}'
        data['status'] = 'OK'

        return data

And this is the LogsWriterFunction input validation model. It looks simple, doesn't it?

from aws_lambda_powertools.utilities.parser.models import SqsModel, SqsRecordModel
from typing import List
from models import LogModel


class Params(SqsRecordModel):
    body: LogModel


class WriteLogModel(SqsModel):
    Records: List[Params]

And now comes the best part - the handler method (LogsWriterFunction). The whole complex validation logic happens behind the scenes. The code is shorter, simpler, and nicely structured.

@event_parser(model=WriteLogModel)
def lambda_handler(event: WriteLogModel, context: LambdaContext):
    for record in event.Records:
        save_log(record.body)

    return {"statusCode": 200}

DynamoDB

Probably the most complex and hard-to-research part was the logs filtering. Unless you install ElasticSearch as an additional service to Dynamo, this database doesn't offer a lot of options for searching (or at least, efficient options for searching). Yes, you can search using the SCAN method instead of Query but it's slow and not recommended for a big amount of data. The real power of Dynamo is the storage partition separation - it's ideal for big data storing.

In this project, we benefit from one feature called "Global Secondary Index" or shortly - "GSI". It prevented us from creating a new ElasticSearch instance, which will be more expensive and will require maintenance. These indexes are a powerful tool for handling "not too complex" filtering cases.

A partition key and an optional sort key are required for each global secondary index. The base table schema and the index key schema can differ. It is possible to establish a global secondary index with a partition key as the composite primary key for a table with a simple primary key, or the opposite. Every GSI makes an internal duplicate of the main table, using the requested fields as partition and sort keys. This way, you can search and filter (by the sort key) very fast.

GSIs

AccountIndex - used for filtering by account
- Partition key: account_id
- Sort key: created_at
TypeIndex - used for filtering by type
- Partition key: type
- Sort key: created_at
AccountTypeIndex - used for filtering by account and type simultaneously
- Partition key: account_id#type
- Sort key: created_at
SortingIndex - used for sorting all available logs; used in the "all logs" API Endpoint - the SCAN method cannot sort the logs because they are in different partitions so this sort key is the only way to "cheat" and sort them. This method has a lot of cons but it's the only way for sorting the data. Because the data is located in one place, it's recommended to use it with a "limit" and pagination.
- Partition key: status (it's set to "OK" for all records so all records are located in the same partition)
- Sort key: created_at

Other fields

id - UUID V4
account_id - string
user_id - string
type - string
sub_type - string
url - string
payload - string/json
submitter_id - string, optional
submitter_country - string, optional
submitter_city - string, optional
submitter_platform - string, optional
submitter_browser - string, optional
submitter_agent - string, optional
created_at - string, ISO 8601

AWS SAM - A Cloudformation Templates Translator

The AWS Serverless Application Model (SAM) is an open-source framework for developing serverless apps. It offers a straightforward syntax for defining functions, APIs, databases, and mappings of event sources. You can define and model the application you want using YAML with just a few lines per resource. You can create serverless applications more quickly since SAM expands and translates the SAM syntax into AWS CloudFormation syntax during deployment.

The Serverless Application Model Command Line Interface (SAM CLI) is an extension of the AWS CLI that adds functionality for building and testing Lambda applications. It uses Docker to run your functions in an Amazon Linux environment that matches Lambda. It can also emulate your application's build environment and API. Using SAM CLI you can also easily deploy your application.

The best part is that the SAM templates are reusable - if you put them into a completely new account and deploy them using SAM CLI, everything will be set up after a few minutes.

The Logs Serverless Application uses SAM for creating the resources and their connections, for deployment and testing, most of the DevOps-related tasks in the project.

I hope this article gave you an overall idea of how powerful are the serverless application. Combined with tools like Lambda PowerTools can lead to amazing results and solutions.

News

Tsvetoslav Nikolov | 11 Jul 2022

Introduction

Structure

SQS Queues

Lambda Functions

Lambda Powertools and PyDantic - the best combination for lambda functions data handling

DynamoDB

GSIs

Other fields

AWS SAM - A Cloudformation Templates Translator

Happy Coding!

Bulgaria

United Kingdom

office@mtr-design.com

Introduction

Structure

SQS Queues

Lambda Functions

Lambda Powertools and PyDantic - the best combination for lambda functions data handling

DynamoDB

GSIs

Other fields

AWS SAM - A Cloudformation Templates Translator

Happy Coding!

MTR Design Limited

Bulgaria

United Kingdom

office@mtr-design.com