Cloud Service Template Compiler in Python

Object Model and Architecture

Published in

Python in Plain English

13 min readFeb 26, 2021

Context and Problem Statement

As it was argued elsewhere in another article an order of magnitude improvement of software development process productivity is required in order to realize full potential of the Serverless Cloud technology. Existing tool chains are conceptually locked in the 50 years old Unix model and are completely inadequate to hyper-fast yet bug-free development process required nowadays.

In particular, we are looking for a way to completely liberate application developers from any DevOpSec concerns by pushing all boilerplate script generation to automatic tools. More specifically, we would like to have a tool that would automatically generate all cloud platform-specific deployment scripts from a pure application code. While general principles remain the same for any cloud platform and any programming languages, to make things tangible within the scope of this article, we will limit ourselves to AWS Cloud and Python programming language. The initial problem statement, therefore, could be illustrated as follows:

Fig 1: Python Service Template Compiler Problem Statement

In this article, we will provide a general overview of selected Python Service Model and internal implementation mechanics. The Service Template Compiler solution is built on the top of Serverless Cloud Importer described in more detail elsewhere in another article. The both technologies are cornerstone ingredients of the Cloud AI Operating System (CAIOS) project initiative currently by BST LABS (advanced engineering arm of BlackSwan Technologies). For the additional details about the CAIOS project look at the original position paper.

Acknowledgements

Mordehay Kontorer and Piotr Orzeszek from the BlackSwan Technologies’ BST LABS took an active part in developing the first version of this component. Scott Lichtman provided invaluable feedback while reviewing initial draft of this paper.

Decision Drivers

Ideally, we should support automatic service packaging from a pure Python code without any mention whatsoever of the underlying cloud platform or even communication protocol details, thus helping application developers to fully concentrate on domain logic.

Here, we would like to make even a bolder assertion. If application developers need to deal with low-level IT details of the underlying cloud platform and/or communication protocol, less attention will be paid to proper domain modeling.

Development speed is the second decision driving factor. Simply put, slow development speed kills innovation. If it takes too long and involves too many people to launch initial version of a service, company management might be reluctant to approve any significant changes going forward. This is the main reason why so many services and whole systems stagnate after initial delivery. On the other hand, if developing a service is truly affordable, nobody will be too concerned about replacing it with a new and improved version or running development of multiple versions in parallel (concurrent engineering).

Within CAIOS, we set as a strategic goal ensuring that developing the initial, ready for integration version of a simple CRUD REST service should take minutes; developing a mid- or high-complexity REST or WebSockets service, a few hours. The same goal holds for developing a non-trivial, long-running workflow. We are after ten-fold productivity improvements.

Cost control is the final decision-making factor; let’s reflect on this for a moment. Developing an initial proof of concept must be really inexpensive, not only in terms of development effort but also in terms of underlying cloud resource consumption. Nobody really needs for a POC multi-region redundancy, data encryption, and network security. What we need is to quickly explore the core application and domain logic to validate whether we have the right solution for the problem at-hand. However, when we are done, this POC version can only take the project so far. Nobody will approve its deployment to a production environment. And the last thing we would like to have to do is to start rewriting to achieve production-hardening adornments.

The same logic applies to porting a service from one cloud platform to another. Production environment adjustments must be treated similarly to what happens with normal compilers: changing the target platform and/or optimization modes occurs through compilation switches, not code changes. In our case, the handling of environment adjustments will be split between service packaging and deployment.

Industry Landscape

When analyzing available Service Template Generator solutions for AWS, it would be easier to look at those that come from AWS itself rather than from those provided by 3rd-party open source or commercial projects. Without pretending to be complete, here is a picture we have assembled so far:

There are quite a few 3rd-party libraries and tools that aim to provide either more convenient or more portable, serverless cloud application development support:

Detailed analysis of the pros and cons of every solution would be a fascinating yet lengthy competitive market analysis; this might be a subject for another paper. Here, suffice to claim that the both AWS and 3rd-party solutions can be split into 2 categories, based on feature/benefit:

Make it easier to write AWS CloudFormation templates by either providing YAML macros (e.g., SAM) or using conventional programming language (e.g. CDK). If we treat the AWS CloudFormation template structure as machine level language, these solutions could be categorized as macro-assemblers — make things a bit easier without raising up the level of abstraction.
Generating AWS CloudFormation templates from a programming language (e.g., AWS Chalice or Zappa) without providing a complete solution thus leaving substantial boilerplate configurations to be prepared using the same low-level YAML or JSON. They also normally leave too many communication protocol details to be encoded in a form of, say, Python function decorators.

In contrast, with the CAIOS Service Template Compiler, we want to generate an underlying cloud platform configuration file(s) completely from a pure high-level programming language (e.g. Python) code with minimal, if any, communication protocol detail mentioned.

Service Template Model

While developing the CAIOS Service Template Model, we evaluated two basic architectural choices:

Automatic conversion based on a naming convention inspired by the Ruby on Rails’ “convention over configuration” doctrine
Python class and method decorators, continuing common practice

We chose the first option of using a naming convention since it delivers clear API call semantics while leaving open the possibility for enforcing industry best practices, security and cost control policies.

The second question was whether we should treat a service as a:

Python class
Python package (presumably containing multiple Lambda functions)

Each option has advantages and disadvantages. After experimentation, we decided to adopt the first option -– defining every service as a Python class — for the following reasons:

It more naturally reflects the service deploy/shut down life cycle through service class instance (objects)
It more naturally reflects service template/service instance relationships through service class/service class instances (objects)
Internally, it needs to be converted into Python packages/modules per Lambda function structure thus addressing the second option, which could be explicitly supported should the need raise
A Python class more naturally reflects service internal resources (e.g. storage) allocations and access
It more naturally reflects service parameters and external dependencies

We, therefore, will treat the last class definition in the service specification module as the service class and will treat each of its methods as a computation unit specification using the following naming convention:

__init__(self, …): service initialization with external parameters
_<function_name>(self, …): simple Lambda function to be invoked from a StepFunctions workflow, resource event trigger or directly (for testing purposes)
on_<function_name>(self, url, connection_id, …): lambda function serving an external WebSockets API call
<verb>_<entity_name>(self, …): lambda function serving an external REST API call (see comments on HTTP API naming convention below)
async _function_name(self, …): internal StepFunctions workflow to be started by some internal API call, event trigger or externally (for testing purposes); could be specified as runforever
async on_<function_name>(self, …): StepFunctions workflow to be automatically invoked by external WebSockets API call
async <verb>_<entity>(self, …): StepFunctions workflow to be automatically invoked by external REST API call

For HTTPI API, we are trying to stick with REST API naming convention using some common sense-based extended vocabulary. Following REST API guidelines, we implement resource creation via HTTP POST method, resource retrieval via HTTP GET method, etc. To make Python code looking more naturally we, however, allow some flexibility in verb naming such that it would be possible define register_person (automatically converted to HTTP POST /people) and unregister_person (automatically converted to HTTP DELETE /people/{person_id}) functions rather than more intimidating create_person and delete_person. More detailed description of the HTTP API naming convention and translation process will be presented elsewhere in a separate article.

Productivity Benefits and Focus Shift

The proposed service model provides an order of magnitude productivity boost. What previously took days and weeks, especially for people without a strong background in cloud technologies, now will take hours or minutes (we measured it). This is especially true for the CRUD-like REST services, where an initial code skeleton could be generated automatically from just a list of entities (or company-wide template).

Now, the main challenge and focus will shift from infrastructure heavy-lifting to proper domain modeling. What kind of entities do we want to reflect in the system? What are their relationships? What kind of operations the service has to support? These questions will never be easy to answer and no automation will help. All we could hope for is to eliminate all infrastructure scaffolding concerns, making the main problem clearly visible and securing the attention it actually deserves.

Internal Function Invocations

There is no free lunch… at least not in software. The ease of use and high productivity potential of the CAIOS Service Template Model creates the (partial) illusion that the service class is a normal Python class. In some senses, it is. For example, the same service code could be run locally for testing purposes prior to uploading it to cloud. That saves a lot of time at the initial stages of development.

However, one needs to keep in mind that service class methods are going to be converted into Lambda functions and StepFunctions StateMachines. The most important implication is that if one service class method calls another one, it means invoking either Lambda functions or starting a StepFunctions StateMachine execution. The service class would seldom be the right place for keeping common code to be invoked from multiple places. For non-trivial domain services it would be more appropriate to extract this common code in a separate conventional Python class or package. But even here, one needs to keep in mind that object properties, unless they encapsulate some storage or messaging resource, are not shareable between Lambda functions. In the future versions, we may consider relaxing some of these restrictions, but for now they need to be taken into account. That initial training process, however, does not take too much time and newcomers usually ramp up in a couple of days.

Workflow Programming

This is a big topic that deserves a separate discussion. Here, we will touch on only the most important aspects. Many existing solutions advocate specification of workflows in the form of a graph using some static configuration language such as JSON or YAML. For example, AWS StepFunctions StateMachine Language is JSON (in SAM could be specified using simplified YAML). While it might be a valid approach from a workflow engine vendor perspective, we consider that kind of specification to be low-level machine language (of the cloud computer) and strive to compile regular programming language code into it automatically.

Ironically, what happens next is that many vendors provide programming language wrappers for building such JSON/YAML configuration files. For example, AWS StepFunction Data Science SDK does it for AWS StepFunctions and SageMaker, while Apache AirFlow uses Python functions for building workflow graphs. We do not think this approach offers any real improvement and prefer expressing workflow in a plain Python code.

CAIOS Service Template Compiler supports all Python control flow structures, waiting (sleep), parallel computing and automatically converts them into corresponding AWS StepFunctions states. This is achieved by mapping Python Abstract Syntax Tree nodes into semantically corresponding StepFunctions States, as illustrated below:

a = b: Pass State
self._<function_name>(…): Task State
return …: Pass State
if …/elif …/else …: Choice State
x = … if … else …: Choice State
while …: Choice State
break: Pass State
continue: Pass State
sleep: Wait State
gather: Parallel State

and more to follow.

Full support of AWS StepFunctions capabilities, such as Express Workflows and SageMaker integration are planned for future versions.

Python Service Template Compiler Architecture

This section will provide an high-level overview of the CAIOS Service Template Compiler (CAIOS STC) architecture leaving more detailed discussion of specific components to future publications.

Layered Architecture

The CAIOS STC needs to address multiple requirements in terms of cloud-platform portability and seamless extension of the system with support for additional cloud resources (e.g. new serverless database engine or messaging system). Such set of requirements could probably best reflected in the Open-Closed Architecture Principle:

it should be possible to add new capability to the system without changing implementation of existing ones

When we add a new capability to the system we need to provide a proper solution to presenting new functionality at an adequate (not too low, not to high) level of abstraction, to correct knobs to achieve right price/performance and security (among other things, correct implementation of the “principle of least privilege”). To achieve these goals we came up with a layered system architecture as illustrated below:

Here is a brief description of each layer (bottom first):

caios-py-kernel: mandatory part of the system, portable across all cloud platforms
caios-py-kernel-aws: CAIOS “hardware abstraction layer” implementation for particular cloud platform (in this case, AWS); starting from this layer it is possible to automatically compile service class with private cloud functions
caios-py-lib: portable plugins defining and implementing specific interfaces and API protocols (e.g. MutableMapping for DB access and HTPP API naming convention parser)
caios-py-lib-aws: system plugins “hardware abstraction layer” implementation for particular cloud platform (in this case, AWS); here abstract interfaces and protocols specified above are implemented on the top of specific cloud resources (e.g. AWS S3, DynamoDB, API Gateway, StepFunctions, etc.)

CAIOS STC (portable part)

We now could describe briefly what happens when a user asks to package (prepare for cloud deployment) some particular service:

At high level, the compilation process consists of 3 steps:

build service specification document
build service target package
calculate digest for every cloud function (to automatically enforce cloud function cold start when something changes)

Building service specification document is the most involved and complex process which could be roughly classified as Python Abstract Syntax Tree re-write. Here, we need to parse the original service module, identify service class (by convention the last class in the module body), deal properly with module imports and globals and to convert every service class function into a separate cloud function.

CAIOS STC (cloud-specific part)

We are now ready to convert cloud-neutral service specification document into cloud platform-specific deployment script (in this case, AWS CloudFormation Template) and to generate boilerplate plumbing code for each cloud function. This process is illustrated below:

Here, we implement a complete packaging process for particular cloud platform (in this case, AWS):

First, the service class compilation process outlined above is initiated.
Second, service configuration is retrieved; if no service-specific configuration is provided, some reasonable default will used; here all service resources such as database, messaging, API, security, import system, logging and tracing, etc. could be specified at required level of details for each deployment mode (e.g. dev, test, stage, prod)
Third, api-specific scan of service functions is performed to identify which type of API Gateway, if any is required for that service
Forth and last, the service (e.g. CloudFormation Stack) template is generated; here, all necessary parts such as template parameters, conditions, resources per cloud function, common resources and outputs are generated in correct order

What’s Next?

Once we have CAIOS Kernel running on particular cloud platform, we could start adding 3rd party and project libraries. What is most important, is that every new component could be fully tested locally and on cloud using the CAIOS STC machinery by using simple CAIOS CLI commands. For example, the caios test run command would automatically run unit, local integrated, and remote integrated tests for the service while remote integrated tests run will be accompanied with automatic service compilation and upload to the cloud.

Such extensibility was ensured by an open-end CAIOS kernel namespace architecture, as illustrated below:

Fig 7: CAIOS STC Open Namespace Structure

This structure enables implementation of different types of cloud functions and resources (e.g. Database) without modification of existing resources. For example, the AWS-specific kernel implementation comes as a relatively straightforward extension of this namespace:

Conclusion

The CAIOS STC architecture allowed us to achieve a 20-fold productivity gains from the very outset. Even more, junior developers without prior experience with AWS cloud were able to start developing REST API services after a couple of days of initial training in CAIOS development environment and HTTP API naming convention (people just need to make their head around automatic conversion of Python functions to HTTP methods).

In this paper, we provided a high-level overview of the project motivation, decision making factors and architecture. More detailed discussion of implementing particular interfaces and protocols will come in forthcoming publications. Stay tuned.