Tomammon.net – Operational Intent

This is Part 5 in an 7-part series discussing the www.tomammon.net online resume application. Check out the architectural overview for some context, or see the links at the end of the article to navigate to other posts in this series.

What is “Operational Intent”?

I see Operational Intent as a set of maintenance practices that go along with the decisions made during the design of a system. It’s the stuff you have to do to keep the system running smoothly. If the Operational Intent is created at the same time you are designing the network, you are far more likely to design something that the business can actually consume and use as a competitive advantage. Hint: the person designing the network should not be the principal author of Operational Intent.

For a more concrete example, consider the routing policy portion of a network design. Let’s say that route redistribution between multiple routing domains is a part of your design. There are a number of ways to control the flow of prefixes from one domain to another, each with their pros and cons. Once you have selected a method for controlling the redistribution of routing information, you have created an entity that has to be cared for and looked after. For example, if you chose prefix lists to control the routing information flow, you have to understand when and how to maintain these prefix lists. To determine Operational Intent for this part of the design, document the answers to questions like these:

  • When and why will the prefix lists be modified?
  • How specific should the matches in the prefix lists be? Should they always be exact matches, or can we use variable prefix lengths? Does it even matter how long the match is, so long as it matches the new prefixes being added?
  •  How will our automation framework interact with this component? Is the name of the prefix list constructed using some type of logic? Or is the name just known in a database or some other state store? Should humans normally be touching this part of the configuration?
  • If human operators are going to be touching this, what is the minimum level of skill that I, as the designer, assume they will possess? If the operators’ skill changes over time, does that have practical consequences for the function of the system down the road?
  • What are the consequences to the stability of the system if this prefix list is accidentally deleted? What if it’s blown wide open with something like “permit 10.0.0.0/8 le 32”?
  • As parts of the network are divested or consolidated, will this component be audited and cleaned up, and if so, how?

It’s not always possible to know the answers to these types of questions in advance, but they serve as a way to bring the design of the system down to operational reality. If you are designing something and can’t answer most of the questions above, or if the answers trouble you, that’s a signal that you need to rethink the design. If you are an operator, and the answers to these questions are unclear or are wildly different when asked of different members of your team, that’s a sign that something isn’t right with the design.

Another valuable role of an OI document is to connect the expertise of the system designer to the expertise of the operator. In my career I have met some very talented operators who could run circles around me in their ability to monitor and automate the networks I have built. Working with individuals like that is a very rewarding experience for me, because my designs almost always become more practical and more functional after they look at it from a practical operational perspective. The OI document allows the designer to spell out their assumptions about how the system will be maintained. That communication of assumptions cannot be overvalued.

Operational Intent for Tomammon.net

Here’s a representative sample of what an OI document might look like for Tomammon.net. Since the networking is pretty simple, we’ll focus on the application services.

Maintenance of Static Content and Application Code

  • Static content and application code are all hosted on github in the tommmonet repo.
  • Changes made to any of these elements are pushed up to the github repo from the dev environment, and then pulled down by each Content Node independently.
  • For risky changes, a Content Node can be pulled out of the round-robin DNS A record before pulling the new code down, and then tested. Rolling back to an older, functional version of the code is handled by git on the Content Node.

Monitoring the Application and Database Layers

  • The app server can be monitored directly via the “testapi” call, see code example for the API below.
  • The database contains a static table with generic content, which is retrieved using the “testdb” call, see code example for the API below.

Common Database Problems and Solutions

  • If connectivity problems arise in the transport network between the Content Nodes and the master database, replication problems can result. Before addressing these problems, confirm that network connectivity is healthy and stable between the slave and master.
  • Database replication problems can often be cleared by simply restarting replication from the MariaDB client (on the slave side, in the Content Node inside the tandb_slave container), using these commands:

 

Final Thoughts

If there is a healthy balance of power between the designers and the operators of systems, the concepts of Operational Intent can produce real wins for the business. At its heart, OI is about collaboration between designers and operators as equal partners, enabled by open communication about technical decisions and requirements.

In the next article in the series, I’ll make some confessions about the weaknesses and problems of the resume application and its supporting infrastructure.

Tomammon.net – Services Infrastructure and Application Code

This is Part 3 in an 7-part series discussing the www.tomammon.net online resume application. Check out the architectural overview for some context, or see the links at the end of the article to navigate to other posts in this series.

In this post we will take a look at the application and the services that it consumes. I wrote most of the code myself. The only code I didn’t write myself was the frontend HTML and CSS, which I took from a template several years ago. Unfortunately, the author didn’t leave his contact information in the source code and I have long since forgotten his name. If you are him, please email me so that I can give you credit.

Let’s start with a quick recap of the high level design of the application. We have Content Nodes, a Data Warehouse, DNS, and Internet Transport. We’ll focus on the content nodes for today. We’ll start our tour at the point that the user’s browser connects to the one of the content nodes. In the next post (Public Cloud Components) we will look at how DNS fits in to all of this.

The content node is where the bulk of the application’s work is done. It is built using a set of 6 docker containers, which interact as shown in the diagram below.

Here are a few things to notice:

  • The Load Balancer distributes incoming HTTP requests using the round-robin method to each of the websrv-* containers. Here’s a snippet of the relevant HAProxy configuration:

The full config file is available here.

  • Once an HTTP request has been dispatched from the load balancer, the websrv-* containers serve the static content (static HTML code in index.php, etc.) and make API calls to the appsrv-* containers for the parts of the app that are dynamic. For example, the data in the “Vendors and Technologies” part of the resume:

lives in several MariaDB database instances, and not natively in the index.php file. This content is retrieved by index.php making an API call to the app server layer, like this:

  • The appsrv-* containers answer the API calls made from the web server layer, translate them into SQL queries, query the local MariaDB slave, and return the results of the query back to the web server layer in JSON format. Here’s an example of one of those API endpoints, which resides on the appsrv-* containers:

  • The MariaDB database is a slave receiving row-based replication from the master DB instance, which we’ll cover a little later on in the post.

That’s a quick drive-by of the flow of data through a Content Node. If you’re interested in more details on any of the code, leave a comment below or email me and I’d be happy to elaborate. You can also see all of the configuration files and application code used by the resume application in my github account if you are so inclined.

Container Configuration

A linux container is, basically, a way to package application code in such a way that it is portable and free of dependencies on the host on which it runs. Sort of like a virtual machine, but way more lightweight. Using containers is one way to implement a microservices architecture. I chose to run my containers with Docker. See the references section at the end for some good resources on learning containers and Docker.

The containers are connected as shown below.

  • For all but the database slave, I have built custom containers for each service in the application, and these are hosted on dockerhub. The Dockerfiles for these are located in my github account. A Dockerfile is a set of instructions for building a custom docker container.
  •  A Docker network is really just a software bridge running on the host. After deploying the app, you can see that this bridge is running on the host:

  • The load balancer has front and back side connectivity, similar to how you might implement it with a physical appliance. It has a bind mount to allow the configuration file to live on the host server instead of inside the container itself. This container is based on Alpine Linux, a very lightweight distro that is popular in container deployments. Here’s some output from the docker inspect command that shows how the load balancer’s bind mount and network connections are configured:

  • The webserver and appserver containers are based on the official PHP docker images – the only reason I had to customize them was because I needed the mysqli database extensions which were not part of the official PHP docker images that I wanted to use. The Dockerfile for the appserver image shows how this is done:

  • docker-compose is a Python utility that allows you to specify all of the parameters for your containers in a YAML configuration file. Without it, you’d have to specify all of the options for each container, such as the networks it connects to, the image it requires, the mounts it requires, etc., at runtime, which can get tedious. The full docker-compose for www.tomammon.net is available in my github account, but here’s a section of the docker-compose.yml file that I built to specify the parameters for all of the containers. This one focuses on the load balancer container:

Database Replication

The slave database instances (tandb_slave containers) are read-only copies of the database running on the master. Changes to the database are made on the master and then replicated as SQL transactions out to the slaves. As we’ll discuss in Part 6: Operational Intent, this scheme has some maintenance and operations drawbacks, but overall it is the simplest choice available, and as we all know, Simplicity is Sophistication.

Designing for High Availability

We can see that there are several places in the application where failures could be absorbed while allowing the application to continue functioning. For example, there are multiple web services and multiple app services, and the load balancer is doing health checks to enable it to remove broken containers from service if they fail. But we still have some single points of failure. The load balancer itself is not in an HA pair, and neither is the database slave.

These were deliberate choices in the design. There is always a tradeoff between complexity, maintenance burden, and redundancy. Often, the first two come at the expense of the latter, and vice versa. To balance these factors, I chose to wrap up the entire set of microservices in a VMWare virtual machine, which I could then replicate as many times as I wanted. This replication would also give me one other key benefit – I now have the possibility of locating the content near to the users who will be consuming it. Add to this the relationship of the master database to the application, and we end up with an application that looks like this:

All of this is running over a combination of private networks, public Internet connectivity for end users, private VPN tunnels that ride over public Internet transport, and public cloud infrastructure. Currently, the production application has two Content Nodes, one in Salt Lake City, Utah, and another in Romania. As we progress in this series of blog posts, we’ll discuss the interaction between all of these components in greater detail.

References

  1. HAProxy Official Documentation
  2. HAProxy Quickstart w/ full example config file
  3. Creating a Simple REST API in PHP
  4. How To Install and Use Docker on CentOS 7
  5. Learn Docker in 12 Minutes
  6. Overview of Docker Compose
  7. How Does MySQL Replication Really Work?
  8. How To Set Up Master Slave Replication in MySQL
  9. Tom’s Github Account – tomammon.net repo
  10. Tom’s DockerHub Account