– The Ugly Bits

This is Part 6 in an 7-part series discussing the online resume application. Check out the architectural overview for some context, or see the links at the end of the article to navigate to other posts in this series.

In this post we’ll take a look at some of the design problems in the resume application.

Some Obvious Problems

The first set of problems comes from the low-budget nature of the project. There is no proper datacenter for most of the components, and the hypervisors running the data warehouse and one of the Content Nodes have lots of single points of failure in their power, cooling, and network connectivity. There’s also no redundancy for internet connectivity. The system is built to be able to handle these points of failure – for example, if the data warehouse goes away for a short period, both Content Nodes will continue to serve content, but no updates to the data can be made. But in general, this is not a robust application in terms of infrastructure.

Backend and Container Networking

The use of remote access VPN as a backend network (for remote Content Nodes) is hokey at best. But, when you’re on a shoestring, you do what you have to do to get things working. On the container side, I could have used one of the available overlays to connect the slave databases to the master, at a container level, instead of letting Docker do its thing with iptables. I chose not to go that route in order to speed up delivery of the project.

DNS issues

The DNS infrastructure is somewhat fragile.

  • The Content Node HA mechanism is round-robin DNS. I have plans to build some monitoring logic to pull out the A record representing a non-responsive content node, but this is not currently implemented. This means that if one of the content nodes go offline some percentage of client traffic will end up getting dropped on the floor.
  • Even when the monitoring logic pulls the record, most browsers cache DNS data independently of the OS, and they also don’t necessarily care about what the stub resolver in the OS is doing after they get an answer. Just because the TTL is set to 300 seconds doesn’t mean that the browser’s DNS cache is going to expire it that quickly. Even if it did, 5 minutes is a long time to wait for a failed content node to disappear.
  • DNS is hosted entirely on Route53. ’nuff said.
  • There is no real CDN. If you happen to be in Europe and if your stub resolver happens to pick the Romanian content node, you’ll get a little better performance than if it picked the North American node. But that choice is not deterministic based on your location.

Application Code

While debugging a different, unrelated, problem with the web service containers, I realized I had a single point of failure between the web server and app server layers inside the content nodes. You can find this in the index.php code on my github account. Do you see it? Hint: look at line 15 in index.php. This relates to how I am using mounts in docker to keep the code on the docker host instead of inside each container. I can think of a few ways to mitigate this defect, but do you have ideas? If so, put them in the comments below or hit me up on twitter and I’d love to discuss them with you.

I’m sure that I’m missing other problems with my design, and I would love to hear any feedback that anybody wants to share. My purpose in building the application has been to broaden my horizons and try to get a taste of life outside my silo. If you are someone who builds web applications, I’d love to hear any criticism you might have – I’m sure I will benefit greatly from it.

The next post is the last in this series. I’ll summarize what I have learned and how my perspective has changed, and hopefully that will inspire you to look outside your particular silo as well. – Operational Intent

This is Part 5 in an 7-part series discussing the online resume application. Check out the architectural overview for some context, or see the links at the end of the article to navigate to other posts in this series.

What is “Operational Intent”?

I see Operational Intent as a set of maintenance practices that go along with the decisions made during the design of a system. It’s the stuff you have to do to keep the system running smoothly. If the Operational Intent is created at the same time you are designing the network, you are far more likely to design something that the business can actually consume and use as a competitive advantage. Hint: the person designing the network should not be the principal author of Operational Intent.

For a more concrete example, consider the routing policy portion of a network design. Let’s say that route redistribution between multiple routing domains is a part of your design. There are a number of ways to control the flow of prefixes from one domain to another, each with their pros and cons. Once you have selected a method for controlling the redistribution of routing information, you have created an entity that has to be cared for and looked after. For example, if you chose prefix lists to control the routing information flow, you have to understand when and how to maintain these prefix lists. To determine Operational Intent for this part of the design, document the answers to questions like these:

  • When and why will the prefix lists be modified?
  • How specific should the matches in the prefix lists be? Should they always be exact matches, or can we use variable prefix lengths? Does it even matter how long the match is, so long as it matches the new prefixes being added?
  •  How will our automation framework interact with this component? Is the name of the prefix list constructed using some type of logic? Or is the name just known in a database or some other state store? Should humans normally be touching this part of the configuration?
  • If human operators are going to be touching this, what is the minimum level of skill that I, as the designer, assume they will possess? If the operators’ skill changes over time, does that have practical consequences for the function of the system down the road?
  • What are the consequences to the stability of the system if this prefix list is accidentally deleted? What if it’s blown wide open with something like “permit le 32”?
  • As parts of the network are divested or consolidated, will this component be audited and cleaned up, and if so, how?

It’s not always possible to know the answers to these types of questions in advance, but they serve as a way to bring the design of the system down to operational reality. If you are designing something and can’t answer most of the questions above, or if the answers trouble you, that’s a signal that you need to rethink the design. If you are an operator, and the answers to these questions are unclear or are wildly different when asked of different members of your team, that’s a sign that something isn’t right with the design.

Another valuable role of an OI document is to connect the expertise of the system designer to the expertise of the operator. In my career I have met some very talented operators who could run circles around me in their ability to monitor and automate the networks I have built. Working with individuals like that is a very rewarding experience for me, because my designs almost always become more practical and more functional after they look at it from a practical operational perspective. The OI document allows the designer to spell out their assumptions about how the system will be maintained. That communication of assumptions cannot be overvalued.

Operational Intent for

Here’s a representative sample of what an OI document might look like for Since the networking is pretty simple, we’ll focus on the application services.

Maintenance of Static Content and Application Code

  • Static content and application code are all hosted on github in the tommmonet repo.
  • Changes made to any of these elements are pushed up to the github repo from the dev environment, and then pulled down by each Content Node independently.
  • For risky changes, a Content Node can be pulled out of the round-robin DNS A record before pulling the new code down, and then tested. Rolling back to an older, functional version of the code is handled by git on the Content Node.

Monitoring the Application and Database Layers

  • The app server can be monitored directly via the “testapi” call, see code example for the API below.
  • The database contains a static table with generic content, which is retrieved using the “testdb” call, see code example for the API below.

Common Database Problems and Solutions

  • If connectivity problems arise in the transport network between the Content Nodes and the master database, replication problems can result. Before addressing these problems, confirm that network connectivity is healthy and stable between the slave and master.
  • Database replication problems can often be cleared by simply restarting replication from the MariaDB client (on the slave side, in the Content Node inside the tandb_slave container), using these commands:


Final Thoughts

If there is a healthy balance of power between the designers and the operators of systems, the concepts of Operational Intent can produce real wins for the business. At its heart, OI is about collaboration between designers and operators as equal partners, enabled by open communication about technical decisions and requirements.

In the next article in the series, I’ll make some confessions about the weaknesses and problems of the resume application and its supporting infrastructure. – Public Cloud Integration

This is Part 4 in an 7-part series discussing the online resume application. Check out the architectural overview for some context, or see the links at the end of the article to navigate to other posts in this series.

In this post we’ll see the role that public cloud plays in the application. The majority of the business logic that powers runs in private facilities. However, it makes sense (as is the case in many modern applications) to run part of it from the public cloud. Here’s the strategic view of the application again:


The biggest cloud component of the application is the DNS Service.

DNS Design

Amazon’s Route53 is used to provide name services. Route53 is authoritative for Global site redundancy for the application is provided by populating a single A record with the public IP addresses of each Content Node to create a round-robin DNS entry. Two content nodes are currently in service, as shown here:

The TTL for these records is set low at 300 seconds, to allow for changes in the recordset to propagate quickly. This approach has some major flaws, which we’ll discuss in Part 6: The Ugly Bits. Also, in Part 5: Operational Intent, I’ll show how the AWS API is used to make these DNS records responsive to the health of the application.

Static Assets

Some of the static assets (such as my profile picture image) for the site are located in an Amazon S3 bucket, which is called via the domain:


It’s pretty painless to set up a simple static website on S3. All of the dynamic components of the application are located in private compute facilites for, so I’m only using public cloud for very specific parts of the application, which keeps the costs quite low.


  1. Hosting a Static Website on Amazon S3
  2. Example: Setting up a Static Website Using a Custom Domain
  3. How Do I Configure an S3 Bucket for Static Website Hosting? – Services Infrastructure and Application Code

This is Part 3 in an 7-part series discussing the online resume application. Check out the architectural overview for some context, or see the links at the end of the article to navigate to other posts in this series.

In this post we will take a look at the application and the services that it consumes. I wrote most of the code myself. The only code I didn’t write myself was the frontend HTML and CSS, which I took from a template several years ago. Unfortunately, the author didn’t leave his contact information in the source code and I have long since forgotten his name. If you are him, please email me so that I can give you credit.

Let’s start with a quick recap of the high level design of the application. We have Content Nodes, a Data Warehouse, DNS, and Internet Transport. We’ll focus on the content nodes for today. We’ll start our tour at the point that the user’s browser connects to the one of the content nodes. In the next post (Public Cloud Components) we will look at how DNS fits in to all of this.

The content node is where the bulk of the application’s work is done. It is built using a set of 6 docker containers, which interact as shown in the diagram below.

Here are a few things to notice:

  • The Load Balancer distributes incoming HTTP requests using the round-robin method to each of the websrv-* containers. Here’s a snippet of the relevant HAProxy configuration:

The full config file is available here.

  • Once an HTTP request has been dispatched from the load balancer, the websrv-* containers serve the static content (static HTML code in index.php, etc.) and make API calls to the appsrv-* containers for the parts of the app that are dynamic. For example, the data in the “Vendors and Technologies” part of the resume:

lives in several MariaDB database instances, and not natively in the index.php file. This content is retrieved by index.php making an API call to the app server layer, like this:

  • The appsrv-* containers answer the API calls made from the web server layer, translate them into SQL queries, query the local MariaDB slave, and return the results of the query back to the web server layer in JSON format. Here’s an example of one of those API endpoints, which resides on the appsrv-* containers:

  • The MariaDB database is a slave receiving row-based replication from the master DB instance, which we’ll cover a little later on in the post.

That’s a quick drive-by of the flow of data through a Content Node. If you’re interested in more details on any of the code, leave a comment below or email me and I’d be happy to elaborate. You can also see all of the configuration files and application code used by the resume application in my github account if you are so inclined.

Container Configuration

A linux container is, basically, a way to package application code in such a way that it is portable and free of dependencies on the host on which it runs. Sort of like a virtual machine, but way more lightweight. Using containers is one way to implement a microservices architecture. I chose to run my containers with Docker. See the references section at the end for some good resources on learning containers and Docker.

The containers are connected as shown below.

  • For all but the database slave, I have built custom containers for each service in the application, and these are hosted on dockerhub. The Dockerfiles for these are located in my github account. A Dockerfile is a set of instructions for building a custom docker container.
  •  A Docker network is really just a software bridge running on the host. After deploying the app, you can see that this bridge is running on the host:

  • The load balancer has front and back side connectivity, similar to how you might implement it with a physical appliance. It has a bind mount to allow the configuration file to live on the host server instead of inside the container itself. This container is based on Alpine Linux, a very lightweight distro that is popular in container deployments. Here’s some output from the docker inspect command that shows how the load balancer’s bind mount and network connections are configured:

  • The webserver and appserver containers are based on the official PHP docker images – the only reason I had to customize them was because I needed the mysqli database extensions which were not part of the official PHP docker images that I wanted to use. The Dockerfile for the appserver image shows how this is done:

  • docker-compose is a Python utility that allows you to specify all of the parameters for your containers in a YAML configuration file. Without it, you’d have to specify all of the options for each container, such as the networks it connects to, the image it requires, the mounts it requires, etc., at runtime, which can get tedious. The full docker-compose for is available in my github account, but here’s a section of the docker-compose.yml file that I built to specify the parameters for all of the containers. This one focuses on the load balancer container:

Database Replication

The slave database instances (tandb_slave containers) are read-only copies of the database running on the master. Changes to the database are made on the master and then replicated as SQL transactions out to the slaves. As we’ll discuss in Part 6: Operational Intent, this scheme has some maintenance and operations drawbacks, but overall it is the simplest choice available, and as we all know, Simplicity is Sophistication.

Designing for High Availability

We can see that there are several places in the application where failures could be absorbed while allowing the application to continue functioning. For example, there are multiple web services and multiple app services, and the load balancer is doing health checks to enable it to remove broken containers from service if they fail. But we still have some single points of failure. The load balancer itself is not in an HA pair, and neither is the database slave.

These were deliberate choices in the design. There is always a tradeoff between complexity, maintenance burden, and redundancy. Often, the first two come at the expense of the latter, and vice versa. To balance these factors, I chose to wrap up the entire set of microservices in a VMWare virtual machine, which I could then replicate as many times as I wanted. This replication would also give me one other key benefit – I now have the possibility of locating the content near to the users who will be consuming it. Add to this the relationship of the master database to the application, and we end up with an application that looks like this:

All of this is running over a combination of private networks, public Internet connectivity for end users, private VPN tunnels that ride over public Internet transport, and public cloud infrastructure. Currently, the production application has two Content Nodes, one in Salt Lake City, Utah, and another in Romania. As we progress in this series of blog posts, we’ll discuss the interaction between all of these components in greater detail.


  1. HAProxy Official Documentation
  2. HAProxy Quickstart w/ full example config file
  3. Creating a Simple REST API in PHP
  4. How To Install and Use Docker on CentOS 7
  5. Learn Docker in 12 Minutes
  6. Overview of Docker Compose
  7. How Does MySQL Replication Really Work?
  8. How To Set Up Master Slave Replication in MySQL
  9. Tom’s Github Account – repo
  10. Tom’s DockerHub Account – Network Infrastructure

This is Part 2 in an 7-part series discussing the online resume application. Check out the architectural overview for some context, or see the links at the end of the article to navigate to other posts in this series.

The Network Base Layer

The network infrastructure surrounding is pretty simple. All of the components to the left of the Internet cloud in the diagram below run on my home network infrastructure, but only the parts of my home network relevant to our topic are shown. Let’s start with a connectivity-focused perspective.

Network Connectivity Design

Most of the connectivity is delivered with physical appliances, while the nodes that run the application code are all virtualized. The load balancer is the only device that straddles these two worlds. This device is a VNF running the following software:

  • CentOS Linux for the operating system
  • HAProxy as the load balancing engine
  • The Free Range Routing suite to provide connectivity into the OSPF routing domain

The load balancer advertises a loopback, taken from PA space, into the core network. This public IP address is one of the addresses configured in the DNS A record for the application. We’ll discuss the DNS components in Part 4: Public Cloud Integration.

OSPF Design

A simple single-area OSPF design keeps things clean. Since the routing domain is quite small, there is no need to run multiple areas or turn nerd knobs. Here are a few things to keep in mind:

  • The Internet Router unconditionally redistributes a default route into the OSPF domain to provide Internet access to the core network. Since there are no other exit points that do not depend on this single internet circuit, there’s no need to make the advertisement of the Type 5 LSAs conditional on a received default route.
  • The VPN headend (A Cisco ASA 5505) is running as an Anyconnect Remote Access VPN. It injects /32 static routes for remote clients, which are then redistributed into OSPF to provide connectivity for remote Content Nodes. The remote Content Nodes run OpenConnect, an open source SSL VPN client, to connect to the VPN headend.
  • The load balancer is not technically an ABR since all it is really doing, from an OSPF perspective, is advertising a stub network that represents its public loopback IP.
  • The load balancer is only required because other public websites run on this same hypervisor, using the same public IP address. This first load balancer simply directs traffic intended for to the local Content Node using an “ACL” – HAProxy’s equivalent of an F5 iRule.

Compute Virtualization and Containers

The hypervisors running on the local network provide the Data Warehouse containing all of the dynamic content for the application, one of the content nodes, and the load balancer. Since the data warehouse resides on a shared database host, it is containerized. From the perspective of the resume application, there is no requirement for the data warehouse to run in a container – it was just a convenient way to provide an isolated environment on a compute resource that was already in service. We’ll go in to much more detail on the use of containers in Part 3.

The links below dive into the details of

Under the Hood of

I began my network automation journey in earnest a few years ago. As I learned about the building blocks of automation, I found my mind wandering beyond the confines of vendor-driven network designs. In my years as a networking engineer I have learned that I need to deliver a real product, satisfying real business requirements, to truly understand how a thing works. With those principles in mind I set out to build my online resume ( as a microservices-oriented web application. In this series of blog posts I will describe its architecture and share lessons that I have learned along the way.


My resume application needed to demonstrate my basic functional understanding of:

  • Microservices Architecture
  • API’s
  • Application Delivery Control (load balancing)

This was in addition to all of the packet delivery expertise I have developed over the course of my career. In addition, I wanted to provide for some level of fault tolerance and high availability. I would need to make use of virtualization technologies to meet all of the above requirements with minimal costs, since no revenue would be directly generated by the application.

Finally, I should say that my goal was not to become a software developer, or to claim that I have any special proficiency with any of these technologies, outside of the networking components. My purpose with the project was to gain an appreciation for the workloads that run on the networks I build, and to gain the perspective needed to be a successful network architect.

At the highest level, the resume application consists of 4 major components:

  • Content Nodes – the discrete compute units that serve application content to users
  • Data Warehouse – a central database that houses all of the dynamic content
  • Domain Name System (DNS)
  • Internet Transport

The links below will dive into the details of how these components interact and connect. I’ll also discuss the supporting network infrastructure and other parts of the system that make it all work.