Tomammon.net – Operational Intent

This is Part 5 in an 7-part series discussing the www.tomammon.net online resume application. Check out the architectural overview for some context, or see the links at the end of the article to navigate to other posts in this series.

What is “Operational Intent”?

I see Operational Intent as a set of maintenance practices that go along with the decisions made during the design of a system. It’s the stuff you have to do to keep the system running smoothly. If the Operational Intent is created at the same time you are designing the network, you are far more likely to design something that the business can actually consume and use as a competitive advantage. Hint: the person designing the network should not be the principal author of Operational Intent.

For a more concrete example, consider the routing policy portion of a network design. Let’s say that route redistribution between multiple routing domains is a part of your design. There are a number of ways to control the flow of prefixes from one domain to another, each with their pros and cons. Once you have selected a method for controlling the redistribution of routing information, you have created an entity that has to be cared for and looked after. For example, if you chose prefix lists to control the routing information flow, you have to understand when and how to maintain these prefix lists. To determine Operational Intent for this part of the design, document the answers to questions like these:

  • When and why will the prefix lists be modified?
  • How specific should the matches in the prefix lists be? Should they always be exact matches, or can we use variable prefix lengths? Does it even matter how long the match is, so long as it matches the new prefixes being added?
  •  How will our automation framework interact with this component? Is the name of the prefix list constructed using some type of logic? Or is the name just known in a database or some other state store? Should humans normally be touching this part of the configuration?
  • If human operators are going to be touching this, what is the minimum level of skill that I, as the designer, assume they will possess? If the operators’ skill changes over time, does that have practical consequences for the function of the system down the road?
  • What are the consequences to the stability of the system if this prefix list is accidentally deleted? What if it’s blown wide open with something like “permit 10.0.0.0/8 le 32”?
  • As parts of the network are divested or consolidated, will this component be audited and cleaned up, and if so, how?

It’s not always possible to know the answers to these types of questions in advance, but they serve as a way to bring the design of the system down to operational reality. If you are designing something and can’t answer most of the questions above, or if the answers trouble you, that’s a signal that you need to rethink the design. If you are an operator, and the answers to these questions are unclear or are wildly different when asked of different members of your team, that’s a sign that something isn’t right with the design.

Another valuable role of an OI document is to connect the expertise of the system designer to the expertise of the operator. In my career I have met some very talented operators who could run circles around me in their ability to monitor and automate the networks I have built. Working with individuals like that is a very rewarding experience for me, because my designs almost always become more practical and more functional after they look at it from a practical operational perspective. The OI document allows the designer to spell out their assumptions about how the system will be maintained. That communication of assumptions cannot be overvalued.

Operational Intent for Tomammon.net

Here’s a representative sample of what an OI document might look like for Tomammon.net. Since the networking is pretty simple, we’ll focus on the application services.

Maintenance of Static Content and Application Code

  • Static content and application code are all hosted on github in the tommmonet repo.
  • Changes made to any of these elements are pushed up to the github repo from the dev environment, and then pulled down by each Content Node independently.
  • For risky changes, a Content Node can be pulled out of the round-robin DNS A record before pulling the new code down, and then tested. Rolling back to an older, functional version of the code is handled by git on the Content Node.

Monitoring the Application and Database Layers

  • The app server can be monitored directly via the “testapi” call, see code example for the API below.
  • The database contains a static table with generic content, which is retrieved using the “testdb” call, see code example for the API below.

Common Database Problems and Solutions

  • If connectivity problems arise in the transport network between the Content Nodes and the master database, replication problems can result. Before addressing these problems, confirm that network connectivity is healthy and stable between the slave and master.
  • Database replication problems can often be cleared by simply restarting replication from the MariaDB client (on the slave side, in the Content Node inside the tandb_slave container), using these commands:

 

Final Thoughts

If there is a healthy balance of power between the designers and the operators of systems, the concepts of Operational Intent can produce real wins for the business. At its heart, OI is about collaboration between designers and operators as equal partners, enabled by open communication about technical decisions and requirements.

In the next article in the series, I’ll make some confessions about the weaknesses and problems of the resume application and its supporting infrastructure.

Leave a Reply

Your email address will not be published. Required fields are marked *