This is Part 6 in an 7-part series discussing the www.tomammon.net online resume application. Check out the architectural overview for some context, or see the links at the end of the article to navigate to other posts in this series.
In this post we’ll take a look at some of the design problems in the resume application.
Some Obvious Problems
The first set of problems comes from the low-budget nature of the project. There is no proper datacenter for most of the components, and the hypervisors running the data warehouse and one of the Content Nodes have lots of single points of failure in their power, cooling, and network connectivity. There’s also no redundancy for internet connectivity. The system is built to be able to handle these points of failure – for example, if the data warehouse goes away for a short period, both Content Nodes will continue to serve content, but no updates to the data can be made. But in general, this is not a robust application in terms of infrastructure.
Backend and Container Networking
The use of remote access VPN as a backend network (for remote Content Nodes) is hokey at best. But, when you’re on a shoestring, you do what you have to do to get things working. On the container side, I could have used one of the available overlays to connect the slave databases to the master, at a container level, instead of letting Docker do its thing with iptables. I chose not to go that route in order to speed up delivery of the project.
The DNS infrastructure is somewhat fragile.
- The Content Node HA mechanism is round-robin DNS. I have plans to build some monitoring logic to pull out the A record representing a non-responsive content node, but this is not currently implemented. This means that if one of the content nodes go offline some percentage of client traffic will end up getting dropped on the floor.
- Even when the monitoring logic pulls the record, most browsers cache DNS data independently of the OS, and they also don’t necessarily care about what the stub resolver in the OS is doing after they get an answer. Just because the TTL is set to 300 seconds doesn’t mean that the browser’s DNS cache is going to expire it that quickly. Even if it did, 5 minutes is a long time to wait for a failed content node to disappear.
- DNS is hosted entirely on Route53. ’nuff said.
- There is no real CDN. If you happen to be in Europe and if your stub resolver happens to pick the Romanian content node, you’ll get a little better performance than if it picked the North American node. But that choice is not deterministic based on your location.
While debugging a different, unrelated, problem with the web service containers, I realized I had a single point of failure between the web server and app server layers inside the content nodes. You can find this in the index.php code on my github account. Do you see it? Hint: look at line 15 in index.php. This relates to how I am using mounts in docker to keep the code on the docker host instead of inside each container. I can think of a few ways to mitigate this defect, but do you have ideas? If so, put them in the comments below or hit me up on twitter and I’d love to discuss them with you.
I’m sure that I’m missing other problems with my design, and I would love to hear any feedback that anybody wants to share. My purpose in building the application has been to broaden my horizons and try to get a taste of life outside my silo. If you are someone who builds web applications, I’d love to hear any criticism you might have – I’m sure I will benefit greatly from it.
The next post is the last in this series. I’ll summarize what I have learned and how my perspective has changed, and hopefully that will inspire you to look outside your particular silo as well.