SMS Blog

When ‘It Works’ Isn’t Enough: Engineering for Operations and Scalability

Engineers are trained to make things work, though I have seen little concern to what happens after that important step. As an engineer progresses in their career, they gain a wider perspective that can see farther than the “make it work” mentality that has been ingrained in us. Whatever that is designed/built has to be maintained and monitored by somebody. The engineer who builds a system is not the only one expected to operate it, so it is our responsibility as good engineers to take the extra step and make the maintainers’ lives as easy as reasonably possible.

💡 Operations is more important than engineering.

To that end, this blog will attempt to briefly outline a perspective that I wish I had when starting my career.

Guiding Principle: Must be easy to teach

Most everything in this blog can be derived from this one guiding principle. If it is easy to teach, then lower level engineers/techs are able to solve problems. If something is difficult to teach, then it will be redesigned in a way that makes it able to be worked on by more than the person/team who built it. Some complexity is unavoidable, but it should be minimized to the greatest degree possible.

“Everything should be made as simple as possible, but no simpler.”

Albert Einstein

Less is more

Because our guiding principle is for the easiest possible solution to teach, a few things become evident:

  • Fewest configurations – Results in the smallest configuration, with fewer opportunities for typos
  • Defaults are good – Only change when necessary. 5% better needs to be weighed against the potential 50% increase in complexity. I’ve seen instances where every nerd knob that can be tweaked is, and it’s a nightmare.
  • Duplication is good – The more configurations you can duplicate, the better. This leads to “just copy/paste from an existing working one” being a viable option.
  • DRY (Don’t Repeat Yourself) – Nesting is your friend. If you can abstract out a piece of logic in a simple manner that does more than just obscure what’s happening, then do it to the greatest possible degree. This leads to a smaller configuration size for people to try to understand, with the option to look inside the referenced code if they so desire.

Naming Schemas

When designing a solution, configurations must follow a strict naming schema, but after that requirement is met it must be made human readable if possible. This is not to say that anything goes, but a strict naming schema easily read is by far the best.

These are the steps I follow when constructing a naming schema:

  1. Figure out the format limitations (numbers, letters, special characters, character length etc.) This gives you the limits of what you have to work with.
  2. Figure out what necessary pieces of information must be included (organization name, building, purpose, etc.)
  3. Use delimiters between these pieces if possible (; or – are best).
    • Underscores and colons are not recommended because they can be mistaken for other characters (spaces or semi-colons) and are easier to typo because you have to press two keys for a single character instead of one.
  4. For names that allow letters, use lowercase or uppercase; never both. Simple rules are best, and going back to our guiding principle “It’s all lowercase” is far easier to teach/understand than trying to make the name look pretty with capitalization. Like with the delimiters, allowing a mix of upper and lowercase letters is a typo nightmare waiting to happen. Do everyone a favor and kill them before they can grow into problems.

Avoid decoder rings like the plague. An example of this would be attempting to bake into a name the customer as a single digit ID, the state being another 3 digit ID, and some other value as yet-another ID. This ends up becoming the opposite of readable and is very difficult to teach. It would require operators to always keep many lookup tables up in order to quickly identify what they are looking at. Seeing a server named ha-s4d-72c34-2 and have to cross reference 4 tables to figure out what it’s for or where it’s at is painful. Looking something up in an assignment sheet for anything that is built at scale is inevitable, but I make a distinction between decoder ring and assignments.

  • Decoder Rings = Requires multiple lookups in multiple sections (Bad, hard to teach)
  • Assignments = Requires a single lookup (Good, easy to teach)

Bonus points awarded for the more values you can derive from a single assignment. An example of this could be a subnet matching the VLAN ID in some way. Systems people usually speak in VLAN, and network people speak in subnet, so meshing those two together makes communication much easier. An example could be having the third octet of a subnet match the VLAN ID, so when a systems guy has a problem with VLAN 235, the network guy would instantly know that the subnet is 10.10.235.0/24.

Whatever is built needs to eventually integrate into many systems in some fashion. This could be logs sent off to a SIEM, some sort of health/monitoring dashboard, or a system that hasn’t been built or even thought of yet. With a rigorous naming schema, everybody is working off the same page and using the same language.

Scale Considerations

“Prototypes are easy. Production is hard.”

Elon Musk

Designing at scale is a different mindset than simply making something work. This seems self-evident but is not usually thought of in the initial design/setup. An easy test to see if a schema will work at scale is to look at the maximum value possible and ask yourself “What happens if we need more.” Even if you anticipate there will never be more than 40 of something, it’s not our role to enforce that limit through a schema we define. It could be something as simple as attempting to map out a customer ID to only one digit in the schema, and then you panic when you inevitably onboard customer number 10.

Final Thoughts

Though there are undoubtably books that expound on this in much more detail, I thought it beneficial to at least outline the mindset and provide some examples to support them. I don’t like to see good engineers build or design something that works but is almost unmaintainable because they failed to adhere to these principles. “It works” is unfortunately almost never followed by “and then.”

Leave a Comment