This blog post about Termites is a stub - you can help one twenty three cloud street grow by expanding upon your experience in the comments.
What is a Termite?
We know a few things about the Termite
Termites are insects from the order Isoptera
Isoptera: Eusocial, wood-eating, colony-forming insects.
Over 3,000 species, with many yet to be discovered.
Incomplete metamorphosis, nymphs develop within the colony.
Ew
Critical decomposers and soil engineers
Ever met a soil architect?
”Useful” as food for some animals and even humans in some cultures
Can be destructive to wooden structures.
The choice is their own - like the honey badger, they take what they want.
Termite Jokes need a lot of citation to understand(1)
Termite 1: "Hey, guess what? I'm feeling multi-potent today! I could become a worker or even a soldier!"
Termite 2: "Oh yeah? Well, I'm totipotent! I could become anything - even the queen!"
Termite 1: "Wait a second... aren't you just a first instar?"
Termite 2: "Maybe. But at least I'm not stuck being a pseudergate like you!"
Termite 1: "Ouch. Way to rub salt in my wingless wounds."
Termites are causing your production outages.
I’m not here to talk to you about a group of detritophagous eusocial insects or the Rhinotermitidae, or the Nasutitermes triodiae of the Northern Territory (located in the southern hemisphere, east of Western Australia, and west of New South Wales). I’m here to talk to you about the hidden problems in your architecture that have become your “unknown unknown’s”. These problems blind side you like a rogue wave of fermented shark hitting your unsuspecting palate in a dimly lit Reykjavik bar, leaving you questioning every life choice that led you to this moment of gastronomic revelation and existential crisis I do when you read my strikethrough text.
When was the last time you spent more than 2 minutes discovering the root cause of a failure? When was the last time you spent more than 2 days? 2 weeks? How long were you down?
Resiliency should be a 1st class citizen in your architecture, design and planning and a major part of resiliency is observability.
Let me guess - the last time you built something, you thought about scaling, maintenance patterns, architecture of your messaging and eventing, data patterns - oh and only then did you think about monitoring it?
You should have thought about that all along the way - how will you know if you have termites deep in the soul of your house application if you don’t have the ability to poke around those parts and check them routinely?
Termites are a symptom of the lack of observability
They hide and can hide for a long time before becoming a lasting problem and that can lure you into a false sense of stability.
Lets look at this basic 3-tier architecture:
Fantastic - we’ve all seen this before. Now let’s add our logging system.
Looking good - what could go wrong?
Well if you deploy’d this with Docker (or ECS, or likely most any container platform). Then you’ve just introduced a single point of failure by introducing observability as an afterthought.
Docker provides two modes for delivering messages from the container to the log driver:
(default) direct, blocking delivery from container to driver
non-blocking delivery that stores log messages in an intermediate per-container buffer for consumption by driver
Quote Source: https://docs.docker.com/engine/logging/configure/
In blocking mode - which is the default in Docker, and ECS - if your logging provider/service is having a bad time, then so are your containers. Can’t write to your logging provider? Well then your container falls over and restarts over and over again until the logging provider is back online.
In non-blocking mode - which you need to change away from the default configuration to use - a logging provider failure does not impact the functionality of the container!
When you (hopefully) tested your failure modes or the initial system before the introduction of logging you wouldn’t have been any wiser to this failure case. You might be able to run for a very long time before your logging provider has any issues causing an outage. You’ve got a termite infestation now - it’s just a matter of time before this fails and takes your whole stack down.
This is just one example of a termite infestation - your mileage may will vary!
How can I check for termites in my stack?
Simple. Test your failure cases. In your staging environment make something fail, see how the system reacts and plan around that failure case. Then do it again and again and again. You can have great diagrams showing multiple arrows going to multiple boxes to describe your resiliency - but just like a backup - you don’t know its real until you test the restoration - you don’t know your failure modes until you test them. This is how to discover your “unknown unknown’s”.
It’s great to be able to recover from a failure quickly, it’s great to detect a failure quickly - but these pale in comparison to your ability to INCREASE THE TIME BETWEEN FAILURES. To do this - you must understand what could fail(everything), how it will fail WHEN (not if) it does fail and how your system will react. Then plan for that failure case.
How can I prevent termites from Day 1?
In Australia, at a cost of more than A$1.5 billion per year,[242] termites cause more damage to houses than fire, floods and storms combined.[243] In Malaysia, it is estimated that termites caused about RM400 million of damages to properties and buildings.[244] The damage caused by termites costs the southwestern United States approximately $1.5 billion each year in wood structure damage, but the true cost of damage worldwide cannot be determined.[236][245]
Quote Source: https://en.wikipedia.org/wiki/Termite
You treat your observability and resilience as first class citizens along side your planning for hardware, software and cost management. Early in the design process you think about the failure cases and implement mechanisms to test them routinely.
Experiencing 'termites' can be a tough lesson, as many of us can attest.
I’d be eager to hear how you've faced off with these nuisances—share your 'termite' tales in the comments below.