Originally shared Apr. 4 2014. Updated for context and reflection.
Actually I pulled the trigger yesterday, but didn't feel it till this morning.
So we have this "new" IP Address Management (IPAM) software (InfoBlox),
which also does DHCP and DNS. Well yesterday, around 11:30am, I was in the IPAM
section creating a new network. I mistyped the Network address, and had to delete it out of IPAM. I must have
highlighted the user’s network which checked its check box without realizing it.
Because this morning I received a bunch
of calls that users at one site could not login this morning. I know DHCP
was the issue; because the user’s IPs were 169.254.x.x/16. I jumped on the switch and used the “sh ip
dhcp snooping binding” to see if the any client had received addresses.
The Total disruption for 10 users was about 30 minutes.
Lessons learned: Slow down with newer/unfamiliar software.
Reflection, Years Later
What stands out now isn’t the mistake. It’s the delay.
The dependency was there the entire time. The system just didn’t surface it.
Deleting a network object also removed the DHCP scope, but the interface treated those as separate concerns. Nothing warned that a live service was tied to that entry.
The impact didn’t show up immediately because the system was still coasting on existing leases. Everything looked healthy until renewal time. Only then did the dependency make itself known.
That pattern has shown up repeatedly since then.
Modern platforms tend to collapse multiple services behind a single portal.
The UI emphasizes structure and organization, but often hides runtime behavior.
Actions that look administrative can have operational consequences, sometimes delayed, sometimes quiet.
What this incident reinforced is that dependency awareness is often retrospective. You learn what mattered after it stops working. The outage becomes the documentation the portal never provided.
Today, I approach unfamiliar systems assuming that dependencies exist even when they aren’t visible. If a tool makes destructive actions easy, I assume the blast radius is larger than advertised. And if nothing breaks right away, I don’t take that as proof that nothing was affected.
The outage was small. The lesson wasn’t.
Some systems only explain themselves when they fail.

No comments:
Post a Comment