Pages

Thursday, January 15, 2026

Shot myself in the foot today.

Originally shared Apr. 4 2014. Updated for context and reflection.

Actually I pulled the trigger yesterday, but didn't feel it till this morning.


So we have this "new" IP Address Management (IPAM) software (InfoBlox), which also does DHCP and DNS. Well yesterday, around 11:30am, I was in the IPAM section creating a new network.  I mistyped the Network address, and had to delete it out of IPAM. I must have highlighted the user’s network which checked its check box without realizing it.  Because this morning I received a bunch of calls that users at one site could not login this morning. I know DHCP was the issue; because the user’s IPs were 169.254.x.x/16.  I jumped on the switch and used the “sh ip dhcp snooping binding” to see if the any client had received addresses.



There were a few, but their lease times were old, we set our lease time to 1 day (86400 sec). This led me to check the DHCP server, where I did a search for the Network and found it missing! In this new software the IPAM and DHCP databases are connected, deleting the Network deletes the DHCP scope for that network. Of course the reason we didn’t get any call yesterday is because all the clients had already received the leases for the day and were go to good till this morning when they tried to renew their IP addresses. I rebuilt the Network and the DHCP scope, and the clients started receiving their valid address.

The Total disruption for 10 users was about 30 minutes.      

Lessons learned: Slow down with newer/unfamiliar software.

Reflection, Years Later
What stands out now isn’t the mistake. It’s the delay.
The dependency was there the entire time. The system just didn’t surface it. 
Deleting a network object also removed the DHCP scope, but the interface treated those as separate concerns. Nothing warned that a live service was tied to that entry.
The impact didn’t show up immediately because the system was still coasting on existing leases. Everything looked healthy until renewal time. Only then did the dependency make itself known.
That pattern has shown up repeatedly since then.

Modern platforms tend to collapse multiple services behind a single portal. 
The UI emphasizes structure and organization, but often hides runtime behavior. 
Actions that look administrative can have operational consequences, sometimes delayed, sometimes quiet.

What this incident reinforced is that dependency awareness is often retrospective. You learn what mattered after it stops working. The outage becomes the documentation the portal never provided.

Today, I approach unfamiliar systems assuming that dependencies exist even when they aren’t visible. If a tool makes destructive actions easy, I assume the blast radius is larger than advertised. And if nothing breaks right away, I don’t take that as proof that nothing was affected.
The outage was small.    The lesson wasn’t.

Some systems only explain themselves when they fail.

No comments:

Post a Comment