The last Saturday, I was out in the garage, trying to clean up this huge mess of wires in the garage in my racks. Because I always threw things together, my cable management was awful and it was actually starting to bother me. I spent several hours, tracing cables, pruning things with lights off, measuring, and so on. Since the blade center happens to run what we at home call production, like this site and routing (hence netflix), and so on, I chose to be careful about where I was working. Yesterday, I spent a significant amount of time setting up my 3120 stack to utilize port channel uplinks better redundancy. Since each one is connected to different members of my 3750 stack, it was easy, but tedious. That meant that I also had to move my uplinks in ESX to utilize them, and then also migrate my VM’s networking to them. It was a success and I felt pretty good about being able to lose a cable here and there, as long as it wasn’t a direct connection between the netapp 7mode filer and the blade center. So Saturday night, before bed, I turn off the lights and close the garage door. There was this smell. To me, it smelled like electronics that were hot and on fire. I couldn’t find any source, so I let it get and went to bed around 10:30. In bed, I tossed, turned, thought, theorized, and couldn’t let it go. Back downstairs I go, in fuzzy pj paints because hey, its cold. I go into the garage and what do ya know, the smell is still there, but still no fire. I nose around a bit, a new thing since my surgery for sinus stuff in May, and determine the source is most likely the blade center. The display on the c7000 shows a management connectivity issue on all blades, so I say to myself: That has to be it, the management module. So I do the only thing that makes sense to me and pull out the management module and reseat it. This was bad idea number 1. Since I didn’t have a secondary management module, the all fans ramped up to high and went orange. You see, this module controls the cooling subsystem, power subsystem, and the mgmt interfaces of the blades and modules. The health light didn’t come back on either after it was seated. Shit, I thought. Perhaps it wasn’t as hotswap as I thought….. so…. I’ll just power cycle the whole blade center. Storage and networking will remain up, risk of issues should be lower. So, I pull the power on the blade center. This was bad idea number 2. Upon re-applying power, I got a surprise. All the fans came on again, all on high. And that was it. No power to any blade, module, the admin module or anything else. It wad dead and loud. The entire environment was down. At this point, its 2:30am and I’m full of self loathing and regret. My next move is to plug in the emergency belkin router I have configured for my IP address. Its my break-fix item for when the virtual routers are down so that I can at least get online and pacify the kids and research what I’ve done. In this specific case, I jumped right onto ebay to assess the damage of a new module. It didn’t take long to find redemption, a PAIR of these things coming from Boston for $25 with free shipping. Ordered and paid for, now I just have to wait for them. As one would expect, it took about 3 days for them to travel 10 hours worth of interstate. Wednesday, they arrived along with some 10GB SFP modules I had bought Friday for those 3120′s I have. As soon as I get home, I rip them open and shove them into the admin slots, and apply power. All the fans come up, and high…. and settle down with all green lights. The management display in front shows a progress bar of discovery. We’re alive. In about 15 minutes, all my systems have booted and my normal environment is back up with little needs. The only clean up I had to do was sync the firmware on both modules, one was at 4.60 and the other was at 4.40. I also had to re-create my admin accounts, reconfig ipv4 and settings that the Admin module held in memory. So the lessons here are easy:
- Always have your redundant parts redundant.
- Don’t jack with anything that isn’t redundant with out a replacement on standby.
- If there isn’t an actual fire, leave it alone.