Ninth Circle of Hell
Take a look at this. If nothing loads, or you get an error, I’m still at work.
Update: Home now, finally. We were down for 5 hours, the longest unscheduled downtime we’ve had in quite some time. For some reason our NFS server panicked and rebooted, and when it came back up, it could only see one of the two RAIDS we had connected to it at a time. Since we had critical systems on both, everything had to come down while we were troubleshooting the problem.
Makes absolutely no sense, does it? Let me try again. I’m part of a team in charge of two main UNC systems, www.unc.edu and most of the other urls ending in unc.edu, and blackboard.unc.edu, which is our online course stuff. Both of those systems take up huge amounts of space, measuring in the hundreds of gigs, so they’re not actually kept on our main computers. All of the blackboard.unc.edu files, and a lesser but still large portion of the www.unc.edu files, are kept on what you can think of as two gigantic hard drives. That’s not anywhere near correct, but for our purposes it’ll do. Those are the RAIDs. In order to get these files out to the world, we connect the RAIDS to a file server, which then acts as if all these files were actually part of the machine. Then we connect parts of the file server to the actual machines that run www.unc.edu and blackboard.unc.edu in the same manner. (BTW, you’ll notice all of the links above are to Sun boxes. We also have Linux boxes, 20 or so, but they are new, and have not yet made it to the production front lines.)
What happened was something, we still don’t know what, exactly, caused the file server to get the two RAIDs confused. First it insisted that both had the same exact content, then once we got past that it refused to see both at the same time. Fixing the problem essentially boiled down to trying different things in order to isolate the problem while we wended our way up Sun’s tech support ladder. Finally we got to a guy who not only recognized the problem, but had a document listing the steps we needed to take to solve the problem, even though he also had no idea what initially caused it. Once we got through that, we spent another half hour calming down the rest of the machines, turning back on our monitoring software, and doing general clean up. That brings me to now, which gives me two hours to get the Carnival done. Joy. Possibly I’ll post a more technical explanation of what went wrong tomorrow, but if so it’ll be more for my benefit than anything else, as I can’t imagine it’ll be any more interesting