Last Friday morning, and I’m talking early Friday morning, from 2 am to 5, our online courseware team attempted to apply a gradebook patch to the system. The timing of the patch was more or less forced on us by the severity of the bug and the upcoming end of the semester.
Normally we wouldn’t touch a critical app like this at the end of the semester. There are too many users accessing it, and too many possible negative repercussions.
But the bug forced our hand. We had determined earlier in the year that the system was not averaging grades correctly for some courses. How many courses? We had no idea. Some were fine, some were not. The only way to check was to download the course grades into a spreadsheet, average them, then compare the result to what the courseware thought they should be.
Now, even at UNC, it’s important for the grades to be accurate. Basic to the nature of the institution, you might say. The idea that some few thousands of courseware grades might not be correct is abhorrent. Not because of any great devotion on our part to the idea of exactly correct measurements of student achievement and the relative rankings thereof, though those are undeniably nice, but because if the problem was not fixed, then none of the courseware grade averages could be trusted, forcing every professor and grad student who uses the application to be instructed in how to;
1. Download the grades into a spreadsheet.
2. Average them within the spreadsheet.
This sounds simple enough, but for many UNC professors, this application and email constitute the sum total of their computer experience. Most would have no problem following the directions above, though their querulous and entirely understandable feedback at having been forced to do so could reasonably be expected to be large in quantity and deafening in tone. A significant minority would have to be physically guided through the process. Essentially, the small number of part time help desk staff associated with the courseware would be totally overwhelmed.
Fortunately, I’m not one of them, though I might be pressed into duty if the bottom were fall out. And really, what is the frigging point of having courseware that can’t add and subtract correctly?
Friday was the third time we attempted to fix the problem. The first patch, one we were assured would fix the problem, didn’t fix the problem. The second choked on a database constraint, the patch developers not having considered the possibility that the fix might have to be run against a system with actual data in it. The third patch was a better version of the second, or so we were promised.
Now, not all of the above is entirely the fault of the app developers, piss-poor developers though they are, in our considered opinion. It’s basically impossible to replicate our system in any meaningful way–it’s too large. Certainly we can’t afford to, and without an exact copy of the production system to test with, there’s not way to be sure a patch will apply correctly. UNC and maybe one or two other institutions are one the bleeding edge wehn it comes to this app. If a bug exists, we’re going to find it, and find it first.
It’s very annoying, but we don’t have much of a choice. According to the historical scuttlebutt, the app we used before this was even worse. This is what happens when large applications are outsourced.
Since we’ve no real way to test the patch against a comparable system, everything has to be shut down and backed up before a patch can be applied, just in case. The database alone takes an hour to replicate.
On the other hand, this appears to be strictly a database problem, and we had provided the developers with access to a copy of ours.
They never even attempted to connect to it. But we had to try and apply the patch anyway.
Friday mornings between 2 and 5 is the time when the courseware usage is the lowest, though there are no times when it is entirely idle. If we were going to inconvenience the fewest number of people possible, the patch had to be applied then. At 2 I shut down the system and disabled the alarms. The database administrator started the db backup, and my boss did the same for the file system. Were it not for a new Netapp, that would not have been possible without serious pain. The courses alone take up 165 gigabytes.
Yes, there are hard drives on PCs bigger than that, but try moving chunks of information that size around on them. And yes, we used an incremental backup instead of a full backup, but it still takes time.
Backups were finished at 3. The actual patch application took maybe 3 minutes, about half the time it took to restart the servers and re-enable the monitors. Testing the fix took another hour and half.
Result? About 25% of the courses we knew had problems were fixed. Whatever was breaking the other 75% remained unaffected.
So sometime in the near future another early morning is going to present itself–we hope. The vendor of the app has been strangely silent since the weekend.
But no cloud is so black that it doesn’t have a silver lining. Today, by way of thanks for the late night/early morning, the project manager placed a six pack of weird beer on my desk and a pound of pistachios on the desk of my boss.
Bought with her own money, I should add, lest you think that the hard-earned dollars of the North Carolina taxpayer regularly goes towards the purchase of yuppie alcohols.
Not that it would have stopped me from taking the beer. I’m willing to bet they would taste even better that way.
In another plus, it turns out the boss doesn’t care for pistachios. Now I have something to go with the beer.
Update: It seems I have left out information critical to the full understanding of the post above.
1. Dogfish Head Chicory Stout
2. Abita Turbo Dog
3. Abita Purple Haze
4. Weihenstephaner Hefeweissbier Dunkel
5. Weeping Radish weizen
6. Cottonwood Low Down Brown