Chapter 2: Stop Screwing With Me

Writes an angry user one Sunday morning at 7:48 a.m.:

“I don’t who know who’s doing the back up this morning, but who ever it was cut me off in the middle of my writing a complicated and lengthy assessment on a patient that is now lost. I know you can tell when we’re using the [database], so why did this happen?”

The organization employs a medical record system that runs on Oracle 10. The front end application is a visual basic application that uses ODBC to connect clients to the backend database. Our normal business hours are Monday through Friday, 8:30 am to 9:00 pm., however users do have remote access to our systems outside of normal business hours. We therefore implemented a server maintenance schedule that occurs Sunday mornings, knowing that some staff would still be inconvenienced by this decision, but at least most of the time, the database would be available during normal business hours.

In theory, one could ask Oracle “who’s logged in right now” and it would tell you as much as it knows (which may or may not be the whole story because of certain design aspects of the database and the application). Of course, the basic problem is that if we asked the database this question, most of the time at least some user would be logged in because of remote access. Consequently, we made a decision to perform cold backups of the Oracle database on Sunday mornings, which would bring down the database for about 3 hours each Sunday morning. Upgrades, patches, and other server changes may make the database unavailable for longer periods. We did provide notice to our user community of our maintenance schedule.

The user, however, raises two points. First, can’t an IT department find an alternate way to backup the database that would not cause an outage of the server. And second, why isn’t the IT department omniscient enough to know if a user has been bad or good.

To the first point, a cold backup is a reliable method of backing up an Oracle database, but it is not the only or most sophisticated method. Oracle supports a number of methods besides cold backups, including: hot backups and RMAN-based backups. We use cold backups because they are the simplest and most certain way to ensure the database can be recovered in the event of a system problem. Our medical record system is the only database that we support that uses Oracle for the database engine (we also support a version of Pervasive, two versions of Microsoft’s SQL Server, mysql, and various flavors of Microsoft Access), so we are not able to retain a full time Oracle expert to administer our database. A more sophisticated database administrator would be able to configure hot backups to run safely (which would not require the database to be down), or would be able to configure RMAN to perform backups, which is integrated into the Oracle administrative tools.

So, the technology is there, but the expertise is outside of our current capabilities. Surprising? Probably not. Every database technology in the end performs a set of similar tasks – the ability to store and retrieve data in an efficient manner. However, how this simple idea is implemented varies widely across various database engines and operating systems, and expertise has been developed around each version. The typical corporate business IT department is unlikely to have expertise in this area in-house because of the relative cost of the resource compared to the relative utility of that resource to the organization in comparison to other priorities. Smaller IT departments generally are made up of a number of generalists who have broad but relatively shallow knowledge about the numerous systems and components that the IT department babysits for the company.

Expertise not maintained in-house must be contracted with from an outside pool of IT experts. However, there is no standard or certification to objectively evaluate external expertise (as there is for physicians and lawyers, both of whom must pass a state-sponsored certification exam). In addition, many IT departments elect to maintain control via in-house staff for critical systems, even if there are more expert staff available to them.

In our case, by design, we elected to depend on the vendor of the health record system for Oracle database support. Our approach was to call on this expert for dire emergencies. The inconvenience of our users for Sunday morning backups seemed less than dire, hence we did not seek further advice from the support vendor on how to mitigate this inconvenience. That meant that we would need to develop some Oracle expertise in-house to do the day-to-day maintenance on the database. The extent of our knowledge was to use the cold backup technology to perform backups.

If the IT department were to hire a contractor who was an Oracle 10 expert to implement RMAN for backups and recovery, an internal member of the IT department would also need to be trained to operate RMAN, address errors, test the backups to see if they are recoverable, and make modifications to the RMAN configuration as a result to changes to the database (for example, as a consequence of an upgrade to the application). Over the longer term, the initial cost to configure RMAN is the smaller cost compared to the ongoing maintenance costs of ensuring that RMAN continues to work properly post-implementation. Additionally, the IT department itself would need to cope with staff turnover – what happens to the knowledge about RMAN when the trained internal resource leaves the organization, or is promoted?

This problem is not really avoided if the department elects to contract with the Oracle consultant for ongoing support, in the sense that in the long term, the consultant may stop providing the service, may become unavailable, or may want to be paid considerably more for his expertise than was originally bargained for. So, either way, the total cost over the long run has to be balanced against the relative importance of implementing the service, in relation to the longer list of competing priorities for the IT department. Given the basic kind of economic decisions made by small IT departments, inconveniencing a few users on Sunday mornings will almost always cost less than the relative expense and difficulty of a more sophisticated system.

As to the second point, users often presume that IT staff watch their every move like a bunch of voyeurs at the digital keyhole. As technology has developed, so have the tools for monitoring user activity. But the truth of the matter is that we do not have enough time typically to review this activity, unless there is a problem or issue. And in the example above, while we may have been able to detect that the user was logged in, there was no way to know if the user was reading the news on Yahoo! or typing the thirteenth page of his graduate thesis.

Could we do better? Of course. Having a larger budget would mitigate the decision making that IT departments engage in because of scarce resources. As to the problem of kicking users out – we made a point of doing our best of posting notice of unanticipated outages during business hours, but there is a limit to how effective notice of regular scheduled outages will be for the hard-headed that insist on working on complicated matters in the middle of our backup schedule. And you just can’t make everyone happy.

Chapter 2: Stop Screwing With Me

Published by

faithatlaw

Leave a comment Cancel reply

Share this:

Published by

faithatlaw

Leave a comment Cancel reply