Lost Data in the Cloud: How Sad

The headlines are ablaze because somebody over at the company, Danger, upgraded a storage array without making a backup, and voila – bye bye T-Mobile contact data.  (See the article on The Washington Post here)  Nik Cubrilovic’s point in his article is that data has a natural lifecycle, and you should be able to survive without your contacts on your phone.  But he also makes the point that all sysadmins have memories of not being able to recover some data at some point, and sweating out bullets as a result.  His commentary is: this stuff is hardly as reliable as we expect it to be.  “Cloud” computers are no different, except that they are generally managed by professionals that increase the odds of successful recovery as compared to the basement enthusiasts.

Having a backup plan is important.  Testing your backups periodically is important.  But generally, the rule is that the most important data gets the most attention.  If you have to make a choice between backing up your T-Mobile contacts and your patient’s health records, the latter probably will get more attention.  That’s in part because there are laws that require more attention to the latter.  But it is also because you probably won’t die if you can’t call your aunt Susan without first emailing your mom for her number.  You can die if your doctor unknowingly prescribes you a medication that interacts with something not in your chart because of data loss.

But the bottom line with this: data loss is inevitable.  There is a tremendous amount of data being stored today by individuals and businesses.  Even the very largest and most sophisticated technology businesses on Earth have had recent data losses that made the headlines.  But the odds of data loss by doing nothing about backups are still higher than if you at use a cloud service.  Oh, and if you use an iPhone with MobileMe, it synchs your contacts between your iPhone and your computer and Apple’s http://www.me.com, so you actually have three copies of your contacts floating around, not just a copy on the “cloud.”  Maybe you T-Mobile people aren’t better off by “sticking together.”

Chapter 3: Activate Me

A cornerstone of security for most computer systems is the user account.  The user account is a way of defining what each human being on the system can do with (or to) the system.  Universally, systems are designed with a user hierarchy in mind: users at the bottom rungs of polite computer society may be able to log in and look at a few things, but not make any changes or see anything particularly sensitive.  Those at the top may exercise complete control over core system functions or services.   The two basic tenets of a security plan are: (1) give each user the least amount of privileges on the computer system as practical for that person’s function, and (2) limit the number of user accounts that have complete access and assign these to the trusted few in the organization.

These principles have a corollary consequence – the IT department is typically the organizational unit that controls privileges for new staff that join the organization.  The process to do this is relatively straightforward: the hiring supervisor completes a form online that notifies the IT department of a new user account to be created.  The actual technical process to establish a new account is relatively lengthy due to the ever-increasing number of systems and applications that require a password.  Not surprisingly, our user community is made unhappy when a new user account doesn’t work “out of the box.”  This problem culminated in a meeting of some of the unhappy users with me, the purpose of which I think was as much to remind me of where my bread was buttered as it was to seek a better way to activate new accounts.

Before a process can be improved, one must understand the steps involved in it.  Process improvement also requires that data be collected on the frequency of the problem in order to be able to measure improvements with changes to the process.  But in this case, the real problem was a more general frustration with the technology and the sense that the technology department had the wrong priorities, or at least a list of priorities that was at variance with what this group of users thought should be the department’s priorities.

So what do you do?  For one thing, having enough notice of a new user account helps to ensure that the account is created timely.  Having time to setup the account also would allow time for IT to test the account to make sure that it works before turning it over to the user.  As we discovered, having a written checklist of the process also helps to cut down errors (especially if the administrator is interrupted while activating the account, which surely never happens elsewhere).  There are also technology solutions to managing accounts across multiple information systems (for example, by using some kind of single sign-on technology that stores the account information of the other systems within the SSO system).  These solutions typically cache subordinate system passwords and pass them to those systems when demanded so that the user need only remember the primary account password (such as their Active Directory login).

We also implemented a feedback process so that a new user (or their supervisor) could provide feedback to the IT department on problems with the account.  This information can be used for training or for process improvement, particularly where there are trends evident in the errors over time.  The problem with this process was that the number of errors reported was relatively small over time, and the fact is that you will not ever have a zero error rate with any process, no matter how much attention you put on it.  However, if you activated thousands of accounts each year, the data collected would be more useful to you.

All of these tools only work when there is a good relationship between the users requesting accounts and the IT staff that create them.  And for IT managers, this may be the underlying issue that causes the actual tension in the room.

One way to improve user relations is to regularly talk with them to understand the issues and to get feedback on the IT department.  This goes beyond an annual user survey and requires an IT manager’s attendance at meetings with users.  In addition, having avenues to communicate with the user community when there are system issues is important.  Finally, advertising the efforts of the IT department to improve processes with the most complaints can help improve how users feel about the department’s services and staff.  Whenever you can, take the complaint as an opportunity to improve relations with your customers and advertise your success at resolving it.

Chapter 2: Stop Screwing With Me

Writes an angry user one Sunday morning at 7:48 a.m.:

“I don’t who know who’s doing the back up this morning, but who ever it was cut me off in the middle of my writing a complicated and lengthy assessment on a patient that is now lost.  I know you can tell when we’re using the [database], so why did this happen?”

The organization employs a medical record system that runs on Oracle 10.  The front end application is a visual basic application that uses ODBC to connect clients to the backend database.  Our normal business hours are Monday through Friday, 8:30 am to 9:00 pm., however users do have remote access to our systems outside of normal business hours.  We therefore implemented a server maintenance schedule that occurs Sunday mornings, knowing that some staff would still be inconvenienced by this decision, but at least most of the time, the database would be available during normal business hours.

In theory, one could ask Oracle “who’s logged in right now” and it would tell you as much as it knows (which may or may not be the whole story because of certain design aspects of the database and the application).  Of course, the basic problem is that if we asked the database this question, most of the time at least some user would be logged in because of remote access.  Consequently, we made a decision to perform cold backups of the Oracle database on Sunday mornings, which would bring down the database for about 3 hours each Sunday morning.  Upgrades, patches, and other server changes may make the database unavailable for longer periods.  We did provide notice to our user community of our maintenance schedule.

The user, however, raises two points.  First, can’t an IT department find an alternate way to backup the database that would not cause an outage of the server.  And second, why isn’t the IT department omniscient enough to know if a user has been bad or good.

To the first point, a cold backup is a reliable method of backing up an Oracle database, but it is not the only or most sophisticated method.  Oracle supports a number of methods besides cold backups, including: hot backups and RMAN-based backups.  We use cold backups because they are the simplest and most certain way to ensure the database can be recovered in the event of a system problem.  Our medical record system is the only database that we support that uses Oracle for the database engine (we also support a version of Pervasive, two versions of Microsoft’s SQL Server, mysql, and various flavors of Microsoft Access), so we are not able to retain a full time Oracle expert to administer our database.  A more sophisticated database administrator would be able to configure hot backups to run safely (which would not require the database to be down), or would be able to configure RMAN to perform backups, which is integrated into the Oracle administrative tools.

So, the technology is there, but the expertise is outside of our current capabilities.  Surprising?  Probably not.  Every database technology in the end performs a set of similar tasks – the ability to store and retrieve data in an efficient manner.  However, how this simple idea is implemented varies widely across various database engines and operating systems, and expertise has been developed around each version.  The typical corporate business IT department is unlikely to have expertise in this area in-house because of the relative cost of the resource compared to the relative utility of that resource to the organization in comparison to other priorities.  Smaller IT departments generally are made up of a number of generalists who have broad but relatively shallow knowledge about the numerous systems and components that the IT department babysits for the company.

Expertise not maintained in-house must be contracted with from an outside pool of IT experts.  However, there is no standard or certification to objectively evaluate external expertise (as there is for physicians and lawyers, both of whom must pass a state-sponsored certification exam).  In addition, many IT departments elect to maintain control via in-house staff for critical systems, even if there are more expert staff available to them.

In our case, by design, we elected to depend on the vendor of the health record system for Oracle database support.  Our approach was to call on this expert for dire emergencies.  The inconvenience of our users for Sunday morning backups seemed less than dire, hence we did not seek further advice from the support vendor on how to mitigate this inconvenience.  That meant that we would need to develop some Oracle expertise in-house to do the day-to-day maintenance on the database.  The extent of our knowledge was to use the cold backup technology to perform backups.

If the IT department were to hire a contractor who was an Oracle 10 expert to implement RMAN for backups and recovery, an internal member of the IT department would also need to be trained to operate RMAN, address errors, test the backups to see if they are recoverable, and make modifications to the RMAN configuration as a result to changes to the database (for example, as a consequence of an upgrade to the application).  Over the longer term, the initial cost to configure RMAN is the smaller cost compared to the ongoing maintenance costs of ensuring that RMAN continues to work properly post-implementation.  Additionally, the IT department itself would need to cope with staff turnover – what happens to the knowledge about RMAN when the trained internal resource leaves the organization, or is promoted?

This problem is not really avoided if the department elects to contract with the Oracle consultant for ongoing support, in the sense that in the long term, the consultant may stop providing the service, may become unavailable, or may want to be paid considerably more for his expertise than was originally bargained for.  So, either way, the total cost over the long run has to be balanced against the relative importance of implementing the service, in relation to the longer list of competing priorities for the IT department.  Given the basic kind of economic decisions made by small IT departments, inconveniencing a few users on Sunday mornings will almost always cost less than the relative expense and difficulty of a more sophisticated system.

As to the second point, users often presume that IT staff watch their every move like a bunch of voyeurs at the digital keyhole.  As technology has developed, so have the tools for monitoring user  activity.  But the truth of the matter is that we do not have enough time typically to review this activity, unless there is a problem or issue.  And in the example above, while we may have been able to detect that the user was logged in, there was no way to know if the user was reading the news on Yahoo! or typing the thirteenth page of his graduate thesis.

Could we do better?  Of course.  Having a larger budget would mitigate the decision making that IT departments engage in because of scarce resources.  As to the problem of kicking users out – we made a point of doing our best of posting notice of unanticipated outages during business hours, but there is a limit to how effective notice of regular scheduled outages will be for the hard-headed that insist on working on complicated matters in the middle of our backup schedule.  And you just can’t make everyone happy.

Lessons From IT Management: Introduction

For the last ten years, I worked for a health center that serves several underserved populations: the gay and lesbian community, HIV positive patients, and patients that lack sufficient health care.  Over that time, we have built a complex and extensive information system to help support the mission of the organization.

This series is about how technology can be integrated into the delivery of health care, and the problems that come up along the way in getting the technology to work.  I suspect that technology causes suffering for some in spite of our best efforts to the contrary.  But our purpose in implementing technology is to reduce suffering by passing repetitive tasks to the computer while increasing the amount of time available to people to do what they are good at (like doctoring, lawyering, and so on).  Within healthcare, automation can also reduce patient suffering by reducing errors (for example, by ensuring accurate prescriptions, or reducing the number of times the same data must be entered into systems that support patient care), which should improve the quality of care that patients receive from their physicians.  When used properly, technology should also bring relevant knowledge to the user as they are doing their job (by making negative drug interactions known to a prescriber, for example).

But technology can cause trouble for users that were perfectly happy with their paper documents. The transition to an electronic system from paper can be tricky; moving from one computer system to a newer one can also pose real challenges.  This series is meant to help technologists and users out there in the world to avoid some of the common pitfalls with technology as both start full steam in implementing health IT to take advantage of the incentives in the ARRA.

This series is also about the place where the rubber of our lofty humanitarian and economic goals meet the road of personality disorders, unreasonable expectations, and inefficiency – which is to say the path to get a computer system working for the people that will ultimately use it.  For the technologist, I do not think you can avoid the road (there are not yet helicopters in the arena of health IT implementation – though one day there may be), but you may at least find some solace in the fact that you are not the only one to have traveled this path.  For end users that might happen to read this book, you might perhaps recognize a peer or yourself in this book and gain some insight into why your IT staff always seem to grumpy.

While others have contributed to the subject matter, any mistakes in this series remain solely those of the author.   Please feel free to contribute by making comments on the blog.  And good luck to those of you implementing technology.

Lessons from IT Mgmt Chapter 1: Information Insecurity

To improve the security of our network, we decided to close port 3389 and no longer publish a Windows terminal server to the internet.  In order to continue to support remote access to our network, we implemented a secure socket layer (SSL) virtual private network (VPN) device that allows users outside of the corporate network to create an secure tunnel into the network.  As implemented, end users were required to use a particular operating system, and to follow relatively simple instructions to install a small program that would initiate the tunnel from the home user’s workstation to the corporate network.  Authentication relies on the existing Active Directory accounts so that users didn’t need another login.  The appliance also allowed us control over which accounts could have remote access, so we could limit the known trouble accounts, such as guest and administrator, from having access to the protected network.

By corporate policy, remote access was originally designed for clinicians to be able to access patient medical records while the clinician was on call.  Over time, end users have been able to use remote access to work in the comfort of their homes, whether on call or not.  However, in no case has the corporation required that end users be able to work remotely as a matter of course, except for a few traveling staff that work during the day at a third party facility.

Nonetheless, users had gotten into their heads that working from home was a right, not a privilege.  And with that right flows the obligation on the part of IT to support the home user’s network configuration.  The change, therefore, by IT to the method of access to the remote network was unwelcome by some and was met with resistance, even if the new method was in fact more secure, recommended by our outside information security consultant, and addressed a major security vulnerability within our network.

There were really two important lessons from this experience.  First, proactive communication and involvement of the user community in implementing changes to remote access is an important element to the implementation plan.  I suspect some would still have grumbled at the change, but we may have headed off some of the complaints simply by better explaining why the change was being made.  Second, remote access had grown organically over time such that a lot of people were using it on a wide variety of home computers and home networks.  Many of the staff were not particularly competent at using their home firewalls, routers, or other network devices if the users needed to make changes to these devices to access the corporate network.  We also underestimated how diverse and how much configuration could be required in order for the SSL VPN device to be able to connect to our network and establish the tunnel for secure communications.

We also discovered that the device was not particularly compatible with OSX (there was a guest kiosk function that would work within OSX, but the screen resolution and performance was poor and effectively unusable for most staff that had to be in for longer periods of time).  We had not realized at the time how many staff were actually using Macs at home, so this also caught us off guard.  Of course, Parallels and VMWare both offer virtualized Windows XP desktops (with which the appliance was compatible), but users still complained that they had to implement this in order to access the network.

Inherently, there is tension between user access and security, and it is up to IT management to determine how much pain to inflict upon the users to protect network assets.  Not everyone will be happy with the balance.  In this case, I still think we made the right call, but we didn’t implement according to a complete plan.  Next time will no doubt be better.

You Aren’t Working Hard Enough

Lessons from IT Management: Prologue

Published in serial fashion, the blog posts in the Technology | Management section of this blog are some thoughts on managing an IT department from an insider’s perspective.

This series is about where the rubber meets the road when it comes to implementing technology for a lowest common denominator of sorts: office employees.  My hope is that others may learn from our mistakes, perhaps feel a bit of catharsis for those that do this stuff for a living, and examine ways that technology itself may be able to reduce the occurrence of some of the things going on in offices all over the world today.  And I also hope that you will laugh from time to time at some of the tales told tall in this series.  Sometimes IT staff and users want things that are just plain silly.

Even though others have contributed to this series (and some, unwittingly), any mistakes that may remain in the text are mine alone.  Please feel free to comment or contribute if you are so inclined, and enjoy the epic saga that follows!