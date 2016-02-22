Except for Oracle shops, that is.

For many IT shops, this sort of all-too-common fire drill involves not only a lot of effort, but downtime as well.

The severity of the bug resulted in a "PATCH NOW!" directive to the IT community at large. While not as nasty as the infamous Heartbleed or Venom bugs, this one merited a serious and immediate response.

Last week, a particularly nasty bug was found in the widely used glibc code that enabled bad guys to essentially take over a DNS server. More details here , here and here .

The IT world is no different, we are all dependent on components from others, especially Linux and open source code. Bugs are found, some are serious -- and they must be quickly patched at considerable effort and expense, otherwise tragedy may await.

In the automotive world, a defective part from a supplier can result in expense and tragedy. Witness the massive Takata airbag recall , affecting dozens of manufacturers and tens of millions of vehicles on the road today, including potentially yours :(

The Magic Of Ksplice

Typically, it's very hard to patch code when it's running -- sort of like swapping out an airline engine while it's at cruising altitude. Much easier to just land the plane.

One approach is to simply update the code on storage, and restart everything. Downtime is never ideal, and you silently cross your fingers when things are coming back up.

Unfortunately, that's so often the norm.

Another popular approach is to use clustering technology for fast failovers. Update shared code on storage, and then failover each server in sequence using a clustering technology to minimize production impact. Better, but it's still a process that deserves close supervision.

These days, there's a third approach, and that's Ksplice.

Ksplice is a Linux framework, pioneered by Oracle, that enables hot-swapping of code modules on-the-fly with ZERO disruption. Old code is unhooked and new code is re-hooked -- and the application is completely unaware that anything has happened.

Ksplice has been happily performing non-disruptive online patching on kernel code in Oracle Linux for several years. It works, and it works great.

But what about user space code, like the aforementioned glibc?

User space code is a much harder, as Oracle has to develop individual harnesses for specific libraries that are widely in use. It's not a generic capability. If you're running user code that has been modified with one of our Ksplice harnesses, you're golden -- otherwise you have to patch like everyone else.

Fortunately, the code in question already had a Ksplice harness as part of our distribution. Anyone running recent Oracle Linux code (and subscribing to the appropriate service) was able to non-disruptively patch the defective library code with ZERO downtime and ZERO impact to production.

It's an impressive feat, one that I'm sure was appreciated by more than a few Oracle customers.

More on Ksplice here, along with additional commentary here and here.

It should be pointed out, it's the same Linux that's used in our Oracle Engineered Systems: Exadata, Exalogic, Exalytics, Supercluster, Oracled Database Appliance, Big Data Appliance, Private Cloud Appliance, Zero Data Loss Recovery Appliance and so on.

The Value Of The Red Stack

One of the things I didn't fully appreciate before coming to Oracle is that owning all the pieces in an IT stack lets you do some pretty impressive things.

Like automatically hot-patching a critical bug across a vast army of Linux-based systems with no downtime.

While I do see the historical appeal of rolling-your-own IT stacks, it's hard to argue with the ease and convenience of a completely engineered stack. Including nifty features like this one.

Just curious: for all you non-Oracle Linux admins out there, how long did it take you to find, patch and restart every affected system?

Just curious :)