Is your hot backup hot enough?

A Q&A with Control Global

As the bar for resiliency continues to get higher for small and large systems, are traditional models of server failover good enough?

Hot server backup plays an important role in mission critical systems. Meanwhile, the bar for resiliency continues to get higher for small and large systems. That leaves many in the field to wonder whether traditional models of server failover are good enough? To find out what most automation experts think about hot backup and what can be done to ensure proper backup, we talked with Chris Little, who represents VTScada by Trihedral.

Q: What do most automation experts define hot backup?

A: In simplest terms, it means that when a primary server fails, another takes its place without human intervention. Hopefully, the software supports more than just one backup and hopefully at least one of them is not in the same location as the others.

Q: What is the problem with how this is usually done?

A: Well, one problem is that for many software products there is a hard limit on how many backup machines you can configure. For some common platforms the limit is two. That’s a problem since an incident serious enough to take out a server can easily take out two. Even if these computers are not in the same building, events like floods, hurricanes, or earthquakes can damage multiple facilities.

But an another potentially serious consideration is how many products handle historical data. If the system is configured such that each backup machine logs its own data, then each one will have a different view of the historical data. No two will have the same timestamps in its historian.

Not only does this overtax I/O devices and communications networks, but also it can make trending and reporting inconsistent across multiple computers.

Worse, trying to correct for this problem can involve complicated coding of system intensive batch files to copy datasets from one machine to another on a timed basis.

Q: How do you turn up the heat on your backup?

A: Rather than every server trying to poll all their own I/O from the PLC or RTUs, only the designated primary server should do the poling. The backup machines should simply synchronize with that primary. What this does is create a coherent data set across all machines. Every piece of time-stamped data is identical regardless of what workstation you are using. This is the approach used with VTScada software. Its integrated Historian was built from the ground up for distributed systems and can synchronize efficiently across the network.

Q: But what if the network goes down and these distributed backups can’t sync with the primary?

A: In that case each computer becomes the primary server for its local I/O and logs independently. Operators at each location can continue monitoring and controlling of their local assets and see local history. When the network is restored, VTScada uses bi-directional synchronization to share time-stamped data from each isolated computer with every other machine. Again, the result is a single coherent dataset across the whole distributed application.

Q: How difficult is it to code versus the traditional approach?

A: That’s another important consideration when it comes to long-term system resiliency. Traditionally, failover and synchronization require some form of script coding. While this can provide a high degree of control it also introduces complexity and risk. This can be a serious problem down the road if the system requires expansion or troubleshooting, and the person who coded it has retired or won the lottery. The more custom code an application includes, the more due diligence is required each time the software is upgraded to the latest version. For this reason, many applications are frozen at a specific version and unable to take advantage of the latest features and security updates.

VTScada allows you to configure basic and advanced failover scenarios as well as bidirectional synchronization in seconds with no custom coding. Since VTScada is unique in that it developed and maintains all its own core features, you can be confident that everything will continue to work in lockstep with every new version.

Q: That sounds like a benefit for smaller systems. Is this a better mousetrap for big players too?

A: Absolutely, while this is a practical approach for smaller users who often assume that resilient hot-back solutions are only available to big players, it can, and does remove complexity and inconsistency from massive monitoring and control systems. The more remote monitoring sites, the greater the danger of mismatched datasets across the same time-period.

Q: What is the difficulty in moving toward this kind of system?

A: When specifying new and updated systems, redundancy often shows up as a single checkmark. Its included or it isn’t, but as we discussed earlier, what “it” is can be a lot of things. It could mean a system with few levels of redundancy, high overhead, and poor resilience over time. Be sure to ask questions. How many redundant servers will this include? Where will they be located? Do we need a third-party historian product that requires its own licensing, support, and failover strategy? How is historical data synchronized and protected? What is the expected life cycle of this application? Will my servers need to be isolated and frozen at a specific version? What is the upgrade strategy? Who will be able to maintain it fifteen years from now? The answers to these questions will tell you a lot about how “hot” your hot backup will be.