The Cost of Failure – Why Mission-Critical Should be Your Mission

Explore the conditions that allow gaps to emerge in the most hardened process control systems and ways to design resiliency into every part of your architecture.

With many important things in life, we are often on the brink of failure and not even know it. This is never truer than in mission-critical control systems like SCADA (Supervisory Control and Data Acquisition). If you are reading this, then you probably already think a lot about how to keep your critical systems online. In this article, rather than deep dive into technical best practices or IT policies, we will look at how blind spots develop in complex systems even when smart people are actively looking for them.

“I’m your friendly neighborhood sinkhole detector,”

said no one, ever. Sinkholes are excellent examples of how potentially disastrous gaps emerge unnoticed in complex systems. They seemingly appear out of nowhere without warning. Yet they develop over long periods and leave plenty of clues. This is a picture of a sinkhole that suddenly erased a major intersection in the Japanese city of Fukuoka. Remarkably, no one was injured.

In the hole you can clearly see the utilities that could have noticed evidence that something was amiss. The water and gas utilities may have experienced loss of pressure / minimal leaks. Telecom providers may have noticed intermittent signal losses. In the days leading up to this event, the City may have filled a few cracks and holes in the pavement.

The problem was that all these clues were siloed and modularized. There was no one whose job was to unify all this different information. There is no friendly neighborhood sinkhole detector.

So, the opposite of a sinkhole is… an iPhone?

In a manner of speaking, yes. The iPhone is an example of a unified system approach done well. When the legendary Apple CEO Steve Jobs appeared on stage at the Macworld Conference & Expo in 2007, the audience was already expecting him to introduce the first iPhone. Most assumed that this would simply be a flip phone grafted onto their popular iPod music device. What Jobs pulled from his pocket that day was much more.

It was a single, unified platform that brought together the Internet, email, an accelerometer, a digital camera and a full marketplace of programs, or apps, that anyone could create and share with users. All with the confidence of knowing there was centralized quality control by a single company that oversaw the hardware and software. All core features can be counted on to work together predictably over time, with little responsibility on the end user to maintain them. When there are issues, they are quickly identified, and the fixes rolled out painlessly.

Why do bad things happen to good control systems?

Let us break the reasons into three categories:
1) System Architecture
2) Cyber Attacks
3) The Unexamined Cost of Recovery

1) System Architecture

These are some of the most common gaps that are easily missed in complex system designs.

Single Points of Failure – One of the most common reasons control systems go down, is a single point of failure somewhere in the system. If your system resides on a single hard drive, then that is the weak point. Maybe you have multiple drives in one PC but the CPU or power supply fails. You might have two servers in your main office, but that office is now underwater. You may have servers in two different locations but just one network connecting them to each other and to your I/O devices.

Be sure to trace the path from I/O, to the PLC, to the HMI, all the way out to remote access and alarm notifications. Identify individual components that can take down your whole system.

Limited Levels of Redundancy – For most mission-critical control systems, redundancy is a cornerstone of high availability. Most specifications for new and upgraded SCADA systems include some requirement for server redundancy. Once a bidder checks this box for their proposed SCADA software the matter is settled. It is redundant. The problem is that not all redundancy is equal. Most SCADA platforms only support two
redundant SCADA servers. Worse still, most SCADA products use third-party Historian products like Oracle or MySQL which require their own methodology for failover and synchronisation.

“Virtually Redundant” May Not Actually Mean Redundant – Virtualized servers have become an important tool for IT departments to effectively manage their systems. Rather than configuring a separate physical machine for each server, a developer can create multiple server instances, each with its own operating system, running on a single physical computer. This can be an efficient way to create a redundant architecture while reducing the cost of maintaining multiple machines. You can even create virtualized
network routers.

The obvious problem is that the single physical computer can become the single point of failure. The less obvious danger is that the physical server may lack the processing power and network bandwidth required to run two or more virtual servers resulting in poor performance and server failure. Virtual machines (VMs) may also lack the physical ports required for proper failover to external hardware such as redundant voice modems. Also, complex virtualized designs can make it even harder to spot the single points of failure discussed above. For example, you may have a virtual router that only resides on one of your ‘redundant’ virtual servers.

2) Cyber Attacks

Some threats to critical systems are intentional and malicious.

Distributed Denial of Service (DDoS) – This very common and disruptive form of attack requires little investment or expertise to accomplish. Rather than trying to gain control of specific hardware or software applications, the assailant simply floods the network that connects them with meaningless requests, making it impossible for legitimate information to get through. The assailant may generate these requests themselves or, more typically, instruct large numbers of infected computers around the world to focus on the target’s public IP addresses (Fixed or Variable). Once begun, these attacks can be difficult to stop and even harder to recover from after.

One solution is to employ one-or-more Virtual Private Networks (VPNs) between your servers and remote I/O devices. VPNs create secure tunnels through a public or private network. You can also configure your firewall product to reject excessive connection requests or only accept requests from whitelisted computers at specific times of day.

Ransomware – The goal with this strategy is to encrypt the user’s data and sell the decryption key to the user, usually in Bitcoin or some other block chain currency. These attacks are often sophisticated and targeted at specific utilities or companies. The key is for the assailant to infect a target computer by some form of social engineering like Phishing where users are tricked into opening email links or by inserting an infected USB…


What do you want to do with it? Your sense of civic duty may inspire you to plug it into your home computer to look at the pictures on the SD Card and hopefully identify the owner. You were probably trained to not plug strange thumb drives into your work laptop, but this is a camera, and it is your own
computer, right? Besides, who would waste an expensive-looking camera on the offchance that someone might plug it in and then eventually bring work home and then bring it back to work where they breach their company’s carefully air-gapped (i.e., not online) control system? If the value of the target is high, the cost of a second-hand camera or two is a small price to pay.

This is just one of the ever-evolving ways that assailants target the staff of the companies they wish to attack or extort. Protecting against these types of threats requires ongoing training and vigilance. Remember, there is always a “first time” for new exploits.

Assume the Bad Guys are Already In

The best methodology, and often the least looked at perspective, is to assume that the hackers have already penetrated your security and have access to your system. It is an uncomfortable thought but if you start from that position, you can begin to develop ways to identify intruders and limit the damage they can do once they are past the lock on your front door.

As luck would have it, at the time of publication, an American software company that provides information technology solutions to businesses and US governmental organizations discovered that their software had been compromised. Hackers were able to access the networks of over 18,000 of their customers for weeks before being discovered.

3) The Underestimated Cost of Data Recovery

While prevention is obviously the best strategy when it comes to management and availability of your data from both an architectural and security point of view, what happens if something does fail? A server fails / an attacker gains entry to the system. In addition to the potential loss of real-time monitoring and control, what is the cost of recovering any data that is lost? This can be very high, often tens if not hundreds of times more than the cost of the systems employed. SCADA users rely on historical data
to produce trends and reports needed to keep their process running smoothly. Billing systems also need accurate customer records to ensure revenue and continuity of customer service.

a. Manual Syncing of Data – Assuming that there are even backups to work from, it is often a long and cumbersome process to manually synchronize secondary computers or backed up databases. This can run from hours to even days, depending on the amount of data that is being restored. It can also tie up valuable technical and human resources that could be spent on other important tasks.

b. Data Loss – Data itself may be lost due to gaps in the system, resulting in inaccurate reporting. This may have a knock-on effect, as these reports may be assumed to be correct in the absence of any other methodology against which to benchmark the information. This can result in operational inefficiencies that persist over a long period.

c. Complexity of Procedures – This is often overlooked. The complexity of the restoration process itself can lead to errors in the re-inputting of data. This has a similar effect as that listed in a and b above.

To summarize, using systems that employ better architecture, that allow data to be distributed over more than just two servers, that are kept fully in sync and restored automatically after any outage, leads to a much higher performing system and a much lower operational cost.

Building Resilience into Your System

System-wide redundancy – As discussed above, many SCADA software platforms are limited in the number of redundant servers they can support. Usually, there is a primary and a backup and no more. Two servers failing at the same time is not such an unlikely event. Look for a product that can provide unlimited levels of redundancy. Redundancy should not be limited to servers. Ensure that there is robust failover for all critical components such as alarm notifications (email, SMS text message, voice-to-speech call out), remote thin client access, communication networks, etc. Also, if you
do have redundant networks how do you ensure that the standby network is still available. Is there an alarm that can inform you that it has failed? Can the system alternate between them to ensure that they are both working?

Real-time System Backup and Bi-directional Synchronization – Traditionally, SCADA users back up their applications online or offline. The offline approach involves shutting down the system while you copy the files to a storage device that is kept offsite. This leaves operators blind and unable to manage any alarms that occur during the process. Alternatively, online backups are done while the system is running. While more convenient, this can cause corruption should the live system attempt to write to a file while it is being read by the backup process.

As mentioned earlier, often a separate backup methodology is required for third-party historical databases. Automating this process may require custom scripting while manual backups are prone to be forgotten. Additionally, although many SCADA platforms can perform automatic server failover, a select few will automatically sync the historical data. That will need to be done later.

The solution is a technology called bi-directional synchronisation. This provides real-time synchronisation of all the services that make up most SCADA systems. In addition to the historian, this includes events, alarms, security accounts and all application settings.

An RPC Manager controls traffic between the various services which can be synced across all servers or divided across servers to optimize performance in larger applications. This means that each installed SCADA server can be an up-to-the-second copy of every part of the application. No more missed backups.

Integrated Software Platforms – As we learned from the story of the sinkhole and the iPhone, gaps emerge overtime when disparate pieces are cobbled together. Many control system platforms use third-party products for core components such as Historians, alarm notifications, thin client servers and scripting languages. A single integrated product ensures that these components will continue to work together seamlessly with each new software version. It also eliminates the risk that one of these components might be altered or discontinued by its manufacturer. Best of all, a unified approach means, one install, one license agreement, one training track and one support contract.

Application Version Control – Many system failures are the result of changes made by employees. These might be malicious acts done by disgruntled workers or the unexpected consequences of innocent configuration in a complex system. Regardless of intent, when things go wrong it is vital to identify who did what and why. Even more importantly, authorized users need to be able to roll back to the last known working version as soon as possible. While some SCADA providers support third-party application version control, there are many benefits to this functionality being a native part of the software. In addition to the points mentioned in the section above, an addition benefit is the ability to automatically distribute the encrypted change list across redundant servers.

Fast Response to Vulnerabilities from the Vendor – Control Systems are complex and are meant to last decades. Software platforms regularly release new versions with new features and often connect to devices that were developed long after the application was deployed. This all but ensures that exploitable security gaps will appear over time. No software is future-proof out of the box.

The Industrial Control Systems Cyber Emergency Response Team (ICS-CERT) is a governmental organization that regularly conducts vulnerability and malware analysis on commercial products that are used in critical infrastructure. When they identify a potential security exploit for one of these products, they contact the vendor with the information needed to recreate the problem. The vendor then has a period in which to patch the vulnerability and distribute the solution to affected users. After that, ICSCERT makes the vulnerability (and the fix if there is one) public.

This graph, from data published by Trend Micro, shows the number of days for common SCADA software vendors to patch security issues once they were notified by ICS-CERT. It is concerning to think that many of these vulnerabilities are publicly announced before the respective vendors have provided a fix to the vulnerabilities exposed. The dependency of vendors on third party software solutions like external databases, programming languages, drivers, alarm notification systems etc., increases the length of time for them to find a solution to these issues. The moral of this story is to be sure and pick a software company that has a proven record of responding to these threats in a timely fashion.

VTScada – The Industry’s Most Powerful SCADA Software™
In the bar graph above, you may have noticed that the software vendor on the far left is Trihedral, the makers of VTScada. This award-winning software has been making complex infrastructure easy to configure for over thirty years. Its unique unified design eliminates the gaps that plague other software platforms and allows systems to easily scale from 2 to over 2 million I/O.

VTScada helps to eliminate downtime by allowing you to configure any number of redundant servers with automatic failover. The native Enterprise Historian supports bi-directional synchronisation across all servers to keep your priceless data safe and available when you need it. Advanced Version Control is part of every application. VTScada is highly secure and has been used in some of the largest systems in North America for decades in industries such as power generation, broadcasting, water and wastewater, manufacturing and oil & gas.

Try the free version of our award-winning software
VTScadaLIGHT is a Development and Runtime license perfect for small industrial and personal applications with up to 50 I/O. Individuals, businesses, and non-profits can install it on up to 10 PCs. There is even a step-by-step video tutorial to get you started. Download VTScadaLIGHT here:

Article By Chris Little – Media Relations at VTScada by Trihedral
Follow Chris on LinkedIn!