A Mission Critical Mindset at Any Scale

High Reliability Isn’t Just for Large Systems

Small does not mean unimportant.

For those who don’t spend a lot of time hanging from ropes, a carabiner is a simple metal shackle that allows users to connect one thing to another quickly and securely. For example, climbers use them as part of a safety system that includes pitons, harnesses, and rope. As they make their way up a rockface, they hammer in a piton, attach the carabiner and pass through a rope. In this way they can create an extended redundant system that can support multiple people. But at its most simple level, one carabiner must be able to hold one person. It cannot fail. It is mission critical.

Is it the size of the system that dictates its criticality? No, the life of one rock climber is precious. A string of Christmas lights hanging from 100 carabiners is not. In the automation world, many systems that are small in scope can still be critical in nature. The downstream, knock-on effect of a system outage may introduce errors and defects in the items being produced, possibly going unnoticed.

In industrial systems, “mission critical” was traditionally synonymous with big. Mostly because, in the past, the equipment and infrastructure required to harden an industrial system were prohibitively expensive. Instead, smaller utilities and businesses simply learned to live with the consequences of downtime. They worked around the failures in their critical system because, at one time, there was no other choice. However, recent advancements in technology have made the cost of applying mission critical methodologies far more accessible.

In this article, we will focus on supervisory control and data acquisition (SCADA) systems and discuss the factors that keep smaller entities from adopting the kind of zero-downtime principles that have been a standard part of industries such as power generation and pharmaceutical production.

What does “mission critical” actually mean?

Mission critical means, it just has to work. Not just some of the time, but all of the time. Critical can mean different things to different people. Some processes and systems may be able to tolerate some downtime, others cannot. Remember Apollo 13? Having to power down and restart a system to make changes can leave you in an extremely vulnerable position. Will it restart? Will your software crash? There are countless examples in the industry of a simple restart not going as planned, followed by significant downtime.

A mission critical mindset

A mission critical mindset needs to be part of the DNA of an organization. You need to look at all aspects that have an impact on your system. This is typically a drill-down approach. Start at what you are trying to achieve and then drill deeper to see where there are dependencies. As you do this, you can look at the interaction between different parts of your system, large or small. That way you can build a better picture of what-ifs. What-if this part of the system fails? What systems would be affected? By working through this process systematically and then reviewing with others in your organization, you are more likely to eliminate blind spots and ensure your system achieves maximum uptime. Reaching this has never been more affordable and achievable.

Case Study – The Gemini Telescope, Chile

This is an excerpt from a case study featured as a video in this eBook. Like our carabiner example, it also takes place on a mountain.

Some days you just can’t climb the mountain.

Paul Collins is the Electrical Supervisor at Gemini South, an optical/infrared observatory at the top of mount Cerro Pachón in Chile. “We are up in the foothills of the Andes at 2,700 meters. In the early stages of the telescope, we would have snowstorms and bad weather, where we would lose power or connection and we couldn’t come up for days on end. People would ask, “What’s going on in the telescope?” I’d have to say, “I can’t tell you.”

After trying many SCADA products, he discovered VTScada software by Trihedral, a comprehensive platform with integrated SCADA components including Enterprise historian, communication drivers, alarm notifications, and thin client connections. Collins created a sophisticated application that allows the maintenance team to work safely by day and provides remote observers with system information as they scan the skies at night.

Gemini’s Mission Critical Systems

Remote observing – “We don’t have observers up here on the telescope at night. We do everything remotely.” Observers can use thin clients to see if there has been rain, earthquakes, or losses of communication.

Remote operations – “If we’re working at the telescope, we put the system in “Summit Mode”. That means no remote control from the base facility. When we leave, we put it in “Standby”. When the observers take over in the evening, it goes into “Base Mode”. This prevents someone in the base facility from controlling systems while we are working on the site.”

Cooling systems – “The main thing for a telescope is cooling. If we lose coolant, it shuts everything down and if you shut down an instrument due to temperature, it could take days to weeks to return it to normal operation. Our instruments operate on a helium system that is cooled by our chillers.” With one screen, Collins can see the status of systems including power generation, UPSs, chillers, temperature control valves, coolant pumps, and air handling units.

Fire alarm system – “We recently replaced our fire alarm system up here. The alarm company wanted to sell us monitoring software for thousands of dollars. So, we went out and bought a $250 Modbus to TCP/IP converter and there you go.”

“The great thing about this system for me is that I can connect to all my PLCs, but I also have the Modbus and SNMP connections. That brought everything together tremendously for us. I’m quite pleased with the way this system turned out.”

Read the full case study or watch the video here:

Underestimating the cost of failure.

Since, for many smaller users, downtime is seen as an unavoidable cost of doing business they don’t bother looking closely at what the loss of monitoring, control, and Alarms is already costing them.

Public safety – Loss of alarms and real time data can lead to dangerous spills and leaks that pose immediate and long-term threats to public and staff alike. This is the highest priority of a mission critical system.

Damage to equipment and infrastructure – Set points and alarms play an important role in protecting and extending the life of important industrial assets such as pumps, motors, valves, UV sterilization equipment etc. Alarm notification systems ensure that those alarms reach personnel who can react in a timely manner. No alarms means that an extended power fault could ruin a motor, a jammed tree branch can damage a pump leading to a spill, and a loss of power can destroy a chemical process.

Fines and legal accountability – In addition to the cost of the actual cleanup, there is also the risk of governmental fines which can be in the million of dollars. In the last decade, water and wastewater utilities and the individuals who run them have faced increased legal accountability. Add to that the possibility of civil and class-action lawsuits from those affected by industrial accidents.

Loss of data – Some costs of failure can be harder to quantify. What happens if an historical data server is destroyed? What happens if a loss of communication results in data not being logged? The cost of recovering lost data can be hundreds of times more than the cost of the systems itself. Assuming that there are backups to work from, it is often a long and cumbersome process to manually synchronize secondary computers or backed up databases. As a result, history may be permanently lost, resulting in inaccurate reporting. This may have a knock-on effect, as these reports may be assumed to be correct leading to operational inefficiencies for years.

Loss of production revenue – In simple terms, system downtime leads to production loss. This is compounded if your software application needs to be shut down each time you need to make even a small configuration change. Downtime, even small amounts, add up over time, affecting profitability.

What makes a SCADA system mission critical?

Built in scales better than bolt on – Many software platforms, though sold as a single product, rely on third-party products for core SCADA components such as Historians, alarm notifications, thin clients, and scripting. A common issue with this approach is that, over time, these components work less and less well together increasing the likelihood of downtime or loss of functionality. This is a common issue with third-party alarm notification systems. A single SCADA product ensures that everything works together seamlessly with new software versions. It also eliminates the risk that components are altered or discontinued by their manufacturers. Best of all, a unified approach means, one install, license agreement, training track, and support contract.

System-wide redundancy – Many SCADA platforms are limited to a primary and a backup server. Look for a product that can provide unlimited levels of redundancy. Ensure that there is robust failover for all components like alarm notifications (email, SMS text message, voice-to-speech call out), thin clients, networks, etc. Can you configure a redundant communications network? If so, is there an alarm to inform you if the backup fails?

Application Version Control – Many system failures are the result of unexpected consequences of innocent configuration or sometimes even the malicious acts of disgruntled workers. When things go wrong it is vital to identify who did what and to roll back to the last known working version immediately. While some SCADA providers support third-party version control, there are benefits to this being a native component such as the ability to automatically distribute the encrypted change list across all servers.

Real-time system backup and bi-directional synchronization – Traditionally, SCADA systems are backed up offline or online.

The former involves shutting down the system leaving operators blind and unable to manage alarms. The latter can corrupt data during the process. Additionally, few platforms automatically sync historical data after failover. Often a separate backup methodology is required for third-party historians. Automating backups may require custom scripting. Manual backups are easily forgotten. Systems that support bi-directional synchronisation provide real-time backup of all your SCADA services. In addition to the historian, this includes events, alarms, security, and application settings. This means each SCADA server can be an up-to-the-second copy of your whole application. No missed backups.

Fast response to vulnerabilities from the vendor – Software platforms regularly release new versions and features and often connect to devices developed long after applications are deployed. This ensures that security gaps will appear over time. The Industrial Control Systems Cyber Emergency Response Team (ICS-CERT) regularly conducts vulnerability analysis on products used in critical infrastructure. When they identify a potential security exploit, they contact the vendor who then has time to patch the vulnerability and distribute the solution before the vulnerability (and hopefully the fix) is made public.

Pick a software company with a record of responding to threats in a timely fashion. The article linked below contains a graph that shows the number of days for common SCADA software vendors to patch security issues once notified by ICS-CERT.

VTScada Software by Trihedral

Many of the principles described above are applicable to many kinds of SCADA but others are unique to VTScada by Trihedral, the platform described in the Gemini Telescope example above. VTScada is multi-award-winning SCADA software that has been making complex infrastructure easy to configure for thirty-five years. Its unique unified design eliminates the gaps that plague other software platforms and allows systems to easily scale from two to over two million I/O. VTScada helps to eliminate downtime by supporting any number of redundant servers with automatic failover. Application changes to the system can be made seamlessly online and easily rolled out across the network to multiple servers and Thin Clients, maximizing system uptime. The native Enterprise Historian supports bi-directional synchronisation across all servers to keep your priceless data safe and available when you need it. Advanced Version Control is part of every application. VTScada is highly secure and has been used in some of the largest systems in the world for decades in industries such as power generation, broadcasting, water and wastewater, manufacturing, and oil & gas.