Human Error: How Automation Can Mitigate Operational Risk

  • July 11, 2019
  • Feature
Human Error: How Automation Can Mitigate Operational Risk
Human Error: How Automation Can Mitigate Operational Risk

By Neil Ferguson, VP, Sales Engineering at Opsview

Organizations are using automation to reduce IT staff time taken up by mundane tasks, helping to promote productivity and make digital transformation a success. What automation can also do is more effectively monitor operations to reduce risk and flag issues caused not by software bugs, attacks or unauthorized applications but by human error.

Automated monitoring, for example, could have helped prevent the Amazon Web Services (AWS) S3 outage caused by human error. According to Amazon’s site, the person was to execute a command to remove a small number of servers, but instead incorrectly removed additional servers that supported two other subsystems. The human error glitch took down services such as Quora and the disruption lasted four hours.

As companies scale, and data workloads get exponentially larger, automated monitoring is becoming the key to reducing costly outages caused by human error, which Gartner estimates is a leading cause. Ponemon Institute has reported human error has accounted for 22% of unplanned outages, and noted this number has held steady, “indicating no progress in reducing what should be an avoidable cause of downtime.”

 

Automating IT Processes Key to Outage Reduction

The internet pages are replete with stories on outages, increasingly attributed to human error. As Ponemon indicates, avoidable causes like human error need to be addressed. One motivator should be the high costs incurred by such outages. The May Salesforce outage is a good example. Using Gartner’s benchmark formula ($5,600 a minute), published reports say the outage cost Salesforce about $5 million (at 15 hours, eight minutes).

Organizations need to explore every avenue to reduce downtime, notably the 22% attributed to human error.  Automating IT processes like monitoring can significantly cut down on human error and therefore, outages. Here are elements to consider:

  1. A single source of truth.  Set up correctly, automation can greatly increase the likelihood of identifying small issues before they snowball into much bigger ones. Automation helps ensure consistency across an environment. Then when issues do arise, an alert is flagged to a member of the IT team who can quickly identify the root cause and take measures to resolve it. It helps prevent scenarios such as cascade failure, when a server issue attacks unrelated systems, and the IT team then attempts to fix the original server - but misses all the other affected devices.  
  2. Improved accountability. Automation means that the whole process of reviewing a system’s health is accountable, providing a clear timeline of previous audits. CIOs and IT staff now have a complete break-down of what has occurred and when, flagging any anomalies that might be buried within the system.  
  3. Scale ready. As enterprises scale up, the risk of human error scales along with growth. When IT teams are managing 10,000 devices they are inevitably spread too thin, and the 22% of unplanned outages caused by human error has a higher probability. Automated monitoring mitigates the increased risk of human error, providing a level of consistency across the IT infrastructure – a level over-tasked teams cannot achieve.  
  4. Added business value. The constant churn of new technologies has made it more challenging for organizations to stay competitive in the digital business culture today. Automation can free up IT staff time that can be put to use for the myriad of operational and customer service improvements organizations must make to win in their market sector. Using automation, IT can add tremendous business value to the organization.  
  5. Increased productivity. Automating monitoring and other processes helps control costs since there are fewer manual tasks to complete. Where once there were many employees plugging gaps and fixing issues, there now needs to be fewer, highly trained employees who can configure the automation system to a high standard and helping with more strategic projects.

 

Human Error Can be Reduced

Some degree of human error is inevitable. There is no such thing as a completely fool-proof IT environment. However, implementing improvements like automated monitoring across the IT environment is a productive step in curbing the risk of outages caused by human error.

One could make the case that human error is not necessarily the fault of the employee, but rather the business itself, due to its procedures and the way it chooses to monitor its assets. Regardless of the ability level of your employees, however, the biggest factor impacting human error is that of consistency. When people are given a litany of monotonous tasks, consistency will drop. Automation and a constant procedural method to IT monitoring delivers the level of consistency organizations must have if they want to avoid outages.

Longer term objectives of a business – whether it is scaling through acquisition, new product lines or a disruptive technology play – all are dependent on the smoothly flowing workloads that support these objectives.  There really is little margin for error – human or mechanical – in this modern environment of hyper-competitiveness and digital transformation. Automation is no longer a nice-to-have element but an essential part of any organization’s success.

About the Author

Neil Ferguson is the Vice President, Sales Engineering and Systems for Opsview, a company that provides unified insight into dynamic IT operations on-premises, in the cloud or hybrid.

Learn More

Did you enjoy this great article?

Check out our free e-newsletters to read more great articles..

Subscribe