When humans make tech mistakes

Credit to Author: Susan Bradley| Date: Mon, 18 Apr 2022 08:54:00 -0700

We often think vendors are perfect. They have backups. They have redundancy. They have experts that know exactly how to deploy solutions without fail. And then we see they aren’t any better than we are.

Let’s look at a few recent examples.

In the small to mid-sized business (SMB) space, StorageCraft has long been a trusted backup software vendor. One of the first to make image backups easy to do, it was used and recommended by many managed service providers. After StorageCraft was acquired by Arcserve in March 2021, there were no immediate major changes in how the company ran.

Then, last month, a lot of backups in the cloud were permanently lost. As was reported by Blocks and Files, “During a recent planned maintenance window, a redundant array of servers containing critical metadata was decommissioned prematurely. As a result, some metadata was compromised, and critical links between the storage environment and our DRaaS cloud (Cloud Services) were disconnected.  Engineers could not re-establish the required links between the metadata and the storage system, rendering the data unusable. This means partners cannot replicate or failover machines in our datacenter.”

As of April 16, the status report said: “All affected machines are now enabled with a buildup of recovery points occurring. All throttling has been turned off and uploads are working as normal. The time to replicate data will depend on each customer’s upload bandwidth and data volume.”

That doesn’t help if there was an older backup you wanted to keep in your cloud repository.

Next up, Atlassian, which indicated on April 4 that approximately 400 Atlassian Cloud customers experienced a full outage across their Atlassian products. As the company noted on its site:

“One of our standalone apps for Jira Service Management and Jira Software, called “Insight – Asset Management,” was fully integrated into our products as native functionality. Because of this, we needed to deactivate the standalone legacy app on customer sites that had it installed. Our engineering teams planned to use an existing script to deactivate instances of this standalone application. However, two critical problems ensued:

“Communication gap. First, there was a communication gap between the team that requested the deactivation and the team that ran the deactivation. Instead of providing the IDs of the intended app being marked for deactivation, the team provided the IDs of the entire cloud site where the apps were to be deactivated.

“Faulty script. Second, the script we used provided both the ‘mark for deletion’ capability used in normal day-to-day operations (where recoverability is desirable), and the ‘permanently delete’ capability that is required to permanently remove data when required for compliance reasons. The script was executed with the wrong execution mode and the wrong list of IDs. The result was that sites for approximately 400 customers were improperly deleted.”

While these incidents may not have directly affected you, it’s wise to use them as lessons to learn from.

First and foremost, always review (in either your contract with a vendor or the terms of licensing) what their responsibilities are and what remedies you may have should a problem occur. In both cases, StorageCraft and Atlassian will be abiding by the terms they agreed to. If you are a larger client, you can control the contract terms and the remedy at hand. If you’re a smaller client, the end user license agreement and the terms included in it control what the vendor will do. If you rely on a vendor and its services, plan on something going wrong at some point. The key is to review how vendors handle their mistakes rather than their successes.

Will they reimburse you for the value of your loss? Will they perform extraordinary actions to restore you to whole or near whole? Often, how quickly they fess up to what’s happened can be more important than how they handle your data.

In both cases, human error was to blame. I can still remember the time I was working on a DOS computer and accidentally typed in del *.* at the root of the C drive rather than under the subdirectory that I intended. Clearly, it’s a lesson that stays with me to this day. Whenever I am doing anything related to deletion, I always pause and ask whether I have a backup in case I make a mistake. I pause and check where I am performing the action. I ask myself if I am deleting the right item.

No matter whether you are a single user or handle a network of computers (either on-premises or in the cloud), always have a full backup. Consider having multiple ways you can recover data after a problem. From full backups to simple copies of directories, be flexible in having ways to recover data.

Next, if you are an MSP, urge your staff to double-check your scripts. Often, we re-use scripts and don’t audit them to ensure they still do what we intend. Reading about the details of the Atlassian failureis painful. Clearly, the teams didn’t communicate well and ended up accidentally deleting information they weren’t planning to delete. Communication when you are planning a major change to your infrastructure is key to success.

That goes for communications from vendors, too. I’m a Microsoft 365 user and I often rely on two different platforms to keep track of issues. The Microsoft 365 Twitter account allows me to get alerts when there are issues. (You can download the Twitter app and set it up to receive a push notification when there’s a status change.) Alternately, you can set up notifications from the message center to ensure you’re kept up to date. For any vendors you use regularly, check on whether they have any communication channels that will keep you up to date.

Remember that technology is driven by human decisions and humans make mistakes. Don’t assume mistakes won’t occur. Plan on what you’ll do when vendors make mistakes. After all, they’re only human.

http://www.computerworld.com/category/security/index.rss