Data Protection in the Cloud Era with Nasuni
Ryan Miller provides insights on how organizations should evaluate data protection solutions in the cloud era.
June 26, 2024 | Ryan Miller
Prior to joining Nasuni, I spent 20 years in Professional Services helping customers with their data protection needs. Much of that was dedicated to the deployment of data protection solutions, but it also involved consultative services, such as performing health checks and providing best practice recommendations. Suffice it to say, the 3-2-1-1 data protection paradigm was solidly entrenched in my way of thinking.
When I started at Nasuni, I quickly realized the company’s data protection story was seemingly at odds with this industry standard in multiple ways. For example, having primary and backup data in the same basket is considered to be a major no-no. Second, the idea of equating a snapshot with a backup runs against the grain. And these are just two examples.
This realization forced me to spend some brain cycles really examining what 3-2-1-1 accomplishes at a fundamental level. This blog post is a summation of the journey I took in stepping back, going back to first principles, thinking about how data protection has been done traditionally and how the same goals can be achieved in a new paradigm. Ultimately, the core question was simple: How do we evaluate data protection in the cloud era?
What Are We Solving For?
When I talk with prospective customers about moving their unstructured file data to a cloud-centric platform, it’s not uncommon for the topic of data protection to come up. Depending on the role of the person I’m talking to, this can be in a few different contexts:
- Protection against data loss
- Protection against security threats (ie, hackers)
- Protection against cloud (region) failure
I’m going to talk primarily about data loss here, although aspects of security and cloud failure almost always end up finding a way into the overall conversation.
Why 3-2-1-1 Worked
I mentioned above the idea of the 3-2-1-1 data protection strategy: 3 copies of data on 2 different storage media, one (1) of which is offsite, and one (1) immutable. The convenient thing about this strategy is that it provides protection against a broad spectrum of scenarios. Files get accidentally deleted? Restore from a backup. Did someone hack into the network and compromise your data? Restore from an inaccessible (offsite) or immutable backup. Did your primary data center go down? An offsite copy, often a replica at a DR site, is available.
Where 3-2-1-1 Fails
This all sounds great in theory, but executing on these contingency plans in a way that makes recovery possible and practical can be another matter entirely. If data protection using these traditional methods were easy, you’d rarely hear about ransomware attacks. An organization impacted by ransomware would be able to quickly and reliably recover from a backup and the incident would be a minor blip on the radar screen. We all know that is not what happens. Clearly, as ransomware has revealed, recovery using traditional backup isn’t quite so easy.
Although it is easy to become myopically focused on the numbers 3, 2, 1 and 1, sometimes it’s a good idea to step back and ask: “What is the fundamental capability I need to provide, without the biases that come with considering ‘How it has always been done’?”
In other words, is there another way to protect my data that may not exactly conform to 3-2-1-1, but still provides me with an equivalent or improved level of protection? Before we answer that, let’s look at how we arrived where we are today.
How Did We Get To 3-2-1-1 Anyway?
One of the primary items of consideration when it comes to protection against data loss is durability. Traditionally, storage admins have been limited in terms of which levers they could pull to improve durability, with the most common being the RAID level they used. The choice often went something like this:
- Maximize performance and reliability, but incur maximum cost = RAID10;
- Balanced performance, reliability, and cost = RAID6;
The problem is that RAID was an availability play vs. a durability one. The goal was to have data available in the event of disk failure, especially when the majority of disks were mechanical spinning disks. As a result, backups were leveraged to cover the gap. If the RAID volume failed or if data was lost, backups would save the day. Thus, backups became the backbone of data protection.
For years, when considering factors such as DR and data availability at multiple locations, a typical data protection strategy might look like this:
- Scenario A: Create a local backup to another disk array because I’m paranoid about something happening to my primary storage array or losing data due to user error.
- Scenario B: Replicate that data to another DC because I’m paranoid about site loss. The bonus? This allows you to check off a box saying failover is possible.
- Scenario C: Write a copy to tape or to the cloud, sending it offsite for long term retention requirements, since my primary and replica disks are expensive and it’s not feasible (either technically, or financially) to store data long term on prem.
With this strategy, we have 3 copies of data (primary, replica, & tape), on at least 2 pieces of storage media (primary disk, replica disk, and bonus tape), with 1 offsite (replica), and 1 air gapped (tape). All of this was stemming primarily from a durability problem, and the only available option was using backups to bridge the gap. Thus, 3-2-1-1 was born and became very popular.
Questions for 3-2-1-1 Adherents
Unfortunately, it’s virtually impossible to determine the potential durability of a 3-2-1-1 backup strategy, due to the sheer number of variables involved. However, I can say that over my 20-year career of working with customers and implementing data protection solutions, there are a number of recovery scenarios that cause people to cross their fingers and hope for the best, even with 3-2-1-1. For example:
- If you write to tape for offsite purposes, what is your confidence level of being able to read data off tape from 2 weeks ago?
- If you had to recover the majority of your organization’s data following a ransomware attack, what is your confidence level of being able to successfully read that volume of data from backups in a timely manner?
- What percentage (estimate) of your restore operations are completely successful?
- When was the last time you performed a DR exercise? Was it successful?
- Have you ever needed to perform a true unplanned DR of any subset of your organization’s data?
- If you’ve performed a failover to another site successfully, were you able to failback confidently, or was there concern for data loss?
These are critical questions that should prompt strong, confident answers. No one should be relying on a data protection solution that forces you to cross your fingers.
How Object Storage Changes Data Protection
Object storage is a completely different animal from traditional storage. It is built for durability (11 x 9’s and beyond). It’s extremely cost effective on a per TB basis. And it is massively scalable in ways that traditional RAID groups are not. The introduction of a highly durable, cost-effective, and scalable storage medium has massive implications on data protection. This is where my aforementioned efforts spending brain cycles comparing traditional data protection vs. that with Nasuni resulted in some light-bulb moments.
Let’s reconsider scenarios A, B, and C above with the use of object storage:
- Scenario A (hardware failure): Use of techniques such as erasure coding and data sharding means that data written to object storage is inherently replicated, multiple times, across multiple disk heads. As a result, individual bits of data will not be lost due to factors such as bit rot, disk failure, RAID group failure, or controller failure. Thus, the need for backups to address things like hardware failure is greatly diminished.
- Scenarios A & B (user error/malicious user; long term retention): With object storage it becomes much more feasible to take an approach wherein any changes are written immutably as new objects (vs. overwriting old ones) and retained as objects for long periods of time. This is largely due to the scalability of object storage, but also bolstered by the low cost per GB afforded when using object storage from a public cloud provider. This is especially true with regards to long-term retention requirements. Recovery from user error or even large-scale ransomware attacks can be performed simply by moving pointers vs. pulling data from another piece of media. In this situation, the need for 3rd party backups to address data loss goes away.
- Scenario C (offsite): Most organizations use object storage from one of the big three providers. These hyperscale datacenters are not only engineered and built to be more durable and resilient than most any individual organization’s on-prem data center on an individual basis – but also provide object storage that is replicated between datacenters, and even across geographies. When using a public cloud provider, the data is inherently offsite from the organization’s on-prem facilities, and depending on the storage class chosen, the data is also automatically replicated across data centers.
Adding it all up, object storage has completely changed the landscape of what is possible with data protection. The need to account for hardware failure is addressed by erasure coding and data sharding; the need to replicate to a DR site is addressed by how the public cloud providers built out their data centers and their options for geo-replication; and lastly, the need for long term retention is addressed by the extremely low cost of object storage, where large amounts of data can affordably be kept in an available form. Which, by the way, when you think of leveraging that data with an AI or ML engine – suddenly that historic information that was often treated as ‘out of sight, out of mind’ for cost reduction purposes is now a potential goldmine.
Considerations with Object Storage
Since there is no such thing as a free lunch, I would be remiss if I didn’t acknowledge that there are some considerations to take into account when it comes to object storage – some of which you may already be thinking of:
- Public vs. Private: I’ve been assuming use of public cloud resources. An organization can use a private object store to gain the benefits of using a highly durable and scalable storage platform, but they would have to replicate to another storage cluster at another site and ensure their hardware maintenance practices are sufficient. Generally speaking, private object storage cost per TB is higher than that from a public provider.
- Resilience: The lowest cost object storage using a public provider is a storage class where the data resides in a single hyperscaler data center. These data centers are highly resilient in terms of having things like backup generators and multiple connections to the outside world, but are still single facilities For additional protection, the customer would have to pick a storage class that replicates the data to another data center within the same availability zone, or to another region. This does add to the cost (but it is still far less costly than another cluster of on-prem object storage).
- Hyperscale Failure: Nothing is impossible, so there is potential for cloud provider failure, but it’s important to keep things in perspective. The chance of a hyperscaler suffering from a major outage for a long period of time is far lower than an individual organization operating and maintaining their own data center experiencing an outage. This is a numbers game and the hyperscaler wins. The difference is psychological comfort: we all know driving a vehicle is more dangerous than flying in an airplane, but driving a vehicle allows us to maintain some control, so we are more comfortable.
- AIR GAP: Finally, the elephant in the room. I’ve found there are two lines of thought on the topics of air gapping. There are those who believe that ANY online storage is susceptible to hackers, and thus one cannot consider data to be air gapped if it’s online. Then there are those who believe that even traditional air gap is susceptible to tampering (such as if a tape falls off a truck). In this case, mitigating the primary threat of an online hacker comes down to good security practices, such as enabling some sort of object lock on the object storage side, configuring RBAC properly, and enabling MFA for the primary admin account(s) that might have permissions to delete containers/buckets. Both lines of thinking are valid, and at the end of the day, each person has to decide where on the spectrum they fall.
Conclusion
The traditional strategy of 3-2-1-1 served its purpose, and did it well, for a long time. However, with the advent and widespread adoption of object storage, and the simultaneous rise of ransomware, data protection experts should step back and examine what we were really trying to solve for with 3-2-1-1 and potentially change our approach instead of blindly adhering to it. We can take advantage of the durability and scalability of object storage and modernize how we approach data protection in a way that allows us to maintain sufficient protection, simplify the overall approach, and ultimately reduce RPOs, RTOs, and mean time to recovery (MTTR) in dramatic fashion.