#BackToBasics / T-SQL Tuesday : Are your backups worthless?
This isn't Kenneth's first rodeo – he hosted #63, on security, last February. This time around, the topic is backup and recovery.
One type of post Kenneth has asked for, specifically, is about backup basics. This is very convenient for me, since I owe a 12th and final "Back to Basics" post for the year as well, and these two will tie in together quite nicely.
This is not a loaded question. Yet.
A backup is the most essential way you can protect yourself against everything from tripping on a power cord, to hard disk failures, to data center explosions. If you're already taking backups and are comfortable with your backup policies, the recovery model you're using, and that you can meet your RPO/RTO, feel free to skip ahead to the loaded part of the question.
If you are not taking backups, or you are relying on SAN software or anything that makes copies of the MDF/LDF files, or you aren't familiar with recovery models, backup types, or what RPO and RTO are, you are potentially putting your company and its data at great risk. If any of these is true, please keep reading.
One type of exchange that I observe at least weekly out on forums always goes something like this:
User A: "I deleted data / dropped a table / updated without a where clause, is there any way to undo it?"
User B: "Do you have backups?"
User A: "No."
User B: "Then probably not."
While there are exceptions, that's usually the end of the conversation; without backups, there's little hope the deleted data is still available anywhere, and that can sometimes be a very expensive lesson to learn (just ask JournalSpace). You may be able to go dumpster diving in the transaction log, but this is extremely tedious. And it may be impossible anyway, depending on your recovery model, your last backup of any kind, and whether your system is even still up and accessible (remember some aspects of disaster recovery are from real, physical disasters).
Assuming I have put enough fear into you that you are now planning to make sure you are backing up your databases, your next questions will probably be:
- Which types of backups should I be running?
- Which recovery model should I use?
- How often should I run backups?
- How long should I keep my backups?
There are three different backup types that are relevant for this conversation:
- Full backup – Think of this as a "snapshot" of your database at a specific point in time. (Documentation)
- Log backup – This is a backup of all transaction log activity since the last full or log backup. (Documentation)
- Differential backup – Think of this as a way to apply, using a single "snapshot," all of the changes that have happened since the last full backup, without incurring the cost and space of every individual transaction (e.g. if a row changes fifteen times, the transaction log has a record of all fifteen changes, while the differential only has the new version). (Documentation)
I would suggest that, in most cases, you'll want to be taking both full and log backups (and perhaps differential as well).
There are 3.5 different recovery models in SQL Server, regardless of edition, and they allow you to protect data in different ways.
- Full – SQL Server allows you to recover to a point in time, anywhere between the time of the last full backup and the most recent log backup. Recovering to a point in time is very useful if, for example, you need to recover a system to the moment before a critical table was dropped or a destructive update was performed. (Documentation)
- Simple – Taking backups in simple recovery mode will not allow you to restore a point in time, unless that point in time happens to coincide with the time the last full or differential backup was taken. The benefit is that SQL Server eliminates the requirement of taking log backups, and as a result can make more efficient use of log space. You cannot take log backups in simple recovery. (Documentation)
- Pseudo-Simple – This is the .5 I mentioned, and it isn't using an official Microsoft term, but one Paul Randal coined to represent a troublesome scenario. You have set a database to full recovery model, but have yet to take any proper backups. While the metadata might say the database is in full, you're not going to be able to fully recover in the event of a disaster. The result is that you aren't really protected by full recovery until you start actively taking backups.
- Bulk-logged – This works like full recovery by default, but operates like simple recovery during any bulk operations (think large ETL processes). The benefit is to not incur the logging cost of these operations, but this comes at a cost: you cannot restore to a point in time that falls inside of any of these bulk operations. (Documentation)
Again, I would say that, in most cases, you should be running in the Full recovery model. Some good background on the choices (and the problems the wrong choice can lead to) is over on Stack Exchange:
How often you take full and log backups, and how long you keep them, will depend on – among other things – two concepts we haven't discussed yet.
- Recovery Point Objective (RPO) – this is the measure of how much data, in time, the business can afford to lose. (WikiPedia definition)
- Recovery Time Objective (RTO) – this is the measure of how long it takes to recover from a disaster, a.k.a. "how long can we be down." (WikiPedia definition)
Most companies will say, well, our RPO and RTO should both be zero. With enough money and hardware, this becomes close to possible, but for most of us, there are trade-offs here. These numbers are often driven by service level agreements as much as anything, but let me give an example. If your RPO is 1 minute, you need to take a successful transaction log backup every minute. This might be reasonable, but this will affect your RTO, because if you have a disaster at 11:59 PM, and need to restore to 11:58 PM, you will need to restore at least one full backup, perhaps a differential backup, but then you need to apply 1,438 log backups. There is no way you can do that in a minute, so you can see why if your RPO is as small as 1 minute, your RTO is unlikely to be that low too. With a higher RPO, you could bring the RTO down – with an RPO of 5 minutes, a log backup every 5 minutes means you would be applying only 288 log files, and more likely to recover within a much shorter period.
Typical schedules I see take full backups once a week, differential backups once a day, and log backups every 5 or 15 minutes. This will provide a known RPO of 5 or 15 minutes, and as a result will attempt to minimize RTO. Your mileage may vary, depending on a variety of factors, including the amount of changes that take place to your data, and the performance and capacity of the drives where you are putting your backups.
As far as retention goes, environments I see use retention policies between 30 and 90 days. A number of factors feed into this; there may be compliance influences in some industries, but otherwise you'll probably just be constrained by how far back you think you might need to recover data from, and how much disk space you have to store it all.
A common excuse I see for not taking backups (or not taking them frequently enough) is that there isn't enough space to store them. I see the same excuse for using simple recovery instead of full, because there isn't enough storage to hold all the log backups.
This is your data and, IMHO, cheaping out on disk space is like buying a Ferrari and refusing to pay for a license plate or, more interestingly, insurance. You can get storage for pennies per GB, and even plugging in an external USB drive would be a good start. Sure, performance will not be awesome, but nobody cares how long a background backup process takes (unless it takes longer than your backup frequency).
Now, all of the above may be review for you, but a much more important part of this story is that you need to be TESTING your backups. I've seen many customers who have been happily taking backups and storing them on some drive somewhere, and then when disaster strikes and they actually need to restore them, they can't – maybe they had been backing up corruption all along, or the backups were failing but they were ignoring alerts, or they weren't taking log backups frequently enough to meet their RPO, or they were only taking full backups.
The moment disaster strikes is the worst possible time to learn, for the first time, that you are not prepared for that disaster. As far as I'm concerned:
If you aren't actively testing the restore of all of your backups, you don't have backups.
Here is what I suggest to customers who aren't already doing something along these lines: set up log shipping, restoring the database and logs to another server. Run your regression tests, DBCC checks, and anything else there. This allows you to be actively testing your backups on a continuous cycle. Some customers even do this on a delay, so the newest log backup they're every applying is 12 hours or 24 hours old. This gives them an "oops" window where – if someone ran an erroneous delete, for example – they can look at or recover an earlier version of the data, without having to go through the pain of the restore, because the "oops" transaction hasn't been committed on the secondary yet. This is a big benefit log shipping has over most other HA/DR features, which apply changes as immediately as possible.
This post was meant to provide some initial thoughts on backups, and hopefully raise concerns if these concepts are new to you. But it is certainly not meant to be a complete and exhaustive treatment of the topic. There is no shortage of material out there to help you get fully up to speed and confident in your backup and recovery plan: