Creating an append-only backup with restic and AWS
Table of Contents
I wanted to backup my data. Easy premise, but harder to get right.
There are a few questions that need to be answered if you want to create a backup:
At the end, I will present my current restic configuration.
To create a backup strategy means thinking about what sort of incidents you want to recover from. The (data) loss of your drive? The loss of your drive and one backup? Two? There might be other considerations apart from disk failure as well, like the location of you and your backup(s) and the frequency of natural or human-made disasters there.
Consider how long you want to store the data and how important it is, and then search for a solution that fits your risk acceptance.
A good starting point for your backup strategy is the 3-2-1 rule:1
- Keep 3 copies of any important file: 1 primary and 2 backups
- Keep the files on 2 different media types to protect against different types of hazards.
- Store 1 copy offsite (e.g., outside your home or business facility)
For me, this means that the data that lives on my computer’s and server’s disks should get mirrored to my NAS and some cloud storage. The data that lives only on my NAS will be mirrored to cloud storage as well, but since that is only one copy, I will create another backup on a hard drive and give that to some relative.
This way, I will have at least three copies (PC/server, NAS, hard drive, cloud), on two mediums (hard drives and cloud storage), and offsite backups. This should save me from data loss, the cloud provider going down, and disasters.
Now that you know where you want to store the data, you have to decide which software to use.
There are many possibilities of how to store backups: A simple copy, compressed archives, incremental backups, or deduplicated archives.
A simple copy takes up a lot of disk space. Compressed archives save some of that, but you still have to store each version of the files separately, which is a huge waste of resources. Incremental backups save space by only storing changes to a full backup, but if one incremental backup fails, the complete backup from that point on may be broken.
A good compromise are deduplicating backups. They are basically full backups, but the data is split into chunks. If there are duplicate chunks, e. g. the same file from two different points in time, they will be stored only once.2 If you want to recover your data, the chunks will be read and form your files again. Now, there are some dangers here as well, like bit flips which might alter some chunks, but they should only concern one file (and can sometimes be repaired).
Another thing you want to consider is encryption, especially if the backups are not under your control. Encryption is a great way to ensure that noone can read your data, but should you loose access to your keys, it is also a great way to ensure that you can’t read your own data.
In the end, I wanted a deduplicated and encrypted backup. There are many programs written for this use case, but it came down to borg or restic for me. In the end I chose restic for its encryption, since borg’s is weaker if multiple clients update the same repository.3
Restic does everything I want, is a single binary (portability!), and has an explicit threat model that works for me.
A big disadvantage is its key management. In restic, your password doesn’t encrypt the data directly, but rather a key file. This keyfile holds the data to create the key actually encrypting files. This means that an attacker, once they have a decrypted keyfile, can read all the repository data forever, even if their key is revoked. This is, of course, only a worry if the attacker has access to the repository storage, but still something to keep in mind.
Since an attacker that can access restic keys can also read all the other data on my system, securing from reading the repository is kind of pointless and has to be solved by other means. I can, however, protect myself from an attacker deleting or overwriting my backups by creating append-only backups. That means that once the data is written, it can’t be overwritten or deleted.
There are two ways to achieve this: with a VPS with enough storage and rclone (see: ruderich.org) or with a storage provider. A VPS just for backups would have led to a lot of manual work, since I would have wanted to keep it out of my usual infrastructure. I therefore chose to go with a managed storage.
I considered quite a few providers to store my backups at. On the list were:
- Hetzner Storage Box
- Some VPS to store my data on
My goal was to find an european storage provider with S3 versioning and IAM access policies with the ability to restrict deletion to subdirectories (
locks/*) and the ability to restrict the deletion of noncurrent object versions.
Sadly (and I would love to be corrected on this, write me!), only AWS had those features. Two came close:
- OVH has IAM access policies (though they seem to need a bit of work), but no versioning
- Scaleway has versioning and some IAM management (you can’t restrict permissions to certain paths), but no DeleteObjectVersion restriction, meaning anyone with a DeleteObject permission can delete any version of any file
So, AWS it is. I am not completely happy, but since restics encryption seems to be holding up, I will use it for my server backups. My personal backup will have to wait for a bit.
I created a S3 Bucket with versioning enabled and a corresponding user with the following permissions:
This allows writing (and overwriting, but that’s what versioning is for) and listing, but only deleting objects in the
locks folder. It also filters by source ip, so only the server that backs up to that repository can actually access it. Someone gaining access to that server and reading files would trigger alarms, so the backup should be fairly safe.
Forgetting a backup gives me the following error:
Remove(<snapshot/7642d7e379>) returned error, retrying after 1.080381816s: client.RemoveObject: Access Denied.
I also implemented some lifecycle rules like removing unfinished multipart uploads and deleting the noncurrent locks after a few days.
The backup automation is done via a cron job that runs once a day:
The files in
backup-envs look like this, but you can use any restic backend you want. The repository just has to be initialized.
I am still in the process of uploading the data, but I will consider moving the
data directory to a different storage class, since it shouldn’t have a lot of operations in general.
The list is from Data Backup Options (2012), US-CERT ↩︎
This also means that it is sometimes better not to compress your files before backing them up. ↩︎
Under these circumstances Borg guarantees that the attacker cannot
- modify the data of any archive without the client detecting the change
- rename, remove or add an archive without the client detecting the change
- recover plain-text data
- recover definite (heuristics based on access patterns are possible) structural information such as the object graph (which archives refer to what chunks)
When the above attack model is extended to include multiple clients independently updating the same repository, then Borg fails to provide confidentiality (i.e. guarantees 3) and 4) do not apply any more).