I have 6 backups of my PC wasting a lot of space, on an external hard disk, because all of them are full backups (not incremental) of my PC, meaning that the majority of the contents are duplicated across the backups.
My idea is to trim those backups, keeping only files MISSING from the original backup source, REGARDLESS the folder location.
As backups are “mountable units” in such a way I can use common file/folders utilities to find duplicates & missing files between two compared sets, I have tried some tools such as Windiff and Comparator pro, but both of those tools will show as missing files those files moved in other folders, when comparing with the current file set.
What I need is a tool the will list missing files from the backup source that are present in the backup, wherever they are, even though they have been moved elsewhere.
I too have been looking for space efficient backups and here are some applications I’ve found:
- Dupemerge – recommend for bulk dedupe
- Hard Link Shell Extension
- duplicati – recommend for backups
- Hardlink Backup (formerly RsyncBackup)
Dupemerge is a command line program what will dedupe directories using hard links. This program will look at a directory or directories you point it to and if there are duplicate files, it will hard link them. If you backup on a regular basis, you can schedule this to run after the backup and free up the space.
CloneSpy is a GUI program that will display lists of duplicate files. You can manually dedupe some files or have it automatically dudupe the files for you. Hard linking needs to be turned on in the options, this program started out as a program to remove duplicate files and they have since added the hard link capability. Si if you want the file hard linked, you must turn that on. I don’t know CloneSpy knows the NTFS hard link limitations, but Dupemerge does.
There are some Windows Explorer extensions to also create hard links and to visually see which files are hard linked from within Windows Explorer, Hard Link Shell Extension will put a red “shortcut” arrow overlay over the files that have been hard linked. Also, on local drives in the file properties it will display which files are hard linked together. It is nice to see which files are duplicates and which are unique. Also if there is a chance of editing the files, this will act as a warning to remind you that if you edit a file with a red arrow that you are actually editing all the hard linked files at the same time. The website http://schinagl.priv.at/nt/hardlinkshellext/hardlinkshellext.html has a ton of information about Hard Links, Junctions, Symbolic Links.
Using hard links like this is nice because each backup folder looks like a complete full backup, but common files within the backup and across backups are only using space up once, usually. Usually because NTFS has a limitation of 1023 hard links to one file, and dupemerge will only hard link 1022 files to one file, so if you have more than 1022 copies, a second copy of the data will have to be stored for another 1022 hard links to link to.
There are some programs intended for space efficient backups, duplicati (a Windows port of the Linux duplicity backup program). This GUI program, from what I understand, will dedupe because it hashes the data. It reminds me of using rsync for backup. The current version of duplicati are much improved, and I would recommend it. It can manage your backups by setting how many backups to keep, how much space to use, max age of backups and so on. I use this program for long term backups.
Hardlink Backup (formerly RsyncBackup) (a GUI program) dedupes by hard links. (I hadn’t used this program since it was rebranded.) Because I was going to use my program in a commercial environment I didn’t test the program out much more. However it did appear to work well.
Rdiff-Backup (a command line program) also dedupes by hard links. The thing I didn’t care for was it put a directory with all the revision history in the backup directory. If that was necessary I wish they would have hidden it and made it a dot directory. This program is similar to an rsync backup program.
Just to warn you, with Windows it isn’t always clear how much space hard linked files are using. I believe the overall drive statistics are correct and show actual space used. However, if you did a properties over the backup directories it would look like no space was saved by hard linking. Eventually, you should be able to store what looks to be over 100% of the capacity of the drive. This is because Explorer counts the space each file entry takes up, but doesn’t check to see if two files entries are pointing to the same piece of data on the disk.
If you only want to keep files not in the original backup regardless of their location. (e.g.
Backup1, and in
File1 is in
File1 was not changed, but relocated you want a program to remove
File1.) Then the program I would suggest is CloneSpy. This was what the program was originally designed for. CloneSpy has many options. I’ve used it for a similar task when I would compare
FolderD. Then compare
FolderD. Lastly compare
FolderD. That way among all the folders only unique files exist.
Actually a simpler way to do the above is to have all the directories in one group and tell it to delete newer files. That would leave the oldest copy of the dupe set. This way you can tell when a version of the file was first created. This would also dedupe files within the directories as well as across the directories, resulting in only one copy of a unique file.