I have been thinking about global deduplication for online backups for small businesses. I saw one vendor--and I do not know how many others are using this method--grab deduplication hashes from, say, 100 machines given that there might be different data on some of the machines. I am going to do a very simple problem. I know the problem is more complex. Let say each hash that I find is 4096 bytes, and for each machine I find 200 hashes I would want to deduplicate across the environment.
Storage space in memory is 200 hashes*100 machines*4096 bytes per hash or 78.125 MiB of space. Well, we all know that 200 is going to be a very low number. A more realistic value might be 50,000. Now the problem looks like 10,000*100*4096=3.8 GiB of space. So what happens? Most machine might have 4 GiB of space, so with the operating system and applications the system will page the hashes out. This likely will not be a big problem for the backup, as it runs in the background and the most commonly used hashes will likely be in memory. The problem gets pretty ugly when it comes to restoring the data. The system will likely page itself into the ground, and the store time will be very very slow. If the machine was say 2 GiB of memory vs. 4 GiB, you are in a world of trouble. Global software dedupe might work for a small number of machines, but as the machine count grows, either they must limit the number hashes to dedupe against across machines or the software must limit the hashes applied based on memory size of the machine, which limits the effectiveness of the deduplication.
It is your responsibility to understand how the software works at scale. Remember, backup is not about backup it is about restore, which is why it is called backup/restore.