Machine learning is revolutionizing the way technology is deployed in many different fields, but when it comes to machine learning in data storage systems, things have been a little less dramatic. Big data storage and storage tiering are two areas where the use of machine learning in storage systems shows promise, but it's in the area of solid state drive (SSD) storage that machine learning may offer the biggest opportunities for improvement.
To understand why, it's necessary to take a quick look at how SSDs work. When flash NAND is made and sold to SSD makers, it comes pre-configured with various trim or register settings. When a controller orders a write to the NAND, the write is made at a certain voltage. Likewise, when a read is carried out, it will be told if a cell has a charge of a certain voltage. These settings are set by the NAND manufacturer and are not the concern of the controller maker.
Now let's assume that the voltage in question is preset at 7V. It turns out that when NAND is new, it's not necessary to supply a voltage of 7V to a cell. Not only might a charge of 2V work just as well, but it will also allow the cell to work for longer before wearing out. As the cell ages through write cycles it will become necessary to increase that voltage to, say, 3.5V, and only much later in the cell's lifecycle will the default 7V be necessary. And finally it may be possible to prolong the life of the cell by applying a voltage even greater than the default 7V.
By changing the default read and write voltage of the NAND and various other flash register settings – either once or several times during the life of an SSD, it may be possible to increase the endurance of the storage device significantly. And in fact endurance is not the only factor that could be enhanced. Other settings could increase the performance of the SSD, and still others its data retention capabilities.
Machine learning in storage systems for optimization
Ultimately, choosing these settings is an optimization problem: endurance may be improved by altering settings without effecting anything else, for example, but in most case it can be improved only at the expense of performance or retention, or perhaps both. In practice the easiest trade-off is endurance versus retention. That's because many manufacturers choose settings that offer retention that can be measured in months, but in a data center environment an SSD may only be required to retain data for a day or two when the power is off. By altering the settings to lower the retention time big gains can often be made in endurance.
But here's the problem. Conventional 2D NAND may have 30 – 50 settings, and there is a highly complex interaction between them. That means that changing one can have a large and unexpected effect on another, making it very hard to optimize the settings manually to achieve a particular desired outcome. And when it comes to 3D NAND – the vertically stacked arrays of cells that most NAND makers are switching to – there can be thousands of settings. That makes the optimization fiendishly complex and practically impossible - for humans, at least.
And that's where machine learning in storage systems comes in to the equation: humans may not be able to optimize the thousands of NAND settings in 3D NAND, but it's the type of exercise that machine learning systems excel at.
Machine learning in storage systems at early stages
The only company known to be doing this type of machine learning in storage systems at the moment is an Irish company called NVMdurance, which bills itself as an" automated flash memory optimization company." Its machine learning technology allows it to take individual manufacturers flash NAND and automatically generate viable sets of flash register settings optimized for different operating requirements.
But even when using machine learning in storage systems in this way, the process is far from quick, according to Pearse Coyle, the company's CEO. "It takes us three months and a hundred pieces of the new flash to generate settings," he says.
To get an idea of the complexity involved, the company takes the 100 pieces of NAND hardware and subjects them to reads and writes, measuring the results. It then builds a software model of the hardware, and produces "hundreds of millions" of virtual devices, according to Coyle. The machine learning system then tests different parameters on these virtual devices, taking the most promising ones to test on real hardware.
'Billions of permutations'
How many different parameters does the machine learning system test? "There are many billions of permutations, and we use thousands of CPUs in the cloud to do the testing," says Coyle. "The search space is actually too big but we quickly see associations between parameters so we are able to reduce the number of dimensions."
Using machine learning, Coyle says the company's technology is able to optimize the NAND's register settings for endurance or performance or data retention, or even produce a dynamic set of settings for a two phase life: the first configuration optimized for performance, and then when performance starts to drop it as the NAND ages it can be optimized for long term storage.
A further complication with the use of 3D NAND is that the quality of the storage media is poorer compared to 2D NAND, says Coyle. As a result, manufacturers specify that a complex form of error correction called low-density parity-check (LDPC) should be used with it. LDPC involves the use of tables called log-likelihood ratio (LLR) tables, and these are time consuming and hard to generate, and specific to a particular type of NAND with specific settings. Because of this they are supplied by the NAND manufacturer to SSD makers who want to use a particular type of NAND.
LLR tables using machine learning in storage systems
So here's another problem: if the NAND settings are changes – perhaps to settings which are optimized for greater endurance – then the LLR tables are no longer valid. "This completely screws SSD makers who want to differentiate their offerings with different settings," says Coyle. "We reckon that there must be 60 or so SSD makers who can't go to market with 3D flash because they can't use the supplied LLR tables." But NVMdurance's machine learning in storage systems technology can automatically generate LLR tables for any sets of 3D NAND register settings that it comes up with.
At this point it's worth asking why all of this is important in a business context. Is the ability to use machine learning in storage systems to optimize an SSD for better endurance (or anything else) really that important?
Business case for machine learning in storage systems
The answer to that question is an unequivocal "yes," according to Tom Coughlin, founder of data storage consulting firm Coughlin Associates. "Endurance is down significantly with 3D NAND, so there is a need for better endurance," he says. "One way to get costs down is to get endurance back up. This technology may also help compensate for differences that show up in manufacturing, leading to higher fab yields and therefore lower costs.
And Coyle says that higher endurance is particularly important in the growing embedded device market, where replacing storage devices is difficult. "You don't want to have to throw a car away just because the storage chips are no good anymore."
He adds that for cloud providers offering solid state storage as a service these sorts of endurance gains mean that they can keep making money from their SSD assets for much longer – generating a far higher return on them.
Coyle also points that for hyperscale users, this use of machine learning in storage systems can allow them to have software running in SSD controllers that divides the SSD life into various stages. It would monitor the SSD looking at how long it had been running, the number of cycles it had performed, the error rates and so on, and then when certain thresholds are reached it could start using new register settings. This could ensure that endurance is maximized, or it could be used to convert high performance SSDs into longer lasting but slower ones as they age.
The final question that's worth asking is how effective is this machine learning in storage systems technology in practice? What are the potential gains. "I have seen twenty fold increases in endurance with some trade-offs, but realistically five to seven times endurance gains are what is probably possible," concludes Coughlin.