Big data requires big file systems. That's where the open source GlusterFS file system is aiming to fit in with the upcoming GlusterFS 3.3 release.
The Gluster project is out this week with the second beta release of GlusterFS 3.3, the final release is expected before the end of the year. The new release provides an integration point for Apache Hadoop enabling Hadoop users to use Gluster for storage. According to Gluster, their filesystem is also comptable with Hadoop's own HDFS (Hadoop File System), though Gluster provides some additional benefits including scalability and performance improvements.
"GlusterFS 3.3 brings two new protocols into the file system," AB Periasamy, CTO and co-founder of Gluster told InternetNews.com. "One of them is the object protocol so you can access the data as objects, similar to the Amazon S3 protocol."
Periasamy noted that the second protocol is an HDFS compatible API.
"So you can do big data applications and MapReduce on Gluster," Periasamy said.
As to why Gluster is adding support for Hadoop now, Periasamy noted there are a number of reasons. He noted that what is happening in the market is that the entire stack is converging. Previously there were pools of storage like SAN and NAS and each one was tailored for specific types of applications.
"We see that object storage is emerging now as another alternative to storing long term unstructured data," Periasamy said. "We are now able to easily scale up and access storage across the Internet."
In terms of HDFS and Hadoop, Periasamy noted that with Hadoop there is data and then there are applications. He explained that using the Hadoop MapReduce framework, a whole bunch of applications are now enabled and it's growing into a powerful ecosystem.
"The storage engine was initially written just to handle certain workloads," Periasamy said. "The metadata server is also a hard bottleneck."
He explained that with HDFS metadata. all the metadata has to be centrally stored in a single system's memory, which is a performance bottleneck for scale-out. In contrast, Periasamy noted that Gluster already has a powerful storage engine with no such metadata bottleneck.
"We addressed the big data market from the storage perspective," Periasamy said, "The Hadoop project addressed the big data problem from the analytics perspective."
As such, in Periasamy's view it makes perfect sense for the Hadoop community to collaborate with Gluster taking advantage of the big data storage backend. He explained that Hadoop itself was already modular enough to enable the Gluster filesystem to plug into it as well.
Additionally Periasamy noted that Gluster has a replication model that enables better geographical scale-out possibilities for Hadoop. Periasamy explained that Gluster has a geo-replication module that keeps multiple sites in sync. The replication model does not rely on snapshots, rather it synchronizes data that changes.
"We have the ability to synchronize the bits when they change and that enables us to have a continuous geographic replication," Periasamy said.
Moving forward Periasamy said that geo-replication will continue to be enhanced in the next release of Gluster as well.
"In GlusterFS 3.4 we will enable you to have active failover from one site to another," Periasamy said.