Nov 15, 2011

How is Hadoop able to provide such high capacity at lower costs than other relational database deployments?

What is it that is so different about Hadoop that allows it to keep cost per terabyte of hard disk capacity so much cheaper than other NoSQL processing platforms?  One figure I just saw was Hadoop can support 4TB of hard disk capacity per node, with each node costing around $4000, while other choices were averaging from $10-12,000 per terrabyte.  This is such a huge difference, and I'm wondering exactly why this is.

Hi JOiseau,

Here's a brief blurb on why Hadoop itself is so cheap, compared to other software solutions.

Why the Hoopla over Hadoop?

"Hadoop is cheap for two reasons. First, it's open source, which means no proprietary licensing fees. Hadoop is an Apache Software Foundation project, created by Doug Cutting and Mike Cafarella, who implemented an idea described in papers by Google. Apache maintains a list of Hadoop distributions, and IBM has announced plans to offer its own distribution.

Second, Hadoop doesn't require special processors or expensive hardware. As Cutting explained, you just need “something with an Intel processor, a networking card, and usually four or so hard drives in it.”

The Hadoop Distributed File System, or HDFS, is one key to decreased cost, as well as the Hive distributed data warehouse.  Most people know that Hadoop is very scalable, but what this allows is distribution of big data processing tasks across many nodes built on cheap x86 servers.  Each node costs about 4 grand because of the low cost of those servers, whereas most relational database deployments run at around $10,000 or $12,000 per terabyte, as you noted.   

Answer this