What is Hadoop?

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. Basically, it's a way of storing enormous data sets across distributed clusters of servers and then running "distributed" analysis applications in each cluster.

Hadoop is designed to run on a large number of machines that don’t share any memory or disks. That means you can buy a bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization’s data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline can be automatically replicated from a known good copy


The underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform. Google’s innovations were incorporated intoNutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.


If you remember nothing else about Hadoop, keep this in mind: It has two main parts - a data processing framework and a distributed filesystem for data storage

Hadoop Distributed Filesystem (HDFS)

The distributed filesystem is an array of storage clusters, its the Hadoop component that holds the actual data. By default, Hadoop uses Hadoop Distributed File System (HDFS), although it can use other file systems as well.

Data Processing Framework & MapReduce

The data processing framework is the tool used to work with the data itself. By default, this is the Java-based system known as MapReduce. But Hadoop is not really a database: It stores data and you can pull data out of it, but there are no queries involved - SQL or otherwise. Hadoop is more of a data warehousing system - so it needs a system like MapReduce to actually process the data. MapReduce runs as a series of jobs, with each job essentially a separate Java application that goes out into the data and starts pulling out information as needed. Using MapReduce instead of a query gives data seekers a lot of power and flexibility, but also adds a lot of complexity.

What is it good for?

Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they’re more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built. Those are just a few examples.

Main Vendors

Amazon Web Services

Amazon offers a version of Apache Hadoop on their EC2 infrastructure


Cloudera distributes a platform of open source Apache projects called Cloudera's Distribution including Apache Hadoop or CDH. In addition, Cloudera offers its enterprise customers a family of product and services that complement the open-source Apache Hadoop platform. These include comprehensive training sessions, architectural services and technical support for Hadoop clusters in development or in production. We serve a wide range of customers including retail, government, financial service, healthcare, life sciences, digital media, advertising, networking and telephony enterprises


IBM InfoSphere BigInsights brings the power of Hadoop to the enterprise. BigInsights Enterprise Edition builds on Apache Hadoop with capabilities to withstand the demands of an enterprise


the Intel® Distribution for Apache Hadoop is a product based on Apache Hadoop 1, containing optimizations for Intel's latest CPUs and chipsets. It includes the Intel® Manager for Apache Hadoop for managing a cluster


Hortonworks develops, distributes and supports a 100% open source distribution of Apache Hadoop for the enterprise, also training, support & services