Apache Hadoop notes

Apache™ Hadoop® notes

Russell Bateman
December 2013
last update:

Apache Hadoop is an open-source software framework for storage and large-scale processing of data sets on clusters of commodity hardware.

Hadoop consists of (a) libraries and utilities used by Hadoop, (b) a distributed filesystem to store data on the (commodity) machines, (c) a resource-management platform for managing resources in clusters and scheduling user applications, and (d) a programming mode (MapReduce) for large-scale data processing.

MapReduce

MapReduce is technology for processing large data sets using parallel and distributed algorithms in a cluster environment.

Installing Hadoop

Hadoop is a tarball with binaries including dæmons, executables and other stuff inside. It's downloaded from Welcome to Apache™ Hadoop®. What Hadoop is is best explained there and here.