1. Introduction
Humanity produces staggering amounts of data daily with the world more connected than ever through social media platforms, messaging apps, and audio/video streaming services.
Therefore, statistics show that an astounding 90% of all the data created has been generated in the past two to three years.
Since the role of data in the modern world has become more strategic and the majority of the data is in the unstructured format, we needed a framework capable of processing such vast data sets – at the scale of petabytes, zettabytes, or more, commonly referred to as Big Data.
In this tutorial, we’ll explore Apache Hadoop, a widely recognized technology for handling big data that offers reliability, scalability, and efficient distributed computing.
2. What Is Apache Hadoop?
Apache Hadoop is an open-source framework designed to scale up from a single server to numerous machines, offering local computing and storage from each, facilitating the storage and processing of large-scale datasets in a distributed computing environment.
The framework leverages the power of cluster computing – processing data in small chunks across multiple servers in a cluster.
Also, the flexibility of Hadoop in working with many other tools makes it the foundation of modern big data platforms – providing a reliable and scalable way for businesses to gain insights from their growing data.
3. Core Components of Hadoop
Hadoop’s four core components form the foundation of its system, enabling distributed data storage and processing.
3.1. Hadoop Common
It’s a set of essential Java libraries commonly available to all Hadoop modules to function better.
3.2. Hadoop Distributed File System (HDFS)
HDFS is a distributed file system with native support for large datasets that provides data throughput with high availability and fault tolerance.
Simply, it’s the storage component of Hadoop, storing large amounts of data across multiple machines and capable of running on standard hardware, making it cost-effective.
3.3. Hadoop YARN
YARN, an acronym for Yet Another Resource Negotiator, provides a framework for scheduling jobs, and manages system resources for distributed systems.
In a nutshell, it’s the resource management component of Hadoop, which manages the resources utilized for processing the data stored in HDFS.
3.4. Hadoop MapReduce
A simple programming model that processes data in parallel by turning unstructured data into key-value pairs (mapping), then splits it across nodes and combines the results into the final output (reducing).
So, in layman’s terms, it’s the brain of Hadoop, providing primary processing engine capabilities in two phases – mapping and reducing.
4. Setup
The GNU/Linux platform supports Hadoop. So, let’s set up our Hadoop cluster on the Linux OS.
4.1. Prerequisites
First, we require Java 8/11 installed for the latest Apache Hadoop. Also, we can follow the recommendations for the Java version here.
Next, we need to install SSH and ensure that the sshd service is running to use Hadoop scripts.
4.2. Download and Install Hadoop
We can follow the detailed guide to install and configure Hadoop in Linux.
Once setup is complete, let’s run the following command to verify the version of the installed Hadoop:
hadoop version
Here’s the output of the above command:
Hadoop 3.4.0 Source code repository git@github.com:apache/hadoop.git -r bd8b77f398f626bb7791783192ee7a5dfaeec760
Compiled by root on 2024-03-04T06:29Z
Compiled on platform linux-aarch_64
Compiled with protoc 3.21.12
From source with checksum f7fe694a3613358b38812ae9c31114e
This command was run using /usr/local/hadoop/common/hadoop-common-3.4.0.jar
Notably, we can run it in any of the three supported operation modes – standalone, pseudo-distributed, and fully-distributed. However, by default, Hadoop is configured to run in the standalone (local non-distributed) mode, essentially running a single Java process.
5. Basic Operations
Once our Hadoop cluster is up and running, we can perform many operations.
5.1. HDFS Operations
Let’s check out a few handy operations for managing files and directories using the HDFS’s command line interface.
For instance, we can upload a file to HDFS:
hdfs dfs -put /local_file_path /hdfs_path
Similarly, we can download the file:
hdfs dfs -get /hdfs_file_path /local_path
Let’s list all the files in the HDFS directory:
hdfs dfs -ls /hdfs_directory_path
And, here’s how we can read the content of a file at the HDFS location:
hdfs dfs -cat /hdfs_file_path
Also, this command checks the HDFS disk usage:
hdfs dfs -du -h /hdfs_path
Furthermore, there are other useful commands like -mkdir to create a directory, –rm to delete a file or directory, and -mv to move to rename a file available through HDFS.
5.2. Running MapReduce Job
Hadoop distribution includes a few simple and introductory examples to explore MapReduce under the hadoop-mapreduce-examples-3.4.0 jar file.
For instance, let’s look at WordCount, a simple app that scans the given input files and extracts the number of occurrences of each word as output.
First, we’ll create a few text files – textfile1.txt and textfile2.txt in the input directory with some content:
echo "Introduction to Apache Hadoop" > textfile01.txt
echo "Running MapReduce Job" > textfile01.txt
Then, let’s run the MapReduce job and create output files:
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.0.jar wordcount input output
The output log shows the mapping and reducing operations among other tasks:
2024-09-22 12:54:39,592 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2024-09-22 12:54:39,592 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2024-09-22 12:54:39,722 INFO input.FileInputFormat: Total input files to process : 2
2024-09-22 12:54:39,752 INFO mapreduce.JobSubmitter: number of splits:2
2024-09-22 12:54:39,835 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1009338515_0001
2024-09-22 12:54:39,835 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-09-22 12:54:39,917 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2024-09-22 12:54:39,918 INFO mapreduce.Job: Running job: job_local1009338515_0001
2024-09-22 12:54:39,959 INFO mapred.MapTask: Processing split: file:/Users/anshulbansal/work/github_examples/hadoop/textfile01.txt:0+30
2024-09-22 12:54:39,984 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2024-09-22 12:54:39,984 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2024-09-22 12:54:39,984 INFO mapred.MapTask: soft limit at 83886080
2024-09-22 12:54:39,984 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2024-09-22 12:54:39,984 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2024-09-22 12:54:39,985 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2024-09-22 12:54:39,998 INFO mapred.LocalJobRunner:
2024-09-22 12:54:39,999 INFO mapred.MapTask: Starting flush of map output
2024-09-22 12:54:39,999 INFO mapred.MapTask: Spilling map output
2024-09-22 12:54:39,999 INFO mapred.MapTask: bufstart = 0; bufend = 46; bufvoid = 104857600
2024-09-22 12:54:39,999 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214384(104857536); length = 13/6553600
2024-09-22 12:54:40,112 INFO mapred.LocalJobRunner: Finishing task: attempt_local1009338515_0001_r_000000_0
2024-09-22 12:54:40,112 INFO mapred.LocalJobRunner: reduce task executor complete.
2024-09-22 12:54:40,926 INFO mapreduce.Job: Job job_local1009338515_0001 running in uber mode : false
2024-09-22 12:54:40,928 INFO mapreduce.Job: map 100% reduce 100%
2024-09-22 12:54:40,929 INFO mapreduce.Job: Job job_local1009338515_0001 completed successfully
2024-09-22 12:54:40,936 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=846793
FILE: Number of bytes written=3029614
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=2
Map output records=7
Map output bytes=80
Map output materialized bytes=106
Input split bytes=264
Combine input records=7
Combine output records=7
Reduce input groups=7
Reduce shuffle bytes=106
Reduce input records=7
Reduce output records=7
Spilled Records=14
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=4
Total committed heap usage (bytes)=663748608
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=52
File Output Format Counters
Bytes Written=78
Once the MapReduce job is over, the part-r-00000 text file will be created in the output directory, containing the words and their count of occurrences:
hadoop dfs -cat /output/part-r-00000
Here’s the content of the file:
Apache 1
Hadoop 1
Introduction 1
Job 1
MapReduce 1
Running 1
to 1
Similarly, we can check out other examples like TerraSort to perform large-scale sorting of data, RandomTextWriter to generate random text data useful for benchmarking and testing, and Grep to search for matching strings in the input file available in the hadoop-mapreduce-examples-3.4.0 jar file.
5.3. Manage Services on YARN
Now, let’s see a few operations to manage services on Hadoop. For instance, here’s the command to check the node status:
yarn node -list
Similarly, we can list all running applications:
yarn application -list
Here’s how we can deploy a service:
yarn deploy service service_definition.json
Likewise, we can launch an already registered service:
yarn app -launch service_name
Then, we can start, stop, and destroy the service respectively:
yarn app -start service_name
yarn app -stop service_name
yarn app -destroy service_name
Also, there are useful YARN administrative commands like daemonlog to check the daemon log, nodemanager to start a node manager, proxyserver to start the web proxy server, and timelineserver to start the timeline server.
6. Hadoop Ecosystem
Hadoop stores and processes large datasets, and needs supporting tools for tasks like ingestion, analysis, and data extraction. Let’s list down some important tools within the Hadoop ecosystem:
- Apache Ambari: a web-based cluster management tool that simplifies deployment, management, and monitoring of Hadoop clusters
- Apache HBase: a distributed, column-oriented database for real-time applications
- Apache Hive: a data warehouse infrastructure that allows SQL-like queries to manage and analyze large datasets
- Apache Pig: a high-level scripting language for easy implementation of data analysis tasks
- Apache Spark: a cluster computing framework for batch processing, streaming, and machine learning
- Apache ZooKeeper: a service that provides reliable, high-performance, and distributed coordination services for large-scale distributed systems
7. Conclusion
In this article, we explored Apache Hadoop, a framework offering scalable and efficient solutions for managing and processing Big Data – essential in today’s data-driven world.
We began by discussing its core components, including HDFS, YARN, and MapReduce, followed by the steps to set up a Hadoop cluster.
Lastly, we familiarized ourselves with basic operations within the framework, providing a solid foundation for further exploration.