WORLD OF HADOOP ON AWS THROUGH THE PATH OF TERRAFORM (HAT)

Rishabh Jain
7 min readOct 21, 2020

Hey! Enthusiasts searching for some cloud Automation and integration of the technologies. You are at the right place!! . Here I am going to show an awesome integration of Cloud and Big Data i.e. I am going to launch a whole Hadoop Cluster on the AWS Cloud and moreover without visiting the cloud and in one click which means through the Terraform.

This setup is named as HAT (Hadoop AWS with Terraform).

Introduction of Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Introduction of AWS

Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform, offering over 175 fully featured services from data centers globally. Millions of customers — including the fastest-growing startups, largest enterprises, and leading government agencies — are using AWS to lower costs, become more agile, and innovate faster.

Introduction of Terraform

Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers as well as custom in-house solutions. Configuration files describe to Terraform the components needed to run a single application or your entire datacenter. Terraform generates an execution plan describing what it will do to reach the desired state, and then executes it to build the described infrastructure.

Before starting the creation of this whole Hadoop Cluster here is a blog for the reference for those who are new to the technology known as Terraform and Integration with Cloud.

The pre requisites for the setup are mentioned below :

  1. AWS CLI V2.
  2. Terraform Installed and Executable.
  3. AWS Cloud Account.

Now let us start to make the whole Hadoop Cluster. Let us first start with Hadoop Master Node. I will be providing you the GitHub link of the files made and used and also the Terraform code which will be used to launch whole Hadoop Cluster on the Cloud within just one click.

1. MASTER NODE

Here is the Terraform Code which is written to launch the Hadoop Master Node on the Cloud.

After the code is written let us move to run the code which include three commands to complete the process which is written below.

  1. terraform init => It is the command which is used to initialize the plugins required for the code.
  2. terraform validate => It is the command used for the validation of the code written.
  3. terraform apply => It is the command which is used to apply the code on the top of the cloud itself.

You can also use the command terraform apply -auto-approve. It works same as terraform apply works. It has just a difference that you need not to confirm after writing this command.

If all goes right you would be viewing the following output after running the Code.

Terraform Init

OUTPUT FOR TERRAFORM INIT

Terraform Validate

OUTPUT FOR TERRAFORM VALIDATE

Terraform Apply

After the Terraform apply is completed you will see that in AWS EC2 section an instance will be created with the name of master as shown below.

MASTER INSTANCE CREATED IN AWS CLOUD

The code is prepared in such a way that you need not to visit to the Cloud instead of that automatically the Dashboard of the Name Node or Master Node will be opened as shown below.

MASTER NODE DASHBOARD

Slave Node

Here is the terraform code to launch the 3 Hadoop Slave Nodes in the cloud.

After the code is written let us move to run the code which include three commands to complete the process which is written below.

  1. terraform init => It is the command which is used to initialize the plugins required for the code.
  2. terraform validate => It is the command used for the validation of the code written.
  3. terraform apply => It is the command which is used to apply the code on the top of the cloud itself.

If all goes right you would be viewing the following output after running the Code.

Terraform Init

Terraform Validate

Terraform Apply

After this you can view it on the cloud that three instances will be created and connected with the Master Node as shown below.

3 SLAVE NODE INSTANCE CREATED

You can check on the Dashboard of the Master Node that the data nodes are connected with the Name node.

DATANODES CREATED AND CONNECTED

Client Node

Now let us come to the Client Node here we will be launching 2 Client Nodes on the top of the Cloud.

Terraform Code to Launch Client is shown below.

After the code is written let us move to run the code which include three commands to complete the process which is written below.

  1. terraform init => It is the command which is used to initialize the plugins required for the code.
  2. terraform validate => It is the command used for the validation of the code written.
  3. terraform apply => It is the command which is used to apply the code on the top of the cloud itself.

If all goes right you would be viewing the following output after running the Code.

Terraform Init

Terraform Validate

Terraform Apply

You can view that 2 Client instance is now created on the cloud a shown below.

2 CLIENTS CREATED

Moreover you can verify it by the Dashboard of the Master Node by seeing it in the File System as shown below.

FILES CREATED BY EACH CLIENT

After Completion of Whole Hadoop Cluster you can destroy thee code by the command terraform destroy -auto-approve or terraform destroy.

Here is the GitHub Link of All the Repositories used above.

1.GitHub Repository For All Codes and Files of Slave, Client Nodes

2. GitHub Repository For Hadoop File

3. GitHub Repository For JDK File.

4. GitHub Repository For Master Node Files.

You may face this error while pushing or pulling the code as my account space is Exhausted for this Month.

So, you have to pull the code and have to use your own Repositories

You can upload the content through the Reference link given below.

Thanks a lot !! Hope you learned and liked it a lot.

--

--

Rishabh Jain

I am a tech enthusiast, researcher and an integration seeker. I love to explore and learn about the right technology and right concepts from its foundation.