According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem. Is it true ??

Mangesh Prakash Jadhav
3 min readNov 18, 2020

Hadoop Cluster Setup :

To study this I have create hadoop cluster in AWS cloud which contain 1 namenode & 3 datanode and my laptop is work as hadoop client. From my laptop I can upload or read files.

After cluster creation start hadoop services on namenode & all datanode. Now we can upoad file from client.

All nodes are placed in remote location and connected through network(internet).So here I use some networking concept. When we send or receive data via networking it comes through packets. Data inside packets is called as payload.

In hadoop cluster, the data which is upload from client is come in packets. So we have to collect this packets. For that we use one program named as ‘tcpdump’.

Before uploading file from client we have to run one command in namenode & datanode as ‘# tcpdump -i <name of ethernet card> tcp port <port number used in namenode> -n -X > <namrof thefile>’.

when we run this command it will collect all packets which are coming to port number which we have provide in command and store the output of command in file. so we can use this file in future.

File Uploading :

Now we can upload file from client using command ‘# hadoop fs -put <nameoffile> <loaction>’ and to check whether file is uploaded or not use command ‘# hadoop fs -ls /’

Analysis of Packets :

File which I have uploaded to cluster is contain following data.

Packets in namenode :

When I analyzed this packets I found that client can’t upload any data to namenode.

Packets in datanode2 :

Packets in datanode3 :

Packets in datanode1 :

when I analyzed all packets I found that namenode can’t send any data to datanode and client does not upload data parallelly in all datanode. Client first upload data to one of the datanode and then that datanode will send this data to another datanode and so on.

In my case client first send data to datanode 2 then datanode 2 send it to datanode 3 then datanode 3 send it to datanode 1. It means that here they don’t use concept of parallelism.

Conclusion :

Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem is not true.

--

--