1.Introduction:-

Hadoop has more or less become synonymous with ‘Big Data’ today. Hadoop is an open source project and a number of vendors have developed their own distributions, adding new functionality or improving the code base. But the many distributions also contribute to a decision complexity on which distribution to choose for your needs. Also, why have vendor distributions when there is a ‘standard Hadoop distribution”’? Who are the major vendors and how do they compare? Read on to know more.

2.Why Hadoop Distribution:-

Hadoop is Apache software which is freely available for download and use. So why do we need distributions at all?

  • Distributions actually package Hadoop nicely into easy to install packages which make it easy for system administrators to manage effectively.

  • Distributions bundle versions of components that work well together. This provides a working Hadoop installation right out of the box.

  • Distribution makers strive to ensure good quality components.

  • Sometimes, they lead the way by including performance patches to the ‘vanilla’ versions and have predictable product release road maps.

  • This ensures they keep up with developments and bug fixes and also they come with support, which could be very valuable for a production critical cluster.

3.Leading Hadoop Distribution Vendors in the Market:-

  • In current market situations there are three leading Hadoop Distributions available in the market i.Cloudera, ii.Hortonworks, iii.MapR.

  • Choosing one amongst them is neither so easy task nor a tough one. A basic analysis on the type of approach towards their work and data helps one to easily choose the suitable Distribution for them.

  • If you’re anxious to test things out, all the vendors are offering free versions, but with each will have some level of restriction, either based on functionality or the number of nodes that can be added to a cluster.

  • If you need to get up and running really quickly, each vendor offers VM images with Linux and Hadoop already installed.

4.Which one to choose:-

  • It depends entirely on your business requirements because Hadoop is licensed under the Apache License, which is a free licensed software.

  • All these vendors will automatically provide patches and updates to the core Hadoop distribution, something that everyone benefits from.

  • So it’s best to instead turn your attention to each of the strengths and weaknesses based on the product offered and the available add-ons developed for your use.

5.CLOUDERA vs HORTONWORKS vs MapR:-

  • Many similarities are there in between this three distributions.

  • All the three distributions Cloudera, Hortonworks and MapR are focused on Hadoop and their entire revenue comes in by offering enterprise ready Hadoop distributions.

    • MapR Distribution is the way to go if it’s all about product.

    • If open source is your uptake – then Hortonworks Hadoop Distribution is for you.

    • If your business requirements fit somewhere in between, then opting for Cloudera Distribution for Hadoop might be a good decision.

      Also by providing support to help their users with the problems faced and also demonstrations, if required. All the three Hadoop distributions have stood the test of time ensuring stability and security to meet business needs.

      Cloudera:-

      • Cloudera is the best known player and market leader in the Hadoop space to release the first commercial Hadoop distribution, it tops the list when it comes to building innovative tools.

      • The management console –Cloudera Manager, is easy to use and can be implemented with rich user interface displaying all the information in an organized and clean way.

      • The proprietary Cloudera Management suite automates the installation process and also renders various other enhanced services to users –displaying the count of real-time nodes, reducing the deployment time, etc.

      • Cloudera offers consulting services to bridge the gap between – what the community provides and what organizations need to integrate Hadoop technology in their data management strategy.

      Hortonworks:-

      • Hortonworks, founded by Yahoo engineers, provides a ‘service only’ distribution model for Hadoop.

      • Hortonworks is different from the other Hadoop distributions, as it is an open enterprise data platform available free for use. Hortonworks Hadoop distribution –HDP can easily be downloaded and integrated for use in various applications.

      • Hortonworks was the first vendor to provide a production ready Hadoop distribution based on Hadoop 2.0. Though CDH had Hadoop 2.0 features in its earlier versions, all of its components were not considered production ready.

      MapR:-

      • MapR is also a platform-focused provider like Hortonworks and Cloudera.

      • MapR integrates its own database system MapR-DB which it claims is between four and seven times faster than the stock Hadoop database, HBase, running on competing distributions.

      • Due to its power and speed, MapR is often seen as a good choice for the biggest of Big Data projects.

      • Unlike Cloudera and Hortonworks, MapR Hadoop Distribution has a more distributed approach for storing meta-data on the processing nodes

      • Because it depends on a different file system known as MapR-File System (MapRFS) and does not have a Name-node architecture.

      Features:-

      Hortonworks Cloudera MapR
      Dependability
      High Availability Single point failure recovery Single point failure recovery Self-healing across multiple failures
      Map Reduce HA Restarts jobs Restarts jobs Continues without restart
      Upgrading Planned down time Rolling upgrades Rolling upgrades
      Replication Data Data Data and meta-data
      Snapshots Consistent only for closed files Consistent only for closed files Consistent for all files and tables
      Disaster Recovery Parallel Cluster File copy scheduling Mirroring
      Manageability
      Hortonworks Cloudera MapR
      Dependability
      High Availability Single point failure recovery Single point failure recovery Self-healing across multiple failures
      Map Reduce HA Restarts jobs Restarts jobs Continues without restart
      Upgrading Planned down time Rolling upgrades Rolling upgrades
      Replication Data Data Data and meta-data
      Snapshots Consistent only for closed files Consistent only for closed files Consistent for all files and tables
      Disaster Recovery Parallel Cluster File copy scheduling Mirroring
      Manageability
      Management tools Ambari Cloudera manager MapR control system
      Heat map, Alarm, Alerts Yes Yes Yes
      Integration with rest api Yes Yes Yes
      Data and job placement control No No Yes
      Performance & Scalability
      Meta-data Architecture Centralized Centralized Distributed
      Data ingest Batch mode write Batch mode write Batch and Streaming write
      Hbase Performance Latency spikes Latency spikes Consistent low latency
      NoSql Applications Mainly batch applications Mainly batch applications Batch and real-time applications
      Data access
      File system access HDFS, read only NFS HDFS, read only NFS HDFS, read/write NFS
      File I/O Append only Append only Read/write
      Security ACl’s Yes Yes Yes
      Write level authentication Kerberos Kerberos Kerberos. Native

      Comparison between MapR & Hadoop File Systems :-

      MapR-FS vs HDFS

      When data is written to MapR-FS, it is sharded into chunks. The default chunk size is 256 Megabytes. Chunks are striped across storage pools in a series of blocks, into logical entities called containers. Striping the data across multiple disks allows data to be written faster, because the file will be split across the three physical disks in a storage pool, but remain in one logical container.

      When data is written to HDFS it is distributed across nodes. HDFS splits data into blocks of 128 megabytes, and distributes these blocks across different locations throughout your cluster. Files are automatically distributed as they are written.

      MapR-FS distributes and replicates the name-space information throughout the cluster, in the same way that data is replicated. Each volume has a name container, which contains the metadata for the files in that volume. The CLDB service typically runs on multiple nodes in the cluster. CLDB is used to locate the name container for the volume, and the client connects to the name container to access the file metadata.

      In HDFS, meta-data is managed by the NameNode. Before any operations can be performed on data stored in HDFS, an application must contact the NameNode. The single NameNode maintains metadata information for all the physical data blocks that comprise the files. This can create performance bottlenecks.

      MapR-FS use replication for high availability and fault tolerance. Replication protects from hardware failures. File chunks, table regions and metadata are automatically replicated. There is generally at least one replica on a different rack. In HDFS data stored on any node gets replicated multiple times across the cluster. These replicas prevent data loss. If one node fails, other nodes can continue processing the data.
      MapR-FS avoids single point of failure and performance bottlenecks problem by fully distributing the metadata for file and directories.

      In HDFS, Name Nodes can lead to single point of failure and performance bottlenecks.

      MapR-FS allows updates to files in increments as small as 8K. Having a smaller I/O size reduces overhead on the cluster, which allows for snapshots, and is one of the reasons that MapR-FS is randomly read/write, even during ingestion.

      Data in HDFS is immutable. If the source data changes, the data must be appended to existing data, or else reloaded into the cluster

      MapR-FS written in C. Being written in C means less garbage collection for the operating system, which translates to faster performance.

      HDFS is written in Java.

      Advantages and Disadvantages :-

      DISTRIBUTION

      Advantages

      Disadvantages

      Cloudera

      • Cloudera has a proprietary management software known as Cloudera Manager.

      • CDH has a user friendly interface with many features.

      • Cloudera has Hue and SQL query handling interface Impala, as well as “Cloudera Search” for easy and real-time access of products.

      • CDH is comparatively slower than MapR Hadoop Distribution

      • Cloudera has a commercial license however, Cloudera also allows the use of its open- source projects free of cost has a free trial.

      • But the package do not include the management suite Cloudera Manager or any other proprietary software.

      Hortonworks

      • Hortonworks is the only commercial vendor to distribute complete open source Apache Hadoop without additional proprietary software.
      • Hortonworks was the first vendor to provide a production ready Hadoop distribution based on Hadoop 2.0 version.
      • Hortonworks has Ambari for management and it makes Hive faster through its new Stinger project. HDP uses Stinger for handling queries, and Apache Solr for in-search of data.
      • Hortonworks HDP is available as a native component on the windows server. A Windows-based Hadoop cluster can be deployed on Windows Azure through HDInsight Service.
      • The Ambari Management interface on HDP is just a basic one and does not have many rich features.

      • Hortonworks is also comparatively slower than MapR Hadoop Distribution

      MapR

      • MapR is one of the fastest Hadoop distribution with multi node direct access, it replaces HDFS components and instead uses its own proprietary file system, known as MapRFS.
      • MapR’s Drill is a low-latency distributed query engine for large-scale datasets, Drill is designed to scale to several thousands of nodes and query petabytes of data at interactive speeds.
      • Through a recent partnership with Canonical, the creator of Ubuntu operating system, MapR is offering Hadoop as a default component of Ubuntu operating system.
      • MapR does not have a good interface console as Cloudera.

      • MapR also has a commercial license, like Cloudera MapR provides the use of its open source projects.

      • Free, community-supported MapR edition comes with only one NFS Gateway including Hadoop, MapR-DB, and MapR Streams.

      6.Summary:-

      Choosing a Hadoop distribution completely depends on the hindrances or obstacles an organization is facing in implementing Hadoop in their enterprise. A right move in choosing a Hadoop distribution will help organizations connect Hadoop to different data analysis platforms with flexibility, reliability and visibility. Each Hadoop distribution has its own pros and cons.

      When choosing a Hadoop distribution for business needs, it is imperative to consider the additional values offered by each Hadoop distribution by balancing the risk and cost, for the Hadoop distribution to prove beneficial for your enterprise needs.

     The world of Hadoop is getting bigger and bigger, this list of options can be overwhelming if you don’t know what you’re looking for. Hopefully these considerations along with their specific criteria can lead you in the right direction as you search for the best Hadoop distribution for your needs.