Hadoop Distributions Comparing Cloudera Hortonworks MapR and Intel

YouTube video

Introduction

Welcome to this Pythian series about Big Data technologies. In this article, we will dive into the world of Hadoop, its distributions, and the vendors that offer these distributions. Hadoop is a family of technologies that are commonly associated with big data. It comprises several vendors and technologies that are built upon the foundation of Hadoop blocks, which include Hadoop Distributed File System (HDFS) and MapReduce.

Cloudera: The Oldest Distribution

One of the most well-known and oldest distributions of Hadoop is Cloudera’s EDH distribution. When utilizing Hadoop, it is often more convenient to adopt one of the available distributions rather than starting from scratch with Apache Hadoop and integrating additional tools later on. Cloudera offers a wide range of open-source and commercial tools, including popular components like Hive and Pig. They also provide their proprietary tool called Enterprise Manager, which assists companies in managing Hadoop at scale efficiently.

MapR: A Proprietary Option

Another major player in the Hadoop space is MapR. Unlike Cloudera, MapR’s platform is proprietary. However, they do implement the same API as Apache Hadoop, allowing for compatibility. MapR uses their own file system called MapRFS, which is similar to HDFS but is implemented in C rather than Java. Additionally, MapR has developed its own management framework and integrates open-source tools like Apache Hive and HBase into their platform.

Hortonworks: The Open-Source Advocate

Hortonworks, a more recent entrant in the market, originated as a spinoff from Yahoo. An essential aspect of Hortonworks’ philosophy is their commitment to open-source software. They rely heavily on the community ecosystem and conduct all of their operations using open-source projects. Unlike MapR, Hortonworks is completely open source, making them an ideal choice for organizations seeking open-source solutions for their Hadoop deployments.

Intel: Optimizing Hadoop for their CPUs

In addition to the aforementioned vendors, Intel has recently entered the Hadoop distribution market. Their distribution is somewhat surprising, considering they are a hardware company. However, Intel’s motivation for creating their Hadoop distribution is to optimize Hadoop tools and platform specifically for Intel CPUs. Their aim is to make Hadoop run more efficiently and faster on Intel’s platform, potentially attracting Hadoop customers to choose Intel over other hardware options. It remains to be seen how Intel’s distribution will be received by the broader market.

Choosing the Right Hadoop Distribution

When selecting a Hadoop distribution for your project, it is crucial to carefully evaluate the available options. Considerations include the level of openness, proprietary tools, integration with open-source technologies, and any potential optimizations for specific hardware platforms. Each vendor offers a unique set of features, and understanding these differences will help you make an informed decision.

Conclusion

In this article, we explored the various Hadoop distributions offered by Cloudera, MapR, Hortonworks, and Intel. We learned that Cloudera is the oldest and most developed distribution, offering both open-source and proprietary tools. MapR provides a proprietary platform with its own file system and management framework, while also integrating open-source tools. Hortonworks, born from Yahoo, is fully committed to open-source software. Lastly, Intel has ventured into the Hadoop space, aiming to optimize Hadoop for their CPUs. Understanding the differences between these distributions will enable you to select the one that meets your specific needs and goals.