Everyone is talking Hadoop these days for tackling serious business problems in BI. Let’s take a closer look at this data storage platform that lies at the heart of many big data solutions.
In a nutshell, Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power, and the ability to handle virtually limitless concurrent tasks or jobs.
Interesting Fact: “Hadoop” was the name of a yellow toy elephant owned by the son of one of its inventors.
What are the benefits of using Hadoop?
One of the top reasons that organizations turn to Hadoop is its ability to store and process huge amounts of any kind of data–both structured and unstructured–quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things, that’s a key consideration.
Other benefits of Hadoop include:
- Computing power. Its distributed computing model quickly processes big data. The more computing nodes you use, the more processing power you have.
- Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.
- Fault tolerance. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. And it automatically stores multiple copies of all data.
- Low cost. The open-source framework is free and uses commodity hardware to store large quantities of data.
- Scalability. You can easily grow your system simply by adding more nodes. Little administration is required.
Many organizations are looking to Hadoop as their next big data platform. Here are some of the more popular uses for the framework today.
- Low-cost storage and active data archive. The modest cost of commodity hardware makes Hadoop useful for storing and combining data such as transactional, social media, sensor, machine, scientific, click streams, etc. The low-cost storage lets you keep information that is not currently critical but that you might want to analyze later.
- Staging area for a data warehouse and analytics store. One of the most prevalent uses is to stage large amounts of raw data for loading into an enterprise data warehouse (EDW) or an analytical store for activities such as advanced analytics, query and reporting, etc. Organizations are looking at Hadoop to handle new types of data (e.g., unstructured), as well as to offload some historical data from their enterprise data warehouses.
- Data lake. Hadoop is often used to store large amounts of data without the constraints introduced by schemas commonly found in the SQL-based world. It is used as a low-cost compute-cycle platform that supports processing ETL and data quality jobs in parallel using hand-coded or commercial data management technologies. Refined results can then be passed to other systems (e.g., EDWs, analytic marts) as needed.
- Sandbox for discovery and analysis. Because Hadoop was designed to deal with volumes of data in a variety of shapes and forms, it can enable analytics. Big data analytics on Hadoop can help run your organization more efficiently, uncover new opportunities and derive next-level competitive advantage. The sandbox approach provides a quick and perfect opportunity to innovate with minimal investment.
Certainly Hadoop provides an economical platform for storing and processing large and diverse data. The next step is to transform and manage that data and use analytics to quickly identify new and useful insights.
What are the challenges of using Hadoop?
But as attractive as Hadoop is, there is still a steep learning curve involved in understanding what role Hadoop can play for an organization, and how best to deploy it.
MapReduce is not a good match for all problems. It’s good for simple information requests and problems that can be divided into independent units. But it is inefficient for iterative and interactive analytic tasks. MapReduce is file-intensive. Because the nodes don’t intercommunicate except through sorts and shuffles, iterative algorithms require multiple map-shuffle/sort-reduce phases to complete. This creates multiple files between MapReduce phases and is inefficient for advanced analytic computing.
Second, there’s a talent gap. It can be difficult to find entry-level programmers who have sufficient Java skills to be productive with MapReduce. That’s one reason distribution providers are racing to put relational (SQL) technology on top of Hadoop. It is much easier to find programmers with SQL skills than MapReduce skills.
Another challenge centers around the fragmented data security issues in Hadoop, though new tools and technologies are surfacing. The Kerberos authentication protocol is a great step forward for making Hadoop environments secure.
And, Hadoop does not have easy-to-use, full-feature tools for data management, data cleansing, governance and metadata. Especially lacking are tools for data quality and standardization.
Perhaps the biggest take-away is understanding that Hadoop is not meant to replace your current data infrastructure, only augment it. Once this important distinction is made, it becomes easier to start thinking about how Hadoopp can help your organization, without ripping out the guts of your data processes.
If you want to keep up on the latest with Hadoop, follow the conference that’s going on right now in San Jose: http://2015.hadoopsummit.org/
Are you thinking about using Hadoop in your organization? Do you already use Hadoop? Let us know in the comments below!