I earmarked this series for writing about it, just wanted to wait for part four to be published… and somehow totally missed it in my feed, until I remembered and decided to check back today… so without further ado, here you go!

Hadoop 2013 – Part One: Performance

Hadoop 2013 – Part Two: Projects

Hadoop 2013 – Part Three: Platforms

Hadoop 2013 – Part Four: Players

Hadoop and Splunk Use cases

March 17th, 2013

Hadoop and Splunk Use cases:

The Splunk and Hadoop communities can benefit from each other’s strengths. Below are several examples of customers that use both environments. Splunk is a primary tool used by this company for making use of big data and gaining real-time operational intelligence from their infrastructure.

Every vendor should have such a cookbook of use cases

Pig Eye for the SQL Guy

March 6th, 2013

Pig Eye for the SQL Guy:

For anyone who came of programming age before cloud computing burst its way into the technology scene, data analysis has long been synonymous with SQL. A slightly awkward, declarative language whose production can more resemble logic puzzle solving than coding, SQL and the relational databases it builds on have been the pervasive standard for how to deal with data.

As the world has changed, so too has our data; an ever-increasing amount of data is now stored without a rigorous schema, or must be joined to outside data sets to be useful. Compounding this problem, often the amounts of data are so large that working with them on a traditional SQL database is so non-performant as to be impractical.

Enter Pig, a SQL-like language that gracefully tolerates inconsistent schemas, and that runs on Hadoop. (Hadoop is a massively parallel platform for processing the largest of data sets in reasonable amounts of time. Hadoop powers Facebook, Yahoo, Twitter, and LinkedIn, to name a few in a growing list.)

This then is a brief guide for the SQL developer diving into the waters of Pig Latin for the first time. Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.

It’s always nice to explain the new with the old…

Cloudera who? Intel announces its own Hadoop distribution:

Hadoop is a juggernaut when it comes to big data. Intel is a juggernaut when it comes to data center infrastructure. Its decision to enter into the open source software market is a big one for the chip company, for the Hadoop ecosystem and for the myriad startups playing in this space.

Plus Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem:

As the Apache Hadoop ecosystem extends into new markets and sees new use cases with security and compliance challenges, the benefits of processing sensitive and legally protected data with Hadoop must be coupled with protection for private information that limits performance impact. Project Rhino is our open source effort to enhance the existing data protection capabilities of the Hadoop ecosystem to address these challenges, and contribute the code back to Apache.

Via myNoSQL.

EMC to Hadoop competition: “See ya, wouldn’t wanna be ya.”:

EMC Greenplum rolled out a new Hadoop distribution that fuses the popular big data platform with its flagship MPP database technology. Co-founder Scott Yara thinks the company’s huge investment puts it in the catbird seat among Hadoop vendors.

Greenplum HAWQ: yet another Hadoop distribution, this time Greenplum RDBMS tied to HDFS.

Cloudera makes SQL a first-class citizen in Hadoop:

Not content to watch its competitors leave it in the dust, veteran big data startup Cloudera is fundamentally changing the face of its flagship Hadoop distribution into something much more appealing.

Monash also writes about it: Quick notes on Impala and More on Cloudera Impala.

The future is hybrid, kind of like SQL and NoSQL combined. Which is what NoSQL stands for, according to some: Not Only SQL ;-) Hadapt is betting big on hybrid being a requirement for analytic platforms, and Curt Monash nicely sums up the new v2 offering that should be available next quarter.

Hadapt+Hadoop is positioned much more as “better than Hadoop” than “a better scale-out RDBMS”– and rightly so, due to its limitations when viewed strictly from an analytic RDBMS standpoint. I.e., Hadapt is meant for enterprises that want to do several of

  • Dump multi-structured data into Hadoop.Refine or just move some of it into an RDBMS.
  • Bring in data from other RDBMS.
  • Process of all the above via Hadoop MapReduce.
  • Process of all the above via SQL.
  • Use full-text indexes on the data.

via Hadapt Version 2.

Drill is Apache’s Dremel.

In the era of big data, there is increasing demand for ever-faster ways to analyze — preferably in an interactive way — information sitting in Hadoop. Now the Apache Foundation is backing an open-source version of Dremel, the tool Google uses for these jobs, as a way to bring that speedy analysis to the masses. The proposed tool is called Drill and the Apache Foundation documents describe it as “a distributed system for interactive analysis of large-scale datasets.”

via For fast, interactive Hadoop queries, Drill may be the answer.

Now it’s VMware’s turn: Meet Spring Hadoop. MapReduce in Spring, MapReduce in Excel… exciting times!

Microsoft’s Hadoop play is shaping up, and it includes Excel. Curious to see how the marriage of Excel and Hadoop is going to work out.