May 28th, 2013
I earmarked this series for writing about it, just wanted to wait for part four to be published… and somehow totally missed it in my feed, until I remembered and decided to check back today… so without further ado, here you go!
March 17th, 2013
The Splunk and Hadoop communities can benefit from each other’s strengths. Below are several examples of customers that use both environments. Splunk is a primary tool used by this company for making use of big data and gaining real-time operational intelligence from their infrastructure.
Every vendor should have such a cookbook of use cases
March 6th, 2013
For anyone who came of programming age before cloud computing burst its way into the technology scene, data analysis has long been synonymous with SQL. A slightly awkward, declarative language whose production can more resemble logic puzzle solving than coding, SQL and the relational databases it builds on have been the pervasive standard for how to deal with data.
As the world has changed, so too has our data; an ever-increasing amount of data is now stored without a rigorous schema, or must be joined to outside data sets to be useful. Compounding this problem, often the amounts of data are so large that working with them on a traditional SQL database is so non-performant as to be impractical.
Enter Pig, a SQL-like language that gracefully tolerates inconsistent schemas, and that runs on Hadoop. (Hadoop is a massively parallel platform for processing the largest of data sets in reasonable amounts of time. Hadoop powers Facebook, Yahoo, Twitter, and LinkedIn, to name a few in a growing list.)
This then is a brief guide for the SQL developer diving into the waters of Pig Latin for the first time. Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.
It’s always nice to explain the new with the old…
February 27th, 2013
Hadoop is a juggernaut when it comes to big data. Intel is a juggernaut when it comes to data center infrastructure. Its decision to enter into the open source software market is a big one for the chip company, for the Hadoop ecosystem and for the myriad startups playing in this space.
As the Apache Hadoop ecosystem extends into new markets and sees new use cases with security and compliance challenges, the benefits of processing sensitive and legally protected data with Hadoop must be coupled with protection for private information that limits performance impact. Project Rhino is our open source effort to enhance the existing data protection capabilities of the Hadoop ecosystem to address these challenges, and contribute the code back to Apache.
February 26th, 2013
EMC Greenplum rolled out a new Hadoop distribution that fuses the popular big data platform with its flagship MPP database technology. Co-founder Scott Yara thinks the company’s huge investment puts it in the catbird seat among Hadoop vendors.
Greenplum HAWQ: yet another Hadoop distribution, this time Greenplum RDBMS tied to HDFS.
December 26th, 2012
Not content to watch its competitors leave it in the dust, veteran big data startup Cloudera is fundamentally changing the face of its flagship Hadoop distribution into something much more appealing.
October 23rd, 2012
The future is hybrid, kind of like SQL and NoSQL combined. Which is what NoSQL stands for, according to some: Not Only SQL Hadapt is betting big on hybrid being a requirement for analytic platforms, and Curt Monash nicely sums up the new v2 offering that should be available next quarter.
Hadapt+Hadoop is positioned much more as “better than Hadoop” than “a better scale-out RDBMS”– and rightly so, due to its limitations when viewed strictly from an analytic RDBMS standpoint. I.e., Hadapt is meant for enterprises that want to do several of
- Dump multi-structured data into Hadoop.Refine or just move some of it into an RDBMS.
- Bring in data from other RDBMS.
- Process of all the above via Hadoop MapReduce.
- Process of all the above via SQL.
- Use full-text indexes on the data.
via Hadapt Version 2.
August 21st, 2012
In the era of big data, there is increasing demand for ever-faster ways to analyze — preferably in an interactive way — information sitting in Hadoop. Now the Apache Foundation is backing an open-source version of Dremel, the tool Google uses for these jobs, as a way to bring that speedy analysis to the masses. The proposed tool is called Drill and the Apache Foundation documents describe it as “a distributed system for interactive analysis of large-scale datasets.”
March 2nd, 2012
March 2nd, 2012
Microsoft’s Hadoop play is shaping up, and it includes Excel. Curious to see how the marriage of Excel and Hadoop is going to work out.