March 6th, 2013
For anyone who came of programming age before cloud computing burst its way into the technology scene, data analysis has long been synonymous with SQL. A slightly awkward, declarative language whose production can more resemble logic puzzle solving than coding, SQL and the relational databases it builds on have been the pervasive standard for how to deal with data.
As the world has changed, so too has our data; an ever-increasing amount of data is now stored without a rigorous schema, or must be joined to outside data sets to be useful. Compounding this problem, often the amounts of data are so large that working with them on a traditional SQL database is so non-performant as to be impractical.
Enter Pig, a SQL-like language that gracefully tolerates inconsistent schemas, and that runs on Hadoop. (Hadoop is a massively parallel platform for processing the largest of data sets in reasonable amounts of time. Hadoop powers Facebook, Yahoo, Twitter, and LinkedIn, to name a few in a growing list.)
This then is a brief guide for the SQL developer diving into the waters of Pig Latin for the first time. Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.
It’s always nice to explain the new with the old…
December 26th, 2012
Not content to watch its competitors leave it in the dust, veteran big data startup Cloudera is fundamentally changing the face of its flagship Hadoop distribution into something much more appealing.
September 17th, 2010
Chen Shapira apparently went through a similar information gathering and facts finding exercise as I did with NoSQL, but she’s much better at writing it up all in this concise and complete article NoSQL Deep Dive – The Missing White Paper.
Highly recommended for all folks who understand SQL RDBMS, and need a quick way to understand NoSQL’s theoretical underpinnings well enough to make sense at the next cocktail party…
July 27th, 2010
Now, depending on your viewpoint, that title could just as well read Keep a Hadoop Cluster in Your Back Pocket. So we’re talking about when it makes sense to combine an old fashioned SQL RDBMS with a fancy and modern NoSQL system from both angles.
Awkward it may be, but SQL is a lot more succint and readable than multiple lines of API calls or crazy, math-like relational algebra languages. And there’s nothing intrinsically slow about the language itself. If you could run “SELECT * FROM table WHERE …” on Cassandra, it would be no slower than specifying the same conditions via API calls.
Netezza blogger Phil Francisco, on the other hand, explains how it makes sense for some of their customers to use Hadoop as large online archive for their colder data.
We have seen customers deploy [patterns] in which the Hadoop Cluster is used for long-term data retention, or as a “queryable archive”. Here one could think of Hadoop as a complementary analytic extension of the Netezza TwinFin when there is far less premium placed on low-latency or high-performance. [...] the queryable archive could also retain long-term copies of structured data that had previously been loaded into the high-performance TwinFin appliance.
There you go. Let me know if you have any other thoughts about how to combine SQL and NoSQL for useful use cases.
July 15th, 2010
The recent releases of SQL Anywhere 12 and SQL Server Compact 4.0 got me thinking about a pet project of mine, basically providing a very cheap SQL persistence layer to fill the gap between Excel / MS Access and the multi-gigabytes typically stored in ‘real’ RDBMS. I’ve been thinking in terms of SQL Clouds such as Microsoft’s SQL Azure or Amazon’s Relational Database Service. But why limit the developer’s freedom with the cloud services, and not just provide a simple integrated SQL store? H2 could be another popular candidate, but limited to Java.
I’ll be looking into that some more… for now, there are 10 Cool New Features In SQL Anywhere 12 and an Introduction to SQL Server Compact 4.0, the Next Gen Embedded Database from Microsoft. And a follow-up on a slightly related topic, (SQL) Azure Appliance, Gartner’s Chris Wolf has a nice write-up of On-Premise Microsoft Azure: An Inevitable Milestone in Azure’s Evolution.
July 13th, 2010
Dell, eBay, Fujitsu, and HP intend to deploy the appliance in their datacenters to offer new cloud services.
While that’s nice to know, I’m much more interested to hear by when we can get our own appliance, and what the configuration sizes and support requirements are going to be.
On a related note, the SQL CAT team released its SQL Azure Customer Best Practices a while ago.
July 9th, 2010
SQL legend Mike Stonebraker gave a session on SQL Urban Myths, and how VoltDB works around them. Todd Hoff then took the arguments and described, analyzed and commented on them on great detail. Excellent read for anybody interested in SQL DBs and their future!