Google’s Megastore

April 26th, 2011

I don’t think I’ve written about Google’s Megastore yet, so here’s a quick summary of worthwile resources.

Megastore is the data engine supporting the Google Application Engine. It’s a scalable structured data store providing full ACID semantics within partitions but lower consistency guarantees across partitions.

James Hamilton’s take on Google Megastore: The Data Engine Behind GAE. His blog is worth following for people interested in scaling infrastructure in general, not just DBs. Todd Hoff’s write-up is about Google Megastore – 3 Billion Writes and 20 Billion Read Transactions Daily, his blog is about everything High Scalability. And last but not least the Storage Mojo take on Google’s Megastore, from a storage insider.

The 451 group’s Matt Aslett argues that Necessity is the mother of NoSQL.

Necessity is particularly relevant when looking at the history of the NoSQL databases. While it is easy for the incumbent database vendor to dismiss the various NoSQL projects as development playthings, it is clear that the vast majority of NoSQL projects were developed by companies and individuals in response to the fact that the existing database products and vendors were not suitable to meet their requirements with regards to the other five factors: scalability, performance, relaxed consistency, agility and intricacy.


The fact that Facebook, LinkedIn, Google and Amazon have had to develop and support their own database infrastructure is not a healthy sign. In a perfect world, they would all have better things to do than focus on developing and managing database platforms. That explains why the companies have also all chosen to share their projects. Google and Amazon did so through the publication of research papers, which enabled the likes of Powerset, Facebook, Zvents and Linkedin to create their own implementations. These implementations were then shared through the publication of source code, which has enabled the likes of Yahoo, Digg and Twitter to collaborate with each other and additional companies on their ongoing development.

He also posts an interesting chart of the evolution of  NoSQL.

MySQL just announced a pre-release snapshot which comes with an integrated Memcached plugin accessing the InnoDB storage engine directly: NoSQL to InnoDB with Memcached

The ever-increasing performance demands of web-based services have generated significant interest in providing NoSQL access methods to MySQL. Today, MySQL is announcing the preview of the NoSQL to InnoDB via memcached. This offering provides users with the best of both worlds – maintain all of the advantages of rich SQL query language, while providing better performance for simple queries via direct access to shared data. In this preview release, memcached is implemented as a MySQL plugin daemon, accessing InnoDB directly via the native InnoDB API

I wouldn’t be surprised to see more of this kind of integration also from other DB vendors.

Structure Big Data Roundup

March 29th, 2011

Good number of articles from Derrick Harris over at GigaOm rounding up the Structure Big Data Conference. First, there’s a look at Hadoop, Cloudera, and alternatives to Cloudera from IBM, DataStax, Hadapt etc. in As Big Data Takes Off, the Hadoop Wars Begin, and second there’s a piece about Why Big Data Startups Should Take a Narrow View:

[…] analyzing social media data is not the same, either in technique or in purpose, as analyzing user data to feed a recommendation engine for a site like Netflix. And herein lies the opportunity. […] It’s a situation just begging for startups to fill the void between big data tools and actually using them for a particular task.

So where are the NoSQL startups targetting the financial industry?

Great Bloomberg interview with Cloudera CEO Mike Olson on open source and big data.

Via the 451 group

Meet Mapr, a Competitor to Hadoop Leader Cloudera.

They are said to be building a proprietary replacement for the Hadoop  Distributed File System that’s allegedly three times faster than the  current open-source version. It comes with snapshots and no NameNode  single point of failure (SPOF), and is supposed to be API-compatible  with HDFS, so it can be a drop-in replacement.

Lots of famous names in the company, which probably explains why they were able to raise $9m funding without ever shipping anything. I don’t know enough about them right now, but it’ll be interesting to see if they have enough of an advantage to keep an edge on plain Hadoop and HDFS over time, as the open source version catches up.

Microsoft Graph DB Trinity

March 23rd, 2011

Microsoft published information about it’s research project Trinity, a hypergraph DB.

Trinity is a graph database and computation platform over distributed  memory cloud. As a database, it provides features such as highly  concurrent query processing, transaction, consistency control. As a  computation platform, it provides synchronous and asynchronous  batch-mode computations on large scale graphs. Trinity can be deployed  on one machine or hundreds of machines.

At this time, Trinity is not available for download or use, but it’s already used in production internal to Microsoft.

MongoDB 1.8 Released, Supports Journaling for Fast Crash Recovery. See more on the MondoDB Journaling page. Good stuff by the Mongo guys, bringing it one step closer to Enterprise readiness.

Acunu NoSQL Appliance

February 24th, 2011

GigaOm reports that Big Data Startup Acunu Raises Small Funding. These guys (based in London) are building a HW appliance with SSD for a variety of NoSQL stores, currently supporting Cassandra and an Amazon S3 compatible RESTful interface.

The Acunu Storage Core is an open source next-generation storage stack built from the ground up for Big Data. Running inside the Linux Kernel, it powers databases like Acunu’s Distribution for Cassandra through a native key-value interface. It brings a host of benefits, including performance from large cost-effective SATA disks and consumer-grade SSDs — while actually enhancing data integrity. It virtually eliminates the risk and performance impact of disk rebuilds.

Let’s see if there’s a market for NoSQL appliances…

With a flurry of recent BI-oriented partnerships, it’s no surprise Cloudera is attracting so much interest.

via How Cloudera Became a Leader in BI/Hadoop