A few months ago, EMC announced it was acquiring Greenplum -- maker of advanced data warehouse database software -- as the foundation of a new Data Computing Products Division here at EMC.
This morning, EMC announced the next logical step in the journey: not only formally launching the new product division, but introducing the first of many EMC products to eventually come in this space: a purpose-built data computing appliance.
Think of it as a storage array with an interesting new presentation layer :-)
And even if large-scale DW/BI isn't your thing, you still might be interested in what we're doing here ...
Yet Another Form Of Information Management for EMC?
If you're familiar with EMC's tag line, it's simple: "Where Information Lives". And one way of visualizing EMC's very broad portfolio is that it's largely information-centric -- we store it, back it up, secure it, virtualize it, etc.
Think back: many years ago, EMC made a big investment in adding value unstructured data (e.g. content) as part of business process workflows: Documentum was quickly joined by a string of subsequent investments that is now known as IIG internally. End-to-end, we can build applications that scan, store, manage, process and query just about any form of document-oriented content that forms a powerful foundation for many thousands of business processes today.
More than a few people see our investment in Greenplum as a parallel theme: the beginning of EMC starting to invest in adding value to structured (e.g tabular) information.
At a high level, their mission statement is pretty clear: helping customer get greater insight and value from their data than ever before. Interesting enough on its own.
But it's the "how" that I find even more interesting: use of open standards, cloud computing, virtualization, and (fascinatingly enough) -- social collaboration. From a pure technology architecture perspective, the path ahead seems clear: MPP shared-nothing models that scale out in linear fashion that exploit the mainstream of industry standard technology.
No need to point to an Infiniband pipe and say "hey, that's what makes us different" :-)
The Greenplum Folks Have Been Very Busy ...
As you can see, no one's been sitting on their hands for the last 75 days.
First, we now formally have a new product division here at EMC.
To most outsiders, that doesn't sound like a lot, but as anyone will tell you, there's a lot of heavy lifting involved: facilities, support systems, chart of accounts, product roadmaps, integrating stand-along functions as part of the broader EMC.
And hiring. Yes, hiring. No shortage of open positions in the new group -- so far, the size of the group has now grown by 30% in just two months, and is likely to continue on a similar pace for the near future.
During all of this, the Greenplum team shipped their most advanced software platform yet: Greenplum Database 4.0, which you can learn more about here.
As part of EMC, customer interest in Greenplum-based solutions is now white-hot and a little difficult to keep up with. I consider that I high-class problem to have, by the way.
They also updated their reference hardware architectures to be "private cloud" compatible -- running under VMware and -- more recently -- Vblocks.
And, oh yes, they're announcing a nifty new purpose-built data computing appliance today.
More on that in a moment.
The Expanding Greenplum Portfolio
At the core is the Greenplum Database.
I ceased to be a decent database geek over a decade ago, but poring through the architectural backgrounders, I found a lot to like -- especially for an analytical workload that is typically partionable and uncorrelated in nature.
This isn't your father's relational database architecture that's been pressed into duty as true data warehouse.
In the spirit of making things easy for people, the single node edition is a freebie -- download it, play with it, even put it to work if you'd like.
Where I tend to get really interested is the new Greenplum Chorus product: it turns the traditional IT-centric model on its head -- putting more power in the hands of knowledgeable users, enabling them to make better decisions with better data.
Three Consumption Models For Advanced Data Warehousing and Analytics
The new Data Computing Appliance discussion fits in neatly with the other two options that are already in the marketplace today.
Some people prefer to build their own supporting DW/BI infrastructure: servers, storage, network, software, etc. and do the integration and support themselves.
Indeed, I get to meet customers who are very proficient at this particular workload, and
know exactly what they're doing.
For these folks, Greenplum is available as a pure software offering with your choice of hardware.
Others look at something pre-integrated and virtualized (like a Vblock) and see data warehousing as just another workload for their private cloud -- especially if their interest is in moving to a self-service model using Greenplum Chorus.
They prefer infrastructure that does well at many use cases, rather than just one. An even if you're not interested in a Vblock, it's great that Greenplum software runs nicely in a VM if neeeded.
And, finally, we have a growing number of purpose-built appliances in this space from the usual players: Teradata, Netezza (now being acquired by IBM) and more recently Oracle's Exadata.
If you're looking at appliances, please add the new Greenplum Data Computing Appliance (DCA) to your list of potential options -- you'll see why in a moment.
Data In. Decisions Out.
Basically, that's what a good appliance is supposed to do: deliver on the core value proposition with as little drama and fuss as possible.
As you can see from this slide, the Greenplum DCA intends to compete against the other options in two primary regards.
First, as we'll see in a moment, they've got a clear advantage in data ingestion performance. To organizations that depend on analytics to drive their business, this is turning out to be a Really Big Deal -- if you can't load the data fast enough, you can't ask it any useful questions. And, by all indications, the demand to load up anything and everything seems to be growing exponentially.
Second, there's sheer price/performance. As you'll see, the Greenplum DCA leverages Greenplum's shared-nothing architecture nicely and thus can industry-standard technology components
throughout.
Other than some nice packaging and single-source support, this is something you could probably build yourself if you had the time, the budget and the inclination.
All the "secret sauce" is entirely in Greenplum's advanced software.
By being on the Intel price/performance curve -- and using a shared-nothing architecture -- we're talking about both astounding levels of end-user performance as well as very linear scalability to very extreme dimensions.
Not that it matters, but it has a cool green LED light bar across the face :-)
Back To Ingestion Speeds
As I mentioned before, most people tend to focus on query speeds when evaluation these sorts of solutions, and there's nothing wrong with that.
But people who are really into this stuff have realized that a query on stale data -- especially when there's fresher stuff at hand -- isn't all that interesting.
A single rack of Greenplum DCA ingests at 10TB per hour. Need more? Scale linearly with more racks -- up to 24. That's ~240TB (roughly a quarter of a petabyte) per hour data loading speed.
Business people really, really care about having fresh data to make decisions that drive the business.
The fresher, the better.
The Money Chart
OK, this is the chart that all of you will want to study in detail -- it's a side-by-side comparison of the new Greenplum DCA against the more well-known alternatives that are out there.
The first thing to note (once again!)is the MPP shared-nothing architecture.
Linear scalability can be debated back and forth, but it's a far easier case to make when the workload is highly divisible (as is the case here), units of computing and storage are rather modest and granular (ditto), and there's an absolute minimum of coordination required between the nodes.
Note that Oracle's Exadata doesn't have the benefit of this approach :-)
A full rack of Greenplum DCA has 16 of the latest-greatest Intel servers. Well, I heard today there are two additional redundant servers that do master coordination duties (very lightly loaded), so -- technically -- the number is 18, but there are 16 you really care about.
That's 192 cores in single rack, all put to good use thanks to the magic of Greenplum software. Needless to say, as Intel offers more/faster cores, those can easily be put to work as well.
If I'm reading the chart correctly, they're using 600GB drives for a good mix of capacity and performance -- yielding 36TB of usable capacity in uncompressed form.
But compression is a big part of the equation here: not only does is reduce storage capacity, but it means that there's far less data to transfer between server and storage, giving the potential to dramatically bump query performance.
Your results may vary, but the team told me that almost all customers experience a range between ~2.5x compression (very uncompressable data) all the way up to ~10x-20x in many instances. Thus the "144 TB" usable (compressed) capacity number shown here is probably more on the conservative side than it needs to be.
Getting a more precise number usually means trying it with your own data :-)
And those numbers on the bottom aren't typos. The Greenplum DCA can linearly scale to 24 racks in a single logical complex.
That's one whackload of crunch if you think about it.
A Dash Of Initial EMC-Value Add, As Well
Given the short timelines for this project, I'm very pleased that we were able to get a decent helping of EMC value-add into the product, with the opportunity for more to come.
First, the Greenplum DCA is specified and manufactured just like any other EMC data appliance product: think Data Domain, Avamar, Centera, Atmos, etc. It comes off largely the same kind of assembly line.
Second, it's supported directly by the EMC Customer Service organization -- same mission-critical support we've always offered for all of our platforms.
And finally, we've done some integration with both our backup and replication technologies in this space, using Data Domain and RecoverPoint.
Data Warehousing and Data Protection
Yes, there are people who realize that the warehouse engine is powering key parts of the business, and deserves the same protection as other critical systems. As a matter of fact, there are a *lot* of these people.
Simply reloading the data warehouse isn't a viable option for these folks. Nor is tape. Going further, some of the more popular snap approaches can interfere with physical data layout -- an important concern for performance.
The first integration is between the Greenplum DCA and the larger Data Domain appliances.
Note the bit about a "native utility". I'm really curious what's in there, because having the backup utility know both about how Greenplum DCA lays out its data -- as well as how the Data Domain appliance operates -- well, there's a lot of opportunity for value-add in this space, I'd offer.
And, of course, the Data Domain engine can replicate copies of itself at a distance if you'd like additional protection.While this backup integration is useful, I tend to gravitate towards the RecoverPoint appliance, shown here.
Ordinarily, data stored twice in a Greenplum environment for data protection purposes. Well, one of those copies can be external, if needed. One of the neat attributes of the DCA is that it's "SAN ready", and this is an example of putting that feature to use.
The "second copy" of data warehouse data can now be easily replicated using RecoverPoint, using all of its powerful capabilities -- point-in-time snapshots, journaling, compression -- the works.
And, because all the replication handling is completely external, nothing gets in the way of application-level performance.
I'm just itching to go head-to-head with the other guys using this particular feature. I've found that none of them particularly wants to have either the backup or remote replication discussion.
Well, we do.
Sizing and Scaling
For most IT organizations, attempting to size and predict growth for a business-critical DW environment is a thankless task indeed. There's no good way to predict how big it will get, or how fast it will need to be. People try all the time, but they usually end up missing the mark (high or low) by a significant margin.
If the project is successful (good news for the business) it ends up needing to be really big and really fast (challenging for IT). This usually results in the typical "have a hunch, provision a bunch" methodology :-)
Greenplum DCA changes the game: start small, grow in reasonable increments, get really big.
Each new unit of resource is transparently added to the pool; resource balancing is automatically and in the background. Put what's needed on the floor to meet today's requirement, with no need to worry what's coming at you down the road.
OK, What Does It All Mean?
First, if you've been looking at all-in-one data warehousing appliances, there's a serious new option to go consider -- one that uses an entirely different set of technology assumptions than the other guys. Choice is good.
Second, I think customers should have options in how they consume software functionality, and data warehousing is no exception. Build your own environment using Greenplum software, if you like. Put the software in virtual machines, and run it on your existing VMware farm, or a Vblock, or even a compatible service provider if you like.
Or, if your requirement is focused enough (and you're in a hurry), consider the optimized appliance approach.
Customers choose, not vendors :-)
Third, if you track how the various vendors are lining up against each other, you'll notice a clear shot from EMC in the general direction of Oracle, Teradata and Netezza. That means we think this market is important, and there's a clear opportunity to disrupt the established players -- using a potent recipe of standard hardware, legacy-free architectures and advanced open software.
Finally, the term "big data" isn't enitrely marketing hype. Prior to the Greenplum acquisition, I was already meeting plenty of IT shops who were already in this world, and trying to figure out what to do. Now, with Greenplum as part of the team, I'm meeting a lot more.
Maybe we'll get a chance to chat with you about it before too long :-)
Certainly looks very impressive! I am wondering if you know off hand what the licensing model is for the software version? Is it per server, per core, per socket or something else? Oracle of course for their DB anyways has crazy core calculations I think IBM does something similar.
Also I look forward to the day when we can see how workloads on these types of systems (EMC/Oracle/etc) run vs what seems to becoming a big trend of running things like hadoop. I'm sure the up front costs of the higher end stuff of course costs more but the efficiency of the system itself could make up for that pretty easily in some cases I suspect.
Something as simple as this:
http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/
Thought it was kind of funny reading Google admitting that their new indexing system based on their distributed database despite the fact that it scales really well they measured a 30 fold overhead in the way they do things vs more traditional ways. That 30 fold can add up fast!
Posted by: nate | October 13, 2010 at 01:22 PM