BigData - Why is all this happening?

Because paid databases no longer have added value over open source. There is also no real difference or real margin in hardware. The new analytical databases have no open source alternative, which means customers are willing to pay good money for them. Thus, Oracle, IBM, EMC and HP are all taking what I call the mainframe approach. There are two very good reasons why mainframes have been around for so long: They are very stable and dependable and ... they are extremely hard to get out of. This is exactly what the big four are trying to achieve with these new analytical databases.

Oh... and there is the fact that data is growing exponentially world wide. Something to do with machine generated data they say...

Preface

The last few year, since roughly 2005 we have been experiencing a surplus of interest in database technologies. Some have been around for a while and some are new. This comes in direct response to the explosion of data in the world. In the past only 4-5 different DB vendors and technologies were in out evoked set when, all quite similar in concept. Oracle, MSSQL, mySql & DB2 were basically completely dominating the market. Today customers are considering a variety of vendors boasting a variety of unique technologies. This paper give an overview of the new BigData technologies and products.

Trends and brands

We can identify a few trends and emerging brands in the BigData world. As you read the following paragraphs please keep in mind this is all happening while we are all moving to the cloud.

NoSQL

“I don’t need a database, I just need a way to store a lot of data in a key value fashion”. This should enable very fast loading are retrieving of data but does not provide any SQL or ACID features. This trend had gained a lot of traction with products like Casandra, MongoDB & CouchDB, however the implementation scenarios are somewhat limited. Another perspective to look from is the data management capabilities offered by SQL and extensions like the analytical functions.

Columnar DBs

The idea behind columnar DBs is simple: Store the data where each column’s data is sequential on disk. This decreases the seek and fetch time and data volume in analytic queries. The simple explanation is this. If I have a select of seven columns from a table that has seventy columns, I am only fetching from disk the relevant seven columns while with a raw store I would be fetching all columns as I am fetching blocks of data that is composed of all columns. I would be scanning more disk & fetching ten times the amount of data. This in itself gives you a X10 factor in query performance. Add on top of that MPP shared nothing and compression that some new DBs offer and you are way ahead any traditional OLTP RDBMS database once it comes to analytics.

There are multiple commercial and community/open-source version of columnar DBs that are getting a lot of traction in the market as they are pushed by IT giants like IBM (Nettezza), HP (Vertica) EMC (Greenplum) and SAP (HANA & Sybase IQ). These databased provide an order of magnitude in analytic performance but most have problems handling any load of OLTP work.

In-Memory

As RAM becomes cheaper and machines with a bigger capacity of RAM in memory databases like VoltDB and Memchaced are starting to become a player in the big data realm along side classic big data solutions like ParAccel which present all-in-memory options. If in the past a 10TB database would be extremely expensive to implement in memory, today it is very feasible, relatively in expensive and the use cases exit. Ranging from algo-trading to real time billing 10TB in memory databases are not even considered big any more. By spreading data across machines with ultra fast networking In-Memory is a very valid option. This is a great solution for high-speed OLTP in BigData systems.

Introducing the idea of warm data & hot data implementations exist and will continue to grow where In-Memory expands to flash drives to store an even bigger amount of “warm” data.

Sharding

A lot of organizations are using mySql sharding to satisfy their big data needs. This is easier said than done. Is is not simple to mange huge amounts of data over shaded mySql databases. Innovators like Xeround cloud are building tools that enables customers to use your existing investment in mySql and grow into the cloud.

Hadoop

Hadoop was inspired by GFS & MapReduce papers published by Google but was actually created by Yahoo. Started based on Apache Nutch. Original goal was to create a web-scale crawler based search. Hadoop is open source and since 2008, when it hit web-scale, it is a great hipe. Is it now in the plateau phase. Hadoop is built on top of a very reliable distributed file system called HDFS and was primarily built for batch processing of map-reduce jobs. Both management technologies were built on top of Hadoop. HBase - a key value store, Hive - a light SQL DB and PIG a high level language over MapReduce are the ones more relevant to the BigData arena.

We tend to classify Hadoop as a great solution for low-touch data. There are multiple Hadoop distributions available.

BI Tools

The BI tools realm is also seeing new sophisticated players penetrating the market with agile, easy to use tools. Companies like Tablue, QlickView & Penthao are offering BI tools that are considered next generation to Cognos and Business objects. Focusing or performance and usability the new BI vendors are quickly gaining ground. The new BI vendors are focusing more on analytics and dash-boarding than reporting. EMC GreenPlum is even coming out with one of their own “productivity layer” called Chorus.

There is also another set of tools that belong to the data mining family like SAS & SPSS.

Who is buying who and why?

Oracle bought Sun which supports their appliance (mainframe) strategy quite well. So no their world class database has the hardware it needs. Oracle uses the HW to delage some of the work done in the DB to the hardware layer. This accelerates performance. Exadata is the name, easiest upgrade from your existing Oracle is the game.

The biggest problem Oracle has is SAP which has “in 9 months” developed HANA. SAP is very blunt about replacing all the Oracle DBs running SAP with HANA because it runs “100,000” times faster. SAP also boght sybase which sells Sybase IQ, a veteran columnar implementation that has a very large install base. SAP bought Sybase for their mobile offering, not their databases but now that they have two BigData databases, they must be a big player in the market.

IBM bought Nettezza. Yet another appliance. IBM also has DB2 and SolidDB which is an in memory database. IBM also has an Hadoop offering called BigInsight. Nettezza has been around for a while and their install base is quite large. Combined with the IBM footprint in the IT world, we are bound to see a lot of Nettezza boxes around... or not. With IBM it can go either way as they have been known to have all kinds of influences on products from stellar growth to rot in the basement.

EMC bought GreenPlum. Not sure if the storage market has become less lucrative for EMC because of NetApp and IBM XIV or simply because the price was right and the timing was good. In any case this is another dominant player in the market with a great install base and a an excellent sales organization in EMC which has embraced GreenPlum. You can buy GreenPlum as a software only product but not if you want performance. Did I say MF strategy already?

HP has bought Vertica, minus VoltDB that was spinned-off earlier, because... it was the only one left? Leo had some money to spend? Oracle bought Sun and walked away from their long time partnership? All of the above? Who knows. One thing is sure, an appliance was made available shortly. You can still buy Verica as a service but not from HP... funny. Who was not bought?

There are many vendors out there that are still independent thus relatively small but selling well. Infobright and ParAccel are just a couple of names.

Are they all the same?

No they are not. They are all build with different technology and behave differently in different scenarios. If you are venturing into the BigData word you should spend time on understanding your use case and how the different technologies impact the performance of your use case.

Here is an updated list of promising BigData players:
http://venturebeat.com/2012/10/16/big-data/view-all/

Copyrights © 2012 MidLink Computing LTD - Your link to innovation | www.midlink.co.il