Like a lot of you, I’ve been following with interest Percona’s testing of the open source column databases. One thing I think is pretty cool about some column databases that work with MySQL is that they don’t require you to create indexes. The reason is, in general, the column is the index. Not having to create indexes is nice because lots of indexes can really bog down a database if you’ve got a lot of load or DML activities because the indexes have to be maintained for all data input and alterations.
In Percona’s test, they showed the load time for all the different databases, but I noticed that the times didn’t include the index creation for LucidDB or MonetDB. I decided to follow Vadim’s link on the LucidDB index creation and totaled up the time it took to create the indexes. For the index and statistics times, it was 384,314 seconds and when you add that to the 140,736 seconds for the table load, you get 6 days just to create the database. That’s quite a difference from the 6 hours for InfiniDB and 14 hours for InfoBright, both of which don’t need or use indexes.
I’m sure indexes supply a benefit for some column DB’s in various use cases, but if the database was real dynamic and required a lot of new objects with indexes be added, continuous heavy loads, or DML, it would seem that indexes could really put a ding in things. In that case, it would seem column DB’s thatdon’t require indexes could have an edge there.
Subscribe to:
Post Comments (Atom)
7 comments:
Indexes enable key RDBMS features like PRIMARY KEYS and FOREIGN KEYS, and without those indexes, relational integrity can not be guaranteed. In my opinion this is suicide from a BI standpoint.
PK and FK indexes imply b-tree structures, but they can be maintained in other structures that are more easily updated. Tokutek is working with Fractal indexing which looks interesting, and other storage engines like Kickfire include multiple types of indexes, from hash based indexes for quick update of primary keys, to bitmaps to specialized join indexed which improve performance of joins.
Even with a column store, lookups on columns with a low cardinality can result in a large amount of data to be examined. In some cases, bitmap indexes can reduce this work significantly.
Also, an alternative to some forms of indexing is materialization, which is an approach taken by Vertica, in addition to other index methods.
I understand the need for PKs but not FKs...shouldn't ref integrity be done before data is loaded into the DW? As most will agree, I want the DW working on crunching and quickly producing data and not doing much else.
Data marts are always at least partially denormalized, and mistakes can be easily made in ETL processes such that a bad schema could be populated.
If you aren't running in strict mode, you could also end up with values you weren't expecting, but that is another discussion.
People also use non-ETL processes such as INSERT ... SELECT and CREATE TABLE AS ... SELECT which can result in accidental inconsistencies.
While working at Kickfire, I worked with multiple customers who thought that their data was referentially consistent, but in fact, it wasn't.
If you aren't prepared to run regular data validation queries on your data you are putting it at risk if you don't have PK and FK constraints.
A few comments:
(1) For LucidDB, the most expensive indexes to build on the fact table (those on foreign key columns with high cardinality) may not actually be getting used by the optimizer. I was working with Vadim to find out whether this was the case, but it turned out that he had already had to recycle the database storage for the next DBMS run, so we weren't able to use EXPLAIN PLAN. If that's the case (or if it turns out that their presence adds marginal benefit), skipping their creation would eliminate a large portion of the overall index creation time (and a lot of storage).
(2) I second what Justin Swanhart said about bitmap indexes. This (combined with star join optimization) is the reason that LucidDB is able to go faster with increasing query selectivity. Saying "the column is the index" only works if you take Vertica's presort approach (and you can afford multiple materializations of the same table in different sort order). See my blog post at http://thinkwaitfast.blogspot.com/2010/01/spin-cycle.html
(3) Although LucidDB currently only supports PRIMARY KEY (not FOREIGN KEY), indexing the foreign key columns is still needed since that's what the star join optimization relies on. But per (1) above, doing this blindly is not always helpful, so an autotune feature would be a nice addition (we could probably figure it out automatically from the results of ANALYZE TABLE ESTIMATE STATISTICS, which uses data sampling).
(4) It's true that referential integrity often comes "by construction" in a warehouse environment assuming your ETL surrogate key generation/lookup flows are perfect, but it's still a useful feature to have available at least as a QA measure (even if you don't enable it at all times), so long term we'd like to get this in LucidDB too.
A product without indexes will not be able to support a large segment of the market for reporting.
I know Oracle does amazing things for star queries with bitmap indexes and others tell me that Kickfire and LucidDB do the same.
Some of these engines without indexes can still run some of those queries but only at a huge increase in power consumption because so much extra work is done and only by sacrificing support for concurrent queries because all of the disks are busy scanning all of the tables for queries that have a lot of selectivity.
Mark –
Good to hear from you on this subject. I will (shockingly!), however, have to respectfully disagree with your blanket statement that DB’s with no indexes can’t compete well in the reporting or DW markets.
First, let me say indexes certainly have their place as all of us know. And there are certainly use cases where an indexed DB will likely perform better than our InfiniDB engine (e.g. many / most columns selected coupled with PK lookup). In fact, we can take a look at the impact selectivity has for the bitmap index for LucidDB using the SSB. SSB Series 3 is progressively more selective as the compound predicates go from region to nation to city as it moves from Q3.1 to Q3.4. As a percentage of the total rows it progresses something like 3.6%, 0.14%, 0.00579%, to the most selective at 0.000081% for Q3.4. At the most selective case tested, the LucidDB bitmap index indeed was fastest of all open source systems tested under those conditions. In fact, for even fewer rows LucidDB would likely run even faster. For this data set and query complexity, that is one indicator where bitmaps may be more effective. Certainly there will be other cases where they will likely do even better.
But there are certainly use cases where a no-indexed DB will outrun an indexed row-DB. Netezza has been doing this for years, and other column DB’s like Vertica and Paraccel have seen good success too. There are probably plenty of opportunities out there where people are analyzing 0.1%, 1%, 10% or more of their data that we feel we can be very competitive.
Remember, too, that it’s not like our engine or the others mentioned above have no mechanism to help reduce I/O – they just don’t use indexes. InfoBright has their KnowledgeGrid; we have our Extent Map; Netezza uses Zone Maps, etc. All of which help to cut down unneeded I/O, but do so without indexes. And there are other factors at play as well (e.g. Netezza with hardware, us with scale up and MPP, etc.) It’s not a table/column scan run wild situation. In fact, since these tests were on column DB’s, there is no scanning of tables.
I used to not buy the no-indexes talk either until I actually started running tests a few years back at MySQL and saw it for myself. The guy who just did our recent product viability test against a leading row database – Bert Scalzo – admits to the same thing in the results paper he did, but now he’s seen it in action and states he’s surprised at what happened. These types of end results are why I was pushing Marten so hard to go ‘all in’ with MySQL and column DB’s before Sun bought MySQL. Alas, just as I got him to make it a priority for MySQL, he left and Sun really never took that much interest.
For those interested in further reducing the elapsed time for these queries and likely the power consumption with InfiniDB, we would definitely recommend loading the data by month and tuning the queries slightly. For a number of the queries, this enables avoiding significant amounts of I/O. In fact, for query Q3.4 it reduced the I/O by better than 95%, also reducing the elapsed time on a different server from 303 seconds to 25 second. No, we haven’t done a power study, but it certainly looks like the right direction both from an efficiency standpoint as well as delivering competitive performance for even very selective queries.
Bottom line for me: the proof is in the pudding and the particular use case. Will we always win against an indexed DB? Nope. But is there good evidence and reason to believe no-indexed DB’s like ours and the others mentioned above can indeed address plenty of needs in the DW/reporting market? I think so. And it’s certainly easy to find this out – just download our database or one of the others and see if it works for your situation.
Last note: One thing we will be adding soon are PK’s, but more so for uniqueness enforcement than query help.
Thanks,
Robin
I agree with Mark that although some applications can get by without indexes some really need them. The conventional wisdom is that indexes are too expensive to maintain, but that wisdom is informed by experience with the 40-year-old B-tree data structure. More modern data structures, such as Fractal Trees, can maintain indexes with two orders of magnitude fewer disk I/Os.
Product plug: Tokutek's TokuDB storage engine for MySQL is available now. TokuDB 2.2.0 (without ACID crash recovery) is generally available. TokuDB 3.0.2 (with full ACID compliance) is available in Beta.
Post a Comment