ACADEMIA September 11, 2009, 11:34 am

When Your Database Analytics Environment Isn't Large Enough, Why Size Matters

by Geno Valente -- To succeed in today's data rich environment, Business Intelligence (BI) experts are looking to un-archive data and put it back into the analytics domain. "Life Events" are now being analyzed using super-computing resources. Data about having children, buying a new car, or even simple TV purchases do not happen every 6 months by the same consumer. To understand your customer, their family, their patterns over the full life of your relationship with them, data warehousing experts want to include up to 10 or 20 years of data into their analytic environments. All they need is good performance, and a beneficial return on investment (ROI) to get the budget for their programs.

Unfortunately, today’s analytic environments are already struggling to maintain a ROI that can keep 6 to 13 months of consumer data in play. Many Fortune 500 companies have data warehouse (DW) systems that will archive this time-series customer data because of two reasons. First is price; today’s warehouse solutions cost between $100K and $200K per terabyte of user data, making long, time series environments completely cost prohibitive. Companies are looking to cut costs, and it would take a brave soul indeed to approach the CIO with a project asking for a 10x budget increase to put 10 years of data into play versus the 13 months they currently have. Second is “performance at scale.” All existing solutions that allow you to scale up to and beyond the petabyte range are good at operational business intelligence (BI) but perform poorly for ad hoc query access. These systems are tuned for canned queries and reporting, but the PhDs and statisticians that perform “wide and deep” analytics don’t want to be told what queries to ask or have restricted access patterns. They want to iterate on the “ask, learn, ask more, learn more” cycle without restrictions to the data. Further, they cannot wait hours or days to get the results they are looking for because they will lose their train of thought or just waste hours of time, restricting progress. If the analyst can stay focused, and get to the “aha” moment faster, their ROI targets can be met with one really good query. “Competing on Analytics” is the key to success today and existing solutions are putting up two roadblocks for larger, faster, cheaper data environments.

There are many other examples of where large data should be leveraged, but is not. Medical research and ongoing drug trials may or may not show progress or side effects for years. Wouldn’t it be nice to record your vital measurements with more regularity and keep them available to researchers for longer periods of time? (aka blood pressure, heart rate, side effects, positive results, etc), Add to that being able to combine this time-series with other clinical research data and then combine that result with the new wave of personal genomic data to find links between research, what is actually happening, and what genomic codes might be linked to success or failure. Obviously, this would result is massive “joins” of petabytes of data and would be prohibitive to many DW/BI solutions either because of size, performance of results, cost, or more likely– all of the above.

Why Systems Aren’t Ready Today

The “Database Analytics” field encompasses a spectrum of applications and the systems deployed by enterprises are variously referred to as Business Intelligence (BI) systems, Decision Support Systems (DSS), Data Warehouses (DW), and Data Marts (DM). The goal of analytics is to extract information useful to the business. This is achieved by collating operational data with historical summaries and third party databases and performing analysis on this dataset. The resulting output may be broadly categorized as falling into two main groups - Operational Reporting and Ad-Hoc Data Exploration.

Operational reporting is typically used for efficiency management, dashboards, and everyday performance monitoring. These solutions are used by business line manages who own profit/loss, run sales teams, and are looking for simple but very useful reports. However, the Ad-Hoc Data Exploration group is completely different. These skilled analysts and PhD statisticians are exploring data “wide and deep” for such project as strategic campaign management, pattern discovery, and even data quality. These people don’t want a dashboard, they want one simple thing: unrestricted SQL access to any/all data they can get their hands on affordably.

As summarized above, these two groups have significant differences in the use model, the end-users of the systems and the owners/buyers of such systems.

Historically, the vast majority of enterprise spending has been directed towards “traditional BI” – the Operational Reporting group of applications. Recently the focus has shifted towards ad-hoc data exploration. The reasons for this shift are simple. Traditional BI – canned queries generating operational reports – are now well-established in all enterprises and have been so for about a decade. This level of analytics is the assumed baseline for all enterprises, a bare necessity for survival.

Beyond survival, however, enterprises need to compete strategically and “Competing on Analytics” is the goal today. Now companies can create and maintain a sustainable strategic advantage through analytics by discovering patterns hidden within data. Pattern discovery implies unconstrained, ad-hoc data exploration.

Performance Challenges In Ad Hoc Queries

The major challenges faced by analysts today are large and growing data sets, and legacy systems that worked well for predictable queries (operational reporting), but poorly for ad-hoc querying. This is not surprising given that historic spending has been directed towards optimizing traditional BI. As a consequence an entire industry now exists to provide “work-arounds”:

Implementing query-aware data models

Carefully constraining query workloads, query complexity, and query schedules

Co-locating data partitions sympathetically with the “most important” joins

Segregating ad-hoc querying into a separate environment

In the end these still remain “work-arounds” and “band-aids”, not real solutions addressing the spectrum of requirements for ad-hoc exploration that include unpredictable data access patterns and data sets that can be incredibly large.

Larger data sets are required when the analysis needs to be “wider and deeper”. This data must be examined at a more detailed level of granularity, with longer time series, much of it historical archived data as well as sub-transactional data. For example, sub-transactional data such as click-stream logs and RFID logs, have low information density, and therefore cannot justify the ROI of a major system investment. Additionally, it is a critical requirement that these large data sets be quickly loaded and unloaded from the analytics system to minimize non-productive usage of the enterprise’s IT infrastructure.

Cost-Effective, Expanded and Accelerated Analysis

As enterprise customers and research organizations continue growing their data requirements, these requirements will demand four essential benefits: Usability, Scalability, Cost-Efficiency, and Eco-Friendliness. Obviously, a system that can store all data, perform all queries in seconds, cost no money, and take up no power or space would be ideal, but is not possible. However, the desire is still there. The sweet spot in the ad hoc analytic market now is for systems allowing cost-effective ways for a database environments to scale beyond a petabyte, be simple and fast to use, allow unrestricted ad hoc access, and are “green”. Not a small feat, but by correctly applying the proper mix of commodity, open-source, and acceleration technology, new vendors are breaking price/performance metrics that were once considered impossible. Today, you can get an analytic environment for $20K/terabyte of user data: one that scales beyond many petabytes, is data model agnostic, and performs regardless of data partitioning. This is starting to commoditize the analytic world as you may know it today and now allows systems that were never previously justifiable. This is not a “magic pill” to help you grow the size of your database, but a reality from a select few who are bringing unrestricted access to the analytic community at price points never thought possible. In the near future, we should expect systems to move to below$5K/TB of user data, become faster, and become ever more “green”, enabling the next wave of large database analytics to grow even faster. Petabyte database analytic environments for the masses: Who would have thought?

Author bio.
Geno Valente is vice president at XtremeData Inc. in Schalumburg, IL. He has spent over 13 years helping support, sell, and market computing technology into markets including the financial services, bio-informatics, high performance computing, and WiMax/LTE sectors. - www.xtremedatainc.com/