Wanted to Blog about what a massive difference migrating from Hive to Spark did for a large Client (part of S&P 500) we work with. The client is working with 100s of Terabytes of structured data. Not really a case for big data but sits in that uncomfortable space where Oracle is a little too small.
The existing approach they had when we were initially engaged to help was to use Hive queries with Avro storage with a normalized schema with 12 odd tables .. something that you would expect fairly competent people coming from Database technology to come up with. A batch report generation system was created using Oozie jobs that called a slew of Hive Scripts that put the results into final reporting aggregated tables via a set of temporary tables and were then exported out to End-user facing systems.
It all sort of worked, took days to run through a few reports and would often fail with large data sets, requiring query tuning/patches constantly. Testing was really hard because it required data transfers/ingestion across different environments, resulting in (take 1 guess) – more patches/fixes to production.
Stay tuned to how we took the functional but very painful, lots of manual intervention system to an automated, scalable, testable success story with massive ROI in a matter of few weeks.