While reading Anand’s post the other day about our client’s recent Hive-to-Spark migration, I thought it would be a useful exercise to dive a little deeper into how Hive works. On the surface it looks just like relational SQL: you write your query – select * from my_table where my_column is not null – and press execute, and after some time a table comes back. But what’s happening under the hood that makes Hive different from relational databases like Oracle and Sybase?
When you submit a HiveQL query it is first sent through the Compiler, where several analysis steps are performed. First, the query is parsed into an abstract syntax tree (AST) representation (you can read more about AST’s here). Each node in this tree represents a language construct; in this case that could be a column name or an operator such as ‘like’ or ‘=’. Next, the AST is transformed into an internal representation of the query, with validations like column verification and type checking performed. In this step partition information, if applicable, is also collected from the Metastore. From the internal query representation the Compiler produces a logical plan, which is a tree of operators – such as filter or join – that make up the query logic. The initial plan is updated with optimizations, such as combining several joins into one multi-way join, to improve performance. The final step in the Compiler is to generate the query plan, a set of map-reduce tasks that will be submitted to Hadoop to produce the query result.
With the query plan in hand, the Compiler signals the Driver to execute the plan by submitting its component map-reduce tasks to Hadoop. This is done via a middle man of sorts called the Execution Engine, which tracks the jobs and sends the final result back to the Driver to be passed along to the client that submitted the query.
So there you have it. While on the surface Hive query execution may look exactly like in your familiar relational databases, behind the scenes a whole bunch of tasks are carried out to differentiate Hive and bring your results back as fast as possible. Hopefully now when you click that ‘Execute’ button you’ll have a better understanding and appreciation of everything that takes place next!