What is Apache Hive?
Apache
Hive is open-source data warehousing software designed to read, write, and manage large data sets extracted from the Apache Hadoop Distributed File System (HDFS), one aspect of a larger Hadoop ecosystem.
With extensive Apache Hive documentation and
continuous updates, Apache Hive continues to innovate in processing data in an easily accessible manner
.
The History of
Apache Hive
Apache Hive is an open source project that was conceived by co-creators Joydeep Sen Sarma and Ashish Thusoo during their time at Facebook. Hive started as a subproject of Apache Hadoop, but has graduated to become a high-level project of its own. With the increasing limitations of Hadoop and Map Reduce jobs and increase data size from 10 GB per day in 2006 to 1 TB/day and to 15 TB/day within a few years. Facebook engineers couldn’t execute the complex jobs with ease, giving way to the creation of Hive.
Apache Hive
was created to achieve two goals: a declarative SQL-based language that also allowed engineers to be able to connect their own scripts and programs when SQL wasn’t enough, which also allowed most of the engineering world (based on SQL skills) to use Hive with minimal disruption or retraining compared to others.
Second, it provided a centralized metadata store (based on Hadoop) of all datasets in the organization. Although initially developed on Facebook walls, Apache Hive is used and developed by other companies such as Netflix. Amazon maintains an Apache Hive software fork included in Amazon Elastic MapReduce on Amazon Web Services.
What are some of the features of Hive?
Apache Hive supports analysis of large datasets stored in Hadoop HDFS and supported file systems such as Amazon S3, Azure Blob Storage, Azure Data Lake Storage, Google Cloud Storage, and Alluxio.
It provides a SQL-like query language called HiveQL with readable schema and transparently converts queries into Apache Spark, MapReduce, and Apache Tez jobs. Other features
of Hive include:
- Hive’s data functions help process and query large data sets. Some of the functionality provided by these functions include string manipulation, date manipulation, type conversion, conditional operators, mathematical functions, and others.
- Storing metadata in a relational database management system
- Different types of storage such as Parquet, plain text, RCFile, HBase, ORC
- Operate with compressed data stored in the Hadoop ecosystem using algorithms
- Built-in user-defined functions (UDFs) to manipulate dates, strings, and other data mining tools. Hive supports extending the UDF set to handle use cases not supported by built-in functions SQL-like
- queries (HiveQL), which are implicitly converted to MapReduce or Tez, or Spark jobs
, and others
Apache Hive architecture and key components of Apache Hive The
key components
of the Apache Hive architecture are the Hive 2 server, the Hive Query Language
(HQL), the Apache Hive external metastore, and the Beeline Shell hive
. Hive Server 2
The Hive 2 server accepts incoming requests from users and applications and creates an execution plan and automatically generates a YARN job to process SQL queries. The server also supports the Hive optimizer and Hive compiler to streamline data extraction and processing.
Hive
query language
By enabling the implementation of code reminiscent of SQL, Apache Hive negates the need for long JavaScript code to sort unstructured data and allows users to perform queries using built-in HQL (HQL) statements. These statements can be used to navigate large data sets, refine results, and share data in a cost-effective and time-efficient manner.
The
Hive metastore
The central repository of the Apache Hive infrastructure, the metastore is where all Hive metadata is stored. In the metastore, metadata can also be formatted into Hive tables and partitions to compare data between relational databases. This includes table names, column names, data types, partition information, and data location in HDFS.
Hive Beeline Shell
In line with other database management systems (DBMS), Hive has its own built-in command line interface where users can execute HQL statements. In addition, the Hive shell also runs Hive JDBC and ODBC drivers and can therefore query from an Open Database Connectivity or Java Database Connectivity application.
How does Apache Hive software work?
Hive Server 2 accepts incoming requests from users and applications before creating an execution plan and automatically generates a YARN job to process SQL queries. YARN work can be generated as a MapReduce, Tez, or Spark workload.
This task works as a distributed application in Hadoop. Once the SQL query has been processed, the results will be returned to the end user or application, or transmitted back to the HDFS.
Hive Metastore will then leverage a relational database such as Postgres or MySQL to preserve this metadata, with Hive Server 2 retrieving the table structure as part of its query planning. In some cases, applications may also interrogate the metastore as part of their underlying processing.
Hive workloads run on YARN, the Hadoop resource manager, to provide a processing environment capable of running Hadoop jobs. This processing environment consists of allocated CPU and memory from the various worker nodes in the Hadoop cluster.
YARN will attempt to leverage HDFS metadata information to ensure processing is implemented where the necessary data resides, with MapReduce, Tez, Spark, or Hive automatically generating code for SQL queries such as MapReduce, Tez, or Spark jobs.
Even though Hive has only recently leveraged MapReduce, most Cloudera Hadoop deployments will have Hive configured to use MapReduce or sometimes Spark. Hortonworks (HDP) deployments typically have Tez configured as the runtime.
What are the five different data types used by Apache Hive?
By using batch processing, Apache Hive can efficiently extract and analyze petabytes of data at fast speeds, making it ideal not only for processing the data but also for executing ad hoc queries.
Apache Hive data types consist of five categories: Numeric, Date/Time, String, Complex, and Misc.
Numeric
data types
As the name suggests, these data types are integer-based data types. Examples of these data types are ‘TINYINT’, ‘SMALLINT’, ‘INT’ and ‘BIGINT’.
Date
and time data types These data types
allow users to enter a time and a date, with ‘TIMESTAMP
‘, ‘DATE’ and ‘INTERVAL’, all accepted entries.
String data types
Again, this data type is very straightforward and allows typed text, or ‘strings’, to be implemented as data for processing. String
data types include ‘STRING’, ‘VARCHAR’, and ‘CHAR.’
One
of the most advanced data types, complex data types
record more elaborate
data and consist of types such as ‘STRUCT’, ‘MAP’, ‘ARRAY’ and ‘UNION’.
Miscellaneous
types Data types
that do not fit into any of the other four categories are known as miscellaneous data types and can take inputs such as ‘BOOLEAN’ or ‘BINARY’.
How
Map Join works in Hive
Apache In Apache Hive, Map Join is a feature employed to increase the speed and efficiency of a query by combining, or rather “joining,” data from two tables without going through the Map-Reduce stages of the process.
What is a relational database management system (RDBMS) and how does Apache Hive use it?
A relational database management system (RDBMS) is a database model that operates by storing metadata in a table structure based on rows or columns and allows the connection and comparison of different data sets.
By using an RDBMS, Apache Hive can ensure that all data is stored and processed securely, reliably, and accurately because built-in features such as role-based security and encrypted communications ensure that only the right people have access to the extracted information.
What is the difference between Apache Hive and a traditional RDBMS?
There are a few key differences between Apache
Hive and an RDBMS:
- RDBMS functions work on read and
- Hive follows the schema rule on read, which means there is no validation, checking or analysis of data, only copying/moving files. In traditional databases, a schema is applied to a table that applies a schema to a write rule.
- Because Hive is based on Hadoop, it has to comply with the same Hadoop and MapReduce constraints that other RDBMSs may not need.
write many times, while Hive works on write once, read many times.
Apache
Hive vs
. Apache Spark
An analytics framework designed to process large volumes of data across multiple datasets, Apache Spark provides a powerful user interface capable of supporting multiple languages, from R to Python
.
Hive provides an abstraction layer that represents data as tables with rows, columns, and data types to query and analyze using a SQL interface called HiveQL. Apache Hive supports ACID transactions with Hive LLAP. Transactions ensure consistent views of data in an environment where multiple users/processes access data at the same time for create, read, update, and delete (CRUD) operations.
Databricks offers Delta Lake, which is similar to Hive LLAP in that it provides ACID transactional guarantees, but offers several other benefits to help with performance and reliability when accessing data. Spark SQL is the Apache Spark module for interacting with structured data represented as tables with rows, columns, and data types.
Spark SQL
is compatible with SQL 2003 and uses Apache Spark as the distributed engine to process the data. In addition to the Spark SQL interface, a DataFrames API can be used to interact with data using Java, Scala, Python and R. Spark SQL is similar to HiveQL.
Both use ANSI SQL syntax, and most Hive functions will run in Databricks. This includes Hive functions for conversions and date and time analysis, collections, string manipulation, mathematical operations, and conditional functions.
There are some Hive-specific functions that would have to be converted to the Spark SQL equivalent or that do not exist in Spark SQL in Databricks. You can expect all HiveQL ANSI SQL syntax to work with Spark SQL in Databricks.
This includes ANSI SQL analytical and aggregate functions. Hive is optimized for the Optimized Row Columnar (ORC) file format and is also supported by Parquet. Databricks is optimized for Parquet and Delta, but also supports ORC. We always recommend using Delta, which uses open source Parquet as the file format.
Apache Hive vs. Presto
A project originally established on Facebook, PrestoDB, more commonly known as Presto, is a distributed SQL query engine that allows users to process and analyze petabytes of data at a fast speed. Presto’s infrastructure supports the integration of relational databases and non-relational databases from MySQL and Teradata to MongoDB and Cassandra.