Overview
Distributed SQL Query Engine Presto enables us for running interactive analytic queries and Hive enables for running batch processing against data sources of all sizes ranging from gigabytes to petabytes. Presto allows querying data in Hive MetaStore. Hive is optimized for query throughput, while Presto is optimized for latency
Hive have Pull Data Processing Modelling whereas Presto has Push Data Processing Models like traditional DBMS Implementations. Presto has Memory Limitation for Query Tasks and Running Daily /Weekly Reports Queries Required a Large Amount of Memory, for which Hive is Best.
Infrastructure Automation Using Ansible and Terraform for Auto Launching, Auto Scaling and Auto Healing of the Presto Cluster and Hive using AWS On-Demand EC2 and AWS Spot Instances.
Problem Statement
-
Client Looking to build Data Processing & Query Platform and Cluster Management for their organization
-
The Customer had large DataSets on Remote Storage and want to use Presto for Data Discovery and Apache Hive, Tez For ETL Jobs.
-
Presently, using AWS Cloud but looking to do Infrastructure Automation for Cluster Management and Deployment for Presto and Hive using AWS Spot Instances.
Solution Offered
We offered Solution for Data Processing & Query Platform with Infrastructure Automation -
-
Greatly simplifies, speeds up and scales Big Data Analytics workloads.
-
It processes your data from external storage using fast execution engines like Presto and Hive.
-
Run large and complex queries.
-
Cost effective as it uses AWS spot instances as default and heals the cluster if cluster scale is smaller than the minimum cluster size.
-
It automatically scales up and down the cluster according to the CPU load.