Introduction
Our Client from Singapore was building a Stock Prediction Platform in which they collect data from various third-party services like Stocks Data from Quandl Stocks Service and Latest News about multiple companies from IBM Watson Discovery Service.
These Data Collection Services were running on their On-Premises VM’s, and they were writing data to Ceph Object Storage, and Then Airflow was deployed on one of the VM’s which schedules the Spark Jobs to collect new data files from Ceph Storage and run the transformations and store the data into Hive. Hadoop was also set up on their On-Premises VM’s.
Then REST API and SDK’s were made to provide access of the Hive Data Warehouse to Data Scientists, and Then their Prediction Algorithms run on TensorFlow and persist the results to MySQL.
Stock Prediction Dashboard was made on top of MySQL by consuming REST API’s. So Client wants to migrate all this Technology Stack to Google Cloud. So We started working on it by collaborating with Client.
Technology Stack
- Node Js based Data Collection Services (on Google Compute Engines)
- Google Cloud Storage found Data Lake (storing raw data coming from Data Collection Service)
- Apache Airflow (Configuration & Scheduling of Data Pipeline which runs Spark Transformation Jobs)
- Apache Spark on Cloud DataProc ( Transforming Raw Data to Structured Data )
- Hive Data Warehouse on Cloud DataProc
- Play Framework in Scala Language ( REST API )
- Python based SDKs
Solution Offered
Steps used to build this Platform
We in collaboration with Client’s team understand their requirements like various data sources, data pipelines, etc. for the migration of their Platform from On-Premises to Google Cloud Platform.
Data Collection Services on Google Compute Engines
We migrated all of their Data Collection Services and REST API and other background services to Google Compute Engine ( VM’s).
Updating the Data Collection Jobs to write data on Google Buckets
Data Collections Jobs were developed in node.js and were writing data to Ceph Object Storage. So they were using Ceph as their Data Lake. So, Our Node.js developers updated their existing code to write the data to Google Buckets. So We used Google Buckets as our Data Lake.
Using Apache Airflow for building Data Pipelines and Building Data Warehouse using Hive and Spark
The client had already developed a set of Spark Jobs which runs every 3 hours and checks for new files in Data Lake ( Google Buckets ) and then run the transformations and store the data into Hive Data Warehouse. So We migrated their Airflow Data Pipelines to Google Compute Engines and also migrated the Hive on HDFS and we used Cloud DataProc Cluster for Spark and Hadoop.
Migrating REST API’s to Google Compute Instances
The REST API which was serving Prediction results to Dashboards and also acting as Data Access Layer for Data Scientists were also migrated to Google Compute Instances ( VM’s ).