Introduction
Retail Corporation Planned to build Data Lake to perform Analytics where they Collect data from Different Countries Servers, IoT, Social Media, Click Streams and logs to research Product discovery, Product recommendations, and New Product Requirements. We have to build Data lake and Data Warehouse for Real-Time & Batch Data Processing for social media analytics, IoT analytics, Image analytics, recommendation system using clickstream, data warehouse ETL operations.
Problem Statement:
- Server Data have Specified Format to Pull Data with DCD format of File which consists of 16 XML’s.
- Monitoring of Stores and refrigerators with IoT devices, with data pipeline to collects data from IoT devices, to run some analytics and detect anomaly on the basis of data collected from the sensors.
- Client wants to setup a data pipeline to collect real-time data from the Social Media with hashtags for their Products for sentimental and Intent analytics.
- Recommendation system where they are collecting clickstream from the web application and mobile application.
- Product Search and Discovery Data Scraping.
- Data Ingestion from ERP Solution for there Vendors.
Solution Offered:
Real-time Social Media Analytics:
Data collection from real-time tweets from the Twitter API and Scrapping of API’s with filter specific keyword, hashtag, language, and location. For the collecting data from Twitter API’s, we have used Python to collect data from Twitter and sends that to Google Cloud Pub/Sub. We used Google Cloud App Engine to deploy this application. The data form Pub/Sub is consumed in Cloud DataFlow for the further cleaning and transformation and sent to the data lake BigQuery.
Real-time IoT Analytics platform:
Sensor Data from IoT devices at Different warehouses with Refrigerator installed at different places to collected data. Different IoT Devices were configured with Google IoT Core using MQTT bridge. And Google Pub/Sub was used as a messaging queue and for Google Cloud DataFlow for the transformation and cleaning. This cleaned data is sent to the Data Lake BigQuery for the further Analytics.
Real-time Clickstream Analytics:
This use case is for the product recommendation system. In this, real-time clickstream data is captured using Google Cloud Function with an HTTP request as the trigger and collected data sent to Google Pub/Sub. Before performing the data analytics with BigQuery the data is cleaned and transformed using Cloud DataFlow.
Sales Analytics Platform:
In this, the client one/wants?? a portal where each store manager has to upload there the data file in DCD format. On the backend, we need to do the conversion of the file into the csv and then we need to publish the data to Cloud Pub/Sub for the further processing. Cloud DataFlow is then used for the data cleaning and for some basic data transformation. After these transformations, the data is then sent to the BigQuery and Bigtable(for Cache).
Technology Stack:
- Cloud App Engine
- Cloud Pub/Sub
- Cloud IoT Core
- Cloud Function
- Cloud DataFlow
- BigQuery
- BigTable
- DataLab
- DataStudio