KEMBAR78
Data Lake Implementation Improved Processing Time by 4X | PDF | Amazon Web Services | Cloud Computing
0% found this document useful (0 votes)
117 views5 pages

Data Lake Implementation Improved Processing Time by 4X

The customer, one of India's largest media companies, implemented a data lake on AWS to improve processing time for generating reports from raw data sources. This reduced report generation from 6-12 minutes to just 3 minutes, allowing the customer to make faster business decisions. The solution migrated the on-premise infrastructure to AWS and used services like S3, Redshift, and Tableau to extract, transform, load, and visualize the data in the new data lake.

Uploaded by

purum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views5 pages

Data Lake Implementation Improved Processing Time by 4X

The customer, one of India's largest media companies, implemented a data lake on AWS to improve processing time for generating reports from raw data sources. This reduced report generation from 6-12 minutes to just 3 minutes, allowing the customer to make faster business decisions. The solution migrated the on-premise infrastructure to AWS and used services like S3, Redshift, and Tableau to extract, transform, load, and visualize the data in the new data lake.

Uploaded by

purum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Data lake implementation improved processing time by 4X

Customer: India’s largest media company

Summary
The customer is one of India’s leading media and entertainment companies. They
were looking to improve their ad placements across channels for improved
conversion along with taking other parameters like social media feedback into
consideration. With the push towards digital content, shifting the on-premise
infrastructure set up to cloud was necessary to optimize costs and manage high
volumes of data. Powerup was also to support and maintain the deployed AWS
services for them.
About Customer
The customer is one of the substantial News network houses in India with rights to
more than 3,818 movie titles, it entertains over 1 billion viewers across 172+
countries offering 80+ channels. The customer has always provided quality
entertainment across the globe and is committed to achieving maximum efficiency
in their ad conversion rates through strategic advertisement placements across
their multiple channels.
Problem Statement
The customer’s current on-premise infrastructure was proving expensive due to the
volume of data being generated and the shift to the cloud was the need of the hour.
Softwares provided TRP information on a weekly basis whereas some reports
needed to be generated every 6 to 12 minutes from the source to destination. All
the players in the business generated these reports to aid critical
decisions and the customer, with some major failure in its existing scheduled-based
processes were facing further delays. The time is taken to generate reports while
making changes in promos and ad placements were proving to be highly critical.
The customer is looking at a fully managed and scalable Infrastructure setup and
configuration on AWS. We proposed to create a data lake on AWS with all the source
data getting pushed to a commonplace. All data warehouse objects are to be
created as in the existing system. Migrate existing SQL Server Integration
Services (SSIS) extract, transform, load (ETL) jobs on Talend for the new data to start
moving to data warehouse along with the movement of Tableau dashboards on
AWS and point to Redshift, Redshift spectrum.

Proposed Solution
A plan was drafted to make the shift from a tight-knit synchronous architecture to
an event-based loosely coupled asynchronous architecture in order to ensure
accurate and on-time report generation as per the user’s requirement.
Their entire process transformation involved a cloud-first approach where the client
gathered on-premise data from multiple sources like SAP, Chrome feeds, Twitter
feeds, social media feedback in excel files and then piped it to cloud. This gave birth
to data warehouses where data got extracted and moved from physical to digital
form.
AWS landing zone to be setup with the following – organization account, centralized
logging, shared services, security and production accounts. The shared service
account used to deploy common applications like Bastion and Tableau server
whereas the security account is created only for audit purposes.
Appropriate users, groups and permissions created using Identity and Access
Management (IAM) service to access different AWS service along with Multi-factor
authentication (MFA) activation. The network is setup using Virtual Private Cloud
(VPC) with appropriate Classless inter-domain routing (CIDR) range, subnets and
route table creation.
VPN tunnel is setup between AWS and customer location. One-time data transfer
will be done directly to Amazon Simple Storage Service (S3) after which, the backup
file on S3 will be restored in Amazon Relational Database Service (RDS) and then the
same will be moved to Redshift entirely. Ultimately S3 will have the entire dump,
which will be deployed on Amazon Elastic Compute Cloud (EC2) on AWS.
Once collated as one single repository, the data could be easily transformed from
raw to columnar format using Lambda functions that can then be smoothly pushed
and visualized on Tableau.
An extract, transform, load (ETL) tool like Talend can be used to transfer incremental
data to S3. An SSH File Transfer Protocol (SFTP) server can be used to upload the
excel files to S3. Alternatively, Talend can be used to extract the data from excel files
and load it into S3. Active Directory Federation Services (ADFS) is configured to
provide federated access to Tableau server as on-premise AD has employees as well
as third party vendors added. Glue Crawlers will run periodically and scan the S3
data lake to automatically populate structured as well as unstructured data in S3,
which in turn can be connected with Amazon Redshift and all other data warehouses
being used.
CloudWatch service will be used for monitoring and Amazon Simple Notification
Service (SNS) will be used to notify the users in case of alarms or metrics crossing
thresholds. All snapshot backups will be regularly taken and automated based on
best practices.
Security and Logging
It was ensured that the system also had a built-in centralized log system which, kept
a check on various parameters like time taken for each process, success/failure of a
process, the reason for the failure of the process and so on. Data will be secured
and security groups will be used to control traffic at the Virtual Machine (VM) level.
Network Access Control Lists (NACLs) are used to control traffic at the subnet level
and VPC flow logs will be enabled to capture the entire network traffic. CloudTrail
will be enabled to capture all the Application Program Interface (API) activities. All
the logs will be sent to AWS Guard Duty for threat detection and identifying
malicious activities in the account and AWS Config will be enabled.
To ease the process, even an EC2 auto-recovery feature is enabled to address
failures, if any, so that data is not lost.
The solution was designed in a modular manner keeping in mind the possibility of
the addition of new channels and scalability in future, where components could be
added or removed without any code changes.

Application Support
Our support involved components that were developed or modified as a part of the
project implementation process.
o Broadcast Audience Research Council (BARC) sequences related to ETL
(extract, transform, load) pipeline functionality support.
o Data lake support on S3.
o For Amazon Redshift data warehouse, support on data issues on the stored
procedures migrated as a part of the project.
o Tableau Dashboard support on data links to Redshift.
The customer was to submit requests, prioritize defects and provide technical
support to Powerup before planned releases. Based on these inputs, we were able
to maintain a backlog of defects reported by client stakeholders. Powerup cloud
resources were to plan and schedule each release jointly with the customer’s team.
Conduct unit tests and provides support acceptance tests by the customer while
fixing business-critical issues if any. Deploy releases to production and warranty
support for any defects found in production releases. Conduct weekly project status
meetings with client stakeholders to review work progress, planned activities, risks
and issues, dependencies and action items, if any.
Opportunities: Post the shift to cloud, the customer was able to derive Sentiment
Analysis through TRP. These TRPs were based on Social Media data on new as well
as existing shows. It also paved the way for them to conduct GAP Analysis in order
to understand and compare the current Infrastructure and process improvements
with potential or expected performance, which helped enhance their efficiency.

Cloud platform
AWS.

Technologies used
Tableau, Redshift, DMS, Glue, Athena.

Business Benefits
o The customer enjoyed a fully managed and scalable Infrastructure set up and
configuration on AWS.
o 120 dashboards created and data processing time reduced from 2-hr to 30-
min.
o The immediate business impact recorded was the modular solution resulting
in the management being able to take improved and timely business-critical
decisions.
o Migration to cloud enabled a swift generation of critical reports in an end-to-
end time span of 3-min from 6-12min which improved the decision-making
capability of business leaders to a significant extent. Going forward, it is
anticipated that TRP is also to increase further due to this digital shift.

You might also like