KEMBAR78
Rakesh Data Engineer | PDF | Apache Spark | Apache Hadoop
0% found this document useful (0 votes)
117 views8 pages

Rakesh Data Engineer

Rakesh is a seasoned Data Engineer with over 10 years of experience in data integration, migration, and business intelligence across various industries including financial services and retail. He has extensive expertise in big data technologies such as Hadoop, Spark, and AWS, along with proficiency in data warehousing, ETL processes, and cloud platforms like Azure and Snowflake. Rakesh has successfully led projects involving data governance, architecture, and the development of scalable data solutions, demonstrating strong collaboration skills with cross-functional teams.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views8 pages

Rakesh Data Engineer

Rakesh is a seasoned Data Engineer with over 10 years of experience in data integration, migration, and business intelligence across various industries including financial services and retail. He has extensive expertise in big data technologies such as Hadoop, Spark, and AWS, along with proficiency in data warehousing, ETL processes, and cloud platforms like Azure and Snowflake. Rakesh has successfully led projects involving data governance, architecture, and the development of scalable data solutions, demonstrating strong collaboration skills with cross-functional teams.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Rakesh

Email : Rakeshdataengineer07@gmail.com
Phone : 9099299787

PROFESSIONAL SUMMARY:

 Over 10+ years of experience in delivering Data Integration, Data Migration, Data Modeling,
Data Warehouse and Business Intelligence solutions with comprehensive hands-on expertise in
Data warehouse architecture, ETL & BI reports testing and development.
 Expertise in the Enterprise Information Management areas including Data Governance, Business
Intelligence, Enterprise Data Warehousing, Analytics, Data Architecture, Data Quality, and
Metadata.
 Led the design, development, and implementation of Data Warehouse/Business Intelligence
Systems at different organizations including industry experience in Financial Services, Retail &
Manufacturing.
 Excellent skills in Data warehouse, relational and multi-Dimensional databases, Data Migration,
System & Data Integration, Reporting, Analysis and Software Development and Data
Management.
 Extensive experience in Analysis, Design, development, Implementation, Support using ETL and
BI Applications.
 Experience in ingestion, storage, querying, processing and analysis of Big Data using Big data
ecosystem related technologies like Hadoop HDFS, Map Reduce, Apache Pig, Hive, Scala,
Sqoop, Hbase, Flume, Oozie, Spark, Cassandra, Kafka and Zookeeper.
 Experienced in creating data pipeline integrating kafka with spark streaming application using
Scala.
 Extensive experience in data migration, transformation, and modelling, with a focus on
optimizing processes and implementing best practices. Demonstrated mastery in managing and
optimizing data processes using a suite of tools within the Microsoft Azure ecosystem, including
Azure Databricks, Apache Spark, PySpark, Python, R, Scala, SQL, Azure Synapse, Azure Data
Factory, PowerBI, Power Query, Snowflake, and Microsoft Fabric.
 In-depth knowledge of SQL including Data Query Language (DQL), Data Manipulation
Language (DML), and Data Definition Language (DDL).
 Comprehensive understanding of the data landscape across various industries, acquired through
diverse experience spanning sectors such as Telecom, Insurance, and Retail.

 Proven expertise in designing, developing, and maintaining scalable data pipelines, architectures,
and data solutions.
 Experience across entire AWS (Glue, EC2, VPC, ELB, S3, EBS, Quick Sight, RDS, DNS
Route53, ELB, Cloud Watch, Cloud Formation, AWS Auto Scaling, Lambda, Elastic Beanstalk,
Airflow), Python, Scala, SQL, Pyspark, SDLC from prototyping to deployment.
 In-depth knowledge of snowflake Database, Schema and Table Structures
 Worked on Snow SQL and Snow pipe and redesigned the views in snowflake to increase the
performance.

 Adept at working with cross-functional teams to deliver high-quality data products. Proficient in
Azure services, data warehousing, and ETL processes.
 Passionate about leveraging data to drive innovation and deliver business insights.
 Develop ETL jobs to load the data from one source to another using Hadoop technology like
Scala, PySpark and Sqoop by landing data in a staging database by applying data profiling and
business rules as a computer layer and to expose them to external teams to consume the data.
 Experienced in Loading streaming data into HDFS using Kafka messaging system.
 Understanding in analysis, design, and development of applications using Microsoft Azure
technology stack basically on Azure Data Factory and Azure Data Bricks.
 Development level experience in Microsoft Azure, Python, Azure Data Factory, Data Bricks,
Notebook, Azure Data Lake Storage File System.
 Experience in building ETL (Azure Databricks) pipelines leveraging Pyspark and Spark SQL.
 Big Data Engineer with Experience in Enterprise level data warehouse tools - creating tables, data
distribution by implementing static and dynamic Partitioning and Bucketing, writing and
optimizing the HiveQL queries to analyze data.
 Understanding on Snowflake Cloud technology.
 Hands on experience in Installing, Configuring and Troubleshooting Hadoop ecosystem
components like Map Reduce, HDFS, Hive, Pig, Sqoop, Spark, Flume, Zookeeper, Kafka &
Impala

TECHNICAL SKILLS:

Hadoop/Big Data Apache Hadoop, MapReduce, Pig, Hive, Sqoop, Oozie, Flume,
Zookeeper, Impala, Spark, Ambari, Impala, Kafka, YARN, HDFS,
Talend, Ranger, Hortonworks and Cloudera distributions
AWS, Snowflake, Cloud Watch, S3, Athena, Glue, AWS RedShift,
AWS CDK, AWS Lambda, EMT, EC2, Airflow, Azure Data
Factory, Azure Data Lake,
NOSQL Databases Apache HBase, MongoDB, Cassandra
RDBMS Oracle, MySQL, SQL Server, Teradata, DB2
Languages C, C++, Java, Scala,PL/SQL, Transact SQL,python
Scripting Languages Unix, Perl, Java Script, Linux Bash Shell Scripting
Operating System Windows 8/7/Vista, Red Hat, Ubuntu
API Apigee, Mulesoft (API integration and management)
Other Tools Putty, WinSCP, Toad, MAVEN, Autosys, WinSCP, Jenkins, GitHub
Methodologies Agile, SCRUM, Waterfall, Lean, Kanban
Cloud Platforms AWS,AZURE
W
ORK EXPERIENCE:

Role: Data Engineer ,


Geico,
Dallas, TX Oct 2022-Present

 Databricks Notebooks Utilization and Business Logic Application: Utilized Databricks notebooks
to read data from the staging layer and applied various business logics using PySpark, enhancing
data processing efficiency and accuracy.
 Direct and collaborate with business, operational, and technical stakeholders on implementing
data governance strategy, utilizing information governance, data quality, information architecture
and information asset management capabilities in support of achieving business operations,
strategic goals, and objectives.

 Developed data ingestion pipelines on AWS EMR and leveraged PySpark for data processing.
 Worked with NoSQL databases, including DynamoDB, to design schemas for high-throughput
applications.
 Conducted health monitoring and implemented error recovery strategies for distributed systems.
 Ensured enterprise-grade security and compliance through integration of secure protocols and
frameworks.
 Extensive experience in Analysis, Design, development, Implementation, Support using ETL and
BI Applications.
 Experience across entire AWS (Glue, EC2, VPC, ELB, S3, EBS, Quick Sight, RDS, DNS
Route53, ELB, Cloud Watch, Cloud Formation, AWS Auto Scaling, Lambda, Elastic Beanstalk,
Airflow), Python SDLC from prototyping to deployment.
 Designed and developed Security Framework to provide fin grained access to objects in AWS S3
using AWS Lambda, DynamoDB
 Cloud development and automation using python, AWS Lambda, AWS CDK (Cloud
development Kit).
 Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external
sources like S3, ORC/Parquet/Text Files into AWS Redshift.
 Data Extraction, aggregations, and consolidation of data within AWS Glue using Pyspark.
 Used Spark SQL for Scala, Python interface that automatically converts RDD case classes to
schema RDD.
 Good Exposure in working and rectifying performance issues using Virtual warehouses, Caching,
Query Optimization and Search optimization techniques in Snowflake.
 Develop, maintain, and support ETL processes for loading data from multiple data sources into a
Redshift data warehouse.
 Developed advanced PL/SQL Packages, Procedures, Indexes and collections to implement
business logic.
 Developed complex database objects like Stored Procedures, Packages and Triggers using SQL
and PL/SQL
 Effectively made use of Table Functions, Indexes, Table partitioning, Collections, Materialized
views etc. using Oracle SQL, PL/SQL & SQL Server
 Setup full CI/CD pipelines and setting up repositories to make use of CI/CD Environment.
 Expertise in creating, Debugging, scheduling, and Monitoring jobs using Airflow for ETL batch
processing.
 Responsible for Continuous Integration and Continuous Delivery process implementation using
Jenkins along with Python and Shell scripts to automate routine jobs.
 Hands on Experience on SnowSQL, SnowPark & StreamLit programming in Snowflake.
 Good knowledge on Data Storage, Query Processing and Cloud services Architecture in
Snowflake

Role: Data Engineer,


Randstad Technologies,
Charlotte, NC August 2021 – Oct 2022
 Analysing and troubleshooting the JIRA-tickets, follow up with the stake holders and giving
resolutions on priority basis.
 Managed streaming data effectively while actively participating in the data modeling process,
ensuring data integrity and modelling accuracy.
 Enhanced report performance and accuracy by implementing efficient DAX calculations and
measures, improving decision-making capabilities for stakeholders.
 Implemented ci/cd pipelines for automated testing and for spark applications.
 Written the data in to delta tables for the easy testing purpose.
 Collaborate with architect and high-end engineers to make sure the code quality is high and set up
the standard practices to run the code easily.

 Designed and implemented data pipelines using AWS Glue, S3, and DynamoDB to process and
store large datasets.
 Developed ETL processes for Big Data environments, leveraging Apache Spark and Hive for
efficient data transformation.
 Built scalable workflow orchestration using Apache Airflow to automate complex data
workflows.
 Wrote and optimized complex SQL queries for Snowflake to support business intelligence and
reporting needs.
 Ensured system integration by collaborating across teams and implementing distributed software
solutions.
 Utilized Git for version control, including branching, merging, and maintaining codebase
integrity.
 Managed data bricks clusters and resources, monitoring performances and scaling as needed for
workload demands.
 Designed and implemented scalable ETL pipelines using AWS Glue and Lambda to process and
transform large datasets.
 Supported the design and implementation of ETL pipelines using SQL Server and Azure Data
Factory to ensure accurate data flow between various systems.
 Worked with the team to implement data warehouses and integrated them with business intelligence
tools like Power BI and Tableau for generating actionable insights.
 Assisted in the integration of Azure Databricks with various data sources for efficient batch and real-
time data processing.
 Cloud Expertise: Experienced in leveraging AWS cloud services such as EC2, MSK, S3, RDS,
SNS, and SQS for scalable and efficient solutions.
 Big Data Ecosystems: Knowledgeable in big data tools and frameworks such as Hadoop, Apache
Spark, and Kafka for handling large-scale data processing.
 Software Engineering Practices: Adept at using modern tools and methodologies, including
GitHub, VSCode, and CI/CD pipelines.
 Managed data integration projects using ETL tools and frameworks, resulting in improved data
quality and accessibility for analytics.
 Developed and maintained data pipelines that ingested data from various sources into Hadoop and
SQL Server, enabling a centralized data repository.
 Established and enforced data standards, policies, and procedures to ensure organizational data
quality, integrity, and security.
Role: Data Engineer,
Bank of America
Charlotte, NC February 2021 – August 2021

Responsibilities:
 Hands on experience in handling Hive tables using Spark SQL.
 Developed real time data processing applications by using Scala and Python and implemented
Apache Spark Streaming from various streaming sources like Kafka and JMS.
 Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka
as a data pipe-line system.
 Implemented the Hadoop platform using Cloudera as a distribution platform, creating HIVE
tables atop datasets in the staging layer and optimizing performance through various Hive
performance techniques. Created Impala tables as target tables for data loading into the
integration layer.
 Leveraged Cloudera Stream Sets for efficient transfer of raw data files into the HDFS foundation
layer, ensuring seamless data transfer and storage. Employed Parquet file format in all target
tables within Impala to enhance performance and optimize storage efficiency.
 Knowledge on Pyspark and used Hive to analyze sensor data and cluster users based on their
behavior in the events.
 Developed Data pipeline using Flume, Sqoop, Pig and Java MapReduce to ingest customer
behavioral data into HDFS
 Utilized Spark Core and Spark SQL for data transformations, building data frames and Resilient
Distributed Datasets (RDDs) to streamline data processing and analysis within the Hadoop
environment. Worked in a collaborative team environment to ensure the successful
implementation and optimization of Big Data solutions.
 Developed ingestion pipelines using Stream Sets to efficiently ingest client data files into the
Hadoop Distributed File System (HDFS), managing a massive volume of 3.5 petabytes of data.
Utilized Spark for data processing, transforming the ingested data in the processing phase before
loading it into the Hive warehouse for use by the Data Science team.

Role: Hadoop developer,


Cerner,
Kansas City, MO February 2020 – December 2020

 Created and deployed efficient ETL processes to transform and load data into a centralized data
warehouse, enhancing data accessibility for analytics.
 Designed and maintained Power BI dashboards that delivered actionable insights, resulting in a
25% boost in operational efficiency.
 Partnered with data analysts and business stakeholders to gather requirements and refine data
models for effective reporting and analysis.
 Maintained and optimized ETL pipelines, ensuring seamless data ingestion from cloud and on-
prem sources to Big Query and Cosmos DB.
 Integrated MuleSoft APIs to streamline data exchange between engineering and business systems.
 Engineered scalable data applications using Spark and Scala, optimizing the processing of large
datasets for improved performance.
 Designed and executed ETL workflows for both structured and unstructured data, ensuring
integrity and consistency across business operations.
 Integrated Databricks with on-premise databases, ensuring seamless data flow and availability for
reporting and analysis using Power BI and other business intelligence tools.
 Coordinated with cross-functional teams to design data platforms that supported enterprise-wide
data-driven solutions, improving reporting efficiency by 40%.
 Designed and implemented data ingestion frameworks in Azure Data Factory, automating ETL
processes and improving data consistency across the platform.
 Optimized Databricks clusters for performance tuning, reduced processing times by 20%, and
ensured the platform met the performance needs of the organization’s data analysts and data
scientists.

Role: Application Architect IV


Bank of America,
Charlotte,NC July 2018 – DEC-2019

Responsibilities:
 Worked on creating secondary indexes in Hbase to join tables.
 Written SparkQL queries for data analysis to meet the business requirements and writing HQL
scripts to extract ,transform and there by load the data in to RDF and Helix database.
 Generated the clients reports and comparing them building and deploying the code in Hive.
 Written the Scala API called Helix for compare the data in between PAX database and cesium UI
and must be check then that both the data is matching with each other and without any errors.
 Developed the validation code for Hive data quality rules by using Spark/Scala.
 Extensively worked on the Hive database to create the tables and develop the data validation
scripts and automated the process.
 Developed and maintained data integration workflows within Informatica, harnessing its robust
mapping and transformation capabilities to create seamless data movement across systems,
ensuring high data quality and consistency throughout the ETL pipeline.
 Leveraged Informatica's advanced mapping and transformation tools to craft intricate data
integration workflows, facilitating seamless data movement and ensuring high data quality
throughout the ETL pipeline.
 Used JIRA for bug tracking and GITHUB ,Bitbucket for version control worked under Agile
methodology.

Role:Scala/Sr. Hadoop Developer


Client:Apptium Technologies
St.louis,MO APR 2016 – AUG2018

Responsibilities:
 Involved in all phases of Software Development Life Cycle (SDLC) activities such as
development, implementation and support for Hadoop
 Imported and exported data from different Relational Database Systems like Mysql and Oracle
into HDFS and Hive and vice-versa, using Sqoop
 Used Sqoop and mongoDump to move the data between MongoDB and HDFS
 Developed data pipe line using Kafka, HBase, Spark and Hive to ingest, transform and analyse
data
 Migrated complex Map Reduce programs into Apache Spark RDD transformations
 Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch
processing.
 Worked on migrating the old java stack to type safe stack using pyspark for back end
programming..
 Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data and
handled Data Skewness in Spark-SQL.
 Worked extensively with sqoop importing data from MYSQL and storing in HDFS.
 Performed advanced procedures like text analytic and processing, using the in-memory computing
of Spark using Scala.
 Expertise in design and development of various web and enterprise applications using type safe
technologies like Scala.
 Experienced in using Scala ,java tools intelli J,Eclipse.s
 Used Talend for connecting, cleansing and sharing cloud and on-premises data
 Developed Spark, Pig and Hive Jobs to summarize and transform data
 Designed and developed Pig and Hive UDF's for Data enrichments
 Used Spark for interactive queries, processing of streaming data and integration with popular
NoSQL database for huge volume of data
 Worked streaming data using Kafka and Spark Streaming for Data preparation
 Used HBase for creating Snapshot tables for updating and deleting records
 Working with AWS to migrate the entire Data Centres to the cloud using VPC, EC2, S3, EMR,
RDS, Splice Machine and DynamoDB services
 Developed entire spark applications in python(PySpark) on distributed environment
 Migrated tables from SQL Server to HBase, which are being used actively till date

Environment: Hadoop-Cloudera, HDFS, Pig, Hive, Sqoop, Kafka, Zookeeper, Spark, Python, Talend,
Hbase, Scala, Shell Scripting, Maven, MapReduce, Amazon EMR, EC2, S3

Role:Scala/Hadoop Developer
Client:HSBC,
Buffalo,NY MAY 2015 -MAR 2016
Responsibilities:

 Imported data using Sqoop to load data from MySQL to HDFS on regular basis
 Involved in collecting and aggregating large amounts of log data using Apache Flume and staging
data in HDFS for further analysis
 Collected and aggregated large amounts of web log data from different sources such as web
servers, mobile and network devices using Apache Kafka and stored the data into HDFS for
analysis
 Designed and Developed Talend Jobs to extract data from Oracle into MongoDB
 Developed multiple Kafka Producers and Consumers from scratch implementing organization's
requirements
 Responsible for creating, modifying topics (Kafka Queues) as and when required with varying
configurations involving replication factors, partitions and TTL
 Wrote and tested complex MapReduce jobs for aggregating identified and validated data
 Created Managed and External Hive tables with static/dynamic partitioning
 Written Hive queries for data analysis to meet the Business requirements
 Increased performance of the HiveQLs by splitting larger queries into small and by introducing
temporary tables in between them.
 Used sparksql for reading data from external sources and process the data using pyspark
computation framework.
 Developed spark scripts by using Scala commands as per requirement
 Created extensive SQL queries for data extension to test the data against the various databases.
 Developed an equivalent pyspark code for existing SAS code to extract summary insights on the
hive tables
 Designing and executing Spark SQL queries on data in Hive in Spark context and ensured
performance optimization
 Integrated Amazon Redshift with Spark using pyspark
 Used open source web scraping framework for python to crawl and extract data from web pages
 Optimized the Hive queries by setting different combinations of Hive parameters
 Developed UDFs (User Defined Functions) to extend core functionality of PIG and HIVE
queries as per requirement
 Extensive experience in writing Pig scripts to transform raw data from several data sources into
forming baseline data
 Implemented workflow using Oozie for running Map Reduce jobs and Hive Queries

Environment: Hadoop, HDFS, Map Reduce, Hive, Sqoop, Talend, Apache Kafka, Zookeeper, Spark,
Hbase, Python, Shell Scripting, Oozie, Maven, Cloudera.

Role:Hadoop/Spark Developer May 2014 – May 2015


Client:Apps Associates,
Hyderbad,India
Responsibilities:

 Responsible for building scalable distributed data solutions using Hadoop


 Worked comprehensively with Apache Sqoop and developed Sqoop scripts to interface data from
a MySQL database into the Hadoop Distributed File System (HDFS)
 Utilize parallel processes of the Hadoop Framework to ensure resource efficiency
 Created Managed tables and External tables in Hive and loaded data from HDFS.
 Using Scala collection framework to store and process the complex consumer information.Based
on the offers setup fro each client ,the requests were post processed and given offers.
 Used slick to query and storing in database in Scala fashion using the powerful Scala collection
framework.
 Strong core Scala,including experience with collections,type variance,implicit parameters and
conversion.
 Functional programming experience in Scala,including higher-order functions,partial
functions,partial application,nested functions.
 Professional experience coding in Scala in a production environment.
 Worked on debugging and performance tuning of Hive & Pig Jobs
 Experienced in creating data pipeline integrating kafka with spark streaming application used
Scala for writing applications..
 Used python sub-process module to perform UNIX shell commands.
 Used sparksql for reading data from external sources and processes then the data using Scala
computation framework.
 Extracted data from Agent Nodes into HDFS using Python scripts.
 Written shell scripts for extracting the data to perform analytics on top of it.
 Implemented Hive Generic UDF's to incorporate business logic into Hive Queries
 Analysed the web log data using the HiveQL to extract number of unique visitors per day, page
views, visit duration, most visited page on website etc.
 Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs (such
as Map-Reduce, Pig, Hive, and Sqoop) as well as system specific jobs (such as Java programs and
shell scripts)
 Used pig and hive Upon the Hcatalog tables to analyse the data and create schema for the Hbase
tables in Hive.
 Responsible for managing and scheduling jobs on Hadoop Cluster.
 Coordinated with the BI team to visualize the transformed data into a dashboard using Tableau
 Assisted in creating and maintaining Technical documentation to launching Hadoop Clusters and
executing Hive queries and Pig Scripts

Environment: Hadoop, HDFS, Map Reduce, Hive, Sqoop, Zookeeper, Hbase, Python, Shell Scripting,
Oozie, Maven, Cloudera, Tableau

You might also like