FARHAN
Email: shfarhanm@gmail.com PH: 9803496761
Data Engineer
Professional Summary
• 9+ years of experience in designing and developing data driven solutions – data warehousing, Business Intelligence,
analytics, data ingestion - extraction, transformation and loading of data from Transactional databases (OLTP) to Data
Warehousing Systems (OLAP)
• Analyzing and understanding source systems and business requirements to Design the Enterprise Data warehousing
and Business Intelligence Solutions, DataMart and Operational Data Store
• Experience in designing & developing applications using Big Data technologies HDFS, Map Reduce, Sqoop, Hive,
PySpark & Spark SQL, Hbase, Python, Snowflake, S3 storage, Airflow.
• Experience in job workflow scheduling and monitoring tools like Airflow and Autosys.
• Experienced in Designing, Developing, Documenting, Testing ETL jobs and mappings in Server and Parallel jobs
using Data Stage to populate tables in Data Warehouse and Data marts.
• Worked in Production support team for maintaining the mappings, sessions and workflows to load the data in Data
Warehouse.
• Hands-on experience in using Hadoop ecosystem components like Hadoop, Hive, Pig, Sqoop, HBase, Cassandra,
Spark, Spark Streaming, Spark SQL, Oozie, Zookeeper, Kafka, Flume, MapReduce framework, Yarn, Scala, and Hue.
• Extensively worked on AWS services like EC2, S3, EMR, RDS, SageMaker, Athena, Glue Data Catalog,
RDS(Aurora), Redshift, DynamoDB, and Elastic Cache (Memcached & Redis) & Quick Sight and other services of
the AWS family.
• Worked closely with the Enterprise Data Warehouse team and Business Intelligence Architecture team to understand
repository objects that support the business requirement and process.
• Extensive knowledge in working with Azure cloud platforms (HDInsight, Datalake, Databricks, Blob Storage, Data
Factory, Synapse, SQL, SQL DB, DWH, and Data Storage Explorer).
• Extensive experience in working with NoSQL databases and its integration Dynamo DB, Cosmo DB, Mongo DB,
Cassandra, and HBase.
• Experience in building data pipelines using Azure Data factory, Azure data bricks and loading data to Azure data
Lake, Azure SQL Database, Azure SQL Data warehouse and controlling and granting database access.
• Work with cross functional teams planning, modeling, and implementing solutions utilizing NoSQL technologies.
• Expertise in transforming business requirements into analytical models, designing algorithms, building models,
developing Data mining, Data Acquisition, Data Preparation, Data Manipulation, Feature Engineering, Machine
Learning Algorithms, Validation and Visualization, and reporting solutions that scale across a massive volume of
structured and unstructured data.
• Excellent knowledge about the architecture and components of Spark, and efficient in working with Spark Core,
Spark SQL, Spark Streaming, and expertise in building PySpark and Spark-Scala applications for interactive analysis,
batch processing, and stream processing.
• Extensively used Spark Data Frames API over Cloudera platform to perform analytics on Hive data and used Spark
Data Frame Operations to perform required Validations in the data.
• Proficient in Python scripting and developed various internal packages to process big data.
• Developed various shell scripts and Python scripts to automate Spark jobs and Hive scripts.
• Strong experience in Data Analysis, Data Profiling, Data Cleansing & Quality, Data Migration, Data Integration
• Thorough knowledge in all phases of the Software Development Life Cycle (SDLC) with expertise in methodologies
like Waterfall and Agile.
• Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in
production.
• Worked with healthcare claims data and wrote SQL queries to organize the data and sorted, summarized, and reported
salient changes within the datasets.
• Evaluated and established the validity of incoming claims data and data element combinations to ensure accuracy and
completeness of all reporting results.
• Collaborated with data and product staff on various aspects of incoming healthcare claims data.
• Experienced in change implementation, monitoring and troubleshooting of AWS Snowflake databases and cluster
related issues.
• Good understanding of Data Modeling techniques, Normalization and Data Warehouse concepts using Star schema
and Snowflake schema modeling.
• Well versed with Snowflake features like clustering, time travel, cloning, logical data warehouse, caching etc.
• Developed Talend jobs to populate the claims data to data warehouse - star schema, snowflake schema, Hybrid
Schema.
• Strong knowledge in Logical and Physical Data Model design using ERWIN
• Good skills in Python programming
• Experienced on Big Data Hadoop Ecosystem components, HDFS, Apache Spark
• Strong knowledge on Snowflake Database, Worked on reading data from semi-structured sources (XML files)
• Proficient in Talend Cloud Real Time Big Data Platform, Informatica Power Center, Informatica Power Exchange,
Informatica B2B Data Transformation, Oracle SQL and PL/SQL, Snowflake, Unix Shell Scripting
• Has experience with Informatica Power Center in all phases of Data Analysis, Design, Development, Implementation
and production support of Data Warehousing applications using Informatica Power Center 10.x/9.x/8.x, SQL,
PL/SQL, Oracle, DB2, Unix, Power Shell
• Worked on designing and developing ETL solutions for complex data ingestion requirements using Talend Cloud
Real Time Big Data Platform, Informatica Power Center, Informatica Intelligent Cloud Services, Python, PySpark and
implemented data streaming using Informatica Power Exchange.
• Developed PySpark programs and created the data frames and worked on transformations.
• Performed root cause analysis on the slowly running solutions (ETL Jobs, reporting jobs, SQL Queries, Stored
Procedures and Views) and improved the solutions for better performance
• Collaborated with different teams like source vendors, data governance, and business teams on data quality issues, as
well as architecture or structure of data repositories
• Assisting in root cause analysis, investigating any data errors or anomalies and assisting in the implementation of
solutions & Framework to correct data problems and establishing and publishing KPIs and SLAs
• Created SOX Control, DQR Framework to monitor data quality issues and provide support for internal audit and
external SOX compliance audits
• Worked with Senior Management, Business users, Analytical Team, PMO and Business Analyst team on
Requirement discussion and Data integration strategy planning
• Research new technologies while keeping up to date with technological developments in relevant areas of ETL, Cloud
Solutions, Big Data, Analytics & Relational/NoSQL DBs
Professional Experience
Data Engineer
Morgan Stanley, New York August 2022 to Present
Responsibilities:
• Collaborated with Business Analysts, SMEsacross departments to gather business requirements, and identify
workable items for further development.
• Participated in all phases of development life-cycle with extensive involvement in the definition and design meetings,
functional and technical walkthroughs.
• Created Talend jobs to copy the files from one server to another and utilized Talend FTP components
• Created and managed Source to Target mapping documents for all Facts and Dimension tables
• Used ETL methodologies and best practices to create Talend ETL jobs. Followed and enhanced programming and
naming standards.
• Created and deployed physical objects including custom tables, custom views, stored procedures, and Indexes to SQL
Server for Staging and Data-Mart environment.
• Design and Implemented ETL for data load from heterogeneous Sources to SQL Server and Oracle as target databases
and for Fact and Slowly Changing Dimensions SCD-Type1 and SCD-Type2.
• Utilized Big Data components like tHDFSInput, tHDFSOutput, tPigLoad, tPigFilterRow, tPigFilterColumn,
tPigStoreResult, tHiveLoad, tHiveInput, tHbaseInput, tHbaseOutput, tSqoopImport and tSqoopExport.
• Partnered with ETL developers to ensure that data is well cleaned, and the data warehouse is up to date for reporting
purpose by Pig.
• Selected and generated data into CSV files and stored them into AWS S3 by using AWS EC2 and then structured and
stored in AWS Redshift.
• Processed some simple statistical analysis of data profiling like cancel rate, var, skew, kurt of trades, and runs of each
stock everyday group by 1 min, 5 min, and 15 min.
• Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data
warehouse.
• Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark
context, Spark-SQL, postgreSQL, Data Frame,OpenShift, Talend, pair RDD's
• Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
• Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
• Developed and validated machine learning models including Ridge and Lasso regression for predicting total amount
of trade.
• Boosted the performance of regression models by applying polynomial transformation and feature selection and used
those methods to select stocks.
• Generated report on predictive analytics using Python and Tableau including visualizing model performance and
prediction results.
• Utilized Agile and Scrum methodology for team and project management.
• Used Git for version control with colleagues.
Environment: Spark (PySpark, SparkSQL, SparkMLIib),Talend,T alend Data Integration 6.1/5.5.1, Talend Enterprise Big
Data Edition 5.5.1, Talend Administrator Console, Python 3.x (Scikit-learn, Numpy, Pandas), Tableau 10.1, GitHub, AWS
EMR/EC2/S3/Redshift, and Pig.
Data Engineer
Mayo Clinic Rochester MN January 2021 to August 2022
Responsibilities:
• Architected analytical data pipeline including but not limited to stakeholders’ interviews, data profiling, and
extraction process designing from diverse sources, and data load optimization strategies.
• Using Kimball four step process (Business process definition, Grain declaration, Fact and Dimension identification),
designed Dimensional Data Models for Loan Servicing and Loan Origination with daily transactional facts and
Customer, Account, Loan status, Credit profile etc. slowly changing dimensions.
• Developed ETL using Microsoft toolset (SSIS, TSQL, MS SQL Server) to implement Type 2 Change Data Capture
process for various dimensions.
• After data extraction from AWS S3 buckets and Dynamo DB, implanted Python/SQL using libraries (pandas, numpy,
Json, Urllib, PyODBC, SQLAlchemy) based JSON parsing daily pipeline for Credit Profile data including Experian
Credit Reports (Prequal Credit report, Full Credit Profile, BizAggs, SbcsAggregates, SbcsV1, SbcsV2 and Premier
Profile).
• Create external tables with partitions using Hive, AWS Athena and Redshift.
• Developed the PySprak code for AWS Glue jobs and for EMR.
• Conducts quantitative analyses of raw claims data, Rx claims, and various healthcare data sources
• Implemented and Designed AWS Solutions using EC2, S3, EBS, Elastic Load balancer (ELB), VPC, Amazon RDS,
CloudFormation, Amazon SQS, and other services of the AWS infrastructure.
• Parsed and evaluated “Lending Club ®” historical Small Business Loan Json data using Python/SQL. Purpose was to
tune in-house loan ScoreCard model and test different products’ What-if analysis in SSIS.
• Fulfill Customer Behavior Score model ETL requirements using Rolled up dimensional model data. During the
process used Feature Engineering methodologies (identification, aggregation and processing) based upon data
scientists and Machine Learning algorithm modelers feedback.
• Created data pipeline for different events in Azure Blob storage into Hive external tables and used various Hive
optimization techniques like partitioning, bucketing, and Mapjoin.
• Worked on Azure Data Factory to integrate data of both on-prem (MySQL, PostgreSQL, Cassandra) and cloud (Blob
Storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.
• Created pipelines in ADF using linked services to extract, transform and load data from multiple sources like Azure
SQL, Blob storage and Azure SQL Data warehouse.
• Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure
Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure
Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
• Primarily involved in Data Migration process using SQL, Azure SQL, SQL Azure DW, Azure storage and Azure Data
Factory (ADF) for Azure Subscribers and Customers.
• Implemented Custom Azure Data Factory (ADF) pipeline Activities and SCOPE scripts.
• Created Spark clusters and configured high concurrency clusters using Azure Databricks to speed up the preparation
of high-quality data.
• Primarily involved in Data Migration process using SQL, Azure SQL, SQL Azure DW, Azure storage and Azure Data
Factory (ADF) for Azure Subscribers and Customers.
• Implemented Custom Azure Data Factory (ADF) pipeline Activities and SCOPE scripts.
• Primarily responsible for creating new Azure Subscriptions, data factories, Virtual Machines, SQL Azure Instances,
SQL Azure DW instances, HD Insight clusters and installing DMGs on VMs to connect to on premise servers.
• Responsible for ingesting data from various source systems (RDBMS, Flat files, Big Data) into Azure (Blob Storage)
using framework model.
• Involve into Application Design and Data Architecture using Cloud and Big Data solutions on AWS, Microsoft
Azure.
• Leading the effort for migration of Legacy-system to Microsoft Azure cloud-based solution. Re-designing the Legacy
Application solutions with minimal changes to run on cloud platform.
• Worked on building the data pipeline using Azure Service like Data Factory to load the data from Legacy SQL server
to Azure Data Base using Data Factories, API Gateway Services, SSIS Packages, Talend Jobs, custom .Net and
Python codes.
• Built Azure Web Job for Product Management teams to connect to different APIs and sources to extract the data and
load into Azure Data Warehouse using Azure Web Job and Functions.
• Build various pipeline to integrate the Azure Cloud to AWS S3 to get the data into Azure Database.
• Design and Develop ET Processes in AWS Glue to migrate Campaign data from external sources like S3,
ORC/Parquet/Text Files into AWS Redshift.
• Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark.
• Interaction with direct Business Users and Data Architect for changes to Data Warehouse design in on-going basis.
• Involved in Data modeling and design of data warehouse in star schema methodology with conformed and granular
dimensions and FACT tables.
• Identified/documented data sources and transformation rules required to populate and maintain data warehouse
content.
• Developed data ingestion modules using AWS Step Functions, AWS Glue and Python modules
• Implemented Azure Data Factory operations and deployment into Azure for moving data from on-premises into cloud
• Used Spark DataFrames to create various Datasets and applied business transformations and data cleansing operations
using Data Bricks Notebooks.
• Efficient in writing Python scripts to build ETL pipeline and Directed Acyclic Graph (DAG) workflows using Apache
E, Apache NiFi.
• Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL,
MongoDB) into HDFS.
• Consulting on Snowflake Data Platform Solution Architecture, Design, Development and deployment focused to
bring the data driven culture across the enterprises.
• Develop stored procedures/views in Snowflake and use in Talend for loading Dimensions and Facts.
• Defined virtual warehouse sizing for Snowflake for different type of workloads.
• Created DWH, Databases, Schemas, Tables, write SQL queries against Snowflake.
• Optimized the PySpark jobs to run on Kubernetes Cluster for faster data processing.
• Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN,
Python, PySpark.
• Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
• Developed spark applications in python and PySpark on distributed environment to load huge number of CSV files
with different schema in to Hive ORC tables.
• Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation
from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
• Ingested data in mini-batches and performs RDD transformations on mini-batches of data by using Spark Streaming
to perform streaming analytics in Databricks.
• Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure
SQL DB).
• Extracted Tables and exported data from Teradata through Sqoop and placed them in Cassandra.
• Used Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory for authentication, and
Apache Ranger for authorization.
Environment: Microsoft SQL Server products (SSIS, SSRS), Python Anaconda, Snowflake, Azure, AWS S3, Dynamo DB,
PostgreSQL. Azure Data Factory, Azure SQL, Azure Databricks, Azure
Data Engineer
Nationwide, Columbus, OH April 2018 to December 2020
Responsibilities:
• Implemented CARS (Customer Anti-Money Laundering Risk Scoring) and Transaction Monitoring (TM) Model
requirements and played key role in data source requirement analysis, ETL Datastage code development and
deployment.
• Broad understanding of healthcare data like claims clinical data quality metrics and health outcomes..
• Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the
Airflow.
• Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in
production.
• Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow.
• Presented efficient TM and CARS model enhancement strategies in terms of risk score assignment on various
financial activity and profile triggers-based risk factors by applying preferred feature picking methods using Entropy,
Mutual Information Gain and Decision tree to streamline High Risk Customer alert processing and SAR/CTR filing.
• Played key role in design and implementation of Predictive Analytics based enrichments on CARS, TM model and in
process used Bayesian Networks algorithm, coordinated with multi facet business domains and stake holders to gain
knowledge regarding classification of Independent and Dependent risk factors in perspective of High Risk customer
alert stacking for investigators and Customer Due Diligence(CDD) process.
• Implemented TM and CARS model outlier identification algorithms using PySpark involving feature (Risk Factors)
engineering, StringIndexer, Vecotrs/Vector Assembler, Linear Regression, Evaluation (RMSE, Feature Correlation
check) to detect members’ Unusual Behavior which in effect tune CDD and SAM process through feedback.
• Ingested wide variety of structured, unstructured, and semi structured data into RDBMS (feasibility conditioned
according to architecture) as well as into AWS Data echo systems with batch processing and real time streaming.
• Worked as Data engineer for members’ cluster and grouping for general activity reporting employing PySpark
classification approaches including LogisticRegression, Decision Tree, Random Forest (feature importance
identification) and also unsupervised K-Means Clustering for pattern matching.
• Worked on Airflow 1.8(Python2) and Airflow 1.9(Python3) for orchestration and familiar with building custom
Airflow operators and orchestration of workflows with dependencies involving multi-clouds.
• Designed Stacks using Amazon Cloud Formation templates to launch AWS Infrastructure and resources. Developed
AWS CloudFormation templates to create custom sized VPC, subnets, EC2 instances, ELB and security groups.
• Worked on creating server-less Micro services by integrating AWS Lambda, S3/Cloud watch/API Gateway.
• Provided ML Data Engineer expertise in Negative News model enhancement with diverse data provided by
LexusNexus and other international vendors. For Data ingestion: AWS (EMR, Kinesis Streams & Firehose, RDS,
DynamoDB), SparkStreaming; For Data Prep: Python Web Scrapping, PyPDF2, Spark Natural Language Processing
(Tokenizer, StopWordsRemover, CountVectorizer, Inverse Document Frequency, StringIndexer), AWS Glue (ETL),
IBM DataStage.
• Designed and developed data cleansing, data validation, load processes ETL using Oracle SQL and PL/SQL and
UNIX.
• Doing ETL jobs with Hadoop technologies and tools like Hive, Sqoop and Oozie to extract records from different
databases into the HDFS.
• Installation of NoSQL MongoDB on physical machines, Virtual machines as well as AWS
• Support and management of NoSQL database Install, configure, administer, and support multiple NoSQL instances
Perform database maintenance and troubleshooting.
• Experienced in developing web-based applications using Python, Django, QT, C++, XML, CSS, JSON, HTML,
DHTML, JavaScript and JQuery.
• Developed entire frontend and backend modules using Python on Django Web Framework.
• Coordinated with Marketing group in Machine Learning Lab activities for MSA (Member Sentiment Analysis) model
development using Spark Streaming, Python and PySpark NLP data prep techniques, Spark Alternate Least Square
(ALS) model and tuned parameters.
• Effectively resolved the persistent overfitting problems in model tuning process by placing feedback controls,
periodic model review strategies (variable data split for train, test, evaluate) and detailed documentation so that data
anomalies, scalability and cold start problem don’t adversely affect the established model with passage of time.
• Implemented a Continuous Delivery pipeline with Docker, Jenkins and GitHub and AWS AMI's, whenever a new
GitHub branch gets started, Jenkins, Continuous Integration server, automatically attempts to build a new Docker
container from it.
• Monitoring Resources and Applications using AWS Cloud Watch, including creating alarms to monitor metrics such
as EBS, EC2, ELB, RDS, S3, and configured notifications for the alarms generated based on events defined.
• Worked with an in-depth level of understanding in the strategy and practical implementation of AWS Cloud-Specific
technologies including EC2 and S3.
• Manage AWS EC2 instances utilizing Auto Scaling, Elastic Load Balancer for QA and UAT environments as well as
infrastructure servers for GIT and Puppet.
• Extensively used Kubernetes which is possible to handle all the online and batch workloads required to feed,
analytics, and machine learning applications.
• Managed resources and scheduling across the cluster using Azure Kubernetes Service (AKS). AKS has been used to
create, configure, and manage a cluster of virtual machines.
• Used Scala for amazing concurrency support where Scala played the key role in parallel processing of the large data
sets.
• Developed map-reduce jobs using Scala for compiling the program code into bytecode for the JVM for data
processing.
Technologies used: HDFS, MapReduce, Hive, Pig, Cloudera, Impala, Oozie, Greenplum, MongoDB, Cassandra, Kafka,
Storm, Maven, Python, Cloud Manager, Solr, NagiOS, Ambari, JDK, J2EE, Ajax, Struts, JSP, Servlets, Elastic Search,
WebSphere, JavaScript, MRunit
Hudda Infotech Private Limited Hyderabad, India July 2016 to Jan 2018
Jr. Big Data Developer
Responsibilities:
• Worked on analyzing Hadoop cluster using different big data analytic tools including Pig, Hive, and MapReduce.
• Involved in loading data from LINUX file system to HDFS.
• Importing and exporting data into HDFS and Hive using Sqoop.
• Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI
team.
• Used Sqoop to import data into HDFS and Hive from other data systems.
• Configured Performance Tuning and Monitoring for Cassandra Read and Write processes for fast I/O operations and
low latency time. used Java API and Sqoop to export data into DataStax Cassandra cluster from RDBMS.
• Experience working on processing unstructured data using Pig and Hive.
• Experienced in running Hadoop streaming jobs to process terabytes of xml format data.
• Involved in scheduling Oozie workflow engine to run multiple Hive and pig jobs.
• Developed Pig Latin scripts to extract data from the web server output files to load into HDFS.
• Extensively used Pig for data cleansing.
• Implemented SQL, PL/SQL Stored Procedures.
• Worked on debugging, performance tuning of Hive & Pig Jobs.
• Implemented test scripts to support test driven development and continuous integration.
• Worked on tuning the performance Pig queries.
• Created and maintained Technical documentation for launching HADOOP Clusters and for executing Hive queries
and Pig Scripts.
• Actively involved in code review and bug fixing for improving the performance.
Technologies used: Hadoop, HDFS, Pig, Hive, MapReduce, Sqoop, LINUX, Cloudera, Big Data, Java APIs, Java collection,
SQL.
Application Developer (ETL DataStage)
Dhruv Soft Services Private Limited, Hyderabad, India Mar 2015 to May 2016
Responsibilities:
• Used IBM InfoSphere suite products for ETL development, enhancement, testing, support, maintenance and
debugging software applications that support business units and support functions in consumer banking sector.
• Utilized Hadoop Ecosystem for Big Data sources in Customer Relationship Hub and Master Data Management: for
data ingestion: Kafka, Storm and Spark Streaming; for data landing: HBase, Phoenix relational DB layer on HBase;
for query and ETL used Phoenix, Pig and HiveQL; for job runtime management: Yarn and Ambari.
• Developed ETL packages using SQL Server Integration services tool to perform data migration from legacy systems
like DB2, SQL Server, Excel Sheets, XML files, Flat Files to SQL Server databases using various tools such as SQL
Server Integration Services SSIS.
• Performed database health checks daily tasks including backup and restore by using SQL Server tools like SQL
Server Management Studio, SQL Server Profiler, SQL Server Agent, and Database Engine Tuning Advisor on
Development and UAT environments.
• Performed the ongoing delivery, migrating client mini-data warehouses or functional data-marts from different
environments to MS SQL server.
• Involved in Implementation of database design and administration of SQL based database.
• Developed SQL scripts, Stored Procedures, functions and Views.
• Worked on DTS Package, DTS Import/Export for transferring data from various database Oracle and Text format data
to SQL Server 2005.
• Designed and implemented various machine learning models (e.g., customer propensity scoring model, customer
churn model) using Python (NumPy, SciPy, pandas, scikit-learn), Apache Spark (SparkSQL, MLlib).
• Provide performance tuning & optimization of data integration frameworks and distributed database system
architecture that is optimized for a.
• Designed and developed a solution in Apache Spark to extract transactional data from various HDFS sources and
ingest it to Apache Hbase tables.
• Designed and developed Streaming jobs to send events and logs from Gateway systems to Kafka.
Environment: HortonWorks, DataStage 11.3, Oracle, DB2, UNIX, Mainframe, Autosys.