Sai Teja
Data Engineer
Email: teja.vzs28@gmail.com
Contact no: +1 (813)-724-6323
LinkedIn ID: linkedin.com/in/teja345
Certification ID: https://www.credly.com/badges/bc9134fc-3784-4410-b256-e57b4737f711
 PROFESSIONAL SUMMARY:
  Big Data and Cloud Data engineering professional with over 9+ years of experience in building distributed data
   solutions, data analytical applications and ETL and streaming pipelines leveraging big data, Hadoop ecosystem
   components, Databricks platform, AWS and GCP Cloud Services.
 Extensive experience building data pipelines with Apache Spark, Hive, Kafka, and orchestration using Apache
   Airflow/Cloud Composer.
 Expertise with Google Cloud Platform (GCP): Big Query, Data proc, Cloud Storage, Dataflow, Pub/Sub, Cloud
   Composer, and Airflow, supporting real-time and batch processing at scale.
 4+ years of GCP experience in designing scalable ETL pipelines with Big Query, Data proc, and Cloud Composer.
 Proficient in Google Cloud Platform (GCP) services, including Vertex AI, Big Query, Dataflow, and Data proc, with
   expertise in designing frameworks for machine learning (ML) and large language model (LLM) operationalization.
   Skilled in AI compliance governance, data governance, and metadata management, with strong programming
   expertise in Python, Spark, and SQL.
 Skilled in designing and implementing scalable data pipelines for structured, semi-structured, and unstructured data
   across cloud platforms, particularly Google Cloud Platform (GCP), including Big Query, Data proc, Cloud Spanner, and
   Dataflow.
 Possesses strong skills in Google Cloud Platform (GCP), including BigQuery and Bigtable, and Infa Cloud Data Gov
   Catalog. & Adept at working with large datasets, optimizing data models, and collaborating with cross-functional
   teams to deliver scalable and efficient data solutions that support business objectives.
 Adept at working with large datasets, optimizing data models, and collaborating with cross-functional teams to
   deliver scalable and efficient data solutions that support business objectives."
 Strong expertise in Google BigQuery, including schema design, partitioning, clustering, materialized views, and
   cost optimization.
 Deep understanding of performance tuning, data model optimization, and data lake management in BigQuery and
   Cloud Storage.
 Hands-on experience setting up BigQuery Views in GCP and Google Analytics Hub for data sharing and analytics.
 Hands-on experience with Google's IAM APIs, security policies, and real-time data streaming.
 Proficient in building distributed data solutions, optimizing Databricks workflows, and creating scalable ETL pipelines
   using advanced data engineering techniques
 Demonstrated hands-on experience with REST API development, IAM policies, and infrastructure development on
   AWS, with a strong focus on security and scalability.
 Designed and implemented ETL pipelines using Snowflake to process and load large-scale datasets for unified data
   warehousing.
 Hands-on expertise in Elasticsearch, enabling fast, scalable search and analytics across structured and semi-
   structured data. Proven track record of optimizing data models, implementing MLOps strategies, and deploying ML
   models for advanced analytics.
 Experienced in data governance, metadata management, and securing cloud platforms using IAM and CI/CD best
   practices. Adept in SQL, Python, PySpark, and orchestration tools like Airflow for robust and automated data
   pipelines.
 Extensive knowledge of CI/CD pipelines, Git workflows, and infrastructure-as-code tools such as Terraform.
 Strong understanding of dimensional data modeling, encompassing Star and Snowflake schemas.
     Proficient in CI/CD practices using Jenkins and container orchestration with Kubernetes.
     Experience in Extraction, Transformation and Loading (ETL) of data from multiple sources into Data Warehouse and
      Data Mart. Experience in developing ETL data pipelines using PySpark
     Architect, deploy, and manage cloud infrastructure components, leveraging services such as Google Cloud
      Platform (GCP) to support data processing and analytics workloads.
     Architected scalable ETL workflows using Data proc (Apache Spark) for processing terabyte-scale datasets,
      optimizing the performance and reducing data processing time by 30%.
     Expert in designing & managing ETL/ELT pipelines using tools like Cloud Dataflow, Big Query, and Cloud Storage.
     Experienced in Dimensional Data Modeling experience using Data modeling, Relational Data modeling, ER/ Studio,
      Erwin, and Sybase Power Designer, Star Join Schema/Snowflake modeling, FACT & Dimensions tables, Conceptual,
      Physical & logical data modeling.
     Expertise working on Amazon Glue, RDS, DynamoDB, Kinesis, CloudFront, CloudFormation, S3, Athena, SNS, SQS,
      X-ray, Elastic load balancing (ELB), Amazon Redshift.
     Data Pipeline Design & Optimization: 5+ years of experience building and optimizing scalable data pipelines using
      tools like Google Cloud Composer, DataFlow, and Apache Beam.
     ETL & Big Data Processing: Skilled in ETL processes, Apache Hadoop and Spark ecosystems for large-scale data
      handling and transformation.
     Experienced on Build, Deploying, and managing SSIS packages with SQL server management studio, creating SQL
      Server agent jobs, configuring jobs, configuring data sources, and scheduling packages through SQL server agent
      jobs.
     Adept at utilizing BI tools such as Power BI and QlikView for enhancing reporting capabilities and developing BI
      applications by client requirements.
     Proficient in cloud services: AWS (Redshift, Glue), GCP (BigQuery, Dataflow, Pub/Sub), and Azure.
     Experience in working on Apache Hadoop open-source distribution with technologies like HDFS, Map-reduce,
      Python, Pig, Hive, Yarn, Hue, HBase, SQOOP, Oozie, Zookeeper, Spark, Spark-Streaming, Storm, Kafka, Cassandra,
      Impala, Snappy, Green plum and MongoDB, Mesos
     Well-versed in Tableau Desktop and Tableau Online, with a strong understanding of their functionalities.
     Experienced in using agile methodologies including extreme programming, SCRUM and Test-Driven Development
      (TDD).
                                             TECHNICAL SKILLS:
    Hadoop/Spark Ecosystem           Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Oozie, Zookeeper,
                                     Spark, Airflow, Apache Beam
    Cloud Platforms                  AWS: Amazon EC2, S3, RDS, IAM, CloudWatch, SNS, Athena, Glue, Kinesis, Lambda,
                                     EMR, Redshift, DynamoDB
                                     GCP: Big query, cloud storage, Data flow, Vertex AI, Data pro, pub/sub, cloud data
                                     fusion, cloud composer, cloud functions, Ai platforms, cloud SQL /cloud spanner,
                                     looker, Cloud Run, Infa Cloud Data Gov & Catalog, Google Bigtable, and BigQuery,
                                     Cloud Storage
    ETL/BI Tools                     Informatica, SSIS, Tableau, Power BI, SSRS
    CI/CD                            Jenkins, Kubernetes, Helm, Docker, Splunk, Ant, Maven, Gradle.
    Ticketing Tools                  JIRA, Service Now, Remedy
    Database                         Oracle, SQL Server, Cassandra, Teradata, PostgreSQL, Snowflake, HBase, MongoDB
    Programming Languages            Scala, Hibernate, PL/SQL, R, JAVA
    Scripting                        Python, Shell Scripting, JavaScript, jQuery, HTML, JSON, XML.
    Web/Application server           Apache Tomcat, WebLogic, WebSphere Tools Eclipse, NetBeans
    Version Control                  Git, Subversion, Bitbucket, TFS.
    Scripting Languages              Python, Scala, R, PL/SQL, Shell Scripting
    DevOps Tools                     Jenkins, Docker, Kubernetes, Bitbucket
    Platforms                        Windows, Linux (Ubuntu), Mac OS, CentOS (Cloudera)
    Machine Learning& MLOps          Vertex AI, Auto ML, ML Ops for GenAI, Real-time CDC ingestion
    PROFESSIONAL EXPERIENCE:
Boeing, Seattle, WA                                                                              April 2023 – Till Date
GCP Data Engineer
The project focused on migrating and optimizing Boeing's data infrastructure on Google Cloud Platform (GCP) to
enhance scalability, security, and real-time analytics capabilities. It involved designing data pipelines, ETL processes, and
data lakes to ensure seamless data integration and processing. The objective was to modernize legacy data systems,
improve data accessibility, and support AI/ML workloads for advanced analytics. The project emphasized cost
optimization, security compliance, and high availability. Additionally, it aimed at implementing CI/CD pipelines for
automated deployments and enhancing observability with monitoring and logging solutions. The solution was built to
support real-time and batch processing for mission-critical aerospace data.
Responsibilities:
     Designed and developed scalable, secure, and high-performance data pipelines on Google Cloud Platform (GCP).
     Built ETL/ELT workflows using Cloud Dataflow, Apache Beam, and Big Query for large-scale data processing.
     Migrated on-premises data warehouses to Big Query and Cloud Storage, optimizing query performance and cost.
     Implemented data ingestion pipelines from various sources, including IoT devices, relational databases, and
      streaming services.
     Designed and optimized ETL pipelines using Dataflow, Data proc, and BigQuery to support large-scale data
      processing.
     Developed data lakes on BigQuery and Cloud Storage, optimizing schema design, partitioning, and clustering for
      performance and cost efficiency.
     Built real-time and batch ETL pipelines using Cloud Dataflow (Apache Beam) for ingesting large datasets from
      multiple sources.
     Created and optimized BigQuery Views to enhance data accessibility and analytics capabilities for business teams.
     Implemented Delta Live Tables for real-time data ingestion and transformation.
     Integrated Elasticsearch with GCP pipelines to support high-speed search and log analytics for aerospace telemetry
      and operational datasets.
     Indexed logs and streaming data from Pub/Sub and Dataflow into Elasticsearch for real-time anomaly detection and
      system health monitoring.
     Automated workflows with Apache Airflow (Cloud Composer) to support advanced analytics.
     Designed and implemented feature engineering pipelines for ML workflows using Vertex AI and Big Query ML,
      enabling advanced analytics and AI model deployment.
     Automated workflows using Cloud Composer (Apache Airflow) to ensure efficient data orchestration and reduced
      manual intervention by 30%.
     Collaborated with AI teams to enable ML model training and inference using Vertex AI.
     Ensured compliance with data governance and security standards through robust IAM policies, encryption, and
      auditing.
     Built and maintained Star Schema models to enhance query performance and storage efficiency.
     Automated data pipeline deployments using Terraform and Kubernetes.
     Troubleshot and resolved data quality issues, ensuring 99.9% data accuracy
     Developed real-time data processing solutions using Pub/Sub, Dataflow, and Bigtable to support mission-critical
      analytics.
     Designed data lake architecture leveraging Cloud Storage, BigQuery, and Dataproc for structured and unstructured
      data.
     Automated data pipeline deployments using Terraform, Cloud Composer (Apache Airflow), and CI/CD tools.
   Ensured data governance, security, and compliance by implementing IAM roles, encryption, and auditing.
   Developed data transformation logic using SQL, Python, and Spark on Data proc for batch and real-time workloads.
   Monitored data pipelines using Cloud Logging, Cloud Monitoring, and Prometheus/Grafana dashboards.
   Optimized Big Query queries to enhance performance and reduce cloud costs.
   Collaborated with data scientists and analysts to enable AI/ML model deployment in production environments.
   Worked closely with business stakeholders to understand data requirements and improve decision-making
    capabilities.
   Built ETL/ELT workflows using Cloud Dataflow, Apache Beam, and BigQuery for large-scale data processing, including
    schema design, partitioning, and clustering.
    Developed and managed data lakes on BigQuery and Cloud Storage, optimizing for performance and cost efficiency.
   Developed real-time data processing solutions using Pub/Sub, Dataflow, and Bigtable to support mission-critical
    analytics, focusing on low-latency data ingestion and transformation
   Ensured disaster recovery and high availability strategies were in place using multi-region GCP services.
   Implemented data validation and anomaly detection mechanisms to maintain high data quality.
   Designed data retention and lifecycle management strategies for optimized storage usage.
   Managed GCP networking configurations such as VPC, Firewall Rules, and Cloud NAT for secure data transfers.
   Worked in Agile (Scrum) environments, participating in sprint planning, daily stand-ups, and retrospectives.
   Automated metadata management and lineage tracking using Data Catalog and Looker for analytics.
   Troubleshot and resolved performance bottlenecks, system failures, and data inconsistencies.
   Assisted in evaluating new GCP technologies to enhance Boeing’s cloud data platform.
   Provided technical support, training, and mentorship to junior engineers.
   Designed and implemented data partitioning and clustering strategies to optimize BigQuery performance and
    reduce costs.
   Optimized SQL queries and analytical workloads for cost efficiency in BigQuery and Cloud SQL.
   Deployed machine learning models using Vertex AI and TensorFlow for predictive analytics.
   Developed CI/CD pipelines using Cloud Build and GitHub Actions to automate deployments.
Environment: Hadoop/Bigdata Ecosystem (Spark, Kafka, Hive, HDFS, Sqoop, Oozie, Cassandra, MongoDB), SQL, cloud
spanner, Google cloud storage, Python 3.x, Py Spark, Data warehousing, GCP (Big Query, Data proc, Dataflow, Data
Fusion,Pub/Sub), Vertex AI, Tensor flow, Terraform, Jenkins, Kubernetes, looker,Elasticsearch, Databricks, Delta Live
Tables, Jira, Agile/Scrum
Optum, Eden Prairie, Minnesota                                                                    May 2022 – Mar 2023
Sr. Data Engineer
The project involved building scalable data pipelines to ingest, process, and unify customer data from multiple sources
using PySpark, Snowflake, and Kafka. Real-time and batch ETL workflows were implemented on AWS, leveraging services
like Glue, Lambda, and S3 for efficient data transformation and storage. Automation with tools like Airflow, Cloud
Composer, and Terraform improved orchestration, reducing manual intervention by 30%. Advanced data modeling in
Snowflake and Databricks optimized query performance and supported machine learning workflows. The project
delivered robust, low-latency data solutions, enabling actionable insights and supporting analytics across the
organization.
Responsibilities:
   Designed and built scalable, distributed data pipelines to ingest and process customer data from multiple sources.
   Developed and managed data solutions within a Hadoop Cluster environment using Hortonworks distribution.
   Created persistent customer keys to unify customer data across various accounts and systems.
   Designed and implemented scalable ETL pipelines on AWS, using PySpark for data transformation and Snowflake for
    data warehousing.
   Integrated Elasticsearch with real-time data pipelines to enable searchable and scalable indexing of patient and
    claims data, improving analytics response times by 30%.
   Designed custom Elasticsearch mappings and analyzers to enhance indexing quality and full-text search capabilities
    across structured and semi-structured datasets.
   Built and optimized distributed data pipelines using PySpark and Snowflake for large-scale datasets.
   Integrated and processed streaming data using Kafka and AWS Glue for real-time analytics.
   Developed and maintained ETL workflows using Airflow and Cloud Composer.
   Improved query performance by 30% through data model optimization and clustering strategies.
   Designed and implemented Star Schema models for enhanced reporting and analytics.
   Collaborated with business intelligence teams to support AI/BI Genie integrations.
   Developed distributed data processing solutions in PySpark to handle large-scale datasets in AWS EMR.
   Utilized AWS S3 as the main data storage solution, integrating with Snowflake for efficient data loading and
    querying.
   Managed data ingestion from various sources into Snowflake via AWS Glue and Apache Kafka for real-time
    streaming data processing.
   Built real-time data pipelines with Kafka to process and stream large volumes of data, ensuring low-latency data
    delivery to Snowflake.
   Built and monitored real-time streaming pipelines with Kafka and Pub/Sub, achieving low-latency data delivery.
   Designed and executed workflows with Cloud Composer (Airflow), improving orchestration efficiency.
   Improved query performance by 30% through optimized data models and partitioning strategies.
   Implemented real-time data ingestion and transformation workflows in Snowflake using Kafka, achieving a 20%
    reduction in data processing latency.
   Designed data models using Neptune DB for unified customer views across multiple data sources.
   Developed real-time data streaming solutions using Kafka and AWS Glue, reducing data processing latency by 20%.
   Automated ETL workflows using Cloud Composer, enhancing orchestration efficiency and reducing manual
    intervention.
   Optimized Snowflake schemas and queries, resulting in a 30% improvement in query performance and cost
    efficiency.
   Collaborated with data science teams to design and deploy ML models using advanced feature engineering
    techniques.
   Created CI/CD pipelines for continuous deployment and monitoring of ETL processes.
   Developed data quality checks with Airflow/Cloud Composer, enhancing accuracy by 80%.
   Automated pipeline monitoring using Cloud Composer and optimized ETL processes with BigQuery for high-quality
    data analysis.
   Designed and optimized complex SQL queries and stored procedures in Snowflake for data extraction and
    reporting, achieving a 20% reduction in processing time.
   Developed and optimized complex SQL queries in Snowflake for data extraction, transformation, and loading.
   Developed and monitored ETL workflows using Airflow and Cloud Composer to support scalable data pipelines.
   Automated real-time data streaming using Kafka and Pub/Sub, enabling low-latency analytics in Snowflake.
   Deployed Airflow DAGs to orchestrate end-to-end data processes, reducing manual intervention by 30%.
   Collaborated with stakeholders to enhance data models and schemas in BigQuery for seamless data retrieval.
   Designed and deployed data solutions using AWS Glue, DynamoDB, and Elasticsearch, reducing query response
    times by 30%.
   Implemented RESTful APIs with AWS Lambda and secured integrations with IAM roles and policies.
   Enhanced system security by defining IAM roles, policies, and permissions, ensuring compliance with AWS
    standards.
   from multiple file formats including XML, JSON, CSV, and other compressed file formats.
   Developed data partitioning and clustering strategies in Snowflake to enhance query performance and optimize
    storage costs.
   Automated data processing workflows using AWS Lambda and AWS Step Functions for event-driven ETL processes.
   Monitored the health and performance of ETL pipelines and Kafka streams using AWS CloudWatch and Kafka
    Monitoring Tools, ensuring reliable data flows.
   Built scalable data pipelines on Databricks using PySpark for data ingestion, transformation, and validation from
    various sources.
   Integrated Snowflake with Databricks to support analytics and machine learning workflows.
   Automated data workflows with Databricks and Cloud Composer, achieving a 30% increase in efficiency.
   Implemented data partitioning and clustering strategies in Snowflake, optimizing query performance and storage
    efficiency.
   Automated data ingestion and transformation processes using Python and Snow pipe, reducing data processing
    time by 20%.
   Design and implement Customer 360 solutions to unify patient and provider data across multiple systems.
   Collaborate with cross-functional teams to integrate clinical, claims, and behavioral data for a single customer view.
   Leveraged Apache Beam for distributed data processing in real-time.
   Manage and configure the CDP to capture, clean, and unify patient/member data from various touchpoints.
   Develop data pipelines to ingest first-party, third-party, and multi-source data for deeper customer insights.
   Collaborated with cross-functional teams to design and implement scalable data models, leveraging Star and
    Snowflake schemas.
   Automated workflows using Apache Airflow and improved data ingestion latency by 20%.
   Supported data science initiatives by delivering clean, processed data from Kafka and PySpark pipelines into
    Snowflake for advanced analytics and machine learning.
Environment: PySpark, ETL, Kafka, Snowflake, Hadoop, Map Reduce, SQL Server, SQL scripting, PL/SQL, Python. AWS,
EC2, S3, Athena, Lambda, Glue, Elasticsearch, RDS, DynamoDB, Redshift, ECS, Hadoop, Hive v2.3.1, Spark v2.1.3, Python,
Java, Scala, SQL, Sqoop v1.4.6, Kafka, Airflow v1.9.0, Customer 360, Customer Data Platform (CDP), Oracle, Databricks,
Tableau, Docker, Maven, Git, Jira
First Citizen Bank, Atlanta, GA                                                          Oct 2020– Apr 2022
GCP Data Engineer
Description:
The project aimed to design and implement a scalable cloud-based data platform on Google Cloud Platform (GCP) to
support banking operations for data processing, analytics, and decision-making capabilities. It focused on migrating on-
premises data pipelines to GCP, ensuring efficient data ingestion, transformation, and storage using BigQuery, Dataflow,
and Cloud Storage., The objective was to modernize data infrastructure, optimize ETL workflows, and enable real-time
data streaming with Pub/Sub and Apache Kafka. Security and compliance were prioritized, implementing IAM roles,
encryption, and data governance policies. The project also aimed at cost optimization and performance tuning while
ensuring high availability and reliability. Additionally, it supported machine learning and business intelligence initiatives
by integrating with Looker and AI/ML services.
Responsibilities:
 Integrated data from various sources, including HDFS and HBase, into Spark RDDs to enable distributed data
   processing and analysis.
 Designed and implemented data lake and ETL pipelines on BigQuery to support banking analytics and decision-
   making.
 Migrated on-prem data pipelines to GCP, reducing processing time by 40%.
 Designed and implemented data lake and ETL pipelines on BigQuery to support banking analytics and decision-
   making.
 Developed cloud-native data warehousing solutions using BigQuery, ensuring high performance and cost
   optimization.
 Implemented data security and governance measures.
   Utilized Data Catalog to improve metadata management and data lineage tracking.
   Collaborated with cross-functional teams to gather requirements and provide cloud-based solutions.
   Developed and managed BigQuery Views for business reporting and analytics.
   Integrated disparate data sources (HDFS, HBase) into a unified platform using GCP services such as Cloud Storage
    and Data proc.
   Designed, developed, and deployed end-to-end data pipelines on GCP to support real-time and batch processing.
   Implemented ETL/ELT frameworks using Cloud Dataflow, Apache Beam, and BigQuery for efficient data
    transformation.
   Developed cloud-native data warehousing solutions using BigQuery, ensuring high performance and cost
    optimization.
   Optimized Snowflake queries and warehouse configurations to enhance performance and reduce costs for large-
    scale banking datasets.
   Developed and maintained ETL/ELT processes using Snowflake’s Snowpipe for continuous data ingestion from
    diverse sources (e.g., banking transactions, customer profiles).
   Conducted performance tuning by analyzing query execution plans, clustering key optimizations, and partitioning
    large datasets.
   Led data modeling initiatives, including Star Schema and Snowflake Schema design for efficient querying and
    reporting.
   Implemented data security and governance measures including IAM roles and encryption.
   Created API Wrappers for seamless data integration across services.
   Deployed and managed Kubernetes clusters to ensure high availability and scalability.
   Automated workflow orchestration using Cloud Composer (Apache Airflow) to manage complex data dependencies.
   Integrated Pub/Sub messaging systems to enable real-time streaming and event-driven architectures.
   Optimized data ingestion from multiple sources such as on-prem databases, APIs, and third-party services into GCP.
   Ensured data security, encryption, and compliance with banking regulations (PCI DSS, GDPR, etc.).
   Built and maintained Cloud Storage-based data lakes for scalable and cost-effective data storage.
   Developed and maintained Terraform-based Infrastructure as Code (IaC) for automated cloud provisioning.
   Implemented logging and monitoring solutions using Stackdriver, Cloud Logging, and Cloud Monitoring.
   Designed and enforced data governance strategies, including role-based access control and audit logging.
   Created materialized views and partitioned tables in BigQuery for efficient query performance.
   Migrated on-premises databases to GCP using Dataproc, Data Fusion, and Database Migration Services.
   Set up CI/CD pipelines using Cloud Build, GitHub Actions, and Terraform for seamless deployment.
   Developed custom Python and SQL scripts for data transformation and validation.
   Utilized Data Catalog to improve metadata management and data lineage tracking.
   Enhanced data quality and integrity by implementing data validation checks and anomaly detection.
   Collaborated with cross-functional teams to gather requirements and provide cloud-based solutions.
   Optimized cloud resource usage to minimize costs while maintaining performance benchmarks.
   Created technical documentation for data pipelines, architecture diagrams, and troubleshooting guides.
   Conducted knowledge-sharing sessions and provided training on GCP best practices to teams.
   Participated in Agile/Scrum methodologies, attending daily stand-ups, sprint planning, and retrospectives.
   Resolved performance bottlenecks and improved query execution time using indexing and caching strategies.
Environment: Hadoop, Spark, ETL, Python, SQL Apache Airflow, cloud Pub/Sub, Google cloud scheduler, cloud
composer, GCP (Cloud Storage, Dataflow, Data Fusion, Dataproc), Looker, Data Catalog, Snowflake, PySpark, Terraform,
Data Integration & Data Migration
Sonata Software, Bengaluru, India                                                           Oct 2016 – Jun 2020
Role: Data Engineer
Description:
I worked as a Data Engineer on a Customer Data Platform project that centralized customer data from various sources,
including CRM systems and web analytics platforms. The project aimed to build a scalable data platform to process and
analyze large volumes of diverse data, enabling actionable insights and efficient decision-making. It focused on designing
ETL pipelines, optimizing Snowflake data models, and supporting advanced analytics and machine learning initiatives
using Spark and AWS. Legacy systems were modernized to enhance performance, and diverse data formats were
processed to implement Data Lake concepts. Collaboration with cross-functional teams ensured alignment with business
needs, supported by Agile methodologies. The motto was to empower data-driven decisions through efficient, scalable,
and reliable solutions.
Responsibilities:
 Collaborated with Business Analysts, SMEs across departments to gather business requirements, and identify
   workable items for further development.
 Developed various spark applications using Scala to perform various enrichment of these click stream data merged
   with user profile data.
 Developed Spark code in Python and Spark SQL environment for faster testing and processing of data and loading
   the data into Spark RDD and doing In-memory computation to generate the output response with less memory
   usage.
 Developed PIG UDF'S for manipulating the data according to Business Requirements and also worked on developing
   custom PIG Loaders.
 Designed, developed, tested, and maintained Tableau functional reports based on user requirements.
 Developed ETL’s in using Spark SQL, RDD, and Data Frames.
 Worked on Scala code base related to Apache Spark performing the Actions, Transformations on RDDs, Data Frames
   and Datasets using Spark SQL and Spark Streaming Contexts.
 Built and maintained ETL pipelines using Apache Spark and AWS for a Customer Data Platform project.
 Utilized Snowflake and AWS S3 for data storage and warehousing, reducing data retrieval time by 35%.
 Created data models in Snowflake to improve query performance by 30%, enhancing reporting capabilities with
   Power BI.
 Collaborated with cross-functional teams to support machine learning and analytics initiatives.
 Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of
   Spark using Scala.
 Worked on migrating MapReduce programs into Spark transformations using Scala.
 Worked with different feeds data like JSON, CSV, XML and implemented Data Lake concept.
 Analyzed the SQL scripts and designed the solution to implement using PySpark.
 Use SQL queries and other tools to perform data analysis and profiling.
 Followed agile methodology and involved in daily SCRUM meetings, sprint planning, showcases and retrospective.
Environment: Spark, Scala, Hadoop, Python, Pyspark, AWS, MapReduce, Pig, Databricks, ETL, HDFS, Hive, HBase, SQL,
Agile and Windows, Spark, Snowflake, AWS, Power BI, Apache Kafka, PowerBi
Sak Soft, Chennai, India                                                                      Nov 2014 – Aug 2016
Role: Hadoop Developer
Description:
 I worked on a project focused on developing a big data processing framework to analyze large volumes of unstructured
data. The project involved designing and implementing data ingestion pipelines using Apache Flume and Sqoop to
extract data from various sources, including relational databases and log files. I utilized Hadoop MapReduce for data
processing and transformation tasks, ensuring efficient handling of large datasets. Additionally, I collaborated with data
scientists to integrate machine learning models into the workflow for predictive analytics. Finally, I helped optimize the
performance of the Hadoop cluster for improved data processing speeds and resource utilization.
Responsibilities:
   Developed and maintained large-scale data ingestion pipelines using Apache Hadoop components such as HDFS,
    MapReduce, and Hive to process terabytes of data daily.
   Integrated data from multiple sources like relational databases, flat files, and logs into Hadoop HDFS for efficient
    storage and processing.
   Implemented MapReduce jobs to transform and aggregate data, performing batch processing on massive datasets.
   Worked with Hive and Pig for querying and analyzing large datasets stored in HDFS, optimizing performance for data
    analysis.
   Developed and maintained large-scale data ingestion pipelines using Hadoop, Hive, and MapReduce.
   Integrated RDBMS and log data into HDFS for processing, enhancing data availability and processing speed.
   Developed and maintained large-scale ETL pipelines using Hadoop components such as MapReduce, Hive, Pig, and
    HBase.
   Conducted performance optimization and tuning for Hadoop clusters, improving resource utilization and job
    execution times.
   Integrated structured and unstructured data using Apache Flume and Sqoop.
   Enhanced the data pipeline by integrating Sqoop for importing structured data from traditional RDBMS into HDFS.
   Performed data cleansing, transformation, and aggregation using Apache Pig scripts.
   Assisted in the design and implementation of HBase tables for real-time, low-latency access to processed data.
   Conducted performance tuning of Hadoop clusters and optimized data processing workflows to improve speed and
    resource utilization.
Environment: Python, HDFS, MapReduce, Hive, pig, ETL pipelines, HBase, Hadoop clusters, SQL