Azure Data Engineer
Azure Storage:
Azure Storage is a Microsoft-managed service providing cloud storage that is highly available, secure,
durable, scalable, and redundant. Azure Storage includes Azure Blobs (objects), Azure Data Lake Storage
Gen2, Azure Files, Azure Queues, and Azure Tables
Azure account refers to the Azure Billing account---> mapped to the email id that you used
to sign up for Azure--->An account can contain multiple subscriptions; each of these
subscriptions can have multiple resource groups and the resource groups, in turn, can have
multiple resources.
---> billing is done at the level of subscriptions
To Create an Azure Storage Account:
Basics:
    1.) Subscription (there is no limit to the number of storage accounts you can create per subscription
        in Azure)
    2.) Resource group (A Resource group is a container that holds related resources for an Azure
        solution)
    3.) Storage account name (Globally Unique)
    4.) Region (Proximity to Users, Compliance Requirements, Redundancy and Disaster Recovery,
        Pricing, Service Availability based Region, Network Performance between your applications and
        the chosen region. Review the SLAs for Azure Storage services in different regions)
                                    Azure Data Engineer
   5.) Performance (Standard and Premium)
   6.) Redundancy (LRS, GRS, ZRS, GZRS)
       LRS---> Replicates data within a single data center
       GRS---> Replicates data to a secondary region for disaster recovery
       ZRS---> Replicates data across different availability zones
       GZRS---> Combines GRS and ZRS for maximum redundancy.
Advanced:
   1.) Require secure transfer for REST API operations---> HTTP, HTTPS are performed securely using
       SSL/TLS encryption
   2.) Allow enabling public access on individual containers-----> By default, containers within a storage
       account are private. Enabling this option allows you to grant public access to specific containers if
       needed.
   3.) Enable storage account key access----> allows you to access the storage account using the
       account keys
   4.) Default to Azure Active Directory authorization in the Azure portal---> allows you to use Azure
       Active Directory (AD) for authentication and authorization instead of storage account keys. It
       provides more secure and granular access control to your storage account resources.
   5.) Minimum TLS version- Transport Layer Security and Choosing a higher version ensures stronger
       encryption and better security.
   6.) Enable hierarchical namespace
   7.) ACCESS PROTOCOLS - Enable SFTP and network file system v3----> Enabling these protocols
       allows you to access your storage account using SFTP (Secure File Transfer Protocol) and NFS
       (Network File System) v3.
   8.) BLOB STORAGE - Access across tenant Replication, and Access Tier
   9.) AZURE FILES - Enable Large File Shares
Networking
   1.) Network access------> 1. Enable public access from all networks
                               2. Enable public access from selected virtual networks and IP addresses
                               3. Disable public access and use private access
Virtual networks
Network routing
   Routing Preferences ------> Microsoft network routing and Internet routing
   Microsoft network routing ensures that traffic between Azure resources within the same
   region stays within the Azure network, while Internet routing allows traffic to flow through
   the internet.
Data Protection
                                      Azure Data Engineer
    1.) Enable point-in-time restore for containers
    2.) Enable soft delete for blobs [Days to retain deleted blobs and Soft delete enables you to recover
        blobs that were previously marked for deletion, including blobs that were overwritten.]
    3.) Enable soft delete for containers
    4.) Enable soft delete for file shares
    Tracking:
    Enable versioning for blobs---> Use versioning to automatically maintain previous versions of your
    blobs.
    Enable blob change feed ---> Keep track of create, modification, and delete changes to blobs in your
    account.
    Access control:
    Enable version-level immutability support
     - Allows you to set time-based retention policy on the account-level that will apply to all blob
versions. Enable this feature to set a default policy at the account level. Without enabling this, you can still
set a default policy at the container level or set policies for specific blob versions. Versioning is required
for this property to be enabled.
Encryption:
    Encryption type -----> 1. Microsoft Managed keys
                            2. Customer Managed Keys
Customer Managed Keys-------> 1. Blob and file service only, or
                              2. To all service types.
Customer-managed key (CMK) support can be limited to blob service and file service only,
or to all service types. After the storage account is created, this support cannot be
changed.
Designing a partition strategy for files in Azure:
    1.   Choose a partition key: Determine a partition key based on the characteristics of your data, such
         as customer ID, date, or geographical location. This key will be used to distribute your data across
         different partitions.
    2.   Select a partitioning scheme: Azure provides two partitioning schemes: partition by range and
         partition by hash. Partition by range is suitable when you have sequential or time-based data.
         Partition by hash is useful when you want to distribute data uniformly across partitions.
                                      Azure Data Engineer
    3.   Define the partitioning strategy: Implement the chosen partitioning scheme by creating a
         partition map. This map specifies the partition key, the partition boundaries (in the case of range
         partitioning), and the number of partitions (in the case of hash partitioning).
    4.   Distribute the data: When writing data to Azure, include the partition key in the data. Azure will
         use this key to determine the appropriate partition for storing the data.
Designing a partition strategy for files has partition key and the partition logic which are dependent on
one another. Example, if we take the partition key has Create Date then the partition logic need adhere to
this Partition key in order to store the files in exact partition.
Example for partition by range:
def get_partition_key(date):
    if "2020-01-01" <= date <= "2020-06-30":
        return "Partition A"
    elif "2020-07-01" <= date <= "2020-12-31":
        return "Partition B"
    else:
        return "Invalid Date Range"
# Example usage
file_date = "2020-05-15"
partition_key = get_partition_key(file_date)
print(partition_key) # Output: Partition A
Example for partition by hash:
import hashlib
def get_partition_key(file_name):
    # Generate a hash value for the file name
    hash_value = hashlib.md5(file_name.encode()).hexdigest()
     # Extract a portion of the hash value to use as the partition key
     partition_key = hash_value[:2]
     return partition_key
def store_file(file_name, file_content):
    partition_key = get_partition_key(file_name)
                                  Azure Data Engineer
    # Logic to store the file in the appropriate partition based on the
partition key
    # For example, you can use Azure Blob Storage and create containers for each
partition
    # Code to store the file in the corresponding partition container
    # For example, using Azure Blob Storage SDK:
    # blob_service_client =
BlobServiceClient.from_connection_string(connection_string)
    # container_client = blob_service_client.get_container_client(partition_key)
    # blob_client = container_client.get_blob_client(file_name)
    # blob_client.upload_blob(file_content)
def access_file(file_name):
    partition_key = get_partition_key(file_name)
    # Logic to access the file based on the partition key
    # For example, you can retrieve the file from the corresponding partition
container
    # Code to access the file from the corresponding partition container
    # For example, using Azure Blob Storage SDK:
    # blob_service_client =
BlobServiceClient.from_connection_string(connection_string)
    # container_client = blob_service_client.get_container_client(partition_key)
    # blob_client = container_client.get_blob_client(file_name)
    # file_content = blob_client.download_blob().readall()
       return file_content
Azure Storage uses <account name + container name + blob name> as
the partition key.
Designing a partition strategy for analytical workloads
There are three main types of partition strategies for analytical workloads. These are listed here:
        Horizontal partitioning, which is also known as sharding
        Vertical partitioning
        Functional partitioning
Horizontal partitioning
In a horizontal partition, we divide the table data horizontally, and subsets of rows are stored in
different data stores. Each of these subsets of rows (with the same schema as the parent table)
are called shards. Essentially, each of these shards is stored in different database instances.
                                  Azure Data Engineer
       NOTE
       Don't try to balance the data to be evenly distributed across partitions unless specifically
       required by your use case because usually, the most recent data will get accessed more
       than older data. Thus, the partitions with recent data will end up becoming bottlenecks
       due to high data access.
Vertical partitioning
In a vertical partition, we divide the data vertically, and each subset of the columns is stored
separately in a different data store. This is ideal for column-oriented data stores such as HBase,
Cosmos DB, and so on.
                                   Azure Data Engineer
Functional partitioning
Functional partitions are similar to vertical partitions, except that here, we store entire tables or
entities in different data stores. They can be used to segregate data belonging to different
organizations, frequently used tables from infrequently used ones, read-write tables from read-
only ones, sensitive data from general data, and so on.
                                 Azure Data Engineer
Designing a partition strategy for efficiency/performance
      Design effective folder structures to improve the efficiency of data reads and writes.
      Partition data such that a significant amount of data can be pruned while running
       queries.
      File sizes in the range of 256 megabytes (MB) to 100 gigabytes (GB) perform really
       well with analytical engines such as HDInsight and Azure Synapse, gen2 . So, aggregate
       the files to these ranges before running the analytical engines on them.
      For I/O-intensive jobs, try to keep the optimal I/O buffer sizes in the range of 4 to 16
       MB; anything too big or too small will become inefficient.
      Run more containers or executors per virtual machine (VM) (such as Apache Spark
       executors or Apache Yet Another Resource Negotiator (YARN) containers).
Iterative query performance improvement process
   1. List business-critical queries, the most frequently run queries, and the slowest queries.
   2. Check the query plans for each of these queries using the EXPLAIN keyword and see the
      amount of data being used at each stage (we will be learning about how to view query
      plans in the later chapters).
   3. Identify the joins or filters that are taking the most time. Identify the corresponding data
      partitions.
   4. Try to split the corresponding input data partitions into smaller partitions, or change the
      application logic to perform isolated processing on top of each partition and later merge
      only the filtered data.
   5. You could also try to see if other partitioning keys would work better and if you need to
      repartition the data to get better job performance for each partition.
   6. If any particular partitioning technology doesn't work, you can explore having more than
      one piece of partitioning logic—for example, you could apply horizontal partitioning
      within functional partitioning, and so on.
   7. Monitor the partitioning regularly to check if the application access patterns are balanced
      and well distributed. Try to identify hot spots early on.
   8. Iterate this process until you hit the preferred query execution time.
Designing a partition strategy for Azure Synapse Analytics
   A dedicated SQL pool is a massively parallel processing (MPP) system that splits the queries
into 60 parallel queries and executes them in parallel. Each of these smaller queries runs on
something called a distribution. A distribution is a basic unit of processing and storage for a
dedicated SQL pool. There are three different ways to distribute (shard) data among
distributions, as listed here:
      Round-robin tables
      Hash tables
                                  Azure Data Engineer
      Replicated tables
Partitioning is supported on all the distribution types in the preceding list. Apart from the
distribution types, Dedicated SQL pool also supports three types of tables: clustered
columnstore, clustered index, and heap tables.Partitioning is supported in all of these types of
tables, too.
In a dedicated SQL pool, data is already distributed across its 60 distributions, so we need to be
careful in deciding if we need to further partition the data. The clustered columnstore tables work
optimally when the number of rows per table in a distribution is around 1 million.
For example, if we plan to partition the data further by the months of a year, we are talking about
12 partitions x 60 distributions = 720 sub-divisions. Each of these divisions needs to have at least
1 million rows; in other words, the table (usually a fact table) will need to have more than 720
million rows. So, we will have to be careful to not over-partition the data when it comes to
dedicated SQL pools.
Identifying when partitioning is needed in ADLS Gen2
As we have learned in the previous chapter, we can partition data according to our requirements
—such as performance, scalability, security, operational overhead, and so on—but there is
another reason why we might end up partitioning our data, and that is the various I/O bandwidth
limits that are imposed at subscription levels by Azure. These limits apply to both Blob storage
and ADLS Gen2.
The rate at which we ingest data into an Azure Storage system is called the ingress rate, and
the rate at which we move the data out of the Azure Storage system is called the egress rate.
Resource                                                                         Limit
Maximum number of storage accounts with standard endpoints per region per 250 by default,
subscription, including standard and premium storage accounts.                   500 by request            1
Maximum number of storage accounts with Azure DNS zone endpoints (preview) 5000 (preview)
per region per subscription, including standard and premium storage accounts.
Default maximum storage account capacity                                         5 PiB             2
Maximum number of blob containers, blobs, file shares, tables, queues, entities, No limit
or messages per storage account.
Default maximum request rate per storage account                                 20,000 requests
                                                                                 per second            2
                             Azure Data Engineer
Resource                                                                        Limit
Default maximum ingress per general-purpose v2 and Blob storage account in      60 Gbps   2
the following regions (LRS/GRS):
     Australia East
   Central US
   East Asia
   East US 2
   Japan East
   Korea Central
   North Europe
   South Central US
   Southeast Asia
   UK South
   West Europe
   West US
Default maximum ingress per general-purpose v2 and Blob storage account in      60 Gbps   2
the following regions (ZRS):
     Australia East
   Central US
   East US
   East US 2
   Japan East
   North Europe
   South Central US
   Southeast Asia
   UK South
   West Europe
   West US 2
Default maximum ingress per general-purpose v2 and Blob storage account in      25 Gbps   2
regions that aren't listed in the previous row.
Default maximum ingress for general-purpose v1 storage accounts (all regions)   10 Gbps   2
Default maximum egress for general-purpose v2 and Blob storage accounts in      120 Gbps      2
the following regions (LRS/GRS):
    Australia East
    Central US
    East Asia
    East US 2
                            Azure Data Engineer
Resource                                                                     Limit
   Japan East
   Korea Central
   North Europe
   South Central US
   Southeast Asia
   UK South
   West Europe
   West US
Default maximum egress for general-purpose v2 and Blob storage accounts in   120 Gbps      2
the following regions (ZRS):
   Australia East
   Central US
   East US
   East US 2
   Japan East
   North Europe
   South Central US
   Southeast Asia
   UK South
   West Europe
   West US 2
Default maximum egress for general-purpose v2 and Blob storage accounts in   50 Gbps   2
regions that aren't listed in the previous row.
Maximum number of IP address rules per storage account                       200
Maximum number of virtual network rules per storage account                  200
Maximum number of resource instance rules per storage account                200
Maximum number of private endpoints per storage account                      200
Develop data processing (40–45%) (4)
                                 Azure Data Engineer
Ingest and transform data (Chapter 8)
Transforming data by using Apache Spark
Apache Spark supports transformations with three different Application Programming
Interfaces (APIs): Resilient Distributed Datasets (RDDs), DataFrames, and Datasets. We will
learn about RDDs and DataFrame transformations in this chapter. Datasets are just extensions of
DataFrames, with additional features like being type-safe (where the compiler will strictly check
for data types) and providing an object-oriented (OO) interface.
What are RDDs?
RDDs are an immutable fault-tolerant collection of data objects that can be operated on in
parallel by Spark.