Azure Storage Account Data Lake
Data Lake Storage Gen2 makes Azure Storage the foundation for building enterprise data lakes on Azure. Designed from the start to service multiple petabytes of information while sustaining hundreds of gigabits of throughput, Data Lake Storage Gen2 allows you to easily manage massive amounts of data.
Made by
Massdriver
Official
Yes
Clouds
Tags
azure-storage-account-data-lake
Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on Azure Blob Storage.
Use Cases
Hadoop-compatible access
Azure Data Lake Storage is primarily designed to work with Hadoop and all frameworks that use the Apache Hadoop Distributed File System (HDFS) as their data access layer. Hadoop distributions include the Azure Blob File System (ABFS) driver, which enables many applications and frameworks to access Azure Blob Storage data directly. The ABFS driver is optimized specifically for big data analytics. The corresponding REST APIs are surfaced through the endpoint dfs.core.windows.net
.
Hierarchical directory structure
The hierarchical namespace is a key feature that enables Azure Data Lake Storage Gen2 to provide high-performance data access at object storage scale and price. You can use this feature to organize all the objects and files within your storage account into a hierarchy of directories and nested subdirectories. In other words, your Azure Data Lake Storage data is organized in much the same way that files are organized on your computer.
Optimized cost and performance
Azure Data Lake Storage is priced at Azure Blob Storage levels. It builds on Azure Blob Storage capabilities such as automated lifecycle policy management and object level tiering to manage big data storage costs.
Performance is optimized because you don't need to copy or transform data as a prerequisite for analysis. The hierarchical namespace capability of Azure Data Lake Storage allows for efficient access and navigation. This architecture means that data processing requires fewer computational resources, reducing both the speed and cost of accessing data.
Massive scalability
Azure Data Lake Storage offers massive storage and accepts numerous data types for analytics. It doesn't impose any limits on account sizes, file sizes, or the amount of data that can be stored in the data lake. Individual files can have sizes that range from a few kilobytes (KBs) to a few petabytes (PBs). Processing is executed at near-constant per-request latencies that are measured at the service, account, and file levels.
Design
Our bundle includes the following design choices to help simplify your deployment:
Redundancy
Azure Storage always stores multiple copies of your data so that it's protected from planned and unplanned events, including transient hardware failures, network or power outages, and massive natural disasters. Redundancy ensures that your storage account meets its availability and durability targets even in the face of failures.
- Locally redundant storage (LRS) copies your data synchronously three times within a single physical location in the primary region. LRS is the least expensive replication option, but isn't recommended for applications requiring high availability or durability.
- Zone-redundant storage (ZRS) copies your data synchronously across three Azure availability zones in the primary region. For applications requiring high availability, Microsoft recommends using ZRS in the primary region, and also replicating to a secondary region.
Best practices
TLS 1.2
Enforcement of TLS 1.2 on public HTTPS endpoints is standard best practice.
Data retention policy
A time-based retention policy stores blob data in a Write-Once, Read-Many (WORM) format for a specified interval. When a time-based retention policy is set, clients can create and read blobs, but can't modify or delete them. After the retention interval has expired, blobs can be deleted but not overwritten.
Security
In order to improve security, we implement a few key safeguards.
Data encrypted in transit
By default, all data in transit will be encrypted with Secure Sockets Layer and Transport Layer Security (SSL/TLS).
Data encrypted at rest
Azure Storage uses service-side encryption (SSE) to automatically encrypt your data when it is persisted to the cloud. Azure Storage encryption protects your data and to help you to meet your organizational security and compliance commitments.
Trade-offs
- CMKs are not supported in this bundle
Variable | Type | Description |
---|---|---|
account.access_tier | string | How frequently will the data be accessed? Hot data is accessed frequently, while cool data is accessed less frequently. Hot data is cheaper to write to, but costs more to store. Cool data is more expensive to write to, but costs less to store. |
account.region | string | The region where the storage account will be created. Cannot be changed after deployment. |
account.tier | string | The performance tier of the storage account. Premium storage accounts do not support geo-replication. Learn more. Cannot be changed after deployment. |
monitoring.mode | string | Enable and customize Function App metric alarms. |
redundancy.data_protection | integer | Set the number of days to allow data recovery if data is deleted (minimum 1, maximum 365). |
redundancy.replication_type | string | No description |
redundancy.zone_redundancy | boolean | Enable zone redundancy for the storage account. Cannot be changed after deployment. |