{"id":16731,"date":"2020-10-13T16:00:45","date_gmt":"2020-10-13T15:00:45","guid":{"rendered":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/?p=16731"},"modified":"2020-10-13T15:43:09","modified_gmt":"2020-10-13T14:43:09","slug":"adls-architectural-pattern-for-mls-using-datasets","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/10\/13\/adls-architectural-pattern-for-mls-using-datasets\/","title":{"rendered":"Using Azure Data Lake Storage (ADLS) and Azure Datasets for Machine Learning"},"content":{"rendered":"
<\/p>\n
Many organisations are now focusing on a single version of truth of their data, typically via some form of a data lake strategy. This brings several benefits, such as a single access point, fewer silos, and an enriched dataset via the amalgamation of data from various sources. Microsoft’s answer to this strategy is Azure Data Lake Storage (ADLS)<\/a>. ADLS brings several benefits to enterprises, such as security, manageability, scalability, reliability and availability.<\/p>\n A typical approach to a data lake strategy that we see being adopted by customers is the hierarchical approach (see fig 1), where the data is first ingested into a landing layer, typically referenced as the \u201craw data lake\u201d. Data is then processed, filtered, optimised and placed in the \u201ccurated data lake\u201d. This is then further refined\/processed based on the application\/service-specific logic, and placed in what is referred to as the \u201cproduct data lake\u201d.<\/p>\n Figure 1- Datalake Abstraction Strategy<\/p><\/div>\n <\/p>\n A raw data lake provides a single version of truth for all the data in an organisation and can be seen as the landing zone for all the data. Data from various sources, whether structured, semi-structured or unstructured is ingested in its native format. Some optimisation and basic data quality checks, such as total number of rows ingested and other basic operations, may be applied at this stage.<\/p>\n When the data is landed, there are a couple of things to consider before moving it to the \u201ccurated data lake\u201d: Who has access<\/a> to this data? How will it be secured<\/a>? How is it going to be processed? What partition strategy should be used? How should it be logged and what file format and compression techniques should be considered? This layer is typically owned by IT.<\/p>\n <\/p>\n While the raw data lake is targeted at the organisation level, the curated data lake is focused on the division or OU level. Each division may define its own business rules and constraints of how the data should be processed, presented and accessed. Data is typically stored in an optimised format and tuned for performance<\/a>\u00a0and is generally cleaned (handling missing values, outliers and others), aggregated and filtered. Curated data tends to have more structure and is grouped according to specific domains, such as Finance, HR, etc.<\/p>\n <\/p>\n The product data lake typically tends to be application specific, and is normally filtered to the specific interest of the application. The application focus in this article is machine learning, which relies on large volumes of data to use for modelling and for batch or real-time inferencing before being written back to the product data lake.<\/p>\n Such benefits would be futile if we are not able to perform advance analytics in a way that gives predictive capabilities. Here we have defined an Azure ML stack infused with a data lake strategy.<\/p>\n Figure 2- Data Lake infused in ML Stack<\/p><\/div>\n The ML stack consists of various layers which are infused with different categories of data lake storage. The first layer, \u201cData Prep\u201d is where raw data is processed depending on the business logic and domain constraints, typically using Databricks, and is made available to product-specific data lakes.<\/p>\n Layer two consists of various frameworks to be used for the ML problem at hand and would typically use the product data lake to conduct various feature engineering.<\/p>\n Layer three consists of Azure machine learning, for carrying out experiments using various compute targets and tracking of experiments, and would use the product data lake as the source of experimentation data. Any write-back would take place in this layer. The output of the model or inferencing results will also be stored in the product data lake.<\/p>\n The core of this stack lies in the ability of Azure machine learning to be able to access and interface with the data lake store. The rest of the paper will focus on how this can be achieved.<\/p>\n <\/p>\n The Azure Machine Learning Service (AMLS)<\/a> provides a cloud-based environment for the development, management and deployment of machine learning models. The service allows the user to start training models on a local machine, then scale out to the cloud. The service fully supports open-source technologies such as PyTorch, TensorFlow, and scikit-learn and can be used for any kind of machine learning, from classical to deep learning, supervised and unsupervised learning.<\/p>\n Figure 3 below shows the architectural pattern that focuses on the interaction between the product data lake and Azure Machine Learning. Until recently, the data used for model training needed to either reside in the default (blob or file) storage associated with the Azure Machine Learning Service Workspace, or a complex pipeline needed to be built to move the data to the compute target during training. This has meant that data stored in Azure Data Lake Storage Gen1 (ADLSG1) typically needed to be duplicated to the default (blob of file) storage before training could take place. This is no longer necessary with the new feature dataset.<\/p>\n Figure 3 – ADLS Architectural Pattern for MLS<\/p><\/div>\n <\/p>\n The new dataset feature of AMLS has made it possible to work with data files stored in ADLSG1 by creating a reference to the data source location, along with a copy of its metadata. The data remains in its existing location, so no extra storage cost is incurred. The data set is thus identified by name and is made available to all workspace users to work with.<\/p>\n <\/p>\n In the following example we will demonstrate how we can use the Azure datasets with Azure Machine Learning to build a machine learning model using the product data lake. The steps are as follows:<\/p>\n <\/p>\n We will be using Azure ML Notebook VM<\/a> to implement to demonstrate this example. This is because it comes pre-built with the Python Jupyter Notebook with MLS SDK installed. However, if you prefer to use your own IDE, you will need to install the MLS python SDK.<\/p>\n The above code snippet assumes that the AMLS configuration file is available in the working directory. The file can be downloaded from the Azure portal by selecting \u201cDownload config.json\u201d from the Overview section of the AMLS workspace. Or you can create it yourself:<\/p>\n
Raw Data Lake<\/h2>\n
Curated Data Lake<\/h2>\n
Product Data Lake<\/h2>\n

Azure Machine Learning<\/h2>\n

Azure Dataset<\/h2>\n
Simple Example (Azure ML SDK Version: 1.0.60)<\/h2>\n
\n
\n
Step 1: Register the Product Data Lake as a data store in the AMLS workspace<\/h2>\n
from azureml.core.workspace import Workspace \r\nfrom azureml.core.datastore import Datastore \r\nfrom azureml.core.dataset import Dataset \r\n \r\n#Give a name to the registered datastore \r\ndatastore_name = \" adlsg1 \" \r\n \r\n#Get a reference to the AMLS workspace \r\nworkspace = Workspace.from_config() \r\n \r\n#Register the Data store pointing at the ADLS G1 \r\nDatastore.register_azure_data_lake(workspace,\r\n datastore_name,\r\n \"[Name of the Product Datalake]\",\r\n \"[AAD Tenant ID]\",\r\n \"[Service Principal Application ID]\",\r\n \"[Service Principal Secret]\",\r\n overwrite=False)<\/pre>\n
{\r\n \"subscription_id\": \"my subscription id\",\r\n \"resource_group\": \"my resource group\",\r\n \"workspace_name\": \"my workspace name\" \r\n}<\/pre>\n