{"id":16731,"date":"2020-10-13T16:00:45","date_gmt":"2020-10-13T15:00:45","guid":{"rendered":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/?p=16731"},"modified":"2020-10-13T15:43:09","modified_gmt":"2020-10-13T14:43:09","slug":"adls-architectural-pattern-for-mls-using-datasets","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/10\/13\/adls-architectural-pattern-for-mls-using-datasets\/","title":{"rendered":"Using Azure Data Lake Storage (ADLS) and Azure Datasets for Machine Learning"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" class=\"attachment-full size-full webp-format\" src=\"https:\/\/www.microsoft.com\/en-us\/industry\/blog\/wp-content\/uploads\/sites\/22\/2020\/04\/DataLakeHeader.jpg\" alt=\"The Data Lake Analytics logo, next to an illustration of Bit the Raccoon.\" width=\"1920\" height=\"700\" data-orig-srcset=\"https:\/\/www.microsoft.com\/en-us\/industry\/blog\/wp-content\/uploads\/sites\/22\/2020\/04\/DataLakeHeader.jpg 1920w, https:\/\/www.microsoft.com\/en-us\/industry\/blog\/wp-content\/uploads\/sites\/22\/2020\/04\/DataLakeHeader-300x109.jpg 300w, https:\/\/www.microsoft.com\/en-us\/industry\/blog\/wp-content\/uploads\/sites\/22\/2020\/04\/DataLakeHeader-1024x373.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/industry\/blog\/wp-content\/uploads\/sites\/22\/2020\/04\/DataLakeHeader-768x280.jpg 768w, https:\/\/www.microsoft.com\/en-us\/industry\/blog\/wp-content\/uploads\/sites\/22\/2020\/04\/DataLakeHeader-1536x560.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/industry\/blog\/wp-content\/uploads\/sites\/22\/2020\/04\/DataLakeHeader-330x120.jpg 330w, https:\/\/www.microsoft.com\/en-us\/industry\/blog\/wp-content\/uploads\/sites\/22\/2020\/04\/DataLakeHeader-800x292.jpg 800w, https:\/\/www.microsoft.com\/en-us\/industry\/blog\/wp-content\/uploads\/sites\/22\/2020\/04\/DataLakeHeader-400x146.jpg 400w\" data-orig-src=\"https:\/\/www.microsoft.com\/en-us\/industry\/blog\/wp-content\/uploads\/sites\/22\/2020\/04\/DataLakeHeader.jpg\" \/><\/p>\n<h2>Data Lake<\/h2>\n<p>Many organisations are now focusing on a single version of truth of their data, typically via some form of a data lake strategy. This brings several benefits, such as a single access point, fewer silos, and an enriched dataset via the amalgamation of data from various sources. Microsoft&#8217;s answer to this strategy is <a href=\"https:\/\/azure.microsoft.com\/en-gb\/services\/storage\/data-lake-storage?ocid=AID3020565\" target=\"_blank\" rel=\"noopener noreferrer\">Azure Data Lake Storage (ADLS)<\/a>. ADLS brings several benefits to enterprises, such as security, manageability, scalability, reliability and availability.<\/p>\n<p>A typical approach to a data lake strategy that we see being adopted by customers is the hierarchical approach (see fig 1), where the data is first ingested into a landing layer, typically referenced as the \u201craw data lake\u201d. Data is then processed, filtered, optimised and placed in the \u201ccurated data lake\u201d. This is then further refined\/processed based on the application\/service-specific logic, and placed in what is referred to as the \u201cproduct data lake\u201d.<\/p>\n<div style=\"width: 909px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"attachment-full webp-format\" src=\"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-content\/uploads\/2019\/09\/adls1.png\" alt=\"An illustration showing a Datalake Abstraction Strategy.\" width=\"899\" height=\"401\" data-orig-src=\"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-content\/uploads\/2019\/09\/adls1.png\" \/><p class=\"wp-caption-text\">Figure 1- Datalake Abstraction Strategy<\/p><\/div>\n<p>&nbsp;<\/p>\n<h2>Raw Data Lake<\/h2>\n<p>A raw data lake provides a single version of truth for all the data in an organisation and can be seen as the landing zone for all the data. Data from various sources, whether structured, semi-structured or unstructured is ingested in its native format. Some optimisation and basic data quality checks, such as total number of rows ingested and other basic operations, may be applied at this stage.<\/p>\n<p>When the data is landed, there are a couple of things to consider before moving it to the \u201ccurated data lake\u201d: Who <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/storage\/blobs\/data-lake-storage-access-control?ocid=AID3020565\" target=\"_blank\" rel=\"noopener noreferrer\">has access<\/a> to this data? How will it be <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/storage\/common\/storage-network-security?ocid=AID3020565\" target=\"_blank\" rel=\"noopener noreferrer\">secured<\/a>? How is it going to be processed? What partition strategy should be used? How should it be logged and what file format and compression techniques should be considered? This layer is typically owned by IT.<\/p>\n<p>&nbsp;<\/p>\n<h2>Curated Data Lake<\/h2>\n<p>While the raw data lake is targeted at the organisation level, the curated data lake is focused on the division or OU level. Each division may define its own business rules and constraints of how the data should be processed, presented and accessed. Data is typically stored in an <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/storage\/blobs\/data-lake-storage-performance-tuning-guidance?ocid=AID3020565\" target=\"_blank\" rel=\"noopener noreferrer\">optimised format and tuned for performance<\/a>\u00a0and is generally cleaned (handling missing values, outliers and others), aggregated and filtered. Curated data tends to have more structure and is grouped according to specific domains, such as Finance, HR, etc.<\/p>\n<p>&nbsp;<\/p>\n<h2>Product Data Lake<\/h2>\n<p>The product data lake typically tends to be application specific, and is normally filtered to the specific interest of the application. The application focus in this article is machine learning, which relies on large volumes of data to use for modelling and for batch or real-time inferencing before being written back to the product data lake.<\/p>\n<p>Such benefits would be futile if we are not able to perform advance analytics in a way that gives predictive capabilities. Here we have defined an Azure ML stack infused with a data lake strategy.<\/p>\n<div id=\"attachment_16740\" style=\"width: 777px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-16740\" class=\"attachment-full size-full webp-format\" src=\"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-content\/uploads\/2019\/09\/adls2.png\" alt=\"An illustration showing how Data Lake is infused into ML Stack\" width=\"767\" height=\"363\" data-orig-src=\"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-content\/uploads\/2019\/09\/adls2.png\" \/><p id=\"caption-attachment-16740\" class=\"wp-caption-text\">Figure 2- Data Lake infused in ML Stack<\/p><\/div>\n<p>The ML stack consists of various layers which are infused with different categories of data lake storage. The first layer, \u201cData Prep\u201d is where raw data is processed depending on the business logic and domain constraints, typically using Databricks, and is made available to product-specific data lakes.<\/p>\n<p>Layer two consists of various frameworks to be used for the ML problem at hand and would typically use the product data lake to conduct various feature engineering.<\/p>\n<p>Layer three consists of Azure machine learning, for carrying out experiments using various compute targets and tracking of experiments, and would use the product data lake as the source of experimentation data. Any write-back would take place in this layer. The output of the model or inferencing results will also be stored in the product data lake.<\/p>\n<p>The core of this stack lies in the ability of Azure machine learning to be able to access and interface with the data lake store. The rest of the paper will focus on how this can be achieved.<\/p>\n<p>&nbsp;<\/p>\n<h2>Azure Machine Learning<\/h2>\n<p>The <a href=\"https:\/\/azure.microsoft.com\/en-gb\/services\/machine-learning-service?ocid=AID3020565\" target=\"_blank\" rel=\"noopener noreferrer\">Azure Machine Learning Service (AMLS)<\/a> provides a cloud-based environment for the development, management and deployment of machine learning models. The service allows the user to start training models on a local machine, then scale out to the cloud. The service fully supports open-source technologies such as PyTorch, TensorFlow, and scikit-learn and can be used for any kind of machine learning, from classical to deep learning, supervised and unsupervised learning.<\/p>\n<p>Figure 3 below shows the architectural pattern that focuses on the interaction between the product data lake and Azure Machine Learning. Until recently, the data used for model training needed to either reside in the default (blob or file) storage associated with the Azure Machine Learning Service Workspace, or a complex pipeline needed to be built to move the data to the compute target during training. This has meant that data stored in Azure Data Lake Storage Gen1 (ADLSG1) typically needed to be duplicated to the default (blob of file) storage before training could take place. This is no longer necessary with the new feature dataset.<\/p>\n<div id=\"attachment_16743\" style=\"width: 917px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-16743\" class=\"attachment-full size-full webp-format\" src=\"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-content\/uploads\/2019\/09\/adls3.png\" alt=\"An illustration showing the ADLS Architectural Pattern for MLS\" width=\"907\" height=\"638\" data-orig-src=\"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-content\/uploads\/2019\/09\/adls3.png\" \/><p id=\"caption-attachment-16743\" class=\"wp-caption-text\">Figure 3 &#8211; ADLS Architectural Pattern for MLS<\/p><\/div>\n<p>&nbsp;<\/p>\n<h2>Azure Dataset<\/h2>\n<p>The new dataset feature of AMLS has made it possible to work with data files stored in ADLSG1 by creating a reference to the data source location, along with a copy of its metadata. The data remains in its existing location, so no extra storage cost is incurred. The data set is thus identified by name and is made available to all workspace users to work with.<\/p>\n<p>&nbsp;<\/p>\n<h2>Simple Example (Azure ML SDK Version: 1.0.60)<\/h2>\n<p>In the following example we will demonstrate how we can use the Azure datasets with Azure Machine Learning to build a machine learning model using the product data lake. The steps are as follows:<\/p>\n<ul>\n<li style=\"list-style-type: none\">\n<ul>\n<li>Step 1: Register the product data lake (ADLS Gen1) store as a data store in the AMLS workspace<\/li>\n<li>Step 2: Register a file (CSV) as a dataset in the AMLS workspace<\/li>\n<li>Step 3: Train a model<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h2>Step 1: Register the Product Data Lake as a data store in the AMLS workspace<\/h2>\n<p>We will be using <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/machine-learning\/tutorial-1st-experiment-sdk-setup?ocid=AID3020565\" target=\"_blank\" rel=\"noopener noreferrer\">Azure ML Notebook VM<\/a> to implement to demonstrate this example. This is because it comes pre-built with the Python Jupyter Notebook with MLS SDK installed. However, if you prefer to use your own IDE, you will need to install the MLS python SDK.<\/p>\n<pre>from azureml.core.workspace import Workspace \r\nfrom azureml.core.datastore import Datastore \r\nfrom azureml.core.dataset import Dataset \r\n \r\n#Give a name to the registered datastore  \r\ndatastore_name = \" adlsg1 \" \r\n \r\n#Get a reference to the AMLS workspace \r\nworkspace = Workspace.from_config() \r\n \r\n#Register the Data store pointing at the ADLS G1 \r\nDatastore.register_azure_data_lake(workspace,\r\n                           datastore_name,\r\n                           \"[Name of the Product Datalake]\",\r\n                           \"[AAD Tenant ID]\",\r\n                           \"[Service Principal Application ID]\",\r\n                           \"[Service Principal Secret]\",\r\n                           overwrite=False)<\/pre>\n<p>The above code snippet assumes that the AMLS configuration file is available in the working directory. The file can be downloaded from the Azure portal by selecting \u201cDownload config.json\u201d from the Overview section of the AMLS workspace. Or you can create it yourself:<\/p>\n<pre>{\r\n        \"subscription_id\": \"my subscription id\",\r\n        \"resource_group\": \"my resource group\",\r\n        \"workspace_name\": \"my workspace name\" \r\n}<\/pre>\n<p>We also need to register an application (<a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/data-lake-store\/data-lake-store-service-to-service-authenticate-using-active-directory?ocid=AID3020565\" target=\"_blank\" rel=\"noopener noreferrer\">Service Principal<\/a>) with the Azure Active Directory (AAD) tenant with Read access on the data lake storage files we need to use.<\/p>\n<p>Note: You should use a Key Vault to store the service principal ID &amp; Secrets.<\/p>\n<p>At this stage the ADLSG1 should be registered as a datastore in the workspace. This can be tested using the following, which should return an azureml.data.azure_data_lake_datastore.AzureDataLakeDatastore object:<\/p>\n<pre># retrieve an existing datastore in the workspace by name \r\ndstore = Datastore.get(workspace, datastore_name) \r\nprint(dstore)<\/pre>\n<p>&nbsp;<\/p>\n<h2>Step 2: Register a CSV file as a dataset in the AMLS workspace<\/h2>\n<pre>dstore = Datastore(workspace, datastore_name) \r\nfilepath='[file path in ADLS Gen1]' \r\ndset = Dataset.Tabular.from_delimited_files(path = [(dstore, filepath)])<\/pre>\n<p>Now the data file is available as a data frame in memory, for example to return the first 5 rows:<\/p>\n<pre>dset.take(5).to_pandas_dataframe()<\/pre>\n<p>To register the dataset with the workspace:<\/p>\n<pre>dset_name = 'adlsg1_dataset' \r\ndset = dset.register(workspace, name = dset_name, description = '[Data Set Description]')<\/pre>\n<p>To get a list of the datasets registered with the workspace:<\/p>\n<pre>print(Dataset.list(workspace))<\/pre>\n<p>&nbsp;<\/p>\n<h2>Step 3: Train a Model<\/h2>\n<pre>#Get the dataset that is already registered with the workspace \r\ndata_set =Dataset. .get_by_name (workspace, 'adlsg1_dataset') \r\n \r\n#Use the dataset \r\ndataset=data_set.to_pandas_dataframe()<\/pre>\n<p>The above code will get an existing registered dataset and covert it to a Pandas dataframe. You can then conduct additional pre-processing and feature engineering. You also have the option of reading it as a Spark dataframe.<\/p>\n<pre>X = dataset.iloc[:, :-1].values  #  independent variable  \r\ny = dataset.iloc[:, 1].values    #  dependent variable  \r\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1\/3, random_state = 0) \r\nrun = Run.get_context() \r\nprint('Train a linear regression model') \r\nregressor = LinearRegression()    # This object is the regressor, that does the regression \r\nregressor.fit(X_train, y_train)   # Provide training data so the machine can learn to predict using a learned model. \r\n \r\ny_pred = regressor.predict(X_test) \r\nprint(y_pred)<\/pre>\n<p>The above code splits the dataset into training and testing as well as input and output features. A simple linear regression model is trained and validated using the validation dataset.<\/p>\n<p>&nbsp;<\/p>\n<h2>Summary<\/h2>\n<p>In this article, we have articulated various levels of abstractions of data using Azure Data Lake Storage and how they are mapped to different levels of the AI stack for building end-to-end machine learning models. The important thing to note is that the new datasets feature of AMLS provides an easy and reusable way of interacting with ADLS to build ML models, then apply inferences. Please note that although this article uses ADLS Gen1, a similar approach can be used for ADLS Gen2.<\/p>\n<p>&nbsp;<\/p>\n<h2>Code<\/h2>\n<p>For the full source code, please visit <a href=\"https:\/\/github.com\/mufajjul\/mls-adls-dataset?ocid=AID3020565\" target=\"_blank\" rel=\"noopener noreferrer\">the GitHub page<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Dr Mufajjul Ali and Dr Basim Majeed write about various level of abstractions of data using Azure Data Lake Store, and how they are mapped to different levels of the AI stack for building end-to-end machine learning models.<\/p>\n","protected":false},"author":430,"featured_media":31965,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ep_exclude_from_search":false,"_classifai_error":"","_classifai_text_to_speech_error":"","footnotes":""},"categories":[594],"post_tag":[519],"content-type":[],"coauthors":[870,867],"class_list":["post-16731","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technetuk","tag-technet-uk"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Using Azure Data Lake Storage (ADLS) and Azure Datasets for Machine Learning - Microsoft Industry Blogs - United Kingdom<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/10\/13\/adls-architectural-pattern-for-mls-using-datasets\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Using Azure Data Lake Storage (ADLS) and Azure Datasets for Machine Learning - Microsoft Industry Blogs - United Kingdom\" \/>\n<meta property=\"og:description\" content=\"Dr Mufajjul Ali and Dr Basim Majeed write about various level of abstractions of data using Azure Data Lake Store, and how they are mapped to different levels of the AI stack for building end-to-end machine learning models.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/10\/13\/adls-architectural-pattern-for-mls-using-datasets\/\" \/>\n<meta property=\"og:site_name\" content=\"Microsoft Industry Blogs - United Kingdom\" \/>\n<meta property=\"article:published_time\" content=\"2020-10-13T15:00:45+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-content\/uploads\/sites\/22\/2020\/04\/DataLakeThumb.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"800\" \/>\n\t<meta property=\"og:image:height\" content=\"450\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Dr Basim Majeed, Dr Mufajjul Ali\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Dr Basim Majeed, Dr Mufajjul Ali\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 min read\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/technetuk\\\/2020\\\/10\\\/13\\\/adls-architectural-pattern-for-mls-using-datasets\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/technetuk\\\/2020\\\/10\\\/13\\\/adls-architectural-pattern-for-mls-using-datasets\\\/\"},\"author\":[{\"@id\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/author\\\/dr-basim-majeed\\\/\",\"@type\":\"Person\",\"@name\":\"Dr Basim Majeed\"},{\"@id\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/author\\\/dr-mufajjul-ali\\\/\",\"@type\":\"Person\",\"@name\":\"Dr Mufajjul Ali\"}],\"headline\":\"Using Azure Data Lake Storage (ADLS) and Azure Datasets for Machine Learning\",\"datePublished\":\"2020-10-13T15:00:45+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/technetuk\\\/2020\\\/10\\\/13\\\/adls-architectural-pattern-for-mls-using-datasets\\\/\"},\"wordCount\":1422,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/technetuk\\\/2020\\\/10\\\/13\\\/adls-architectural-pattern-for-mls-using-datasets\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/wp-content\\\/uploads\\\/sites\\\/22\\\/2020\\\/04\\\/DataLakeThumb.jpg\",\"keywords\":[\"TechNet UK\"],\"articleSection\":[\"TechNet UK\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/technetuk\\\/2020\\\/10\\\/13\\\/adls-architectural-pattern-for-mls-using-datasets\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/technetuk\\\/2020\\\/10\\\/13\\\/adls-architectural-pattern-for-mls-using-datasets\\\/\",\"url\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/technetuk\\\/2020\\\/10\\\/13\\\/adls-architectural-pattern-for-mls-using-datasets\\\/\",\"name\":\"Using Azure Data Lake Storage (ADLS) and Azure Datasets for Machine Learning - Microsoft Industry Blogs - United Kingdom\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/technetuk\\\/2020\\\/10\\\/13\\\/adls-architectural-pattern-for-mls-using-datasets\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/technetuk\\\/2020\\\/10\\\/13\\\/adls-architectural-pattern-for-mls-using-datasets\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/wp-content\\\/uploads\\\/sites\\\/22\\\/2020\\\/04\\\/DataLakeThumb.jpg\",\"datePublished\":\"2020-10-13T15:00:45+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/technetuk\\\/2020\\\/10\\\/13\\\/adls-architectural-pattern-for-mls-using-datasets\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/technetuk\\\/2020\\\/10\\\/13\\\/adls-architectural-pattern-for-mls-using-datasets\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/technetuk\\\/2020\\\/10\\\/13\\\/adls-architectural-pattern-for-mls-using-datasets\\\/#primaryimage\",\"url\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/wp-content\\\/uploads\\\/sites\\\/22\\\/2020\\\/04\\\/DataLakeThumb.jpg\",\"contentUrl\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/wp-content\\\/uploads\\\/sites\\\/22\\\/2020\\\/04\\\/DataLakeThumb.jpg\",\"width\":800,\"height\":450,\"caption\":\"The Data Lake Analytics logo, next to an illustration of Bit the Raccoon.\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/technetuk\\\/2020\\\/10\\\/13\\\/adls-architectural-pattern-for-mls-using-datasets\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Using Azure Data Lake Storage (ADLS) and Azure Datasets for Machine Learning\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/\",\"name\":\"Microsoft Industry Blogs - United Kingdom\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/#organization\",\"name\":\"Microsoft Industry Blogs - United Kingdom\",\"url\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/wp-content\\\/uploads\\\/sites\\\/22\\\/2019\\\/08\\\/Microsoft-Logo.png\",\"contentUrl\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/wp-content\\\/uploads\\\/sites\\\/22\\\/2019\\\/08\\\/Microsoft-Logo.png\",\"width\":259,\"height\":194,\"caption\":\"Microsoft Industry Blogs - United Kingdom\"},\"image\":{\"@id\":\"https:\\\/\\\/www.microsoft.com\\\/en-gb\\\/industry\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Using Azure Data Lake Storage (ADLS) and Azure Datasets for Machine Learning - Microsoft Industry Blogs - United Kingdom","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/10\/13\/adls-architectural-pattern-for-mls-using-datasets\/","og_locale":"en_US","og_type":"article","og_title":"Using Azure Data Lake Storage (ADLS) and Azure Datasets for Machine Learning - Microsoft Industry Blogs - United Kingdom","og_description":"Dr Mufajjul Ali and Dr Basim Majeed write about various level of abstractions of data using Azure Data Lake Store, and how they are mapped to different levels of the AI stack for building end-to-end machine learning models.","og_url":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/10\/13\/adls-architectural-pattern-for-mls-using-datasets\/","og_site_name":"Microsoft Industry Blogs - United Kingdom","article_published_time":"2020-10-13T15:00:45+00:00","og_image":[{"width":800,"height":450,"url":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-content\/uploads\/sites\/22\/2020\/04\/DataLakeThumb.jpg","type":"image\/jpeg"}],"author":"Dr Basim Majeed, Dr Mufajjul Ali","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Dr Basim Majeed, Dr Mufajjul Ali","Est. reading time":"7 min read"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/10\/13\/adls-architectural-pattern-for-mls-using-datasets\/#article","isPartOf":{"@id":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/10\/13\/adls-architectural-pattern-for-mls-using-datasets\/"},"author":[{"@id":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/author\/dr-basim-majeed\/","@type":"Person","@name":"Dr Basim Majeed"},{"@id":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/author\/dr-mufajjul-ali\/","@type":"Person","@name":"Dr Mufajjul Ali"}],"headline":"Using Azure Data Lake Storage (ADLS) and Azure Datasets for Machine Learning","datePublished":"2020-10-13T15:00:45+00:00","mainEntityOfPage":{"@id":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/10\/13\/adls-architectural-pattern-for-mls-using-datasets\/"},"wordCount":1422,"commentCount":0,"publisher":{"@id":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/#organization"},"image":{"@id":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/10\/13\/adls-architectural-pattern-for-mls-using-datasets\/#primaryimage"},"thumbnailUrl":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-content\/uploads\/sites\/22\/2020\/04\/DataLakeThumb.jpg","keywords":["TechNet UK"],"articleSection":["TechNet UK"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/10\/13\/adls-architectural-pattern-for-mls-using-datasets\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/10\/13\/adls-architectural-pattern-for-mls-using-datasets\/","url":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/10\/13\/adls-architectural-pattern-for-mls-using-datasets\/","name":"Using Azure Data Lake Storage (ADLS) and Azure Datasets for Machine Learning - Microsoft Industry Blogs - United Kingdom","isPartOf":{"@id":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/10\/13\/adls-architectural-pattern-for-mls-using-datasets\/#primaryimage"},"image":{"@id":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/10\/13\/adls-architectural-pattern-for-mls-using-datasets\/#primaryimage"},"thumbnailUrl":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-content\/uploads\/sites\/22\/2020\/04\/DataLakeThumb.jpg","datePublished":"2020-10-13T15:00:45+00:00","breadcrumb":{"@id":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/10\/13\/adls-architectural-pattern-for-mls-using-datasets\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/10\/13\/adls-architectural-pattern-for-mls-using-datasets\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/10\/13\/adls-architectural-pattern-for-mls-using-datasets\/#primaryimage","url":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-content\/uploads\/sites\/22\/2020\/04\/DataLakeThumb.jpg","contentUrl":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-content\/uploads\/sites\/22\/2020\/04\/DataLakeThumb.jpg","width":800,"height":450,"caption":"The Data Lake Analytics logo, next to an illustration of Bit the Raccoon."},{"@type":"BreadcrumbList","@id":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/10\/13\/adls-architectural-pattern-for-mls-using-datasets\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/"},{"@type":"ListItem","position":2,"name":"Using Azure Data Lake Storage (ADLS) and Azure Datasets for Machine Learning"}]},{"@type":"WebSite","@id":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/#website","url":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/","name":"Microsoft Industry Blogs - United Kingdom","description":"","publisher":{"@id":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/#organization","name":"Microsoft Industry Blogs - United Kingdom","url":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-content\/uploads\/sites\/22\/2019\/08\/Microsoft-Logo.png","contentUrl":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-content\/uploads\/sites\/22\/2019\/08\/Microsoft-Logo.png","width":259,"height":194,"caption":"Microsoft Industry Blogs - United Kingdom"},"image":{"@id":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/#\/schema\/logo\/image\/"}}]}},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-json\/wp\/v2\/posts\/16731","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-json\/wp\/v2\/users\/430"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-json\/wp\/v2\/comments?post=16731"}],"version-history":[{"count":0,"href":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-json\/wp\/v2\/posts\/16731\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-json\/wp\/v2\/media\/31965"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-json\/wp\/v2\/media?parent=16731"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-json\/wp\/v2\/categories?post=16731"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-json\/wp\/v2\/post_tag?post=16731"},{"taxonomy":"content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-json\/wp\/v2\/content-type?post=16731"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/wp-json\/wp\/v2\/coauthors?post=16731"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}