{"id":36780,"date":"2020-07-07T15:00:41","date_gmt":"2020-07-07T14:00:41","guid":{"rendered":""},"modified":"2020-07-02T20:43:01","modified_gmt":"2020-07-02T19:43:01","slug":"how-to-operationalise-your-data-lake","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2020\/07\/07\/how-to-operationalise-your-data-lake\/","title":{"rendered":"How to Operationalise your Data Lake"},"content":{"rendered":"
<\/p>\n
Data lake operationalisation is a colossal topic<\/span>\u00a0with many\u00a0<\/span>deliberat<\/span>ions\u00a0<\/span>on either building the right data lake or defining the right strategy<\/span>.\u00a0<\/span>The <\/span>five important <\/span>points <\/span>that everyone stresses on\u00a0<\/span>prior to starting the process of building a data lake<\/span>\u00a0are:<\/span>\u00a0<\/span>\u00a0<\/span><\/p>\n T<\/span>his blog\u00a0<\/span>provides six mantras\u00a0<\/span>for organisations <\/span>to\u00a0<\/span>ruminate on <\/span>i<\/span>n<\/span>\u00a0order<\/span>\u00a0<\/span>to successfully tame the \u201cOperationalising\u201d of a data lake,<\/span>\u00a0post production release<\/span>.<\/span>\u00a0<\/span><\/p>\n \u202f<\/span>\u00a0<\/span><\/p>\n D<\/span>ata lakes are not only about pooling data, but also <\/span>dealing with\u00a0<\/span>aspects of its consumption<\/span>.<\/span>\u00a0The choice of data lake pattern depends on the masterpiece one wants to paint<\/span>.<\/span>\u00a0<\/span><\/p>\n Depending on the ask of the organisation, you can choose <\/span>to store the enterprise data either all in one\u00a0<\/span>location<\/span>\u00a0(Central)<\/span>\u00a0closest to the\u00a0<\/span>organisation\u2019s<\/span> headquarters, or\u00a0<\/span>due to\u00a0<\/span>sovereignty<\/span>\u00a0<\/span>requirements, keep the <\/span>data<\/span>\u00a0stored<\/span>\u00a0<\/span>in\u00a0<\/span>their specific subsidiaries (Federated)<\/span>.<\/span><\/p>\n If\u00a0<\/span>a<\/span>n enterprise has a Global footprint<\/span>, adopting a Hub and Spoke model\u00a0<\/span>(Hybrid) <\/span>with a satellite<\/span> of<\/span>\u00a0local data\u00a0<\/span>closer to the reporting countries<\/span>\u00a0would do the trick<\/span>. Even though this model will have alignment issues (<\/span>data replication etc.)<\/span>\u00a0it will aid performance, regional governance and development<\/span>.<\/span>\u00a0(Fig 1)<\/span>\u202f<\/span><\/p>\n Figure\u00a01\u00a0\u2013\u00a0Hybrid\u00a0Architecture\u00a0<\/em><\/p>\n \u202f<\/span><\/p>\n S<\/span>ample architecture patterns for <\/span>Data Platform<\/span><\/a> or <\/span>Cosmos DB<\/span><\/a> Lambda Architecture.<\/span><\/p>\n \u00a0<\/span><\/p>\n High availability strategies are intended for handling temporary failure conditions to<\/span>\u00a0<\/span>allow the system to continue functioning<\/span> while disaster recovery is recovering from catastrophic loss of application functionality.<\/span> For the right <\/span>DR and HA<\/span><\/a>\u00a0framework, keep\u00a0<\/span>the following scenarios in mind\u00a0<\/span>along with business c<\/span>riticalities<\/span>: d<\/span>ata corruption; accidental data deletion, regional outage, n<\/span>etwork\/connectivity issues and component<\/span>\u00a0<\/span>failures<\/span>.<\/span>\u00a0<\/span><\/p>\n ADLS Gen2 now supports\u00a0<\/span>replications\u00a0<\/span>such as ZRS or GZRS (preview)<\/span>\u00a0which<\/span> improve HA, while GRS and RA-GRS improve DR<\/span>. Azure <\/span>Cosmos DB is<\/span>\u00a0known for\u00a0<\/span>its<\/span>\u00a0<\/span>99.999%\u00a0<\/span>high availability and\u00a0<\/span>globally dis<\/span>tribut<\/span>ed<\/span>\u00a0replications.\u00a0<\/span>\u00a0<\/span><\/p>\n Each Azure component checks most of these, so I e<\/span>ncourage you to look at their product documentation. <\/span>\u00a0<\/span><\/p>\n \u202f<\/span>\u00a0<\/span><\/p>\n Planning a Data Lake and then\u00a0<\/span>s<\/span>caling it\u00a0<\/span>up requires\u00a0<\/span>some\u00a0<\/span>con<\/span>templation<\/span>.\u00a0<\/span>\u00a0<\/span><\/p>\n Each product in Azure has a few boundary considerations and subscription<\/span> limits, quotas and<\/span> c<\/span>onstraints<\/span><\/a>.<\/span>\u00a0<\/span>Cautious treading will avoid hitting\u00a0<\/span>t<\/span>he\u00a0<\/span>thresholds and limits of the products while scaling<\/span>.\u00a0<\/span>While defining the\u00a0<\/span>lambda architecture you can choose your storage, and<\/span> ADLS Gen2\u00a0<\/span>and Cosmos DB both do an\u00a0<\/span>exceptional<\/span>\u00a0job to o<\/span>vercom<\/span>e<\/span>\u00a0<\/span>throughput and limit<\/span>\u00a0challenges<\/span>.\u00a0<\/span>Environment isolation\u00a0<\/span>should\u00a0<\/span>be thought about, especially <\/span>during resource consumption <\/span>for a laboratory experiment,<\/span>\u00a0as well as<\/span>\u00a0f<\/span>eatures and functionality testing\u00a0<\/span>such as\u00a0<\/span>firewall rules or life-cycle management. \u202f<\/span>\u00a0<\/span><\/p>\n Businesses may want to keep the billing separate<\/span> or define a chargeback model<\/span>\u00a0through different subscriptions<\/span>\u00a0for\u00a0<\/span>each\u00a0<\/span>business layer,<\/span>\u00a0and also consider other influencing f<\/span>actors such as r<\/span>egional legal obligations,\u00a0<\/span>r<\/span>egulatory constraints or data sovereignty.<\/span><\/p>\n Production costs for Dev<\/span>\/Test environments<\/span> can be reduced by seeking out providers like M<\/span>icrosoft\u00a0<\/span>who\u00a0<\/span>offer\u00a0<\/span>great discounts<\/span><\/a>\u00a0on lower environments<\/span>. It is always advisable to have separate, split\u00a0<\/span>subscriptions for Dev\/Test and Production<\/span>\u00a0based on b<\/span>usiness functions<\/span>. Choose wisely and sa<\/span>v<\/span>e profusely<\/span>.\u00a0<\/span>\u00a0<\/span><\/p>\n Owing to these constraints, you could rethink on North Star architecture and look at Hub and Spoke models if they’re suitable<\/span>. <\/span>\u00a0<\/span><\/p>\n \u00a0<\/span><\/p>\n It is\u00a0<\/span>imperative to feel the pulse and the interaction of\u00a0<\/span>different source systems, as this <\/span>can give us a better idea of how to sufficiently <\/span>hydrate\u00a0<\/span>the\u00a0<\/span>data lake.\u00a0<\/span>
<\/p>\n1. ALWAYS have a North star Architecture<\/h2>\n
Central vs Federated<\/span>\u00a0<\/span>vs Hybrid\u00a0<\/span>\u00a0<\/span><\/h3>\n
<\/p>\nStreamed vs. Batch vs. Near Real Time<\/span>\u00a0<\/span><\/h3>\n
\n
Build the right\u00a0<\/span>HA-DR:<\/span>\u00a0High Availability &\u00a0<\/span>Disaster Recovery Strategy<\/span><\/h3>\n
<\/b>2. Subscription Model<\/h2>\n
Technical Limitations<\/h3>\n
Business Constraints<\/h3>\n
3. Ingestion<\/h2>\n
Understand the soul of the\u00a0Data Sources<\/h3>\n