{"id":53241,"date":"2023-01-25T16:17:00","date_gmt":"2023-01-25T15:17:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/?p=53241"},"modified":"2023-05-04T12:55:11","modified_gmt":"2023-05-04T11:55:11","slug":"how-to-manage-sensitive-data-in-cloud-analytics-platforms","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-gb\/industry\/blog\/technetuk\/2023\/01\/25\/how-to-manage-sensitive-data-in-cloud-analytics-platforms\/","title":{"rendered":"How to manage Sensitive Data in Cloud Analytics Platforms"},"content":{"rendered":"

\"An<\/p>\n

Managing Sensitive Data is one of the most important challenges organisations face these days. Data breaches are an ever-increasing threat to every industry that needs to be protected against both internal and external threats. 2021 saw the highest average cost of a data breach in 17 years.<\/p>\n

In this article we will take a look at what are the main considerations, strategies and an architecture example on how to manage Sensitive Data in a multi-component Cloud Analytics Platform, which as we will see is a complex task but in our post-GDPR days is absolutely essential for the success of any organisation that deals with Sensitive Data.<\/p>\n

We know that if you need to manage sensitive data in a simple analytics platform with only one central component like for example data stored in an RDBMS like Azure SQL there are plenty of built-in mechanisms that we can leverage, for example Column Level Encryption, Data Masking, Transparent Data Encryption, and others. However, the reality is that, in most cases, organisations work with Analytics Platforms that involve more than one component, for example a Modern Cloud Analytics Data Platform could include a Data Lake Storage likes ADLS, Azure Synapse, a Spark Engine like Synapse Spark or Databricks, etc.<\/p>\n

So how can we build modern analytics platform that deliver value to the business while properly managing Sensitive Data and being compliant with strict security and privacy policies, sector regulation and regional laws? Let\u2019s take a look.<\/p>\n

The Data<\/h3>\n

First of all we need to define what do we mean by sensitive data.<\/p>\n

In the context of GDPR: \u201cdata consisting of racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, genetic data, biometric data, data concerning health or data concerning a natural person’s sex life or sexual orientation.<\/em>\u201d<\/p>\n

However, not all sensitive data is personal data. We could be talking about data that is confidential because it may have an impact on the share price of the organisation for example.<\/p>\n

Another important consideration is that sensitive data will sit probably within multiple business areas, sometimes on premises as well as the cloud, in new systems as well as legacy ones, etc.<\/p>\n

So, the first step will be to define what data is sensitive in the context of our organisation and where it is currently stored taking into consideration all the relevant viewpoints<\/strong>: data protection laws for personal data, for customers but also for the employees, data that is confidential for our organisation like intellectual property, trade secrets, etc. In essence: any data that must be protected from a privacy standpoint.<\/p>\n

It is very helpful if the organisation has already a Data Classification framework in place as we will be able to use it as a foundation for our data strategy. In particular, we will need to have defined a set of sensitivity labels to classify the universe of data available. If the organisation doesn\u2019t have one yet, I would strongly recommend focusing on building one as initial step. It’s not a simple process, and we will need to involve a broad audience within the organisation, but it will be essential for the success of any analytics platform that manages sensitive data.<\/p>\n

More information on how to build a well-designed Data Classification Framework can be found here<\/a>.<\/p>\n

Once we have defined our Data Classification policies we will need a Unified Data Governance solution<\/strong> like Azure Purview to manage the discovery and classification of our data assets.<\/p>\n

\"graphical<\/p>\n

Features like Purview Sensitivity Label Insights (currently in Preview) will enable us to classify and label datasets based on their level of sensitivity as well as to obtain insights on how many assets we have from each category:<\/p>\n

\"graphical<\/p>\n

\"table\"<\/p>\n

The Challenge<\/h3>\n

It is essential to have a clear view on what the challenges for building this type of analytics platform are. In some instances, one challenge when building an Analytics Platform that will manage Sensitive Data is the perception of some organisations that it is harder to ensure data privacy and security in the cloud. In Microsoft we have available plenty of documentation and white papers that cover this topic in-depth and Azure is leading the industry with more than 90 compliance certifications<\/strong>, including over 50 specific to global regions and countries and more than 35 compliance offerings specific to the needs of key industries, including health, government, finance, education, manufacturing and media. Here are some examples:<\/p>\n

\"logo,<\/p>\n

One of the main challenges from a Data Privacy standpoint is the threat of data exfiltration, in particular around the figure of the malicious or negligent insider<\/strong> (an employee that intentionally or by mistake exfiltrates data outside the organisation, typically financial or personal data). This is one of the most complex scenarios because we are looking at protecting sensitive data from people that actually should have access to the data assets. For example, let’s imagine a scenario where we are storing customer data including a column with credit card information (unencrypted) in a relational database, and we are using a Data Masking mechanism which exposes the last four digits and adds a constant string as a prefix in the form of a credit card (e.g. XXXX-XXXX-XXXX-1234). This would be effective for data consumers querying the data, however in the unfortunate (and hopefully unlikely) scenario of the database admin being the malicious insider this mechanism wouldn’t be effective because they would have direct access to the unencrypted data in the database.<\/p>\n

There are some ways to mitigate this, however it is important to bear in mind that very often when we are talking about data privacy, People and Processes are more effective than Technology<\/strong> controls. Also, it is important for any organisation to define their level of risk appetite when building security and privacy controls for any system that manages data.<\/p>\n

Frequently increased privacy control also means increased complexity, for example we will see below in this article that one effective strategy to tackle the malicious\/negligent insider threat is to use a Data Obfuscation strategy, however this often increases the complexity of the architecture, especially in a multi-component analytics data platform, as we will need to keep a consistent obfuscation strategy through the whole end to end data workflow.<\/p>\n

Common Strategies<\/h3>\n

Next step will be to define the data privacy strategy for our platform, the most common ones for managing sensitive data in an analytics platform are the following:<\/p>\n

\"A<\/p>\n