Neel Ball, Author at Microsoft Industry Blogs - United Kingdom http://approjects.co.za/?big=en-gb/industry/blog Fri, 18 Dec 2020 22:49:54 +0000 en-US hourly 1 Unified Data Governance using Azure Purview – preventing Data Lake from becoming a Data Swamp http://approjects.co.za/?big=en-gb/industry/blog/technetuk/2020/12/10/unified-data-governance-using-azure-purview-preventing-data-lake-from-becoming-a-data-swamp/ Thu, 10 Dec 2020 16:15:08 +0000 Today, Data Lake is a strategic investment if you’re in a data driven organisation. Here’s our shared experience on why Purview should be a key component in your Data Lake architecture.

The post Unified Data Governance using Azure Purview – preventing Data Lake from becoming a Data Swamp appeared first on Microsoft Industry Blogs - United Kingdom.

]]>
An illustration depicting miscellaneous 'data', next to a picture of Bit the Raccoon.

With Azure Purview, Microsoft launched a unified data governance service that automates data discovery, cataloguing, building lineage, and classifying sensitive data, enabling users to get a holistic view of the data landscape. Today, Data Lake is a strategic investment if you’re in a data driven organisation. Here’s our shared experience on why Purview should be a key component in your Data Lake architecture.

Most data professionals have worked with tabular data stored in databases because that was the way applications and reporting systems were designed or built.  This meant there were only a finite number of tables/databases to search for relevant information. With the advent of newer technologies such as IoT devices, and a plethora of web and mobile applications, data professionals must not only process and secure structured, semi-structured and unstructured data but also discover the pertinent data from the huge amount of data accumulated by the organisation to solve business problems.

Many organisations are now moving to a more complex architecture that has data stored as flat files in storage such as Data Lake, unstructured data in non-relational and tabular data in relational databases. Managing the life cycle of data from creation to archival has been extremely challenging and requires massive investment from an organisation including training people, optimising existing or creating new processes, and adopting new technologies. Despite all these efforts, it remains the nightmare of many CDOs and CSOs. As Gartner suggests, over 60 percent of challenges to the data management practice arise from data quality and supporting data governance and security.

Linking to the real world

Many of our customers have a petabyte-scale Data Lake, and it doesn’t take long for data governance to become extremely difficult. The purpose of the Data Lake is to accelerate data projects by having the entire organisation’s information in one place. The average time taken to identify and retrieve the relevant data from the Data Lake is somewhere between 5-6 weeks. We might have heard of the old adage “time is money”, and it’s true for our customers. The time taken at the beginning of the project to discover relevant information is unacceptable because the project team is to be paid during this phase, and the business gets frustrated as no progress is made on the project. In the meantime, the competitor would have shipped the latest product when the project team was searching for relevant information to kick start the project.

The user journey for data discovery and mapping can be summarised as shown below:

The user journey for data discovery and mapping.

What is data governance?

We just covered the complexities of a modern data estate, and the key challenge is the lifecycle management of data. Data governance is not just about data quality but also about metadata, policy, and identity management. Broadly speaking, we can divide the people accessing data into four categories:

  1. Producer – the data stewards/data experts, who understand the business well and help curate the data for consumption by downstream applications.
  2. Consumer – project owners or business users of downstream applications that use the data produced by the producer.
  3. Data security officer – interested in securing the data asset of your organisation.
  4. Chief compliance officer – responsible for data’s compliance to GDPR or any new rules and regulations imposed by the government.

At a high-level, metadata management provides the data about the organisation’s data. For example, it provides the physical location of the data you are trying to find, identifies the different columns in a file and applies appropriate classification for the column within a data asset. Finally, it also provides the data lineage, which indicates the source for the information, the transformation applied on the source information, where it’s getting used etc. The personas interested in metadata management are producers and consumers. The consumers leverage the search functionality to find the relevant data in a huge Data Lake or other data sources.

Data platform image

Policy and identity management is of interest to the data security officer persona. For example, they need to ensure finance information is accessed only by the finance department, define how long data should be retained for compliance purposes and so on. Our colleague, Pratim Das, has previously covered how to build and deliver a data strategy in great depth.

What Is Azure Purview?

You can get more from data sources with Microsoft Azure Purview. Azure Purview is a fully managed cloud service that serves as a system for registration and system of discovery for enterprise data sources. It provides capability that enables any users, from analysts to data scientists and developers, to register, discover, understand, and consume enterprise data sources.

Azure Purview image

The role of a data catalogue in Data Lake architecture

A data catalogue lifecycle will register the various sources, discover the shape of the ingested dataset, understand and trace the data as it flows through intermediate layers, and finally enable analysts and engineers to consume those data by downstream applications.

1.      Register

The USP of Data Lake architecture is to break the problem of data in silos and gain valuable insights from TBs and PBs of data, which was not possible five years or even a decade ago. Not only that, a Data Lake expedites getting insights in near to real time. As different heterogenous data is ingested in the lakes – whether the traditional relational database, semi-structured JSON, XML-type data, HDFS-like data (Avro, Parquet etc) – it’s important more than ever to have a central place of metastore.

The ability to connect to diverse data sources is also a vital capability. Azure Purview allows you to register not only cloud native storage and databases such as Azure Data Lake, Azure CosmosDB, Azure Synapse, but also on-premise databases, e.g. SQL Server etc. New scan rule set image

You can also logically group assets based on a project or an entire data asset in the organisation. An example is shown below:

Image of grouping assets based on project

2.      Discover

Let us take the next step. You have TBs of data in your Data Lake connected to 100+ data sources. Is that good enough? No. Data lying undetected or unexplored is as bad as, if not worsethan, not ingesting large data in the analytics domain. Hence, the ease of discovering data empowers the end users or business users and this capability will determine the adoption of the Data Lake asset in an organisation.

ContosoPurview image

3.      Understand

Lineage – Single source of truth vs single version of truth

In so many meetings, I have found even the most technology-minded decision makers use “single source of truth” and “single version of truth” interchangeably. But there is a clear difference.

Single source of truth means that the data comes from a specific source all the time. You have a CRM system running on SQL Server, then the data coming from that is the single source of truth. However, in some other scenario, say weather information, data is pulled from public datasets and also from the IoT sensor data that lands in your Data Lake. In such cases, there is no single source of truth.

Single version of the truth

Any transactional or temporal data system will change multiple times over its lifecycle, and it can upset an operation or delete an operation. The single version of the truth is to ensure you get the most consistent version, not necessarily the latest version, of the data that is available in the Data Lake.

Your organisation might be trying to solve the problem “How to get the single version of the truth on Data Lake”, and the answer is in the lineage.

Why lineage is important

The ability to trace and track the data as it goes out of the data source and then goes through multiple layers in the Data Lake – usually raw, curated, enriched etc – allows an environment of transparency in the data organisation and encourages self-service analytics in the ecosystem. Every time the data is manipulated or transformed, the shape and the meaning of the data changes.

Image of Sales face dashboard

Azure Purview offers the lineage capability all the way from the source to the destination. This feature answers several questions from consumers when considering the data in end-user products.

Examples of these questions are:

  1. Is the data source a trusted one that has been approved by the data owner?
  2. Is it an internal or external data source?
  3. Which KPIs are created in which phase and should I select granular or summary data for downstream applications?
  4. In case the data needs to be validated, which trail should I follow during RCA or support incident issues?

4.      Consume

A glossary acts as a glue between the business decision users and the technical engineers, and is a very common setup.

Take a hypothetical situation. You’re talking to your parents and during your conversation you use the terms “YOLO”, “FOMO” etc. Even though these are English abbreviations, your parents may have never heard of them.

Similarly, in the IT world, if you’re from an IT background and use the terms “materialised views”, “data factory” etc, business users may not understand. Also, business users from the supply chain might use functional jargon such as “APS”, “B2C”, “MRP” etc, and IT engineers may not have the knowledge to interpret it.

In addition, different terms mean different things, depending on context. “APS” in the technology world means Analytics Platform System, whereas “APS” in the retail space refers to Advance Planning and Scheduling.

A glossary solves this problem by enriching the catalogue, adding business terms and definitions and linking those terms with the datasets or Data Lake assets.

Image of glossary terms

In Data Lake architecture, the idea is to cross pollinate the data from one business area with another to get a holistic picture.  The datasets contain raw and KPI driven measures that are relevant to a specific business function. However, cross-skilling SMEs, developers and analysts is not always possible. A glossary can act as one place to manage those business terms and give control to the data owner to publish the data and bring consistency across the various business units in the organisation.

Linking to the real world – continued

Earlier, we spoke about how the data discovery phase takes more than 5-6 weeks. When Azure Purview was presented as a solution, it was looked upon with scepticism. With the reincarnation of Azure Purview, Microsoft has injected a lot of features into a genuinely stable productthat is here to stay. The scan of the entire Data Lake, which was over 1PB, took less than two days to complete. Customers can speed up this operation if the integration runtime is scaled from one to four. These scans can be run on a schedule and as frequently as required.

By default, Microsoft provides a list of classification to which the data is classified during each scan. Similarly, the glossary terms can be associated with assets discovered during the scan. Any existing documentation pertaining to the data on web pages or a SharePoint site can be linked to an asset via the glossary. Many of these activities are automated and can be used as soon as the Azure Purview is provisioned. You can also use the Apache Atlas API to extend any existing feature.

After the scan and tagging operation, one of our customers was able to complete the data-discovery journey for various projects in 1-3 weeks rather than 5-6 weeks per dataset. This is a marked improvement, as the catalogue setup itself is easy and is a PaaS service.

The user journey for data discovery and mapping.

This product is just starting its journey outside private preview and has an exciting vision and roadmap.  So please give us your feedback to help us shape the product in future.

Conclusion

Azure Purview helps to answer the who – what – when – how – where – why of the data present in not only Data Lake but also other data sources, and brings the data closer to the user. When your organisation moves away from centralised to self-service analytics, the value of the Data Lake is unlocked, and the digital feedback loop empowers employees to make data-driven decisions that will determine success in the coming decade. One of the key services that will unblock self-service analytics in your organisation is Azure Purview. Get started with Azure purview with our quick-start guide.

The post Unified Data Governance using Azure Purview – preventing Data Lake from becoming a Data Swamp appeared first on Microsoft Industry Blogs - United Kingdom.

]]>
Power BI Performance Tuning Workflow – Part 2 http://approjects.co.za/?big=en-gb/industry/blog/technetuk/2020/08/04/power-bi-performance-tuning-workflow-part-2/ Tue, 04 Aug 2020 15:00:18 +0000 Is your Power BI report is running slow? Nilabja Ball walks us through what you can tune and optimise in the second part of this two part series.

The post Power BI Performance Tuning Workflow – Part 2 appeared first on Microsoft Industry Blogs - United Kingdom.

]]>
An illustration representing a data warehouse, next to an illustration of Bit the Raccoon.

In Part 1 of the series, you learned how to ingest, transform, and load data, and how to develop and optimise the Power BI data model. This article provides guidance that enables developers to write optimised DAX expressions and design reports, as well as enabling administrators to deploy and manage Power BI solutions.

 

DAX Calculations

Variables vs Repeatable Measures

Let’s take a look at DAX Measures versus DAX with Variables:

DAX Measures

A Dax Measures line of code

DAX with Variables

A Dax with Variables line of code

In the second scenario, the measure Total Rows in the first scenario is executed twice, whereas it is executed only once in the variables scenario. Under such scenarios, variables can improve performance significantly.

 

Handling Blanks

Sales (No Blank) =
IF (
    ISBLANK([Sales]),
    0,
    [Sales]
)

It’s recommended that measures return BLANK when a meaningful value cannot be returned, as this design allows Power BI to render reports faster. Also, returning BLANK is efficient because report visuals—by default—eliminate groupings when summarisations are set to BLANK.

 

SELECTEDVALUE () vs HASONEVALUE ()

A common scenario is to use HASONEVALUE() to check if there is only one value present in a column after applying slicers and filters, and then use the VALUES(column name) DAX function to get the single value.

SELECTEDVALUE () performs both the above steps internally and gets the value if there is only one distinct value present in that column, or returns blank in case there are multiple values available.

 

SELECTEDVALUE () vs VALUES ()

VALUES () will return an error if it encounters multiple values. Normally, users handle it using error functions which are bad for performance. Instead of using that, SELECTEDVALUE () must be used. It is a better approach and returns blank in case of multiple values being encountered.

 

DISTINCT () vs VALUES ()

Power BI adds a blank value to the column in case it finds a referential integrity violation. For direct queries, Power BI by default adds a blank value to the columns as it does not have a way to check for violations.

DISTINCT (): does not return blank when encountering an integrity violation. It returns blank only if it is in part of the original data.

VALUES (): includes blank, as it’s added by Power BI due to referential integrity violations.

The usage of either of the functions should be the same throughout the whole report. Use VALUES () in the whole report if possible so that blank values are not an issue.

 

Avoid FORMAT in measures

Format functions are done in a single-threaded formula engine and slows down the calculation for large numbers of string values.

 

Optimize Virtual Relationships using TREATAS ()

A virtual relationship is simulated with DAX and column equivalency. FILTER|CONTAINS executes slower than TREATAS.

An example of DAX with column equivalency.

 

ISBLANK () vs Blank ()

The built-in function ISBLANK () to check for blank values is faster than using the comparison operator “= Blank ()”

 

Ratio Calculation efficiently

Use (a-b)/b with variables instead of (a/b)-1. The performance is the same in both cases usually, but under edge cases when both a and b are blank values, the former will return blanks and filter out the date whereas the latter will return -1 and increase the query space.

 

DIVIDE () vs /

DIVIDE () function has an extra parameter which is returned in cases where the denominator value is zero. For scenarios that might have a zero in the denominator, it is suggested to use DIVIDE () as it will internally check if the denominator is zero. It also checks for ISBLANK (). However, if there can be a guarantee that the denominator will be non-zero, it is better to use / because DIVIDE() will perform an additional if ( denominator <> 0 ) check.

 

Avoid IFERROR () and ISERROR ()

IFERROR () and ISERROR () are sometimes used in measure. These functions force the engine to perform a step by step execution of the row to check for errors. So wherever possible, replace with the in-built function for error checking.

Example: DIVIDE () and SELECTEDVALUE () perform an error check internally and return expected results.

 

COUNTROWS () vs COUNT ()

When the intention is to count table rows, it is recommended that you always use the COUNTROWS function. It’s more efficient and doesn’t consider BLANKs contained in any column of the table.

 

SUMMARIZE () Vs SUMMARIZECOLUMNS ()

SUMMARIZE is used to provide aggregated results by performing groups of actions on the columns. It is recommended to use the SUMMARIZECOLUMNS () function, which is the optimised version. SUMMARIZE should only be used to get just the grouped elements of a table without any measures/aggregations associated with it.

 

FILTER (all (ColumnName)) vs FILTER (VALUES ()) vs FILTER(Table)

Instead of applying a filter to the whole table, it is better to apply filters to a desired column. Also, use ALL with the FILTER function if there is no specific need to keep the current context. To calculate measures while ignoring all the filters applied on a column, use the All (ColumnName) function along with FILTER instead of Table or VALUES ().

 

Avoid AddColumns () in measure expression

By default, measure expressions are calculated in an iterative manner. Adding Addcolumns will convert it to a nested loop and further slow down the performance.

 

Complex Row Level Security

  • Keep the security table as small as possible. If the security granularity needs to be applied, consider multiple security tables. It is recommended to have relationships between security tables, as this will avoid the additional LOOKUP calls for filter operations.
  • Avoid Row level security filter directly on Fact tables.
  • If you have multiple RLS filters applied on a single fact table, consider a mapping table, and implement all the RLS fields on the single table.
  • Avoid Bi-directional filters, instead of converting to a single direction in combination of FILTER DAX expression.
  • Keep filter functions simple and avoid regular expressions, string manipulation and complex logic.

An example of Row-Level Security.

 

Reports and dashboard design

Limit Number of Visuals

A good starting point is to build a report with few tiles, with a maximum limit of 20 visuals. Every visual generates at least one query against its data source, so a higher number of visuals can throttle the CPU and network.

  • A suitable approach is to segregate high level KPI data and granular line item reports in separate reports.
  • Drill Through to provide granular data.
  • Set default slicer and filters to limit dataset context.
  • Use the bookmark capability to hide visuals and provide subsets of data for analysis.

The selection pane and bookmarks pane switched off in Power BI.

  • Test custom visual performance in isolation. Replace custom visuals with built-in visuals wherever possible. However, if there is a functionality gap, it is recommended to use ones that are validated by Microsoft.
  • Avoid the export of reports with large granular data. Leaf level queries will consume memory and network bandwidth. Limit data export by selecting either export summarised data or none.

The 'Allow end users to export both summarized and underlying data from the service' options checked in Power BI.

Simplify Table or Matrix data

Tables/matrix in reports with thousands of rows, many columns and measures can be complex and slow. Also, so many rows can be overwhelming for users to gain good insights.

  • Use TopN filters to limit the initial view of table.
  • Move less critical measures to tooltips so these are only displayed on demand, per row.

Slicers/Filters

  • With a high number of visuals, using slicers or cross filters for every other visual by default generates many queries. An option is to edit interactions and disable the meaningless cross filters.

The Edit Interactions option highlighted in the Power BI visualisations dashboard

  • Avoid slicers with a very large number of values. Slicers have two queries; One is to populate and the other is to fetch selection details. Instead of slicers, use filters or force context and limit values.
  • Also, set default values and set single/multi-select properties in slicers. It will reduce the context, reduce memory load, and fetch less data.
  • Use sync slicers with care.

 

Data Source/Network Latency

The on-premises data gateway (OPDG) acts as a link to provide secure data transfers between on-premises data and Azure data services. One of the cloud services in Power BI uses OPDG to connect to data sources that are connected to data sources on-premises, or those within a private vnet.

An example on-premises data gateway (OPDG)

 

Data Gateway Configuration

  • Keep data gateways as close to the data source. The data traffic between the gateway and Azure is compressed, as this will minimise the raw data movement for shorter distances between data source and gateway, and leverage the benefit of the compression in the downstream traffic.
  • Use enterprise gateways instead of personal gateways because enterprise gateways enable centralised gateways and data source management and supports various storage models.

 

Gateway Parallelism

The on-premises data gateway has settings controlling the resource usage on the machine where the gateway is installed. By default, gateways automatically scale these values, using resources depending on CPU usage. In scenarios related to poor refresh, there are a few settings that can be considered

  • MashupDefaultPoolContainerMaxCount: Maximum container count for Power BI refresh, Azure Analysis Services, and others.
  • MashupDQPoolContainerMaxCount:  Maximum container count for Power BI Direct Query. You can set to twice of number of cores in gateway or leave it to auto tuning

Read more on how to adjust gateway performance based on server CPU on Microsoft Docs.

 

AutoScaling Gateway

Currently the gateway application utilises resources on the gateway machine based on the pre-configured state. The autoscaling application can now scale to use more or less resources depending on the system CPU.

To allow the gateway to scale based on CPU, this configuration “MashupDisableContainerAutoConfig” would need to be set to false. When this is done, the following configurations are adjusted based on the gateway CPU.

Read more about this on the Power BI blog.

 

Cluster Load Balancing

  • Gateway admins can throttle resources of each gateway member to make sure either a gateway member or the entire gateway cluster isn’t overloaded causing system failures.
  • If a gateway cluster with load balancing enabled receives a request from one of the cloud services, it will randomly select a gateway member. If the chosen member is already near the throttling limit set for CPU or memory, another member within the cluster is selected. If all members within the cluster are in the same state, all the new requests fail.
  • CPUUtilizationPercentageThreshold – This configuration allows gateway admins to set a throttling limit for CPU. The range for this configuration is between 0 to 100. The default value is 0, would indicate that this configuration is disabled.
  • MemoryUtilizationPercentageThreshold –Allows gateway admins to set a throttling limit for memory with a between 0 to 100.The default value is 0, would indicate that this configuration is disabled.
  • ResourceUtilizationAggregateionPeriodInMinutes –The time for which CPU and memory system counters of the gateway machine would be aggregated to be compared against the respective threshold limits using above mentioned configurations. The default value is 5.

 

Use SSD/Fast Storage

Power BI gateway refresh and data movement returns a large dataset that is temporarily stored on the gateway machine.  It is recommended to have SSD storage for the spooling layer.

 

Useful Links

The post Power BI Performance Tuning Workflow – Part 2 appeared first on Microsoft Industry Blogs - United Kingdom.

]]>
Power BI Performance Tuning Workflow – Part 1 http://approjects.co.za/?big=en-gb/industry/blog/technetuk/2020/07/31/power-bi-performance-tuning-workflow-part-1/ Fri, 31 Jul 2020 12:00:28 +0000 Is your Power BI report is running slow? Nilabja Ball walks us through what you can tune and optimise in the first part of this two part series.

The post Power BI Performance Tuning Workflow – Part 1 appeared first on Microsoft Industry Blogs - United Kingdom.

]]>
An illustration representing a data warehouse, next to an illustration of Bit the Raccoon.

This is the first in a two-part blog series covers various tips and tricks on Power BI performance tuning. In this blog, the focus is on best practices related to data preparation and design data models to meet performance requirements.

Power BI is a cloud based analytics platform that provides self-service analytics at enterprise scale, unifying data from many sources to create interactive, immersive dashboards and reports that provide actionable insights and drive business results. 

One of the common issues we hear is that a “Power BI report is running slow”. Broadly, the areas that can be optimised and tuned are as follows:

  • The data source(s) – Is it on-premises or in Azure? Structured or Unstructured?
  • The data model – Does it use a Snowflake or Star model? Is it Normalised or De-normalised?
  • Visualisations – this included dashboards and Power BI reports. Are you using a Table or Matrix? Charts or Graph? Custom or Built-in?
  • Infrastructureincluding capacities, data gateways, and the network etc.

A table that shows an example Power BI setup

  

Inappropriate use of Direct Query and Import  

Import mode is the most common mode used to develop models. This mode delivers extremely fast performance thanks to in-memory querying and provides complete DAX support. Because of these strengths, it’s the default mode when creating a new Power BI desktop solution. 

A demonstration of an import modelDirect Query mode is an alternative to the Import model. The model in this mode consists only of metadata defining the model structure. When the model is queried, queries that are compatible with the underlying data sources are fired off in the background to retrieve data.

An example DirectQuery setup  

Considerations for Import

The Power BI engine offers around 5-10x compression depending on the data type and values. In a shared capacity, the import model can store around 1GB of data but can scale using premium capacity that supports larger datasets. However, you should avoid the import model in these situations:

  • The dataset is very large and cannot fit in the available capacity.
  • You need real-time insights for reporting. Import works on the refresh model, and will not suffice for data refreshing every few seconds or minutes.

 

Considerations for Live Connection and Direct Query

This is essentially the opposite to the Import model. Whenever you need real time data for reporting or analytics or you cannot store your data in the Power BI workspace (shared or premium, based on your environment), use Direct Query. 

If your team or organisation already uses a tabular model, you can continue with AAS or SSAS tabular for live connection scenarios.  

There are limitations of a Direct Query model, including extra network round-trips to retrieve the dataset from the source. DAX coverage is also limited and depends on the supportability w.r.t to the various data sources. 

 

Data Refresh

  • Schedule your refreshes for less busy times, especially if your datasets are on Power BI Premium.
  • Keep refresh limits in mind. If the source data changes frequently or the data volume is substantial, consider using DirectQuery/LiveConnect mode instead of Import mode if the increased load at the source and the impact on query performance are acceptable.
  • Verify that your dataset refresh time does not exceed the maximum refresh duration.
  • Use a reliable enterprise data gateway deployment to connect your datasets to on-premises data sources.
  • Use separate data gateways for Import datasets and DirectQuery/LiveConnect datasets so that the data imports during scheduled refresh doesn’t impact the performance of reports and dashboards on top of DirectQuery/LiveConnect datasets. 
  • Configure failure notifications and send emails to your mailbox when any refresh fails.
  • Configure incremental refresh for datasets that are filtered by using Power Query date/time parameters with the reserved, case-sensitive names RangeStart and RangeEnd. This will minimise the memory footprint and reduce refresh time.

 

Use Query Caching

Query caching is a premium feature that provides performance benefits when a dataset is accessed frequently and doesn’t need to be refreshed often. 

A Query Caching dialogue box in Power BI

Query caching can also minimise load on premium workspaces because it reuses the existing cached data and overall reduced the number of queries.

 

Data Loading

Modelling Improvements

In Power BI models, there can be intermediate tables used as staging layers or custom queries that the author hides to prevent access. These hidden tables consume memory, and one way to improve performance is to disable load. Don’t get confused with the “Hide in report view” option. This only removes from the view, but the table is still loaded in the model and consumes memory 

A screenshot showing the Enable Load option is enabled.

Don’t use Auto Date/Time  

The Auto date/time is a data load option in Power BI Desktop. When the date columns are loaded into the model, this feature helps developers to create time-based visuals easily.

If a data model has many date/time fields, this setting can create several internal tables and increase the memory footprint of small models.   

The Time intelligence option in Power BI, showing that Auto date/time is enabled.

 

GroupKind.Local

GroupKind.Local performs faster than the default setting. When the data is sorted or in continuous fashion you can speed up your grouping operations considerably.  You can read more about this on Microsoft Docs.

 

Data models

Star Schema

Star schema is a mature modelling approach widely adopted by relational data warehouses. It requires modellers to classify their model tables as either dimension or fact. 

A well-structured model design should include tables that are either dimension-type tables or fact-type tables. Avoid mixing the two types of tables into a single table that might have dimension fields or additive measures. We also recommend that you should try to create the right number of tables with the right relationships in place.

A diagram showing the star schema model.

You can read more about Star Schema on Microsoft Docs.

 

Prefer integers over strings

Integers are a fixed length datatype that use run length encoding, whereas strings use dictionary encoding. Also, if you sort the column data using the integer column, the level of compression will be significant higher. The Power BI segments boundary is 1 million.  

 

Avoid high precision/cardinality columns

If you create models that have higher precision such as numeric or date/time etc, it reduces the compression ratio and increases load times. Wherever possible, find ways to reduce precision without impacting business requirements.  

For example, Date has a precision of milliseconds. Use Date only if you need to use Date, Time if you only need to use Time. Also, reduce the precision and round off to values wherever possible. 

 

Lean Model

The source tables or views might have many columns, but in your data model the area of interest might be restricted to a few rows. For example, auditing columns such as Last Modified are not useful for an analytics purpose. Remove unwanted tables or columns in the model, as this will reduce the model size and improve refresh times.  

 

Bi-Directional Relationships

The bi-directional cross-filtering feature is very powerful, and it allows us to solve complex models with more ease. However, if you have a model which is full of bi-directional filters, any slicing or filtering activity might slow down because of the relationship propagation chain. 

Also, if not modelled correctly, you might see inconsistent behaviour, particularly in a snowflake architecture. 

A screenshot showing the cross filter direction option in the relationship editor in Power BI.

 

Default Aggregations

Not all numeric columns are additive in nature, such as surrogate columns or primary keys. By default, when any numeric column is placed on a visual, Power BI will aggregate that column, increasing the report compute time. It’s recommended that you change the default aggregations to none for such columns. One of the key points is that the fact-type tables always load data at the correct granularity level. If you have a higher granularity, you would miss detailed information. Too low, and it becomes too detailed for reports, and you have a cost-inefficient large model size.

 

Composite Model and Aggregations

Whenever possible, it’s best to develop a model in Import mode. This mode provides the greatest design flexibility and best performance. However, challenges related to large data volumes, or reporting on near real-time data, cannot be solved by Import models. In either of these cases, you can consider a DirectQuery model, providing your data is stored in a single data source that’s supported by DirectQuery mode. 

Further, you can consider developing a Composite model in the following situations:

  • Your model could be a DirectQuery model, but you want to boost performance. In a Composite model, performance can be improved by configuring appropriate storage for each table. You can also add aggregations
  • You can also add aggregations to store aggregated fact data in the PBI import mode.
  • You want to combine a DirectQuery model with additional data, which must be imported into the model. Imported data can be loaded from a different data source, or from calculated tables. 
  • You want to combine two or more DirectQuery data sources into a single model. 

You can read more about Desktop Composite Models on Microsoft Docs.

 

Important composite best practices

  • Set the storage mode to DirectQuery when a table is a fact-type table storing large data volumes, or it needs to deliver near real-time results.
  • Set the storage mode to Dual when a table is a dimension-type table, and it will be queried together with DirectQuery fact-type tables based on the same source.

Aggregations

  • You can add aggregations to DirectQuery tables in your Composite model. Aggregations are cached in the model, so they behave as Import tables. Their purpose is to improve performance for “higher grain” queries.  
  • We recommend that an aggregation table follows a basic rule: Its row count should be at least a factor of 10 smaller than the underlying table. 

Be sure to check out part two, which looks at optimisations you can make to certain calculations, row-level security, dashboard design and more!

Useful Links

The post Power BI Performance Tuning Workflow – Part 1 appeared first on Microsoft Industry Blogs - United Kingdom.

]]>