Big Data - Microsoft SQL Server Blog

Modernize your database with the consolidation and retirement of Azure Database Migration tools

Sudhir Raparla — Thu, 12 Sep 2024 15:00:00 +0000

Simplifying Database Migrations with Azure SQL

By migrating their databases to Azure, customers like Ernst and Young are modernizing their data estate and leveraging cutting-edge cloud innovations. However, the migration process can be complex, whether moving within the same database management system (homogeneous) or between different systems (heterogeneous). Microsoft offers a suite of tools for migration to simplify the migration process. To further enhance the user experience, we are streamlining the Azure database migration tools ecosystem. This involves retiring certain overlapping tools to simplify finding the right tool and provide unified migration experiences across all phases of migration. As part of this effort, effective 12/15/2024 we are replacing some tools with unified experiences that offer capabilities across various migration stages in the drive to modernize their data estate and take advantage of innovation in the cloud.

Azure Database Migration Guides

Step-by-step guidance for modernizing your data assets

Learn more

With a refined set of tools, you can confidently plan, assess, and execute your database migration with minimal downtime, ensuring a smooth transition to Azure SQL. Post the 12/15/24, retirement date, Microsoft will stop supporting these tools for any issues that arise and will not issue any bug fixes or further updates. Here is the list of tools that are planned for retirement and Microsoft recommended replacement tools.

Tool	Retirement Date	Recommend replacement
Database Migration Assessment for Oracle (DMAO) is an extension in Azure Data Studio that helps you assess an Oracle workload for migrating to Azure SQL and Azure Database for PostgreSQL.	12/15/2024	For Azure SQL target assessments switch to using assessment and Azure SQL target recommendation capabilities in SQL Server Migration Assistant (SSMA) for performing Oracle to Azure SQL assessments in your migration journey to Azure SQL. For PostgreSQL target assessments switch to using Ora2PG Migration cost assessment capabilities to get Azure PostgreSQL target recommendations.
Database Schema conversion Toolkit (DSCT) is an extension for Azure Data Studio designed to automate database schema conversion between different database platforms.	12/15/2024	Switch to using conversion assessment and converting Oracle Schemas capabilities in SQL Server Migration Assistant (SSMA) for Oracle to Azure SQL conversions in your migration journey to Azure SQL.
Database Experimentation Assistant (DEA) is an experimentation solution for SQL Server upgrades. DEA can help you evaluate a targeted version of SQL Server for a specific workload.	12/15/2024	Use open-source tools like SQLWorkload, which is a collection of tools to collect, analyse and replay SQL Server workloads, on premises and in the cloud.
Data Access Migration Toolkit (DAMT) is a VS Code extension that help users identify SQL code in application source code when migrating from one DB to another and identify SQL compatibility issues. Supported source database backends include IBM DB2, Oracle Database and SQL Server.	12/15/2024	For identifying the SQL queries in source code, our recommendation is to use Regex or parse the application code either manually or with custom-built tools to identify T-SQL embedded in the application code. For identifying compatibility between your source SQL Server and the target Azure SQL, please use assessment capabilities available in SQL Server enabled by Arc or Azure SQL Migration extension for Azure Data Studio or using Azure Migrate SQL Assessment capabilities.

Microsoft recommended tools for migration to Azure SQL

With the retirement of Database Migration Assistant for Oracle (DMAO), Database Schema Conversion Toolkit (DSCT), Data Access Migration Toolkit (DAMT), Database Experimentation Assistant (DEA), the Azure database migration tooling ecosystem is greatly simplified. Here is Microsoft’s recommendation for database migration tools for customers moving to Azure SQL.

Homogenous migrations (SQL Server to Azure SQL)

If the SQL Server that will be migrated is already enabled by Azure Arc, you can use Arc capabilities to perform a migration assessment and get optimal Azure SQL Target recommendations. Additionally, SQL Server enabled by Azure Arc provides multiple Azure benefits to SQL Servers outside Azure like automated backups and patching, Microsoft Defender for SQL, inventory of instances and databases, and Entra ID support. By enabling these Arc features, you can leverage cloud automation and security for Azure SQL Server even before you migrate.

If the SQL Server outside Azure is not inventoried yet, you can use Azure Migrate for discovery, assessment and business case to know the right Azure SQL targets for your on-premises SQL Workloads and to get the projected cost savings of migrating to Azure SQL.

To migrate SQL Server into an Azure Virtual Machine with the same configuration as the source, users can use Azure Migrate to perform lift and shift migrations. SQL Server on Azure Virtual Machines allows you to easily migrate your SQL Server workloads to the cloud, offering SQL Server’s performance and security along with Azure’s flexibility and hybrid connectivity to address urgent business needs. Later you can evaluate one of the Azure SQL PaaS targets (Azure SQL Managed Instance, Azure SQL Database) and modernize to a PaaS service for better cost and workload performance optimizations.

If you have completed an assessment and are ready to move to Azure SQL Managed Instance or Azure SQL Database, you can start your migration journey with Azure Migrate, you can use Azure Database Migration service or Azure SQL Migration extension for Azure Data Studio can be used.

If the SQL Server estate is already inventoried, users can use Azure SQL Migration extension for Azure Data Studio to complete the entire migration journey i.e., perform assessment, get Azure SQL Target recommendations and perform migrations.

Heterogenous migrations (non-SQL Server databases to Azure SQL)

With the availability of Target Assessment and SKU recommendation capabilities in SQL Server Migration Assistant (SSMA) along with existing code conversion and migration capabilities, SSMA becomes a single tool that you need to use to migrate from other source database platforms like Oracle, DB2, SAP ASE, MySQL, Access to Azure SQL or SQL Server.

Learn more about modernizing your databases with Azure

Get step-by-step guidance for modernizing your Oracle and SQL Server databases to Azure in the Azure Database Migration Guides.
Read about the Azure Migrate Assessment for Azure SQL to get an overview of SQL Server assessment process using Azure Migrate.
Know how to complete end to end SQL Server to Azure SQL migration journey with Azure SQL Migration extension for Azure Data Studio.
Find documentation for SQL Server Migration Assistant (SSMA), a tool for assessing and automating migration from Microsoft Access, DB2, MySQL, Oracle, and SAP ASE.

The post Modernize your database with the consolidation and retirement of Azure Database Migration tools appeared first on Microsoft SQL Server Blog.

What’s new with SQL Server Big Data Clusters—CU13 Release

Daniel Coelho — Wed, 06 Oct 2021 15:00:09 +0000

SQL Server Big Data Clusters (BDC) is a capability brought to market as part of the SQL Server 2019 release. Big Data Clusters extends SQL Server’s analytical capabilities beyond in-database processing of transactional and analytical workloads by uniting the SQL engine with Apache Spark and Apache Hadoop to create a single, secure, and unified data platform. It is available exclusively to run on Linux containers, orchestrated by Kubernetes, and can be deployed in multiple-cloud providers or on-premises.

Today, we’re proud to announce the release of the latest cumulative update, CU13, for SQL Server Big Data Clusters which includes important changes and capabilities:

Hadoop Distributed File System (HDFS) distributed copy capabilities through azdata
Apache Spark 3.1.2
SQL Server Big Data Clusters runtime for Apache Spark release 2021.1
Password rotation for Big Data Cluster’s auto-generated Active Directory service accounts during BDC deployment
Enable Advanced Encryption Standard (AES) Optional parameter on the automatically generated AD accounts

Major improvements in this update are highlighted below, along with resources for you to learn more and get started.

HDFS distributed copy capabilities through azdata

Hadoop HDFS DistCP is a command line tool that enables high-performant distributed data copy between HDFS clusters. On SQL Server Big Data Clusters CU13 we are surfacing the capability of distcp through the new azdata bdc hdfs distcp command to enable inter Big Data Clusters distributed data copy. This enables data migration scenarios between SQL Server Big Data Clusters; supporting both secure and non-secure cluster deployment configurations.

For more information, see:

Apache Spark 3.1.2

Up to cumulative update 12, Big Data Clusters relied on the Apache Spark 2.4 line, which reached its end of life in May 2021. Consistent with our continuous improvement commitment to the Big Data and Machine Learning capabilities of the Apache Spark engine, CU13 brings in the current release of Apache Spark, version 3.1.2.

This new version of Apache Spark brings stellar performance benefits on big data processing workloads. Using the reference TCP-DS 10 TB workload in our tests we were able to reduce runtime from 4.19 hours to 2.96 hours, a 29.36 percent improvement achieved just by switching engines while using the same hardware and configuration profiles, no additional application optimizations. The improvement mean of individual query runtime is 36 percent.

Spark 3 is a major release and as such, contains breaking changes. Following the same established best practice in the SQL Server universe, perform a side-by-side deployment of SQL Server Big Data Clusters to validate your current workload with Spark 3 before upgrading. You can leverage the new azdata HDFS distributed copy capability to have a subset of your data needed to validate this workload. For more information, see the following articles to help you assess your scenario before upgrading to the CU13 release:

SQL Server Big Data Clusters runtime for Apache Spark release 2021.1

With this release of SQL Server Big Data Clusters, we doubled down on our commitment of release cadence, binary compatibility, and consistency of experiences for data engineers and data scientists through the SQL Server Big Data Clusters runtime for Apache Spark initiative.

The SQL Server Big Data Clusters runtime for Apache Spark is a consistent versioned block of programming language distributions, engine optimizations, core libraries, and packages for Apache Spark.

Here is a summary of the SQL Server Big Data Clusters runtime for Apache Spark release 2021.1 shipped with SQL Server Big Data Clusters CU13:

Apache Spark 3.1.2
Scala 2.12 for Scala Spark
Python 3.8 for PySpark
Microsoft R Open 3.5.2 for SparkR and sparklyr

For more information on all included packages and how to use it, see:

Password rotation for Big Data Cluster’s Active Directory service accounts

When a big data cluster is deployed with Active Directory integration for security, there are Active Directory (AD) accounts and groups that SQL Server creates during a big data cluster deployment, see auto-generated active directory objects for further information.

When it comes to security-sensitive customers, it is usually required security reinforcement such as setting password expiration policies, allowing the administrator to set user passwords to never expire or expire after a certain number of days. For SQL Server Big Data Cluster deployments it was previously required to manually rotate the password for those auto-generated active directory objects.

With SQL Server Big Data Clusters CU13, we are now releasing the azdata bdc rotate command to rotate passwords for all auto-generated accounts except for the DSA account. In order to update the DSA password for SQL Server Big Data Clusters we are releasing a specific operation notebook.

Enable Advanced Encryption Standard (AES) on the automatically generated AD accounts

Today’s enterprise environments are facing a lot more challenges than it used to be. Using secure and encrypted connections when authenticating with Kerberos will significantly lower the risk to encounter attacks such as Kerberoasting; a type of attack targeting service accounts in Active Directory. Starting with SQL Server Big Data Clusters CU13, we’re enabling the Advanced Encryption Standard (AES) support on the auto-generated AD accounts by allowing users to set an optional boolean parameter in the BDC deployment profile to indicate this AD account supports Kerberos AES 128 bit and 256 bit encryptions.

For more information, see:

Ready to learn more?

Check out the SQL Server Big Data Clusters CU13 release notes to learn more about all the improvements available with the latest update. For a technical deep-dive on Big Data Clusters, read the documentation and visit our GitHub repository.

Follow the instructions on our documentation page to get started and deploy Big Data Clusters.

The post What’s new with SQL Server Big Data Clusters—CU13 Release appeared first on Microsoft SQL Server Blog.

What’s new with SQL Server Big Data Clusters—CU11 Release

Daniel Coelho — Thu, 08 Jul 2021 16:00:23 +0000

SQL Server Big Data Clusters (BDC) is a capability brought to market as part of the SQL Server 2019 release. BDC extends SQL Server’s analytical capabilities beyond in-database processing of transactional and analytical workloads by uniting the SQL engine with Apache Spark and Apache Hadoop to create a single, secure, and unified data platform. BDC is available exclusively to run on Linux containers, orchestrated by Kubernetes, and can be deployed in multiple-cloud providers or on-premises.

Today, we’re announcing the release of the latest cumulative update, CU11, for SQL Server Big Data Clusters, which includes important capabilities:

Encryption at Rest with external key providers, commonly known as “bring your own key” (BYOK).
Several SQL Server PolyBase Hadoop fixes and SQL Server PolyBase additional support to many data sources.

Major improvements in this update are highlighted below, along with resources for you to learn more and get started.

Data Encryption at Rest

SQL Server 2019 CU8 introduced the Encryption at Rest initial feature set, bringing together a system-managed experience across both SQL Server and HDFS components. With each additional release shaped by our community and insightful customer feedback, many features were added. With the release of the latest cumulative update, CU11, we get to a complete Encryption at Rest feature set, with seamless application-level encryption for the SQL Server and HDFS components.

In CU11, we introduced the BYOK functionality with integration with external key providers, such as Hardware Security Modules (HSM) or services like Azure Key Vault or even Hashicorp Vault. With that capability SQL Server Big Data Clusters Encryption at Rest feature set now contains both system-managed and user-managed Encryption at Rest for SQL Server and HDFS components.

To learn more about the complete Encryption at Rest feature set, see the in-depth documentation:

SQL Server Big Data Clusters PolyBase improvements

Consistent with our commitment to continuous improvements of the Data Virtualization and scale-out capabilities, CU11 bring fixes and new support for the following data sources: Hortonworks HDP 3.1, Cloudera CDH 6.1, 6.2, 6.3, Azure Blob Storage (WASB[S]), and Azure Data Lake Storage Gen2 (ABFS[S]).

For more information, see:

Ready to learn more?

Check out the SQL Server Big Data Clusters CU11 release notes to learn more about all the improvements available with the latest update. For a technical deep-dive on Big Data Clusters, read the documentation and visit our GitHub repository.

Follow the instructions on our documentation page to get started and deploy Big Data Clusters.

The post What’s new with SQL Server Big Data Clusters—CU11 Release appeared first on Microsoft SQL Server Blog.

What’s new with SQL Server Big Data Clusters—CU10 release

Rahul Ajmera — Wed, 07 Apr 2021 15:00:06 +0000

SQL Server Big Data Clusters is a new capability brought to market as part of the SQL Server 2019 release. Big Data Clusters extends SQL Server’s analytical capabilities beyond in-database processing of transactional and analytical workloads by uniting the SQL engine with Apache Spark and Apache Hadoop to create a single, secure, and unified data platform.

Big Data Clusters is available exclusively to run on Linux containers, orchestrated by Kubernetes, and can be deployed in multiple-cloud providers or on-premises.

Today, we’re announcing the release of the latest cumulative update (CU), CU10, for SQL Server Big Data Clusters, which includes important capabilities:

Upgraded base images from Ubuntu 16.04 to Ubuntu 20.04.
High availability support for Hadoop KMS components.
Additional configurability of SQL Server networking and process affinity settings at the resource-scope.
Resource management for Spark-related containers through cluster-scoped settings.

Major improvements in this update are highlighted below, along with resources for you to learn more and get started.

Upgraded base image versions

SQL Server 2019 CU9 included a software refresh for most of the open source components deployed with Big Data Clusters. Building on this momentum and in line with our commitment to ensure that Big Data Clusters component versions are up to date with those supported, we are now upgrading the base operating system (OS) for all container images from Ubuntu 16.04 to Ubuntu 20.04.

For existing Big Data Clusters deployments, no other action is necessary apart from the regular in-place upgrade to the new CU. The new CU10 images that include the upgraded base OS version will be used when upgrading Big Data Clusters. As a best practice, we recommend upgrading to CU10 to take advantage of new capabilities and improvements and to ensure containers are covered by the Ubuntu support lifecycle.

High Availability support for Hadoop KMS components

Consistent with our commitment to continuous improvements of the Encryption at Rest feature set, CU10 adds High Availability capabilities for Hadoops key management service (KMS) components. After the upgrade, all namenode pods will have a KMS instead of just one namenode pod. The benefits are two-fold, increased high availability and increased performance of encryption operations on encryption zones.

Ready to learn more?

Check out the SQL Server Big Data Clusters CU10 release notes to learn more about all the improvements available with the latest update. For a technical deep-dive on Big Data Clusters, take a look at the documentation page and visit our GitHub repository.

Follow the instructions on our documentation page to get started and deploy Big Data Clusters.

The post What’s new with SQL Server Big Data Clusters—CU10 release appeared first on Microsoft SQL Server Blog.

What’s new with SQL Server Big Data Clusters

SQL Server Team — Tue, 16 Feb 2021 17:00:59 +0000

SQL Server Big Data Clusters (BDC) is a new capability brought to market as part of the SQL Server 2019 release. BDC extends SQL Server’s analytical capabilities beyond in-database processing of transactional and analytical workloads by uniting the SQL engine with Apache Spark and Apache Hadoop to create a single, secure, and unified data platform. BDC is available exclusively to run on Linux containers, orchestrated by Kubernetes, and can be deployed in multiple-cloud providers or on-premises.

Today, we’re announcing the release of the latest cumulative update (CU9) for SQL Server Big Data Clusters, which includes important capabilities:

Support to configure BDC post deployment.
Improved experience for encryption at rest.
Ability to install Python packages at Spark job submission time.
Upgraded software versions for most of our OSS components (Grafana, Kibana, FluentBit, etc.) to ensure Big Data Clusters images are up to date with the latest enhancements and fixes.
Miscellaneous improvements and bug fixes.

This announcement highlights some of the major improvements, provides additional context to better understand the design behind these capabilities, and points you to relevant resources to learn more and get started.

Configuring SQL Server Big Data Clusters to meet your business needs

SQL Server Big Data Clusters, a feature released as part of SQL Server 2019, is a data platform for operational and analytical workloads. We are announcing new configuration management functionality as part of today’s CU9 release. Workload requirements are constantly changing and these enhancements will help customers ensure that their Big Data Cluster is always prepared for their needs.

Configuration management is the ability to alter or tune various parts of the Big Data Cluster after deployment and to provide users with clarity into the cluster’s configurations. This allows administrators to configure the Big Data Cluster configurations to meet their workload’s needs. Whether an administrator wants to turn on SQL Agent, define the baseline resources for their organization’s Spark jobs, or even see what settings are configurable at each scope—configuration management is the one-stop solution to meet these needs.

To enable this functionality, we are exposing new commands to the azdata command line interface (CLI). Azdata, an interface to manage a BDC, now includes post-deployment configuration functionality to set, diff, and apply configuration settings. To start, customers can configure settings at the cluster, service, and resource scope and then commit them for change. After applying pending configuration changes, customers can monitor the process through azdata or Azure Data Studio. Once the update is completed, the Big Data Cluster is ready for the next workload.

Learn more and get started with configuration management.

Spark job library management

Data engineers and data scientists often want to experiment with and use a variety of different libraries and packages as part of their workflows. There are separate ways to do this for each language including importing from Maven, installing from Python Package Index (PyPi) or conda, or installing from Microsoft R Application Network (MRAN). Before today, customers could import jars from Maven or reference custom packages stored in Hadoop Distributed File System (HDFS) through Spark job configurations.

Starting in CU9, data engineers and data scientists now have added flexibility for their PySpark jobs through job-level virtual environments. They can easily configure a conda virtual environment and get to work with their favorite Python libraries.

Learn how to configure a job-level Spark environment.

Improving the experience on encryption at rest

In SQL Server Big Data Clusters CU8, we introduced a comprehensive encryption at rest feature set that focused on system-managed keys. This enabled application-level encryption capabilities to all data stored in the platform, on both SQL Server and HDFS. The HDFS experience provided at that time for administrators was centered on usage of Azure Data Studio Notebooks to control all aspects of the feature. Starting with CU9, in addition to expanding the Notebook experience, we are enabling HDFS encryption zones and HDFS key management through azdata. This enables the automation of encryption at rest administrative tasks for HDFS administrators, a much desirable and consistent feature of the SQL Server Big Data Clusters platform.

To learn more about the new notebooks and the new azdata commands, visit the release notes.

Ready to learn more?

Check out the SQL Server CU9 release notes for Big Data Clusters to learn more about all of the improvements available with the latest update. For a technical deep-dive on Big Data Clusters, read the documentation and visit our GitHub repository.

Follow the instructions on our documentation page to get started and deploy Big Data Clusters.

The post What’s new with SQL Server Big Data Clusters appeared first on Microsoft SQL Server Blog.

Expanding SQL Server Big Data Clusters capabilities, now on Red Hat OpenShift

Mihaela Blendea — Tue, 23 Jun 2020 17:00:15 +0000

SQL Server Big Data Clusters (BDC) is a new capability brought to market as part of the SQL Server 2019 release. BDC extends SQL Server’s analytical capabilities beyond in-database processing of transactional and analytical workloads by uniting the SQL engine with Apache Spark and Apache Hadoop to create a single, secure and unified data platform. BDC is available exclusively to run on Linux containers, orchestrated by Kubernetes, and can be deployed in multiple-cloud providers or on-premises.

Today, we’re announcing the availability of the latest cumulative update (CU5) for SQL Server 2019, that includes important capabilities for SQL Server and BDC including:

Support for deploying BDC on Red Hat OpenShift Kubernetes platform.
Enabled running applications within BDC as non-root users.
Support for deploying multiple BDCs against the same Active Directory domain.
Enriched data virtualization experiences.
Enhanced and open sourced Spark SQL connector.
Miscellaneous improvements and bug fixes.

This announcement blog highlights some of the major improvements, provides additional context to better understand the design behind these capabilities, and points you to relevant resources to learn more and get you started.

Deploy Big Data Clusters on Red Hat OpenShift Kubernetes platform

Red Hat OpenShift provides an enterprise-grade, commercially-supported distribution of Kubernetes as the foundation of its container platform across hybrid and multi-cloud environments. Through a close partnership with the Red Hat team, today we’re announcing support for SQL Server BDC deployments on OpenShift, for version 4.3 and up, on-premises or in public cloud environments with (ARO). You can now leverage a fully supported stack to operationalize your next unified analytics platform using BDC, ensuring design and development best practices, and enterprise-grade security guidelines that are core to OpenShift.

We have enhanced the security design of BDC to take full advantage of the OpenShift Container Platform. In addition to privileged containers being no longer required, containers are also running as a non-root user by default. This includes enabling enhanced process isolation within a container. The white paper produced in collaboration with SQL Server and Red Hat security teams describes the design in detail, highlighting what and why we require certain security policies when deploying BDC on OpenShift.

The BDC deployment model and experiences were enhanced so that you can follow the prescribed guidance, in an integrated manner, with built-in deployment profiles targeting OpenShift environments or UX enhancements in Azure Data Studio that include OpenShift as a target platform. With containers and Kubernetes powered Red Hat OpenShift, organizations can achieve the desired agility, scalability, flexibility, security, and portability for Big Data Clusters.

Bringing SQL Server and Big Data Clusters to the OpenShift Container Platform has been a real team effort. Red Hat provided our team with valuable help, bootstrapping our initial efforts, as well as providing best practice guidance during implementation. Security and trust are critical for both companies and so we appreciate the valuable input and contributions of Dan Walsh, Senior Distinguished Engineer at Red Hat, and Michael Nelson, Principal Software Engineering Manager at Microsoft, who collaborated on the security design for Big Data Clusters on OpenShift.

For more information on the BDC deployment process on OpenShift, follow the instructions on our documentation page.

Secure by default containers, running as non-root users

As a modern data platform, BDC ensures enterprise-grade secure data access by enabling Active Directory authentication though innovative implementations for applications running in containers. In addition, we are now making the platform implementation safer by ensuring that all container applications running within BDC are started as non-root users by default, on all supported platforms. These capabilities are available for all new deployments using the SQL Server 2019 CU5 corresponding image tag. Existing pre-CU5 BDC deployments will not be impacted, and applications in these clusters will continue to run as root user. Support for migrating these clusters to non-root type configuration will be added in a future cumulative update.

Deploy multiple BDCs against the same Active Directory domain

To complement the above platform enhancements regarding secure big data clusters, we are pleased to announce that we added support for deploying multiple BDCs against a single Active Directory domain. You can now leverage multiple BDC deployments in your secure enterprise environment, to accommodate multiple use cases like development/test, pre-production or production, CI/CD pipelines or HADR.

To learn more about Active Directory integration for BDC and deploying multiple BDCs against the same domain, see the security related topics on our documentation page.

Announcing new data virtualization enhancements

In addition to the improvements above, we have also improved our data virtualization capabilities. Namely, we’ve introduced two new stored procedures, sp_data_source_objects and sp_data_source_table_columns, to support introspection of certain External Data Sources. They can be used by customers directly via T-SQL for schema discovery and to see what tables are available to be virtualized. We leverage these in the External Table Wizard of the Data Virtualization Extension for Azure Data Studio, which allows you to create external tables from SQL Server, Oracle, MongoDB, and Teradata.

For more information on the external table wizard, visit the documentation page.

SQL Server and Azure SQL Connector for Apache Spark Open Sourcing

BDC includes the SQL Server and Azure SQL Connector for Apache Spark. Based on the Apache Spark DataSource V1 APIs and SQL Server Bulk APIs, this connector enables you to read/write to and from any SQL Server using Apache Spark. As part of Microsoft’s commitment to open-source technology, we will be releasing this connector under the ApacheV2 license for anyone to use and contribute to. Stay tuned for more updates once the connector is live!

SQL Server BDC team hears your feedback

If you would like to help make BDC an even better analytics platform, please share any recommendations or report issues through our feedback page. SQL Server engineering team is thoroughly going through the reported suggestions. They are valuable input for us, that is being considered when planning and prioritizing the next set of improvements. We are committed to ensuring that SQL Server enhancements are based on customer experiences, so we build robust solutions that meet real production requirements in terms of functionality, security, scalability, and performance.

Ready to learn more?

With SQL Server 2019 CU5 updates, BDC continues to simplify the security, deployment, and management of your key data workloads. Industry-leading innovative security and compliance features and support for market-leading Kubernetes based platforms like Red Hat’s OpenShift will help our mutual customers achieve the expected agility, scalability, flexibility, and portability to develop and operationalize intelligent applications.

Check out the SQL Server CU5 release notes for BDC to learn more about all the improvements available with the latest update. For a technical deep-dive on Big Data Clusters, read the documentation and visit our GitHub repository.

To get started with deploying BDC on OpenShift, follow the instructions on our documentation page. Make sure to read the Security Best Practices whitepaper to better understand the security requirements.

The post Expanding SQL Server Big Data Clusters capabilities, now on Red Hat OpenShift appeared first on Microsoft SQL Server Blog.

Apache Spark Connector for SQL Server and Azure SQL is now open source

SQL Server Team — Mon, 22 Jun 2020 16:00:23 +0000

Accelerating big data analytics with the Spark connector for SQL Server

We’re happy to announce that we have open–sourced the Apache Spark Connector for SQL Server and Azure SQL on GitHub. Born out of Microsoft’s SQL Server Big Data Clusters investments, the Apache Spark Connector for SQL Server and Azure SQL is a high-performance connector that enables you to use transactional data in big data analytics and persists results for ad-hoc queries or reporting. The connector allows you to use any SQL database, on-premises or in the cloud, as an input data source or output data sink for Spark jobs.

Why use the Apache Spark Connector for SQL Server and Azure SQL

The Apache Spark Connector for SQL Server and Azure SQL is based on the Spark DataSourceV1 API and SQL Server Bulk API and uses the same interface as the built-in JDBC Spark-SQL connector. This allows you to easily integrate the connector and migrate your existing Spark jobs by simply updating the format parameter!

Notable features and benefits of the connector:

Support for all Spark bindings (Scala, Python, R).
Basic authentication and Active Directory (AD) keytab support.
Reordered DataFrame write support.
Reliable connector support for single instance.

Depending on your scenario, the Apache Spark Connector for SQL Server and Azure SQL is up to 15X faster than the default connector. The connector takes advantage of Spark’s distributed architecture to move data in parallel, efficiently using all cluster resources.

Visit the GitHub page for the connector to download the project and get started!

Get involved

The release of the Apache Spark Connector for SQL Server and Azure SQL makes the interaction between SQL Server and Spark even more flawless. We are continuously evolving and improving the connector, and we look forward to your feedback and contributions!

Want to contribute or have feedback or questions? Check out the project on GitHub and follow us on Twitter at @SQLServer.

The post Apache Spark Connector for SQL Server and Azure SQL is now open source appeared first on Microsoft SQL Server Blog.

The ultimate performance for your big data with SQL Server 2019 Big Data Clusters

SQL Server Team — Wed, 29 Jan 2020 17:00:41 +0000

Microsoft SQL Server 2019 Big Data Cluster enables intelligence over all your data and helps remove data silos by combining both structured and unstructured data across the entire data estate. Big Data Clusters integrates Microsoft SQL Server and the best of big data open-source solutions. It is deployed on scalable clusters using Apache Spark, HDFS containers with Kubernetes, and SQL Server. Microsoft SQL Server 2019 Big Data Cluster is the ideal Big Data solution for AI, ML, M/R, Streaming, BI, T-SQL, and Spark.

In October 2019, Microsoft and Intel conducted performance and scalability testing using workloads derived from the TPC-DS schema with very large data sets producing 1TB, 10TB, 30TB, and 100TB worth of raw structured and semi-structured data running on Microsoft SQL Server 2019 Big Data Cluster.

The TPC-DS is the world’s first industry-standard benchmark designed to measure the performance of a decision support system including queries and data maintenance. It’s comprised of 99 queries that scan large volumes of data by utilizing Spark SQL and gives answers to real-world business questions. It challenges the cluster configurations to extract maximum efficiency from CPU, memory, and I/O along with the operating system and the big data solution.

We used 2^nd Gen Intel Xeon Scalable processors for the performance testing. Across infrastructures, Intel® Xeon® Scalable platform is designed for data center modernization to drive operational efficiencies that lead to improved total cost of ownership (TCO) and higher productivity for users.

Results

The Big Data Cluster benchmarks, derived from TPC-DS, demonstrates the scalability and performance of Microsoft SQL Server 2019 Big Data reference Cluster.

Our testing demonstrates that the performance scales linearly from 1TB to 100TB datasets seamlessly and the various system resources are effectively utilized. Microsoft SQL Server 2019 Big Data Cluster leverages the high performance of Intel® Xeon® processors and Intel® SSDs to deliver great performance for complex queries. In addition, the benchmark results demonstrate powerful elasticity and performance of the entire platform.

The combination of Microsoft SQL Server 2019 Big Data Cluster and Intel’s Xeon Scalable platform can address many of your Big Data challenges. You can store and analyze data from multiple sources at scale, in various data formats, with scale-out compute for data processing and machine learning, together with the industry-leading experience of SQL Server.

Here is a link to download the technical white paper that captures detailed steps, configuration, and analysis of the benchmark study for Microsoft SQL Server 2019 Big Data Cluster on Intel’s Xeon Scalable platform.

Microsoft SQL Server 2019 Big Data Cluster performance benchmark technical whitepaper.

Learn more

Read the SQL Server 2019 launch blog here.
Download the SQL Server 2019 e-book.
Learn more about Microsoft SQL Server 2019 Big Data Clusters.
You can now get free big data node cores with your software assurance benefit. Download our licensing data sheet to learn more.

The post The ultimate performance for your big data with SQL Server 2019 Big Data Clusters appeared first on Microsoft SQL Server Blog.

How to deploy SQL Server 2019 Big Data Clusters

SQL Server Team — Tue, 19 Nov 2019 17:00:17 +0000

SQL Server 2019 Big Data Clusters is a scale-out, data virtualization platform built on top of the Kubernetes container platform. This ensures a predictable, fast, and elastically scalable deployment, regardless of where it’s deployed. In this blog post, we’ll explain how to deploy SQL Server 2019 Big Data Clusters to Kubernetes.

First, the tools

Deploying Big Data Clusters to Kubernetes requires a specific set of client tools. Before you get started, please install the following:

azdata: Deploys and manages Big Data Clusters.
kubectl: Creates and manages the underlying Kubernetes cluster.
Azure Data Studio: Graphical interface for using Big Data Clusters.
SQL Server 2019 extension: Azure Data Studio extension that enables the Big Data Clusters features.

Choose your Kubernetes

Big Data Clusters is deployed as a series of interrelated containers that are managed in Kubernetes. You have several options for hosting Kubernetes, depending on your use case, including:

Azure Kubernetes Service (AKS): You can use the Azure portal to deploy Azure Kubernetes Service. Azure Kubernetes Service allows you to deploy a managed Kubernetes cluster in Azure, all you manage and maintain are the agent nodes. You don’t even have to provision your own hardware.
Multiple Linux machines: Kubernetes can also be deployed to multiple Linux machines, physical or virtual. This is a great option if you’re looking for an opportunity to leverage existing infrastructure. You can use the kubeadm tool to create the Kubernetes cluster and a bash script. Visit our documentation to learn how to automate the deployment.

Deploy SQL Server 2019 Big Data Clusters

After configuring Kubernetes, your next step is to deploy Big Data Clusters with the azdata bdc create command. There are several different ways to do this as well:

If you’re deploying to a dev-test environment, consider using one of the default configurations provided by azdata.
To customize a deployment, you can create and use your own deployment configuration files.
If you need to walk away while you’re deploying, an unattended installation will allow you to pass all other settings in environment variables.

Deployment scripts

Deployment scripts can make deployment easier and faster by deploying both Kubernetes and Big Data Clusters in a single step. They also often provide default values for Big Data Clusters settings. However, you aren’t locked into the values defined by the script. Deployment scripts can also be customized, so you can create your own version that configures the Big Data Clusters deployment to your liking.

Two deployment scripts are currently available. The Python script deploys a big data cluster on Azure Kubernetes Service, and the Bash script deploys Big Data Clusters to a single node kubeadm cluster.

Deployment notebooks

There’s one more option for deploying Big Data Clusters, and that’s running an Azure Data Studio notebook. There will also be a UX experience in Azure Data Studio for deployment.

Because SQL Server Big Data Clusters are deployed on Kubernetes, getting up and running is fairly painless. As you can see, you have several options each step of the way, but your path is made clear based on your use case. To learn more about what you can do with Microsoft SQL Server 2019, check out the free Packt guide Introducing Microsoft SQL 2019. If you’re ready to jump to a fully managed cloud solution, check out the Essential Guide to Data in the Cloud.

The post How to deploy SQL Server 2019 Big Data Clusters appeared first on Microsoft SQL Server Blog.

Build an intelligent analytics platform with SQL Server 2019 Big Data Clusters

Mihaela Blendea — Mon, 11 Nov 2019 17:00:24 +0000

In the most recent releases, SQL Server went beyond relational data and enabled support for graph data, R, and Python machine learning, while making SQL Server available on Linux and containers in addition to Windows. At the same time, organizations are challenged with the amount of data stored in different formats, in silos, and the expertise required to extract value out of the data. Through enhancements in data virtualization and platform management, Microsoft SQL Server 2019 Big Data Clusters provides an innovative and integrated solution to overcome these difficulties. It incorporates Apache Spark™ and HDFS in addition to SQL Server, on a platform built exclusively using containerized applications, designed to derive new intelligent insights out of data.

Modernize your data estate with a scalable data virtualization and analytics platform

Data integration strategies are based on extract, transform, and load (ETL) results in data duplication and transformations that diminish data quality, higher maintenance, and security risks. SQL Server 2019 has a new approach to data integration called data virtualization across disparate and diverse data sources, without moving data. Out-of-the-box connectors for data sources like Oracle, Teradata or MongoDB help you keep the data in place and secure, with less maintenance and storage cost. You can now uncover unconsidered perspectives by easily combining all your data, which ultimately leads to better data-driven decisions.

Systems Imagination is using these capabilities in SQL Server Big Data Clusters, eliminating the need to shift or replicate data to gain insights.

“With SQL Server 2019 Big Data Clusters, we can analyze cancer research data coming from dozens of different data sources, mine interesting graph features, and carry out analysis at scale” – Pieter Derdeyn, Knowledge Engineer, Systems Imagination.

In addition, SQL Server 2019 Big Data Clusters provides a comprehensive machine learning and AI platform with all the tools and services required to ingest, store, prepare, and analyze data. With previous versions of SQL Server, you can execute Python and R scripts to clean and prepare data, train, evaluate, or deploy machine learning models within a database. Within Big Data Clusters, you can use the data analysis tools and frameworks of your choice on the same platform where data resides.

In Azure Data Studio, you can submit Apache Spark™ jobs and use the built-in compute context in your preferred language including R, Python, or Scala. Your AI and machine learning lifecycle can benefit from SQL Server’s mission-critical features like performance, security, availability, and scalability. You can also operationalize these models and deploy them as containerized applications running within the platform, side by side with the data. Models are exposed as a REST API for easy integration with your business applications. This set of comprehensive analytics tools is what Dr. Foster, one of the Big Data Clusters early adopters customers, leveraged for their analytics platform:

“Our analysts need access to cutting edge data science technologies and techniques that adhere to strict industry-regulated guidelines. With SQL Server 2019 Big data clusters, we are able to analyze our relational data in the unified data platform, leveraging Apache Spark™, HDFS, and enhanced machine learning capabilities, all while remaining compliant.” – George Bayliffe, Head of Data, Dr. Foster

Built on top of the Kubernetes containers, Big Data Clusters have a built-in management system on any infrastructure

Managing all the services that enable you to run relational and big data workloads in a secure, efficient, and scalable way is challenging. With Big Data Clusters, you can operationalize management and data engineering tasks in an integrated and consistent way with a modern, containers-based architecture built on top of Kubernetes. At the center of this platform is the SQL Server master instance that stores relational data and serves as an entry point to other data sources within or outside the cluster. With additional SQL Server instances in the data pool, you can build a scale-out data mart for ingesting and automatically distribute data resulting in enhanced query performance efficiency. Multiple parallel-processing SQL Server instances in the compute pool and elastically scalable shared storage with SQL Server and HDFS are also included by default in a big data cluster. To further expand your data lake, you can unify your HDFS stores using HDFS tiering, Microsoft’s latest contribution to the Apache HDFS open source project, now available with SQL Server 2019 Big Data Clusters. Along with HDFS, we include Apache Spark™, ideal for data ingestion tasks, preparation, training, and analysis of high data volumes in a scalable and performant way.

The choice of infrastructure is fundamental when it comes to deploying and managing all these components at scale. Kubernetes enables application portability, elastic scalability, and consistency across platforms, allowing SQL Server 2019 Big Data Clusters to ensure a predictable, self-contained, and fast deployment workflow. Balzano recognizes the value of a self-managed, autonomous and flexible platform that allows you to focus on getting valuable insights out of data.

“SQL Server 2019 Big Data Clusters allowed us to accommodate and integrate all aspects from one shared platform for our data scientists and for our software engineers who wire up workflows, security, and scalability. At runtime, our healthcare customers benefit from simple containerized deployment and maintenance while being able to move our solution between on-premises and the cloud easily.” – René Balzano, Founder and CEO, Balzano.

SQL Server’s years long commitment is to support mission-critical applications. In Big Data Clusters, we ensure that management services embedded within the platform provide fast scale and automated upgrade operations, automatic logs and metrics collection, enterprise grade secure access, and high availability. Azure Active Directory authentication is available through innovative implementations for applications running in containers, providing an integrated security model that spans all services, including SQL Server, Apache Spark™, and HDFS. Maintenance tasks like secure container deployment, certificates, and secrets storing and rotation are provided by the platform through tight integration with Azure Active Directory and Kubernetes. Applications running on top of a Kubernetes orchestrator benefit from the platform’s built-in health monitoring, failure detection, and failover mechanisms. In addition, for critical components like the SQL Server master instance, you can enable flagship features like Always On Availability Groups for additional reliability and read scale out capabilities.

Cost effective big data and AI platform

You can start with the Developer Edition at no cost and try the complete set of capabilities of a full-featured deployment. The SQL Server 2019 licensing model was updated to incorporate a new subscription model for Big Data Clusters, and you have the option to use your existing SQL Server software licenses for Big Data Clusters deployments. A new Software Assurance benefit gives you eight Big Data Cluster node core licenses for each of the Enterprise Edition SQL Server master instance cores for free.

Get started

With a unified set of data integration, management, and data analysis tools, Big Data Clusters makes it not just easy, but also affordable for you to build on this platform. SQL Server 2019 Big Data Clusters provides the analytics at scale platform that you can count on for enterprise-grade performance, high availability, security, and manageability. We are very excited to see you use the broad range of scenarios that will help bridge the gap between relational data and big data deployments. You can store and analyze data from multiple sources at scale, in various data formats, with scale-out compute for data processing and machine learning, together with the industry-leading experience of SQL Server.

Start building your new analytics platform today. Here are a few pointers to help you get started:

Read the online documentation for SQL Server 2019 Big Data Clusters
Deploy Big Data Clusters and leverage the new analytics platform powered by Kubernetes
Experiment and try new capabilities of SQL Server 2019 Big Data Clusters using free training workshops
Learn more about all of the enhancements that went in SQL Server 2019 and try the new release of SQL Server today!

The post Build an intelligent analytics platform with SQL Server 2019 Big Data Clusters appeared first on Microsoft SQL Server Blog.