Open sourcing the Java language extension for SQL Server

Nellie Gustafsson — Wed, 22 Apr 2020 17:00:17 +0000

With over 20 years of existence, Java is still one of the most popular programming languages and is used in many enterprise applications.

In SQL Server 2019, we added a Java language extension, which enables secure execution of Java programs in the context of a SQL Server query. This enables a wide range of scenarios such as performing advanced text and data preparation tasks, reaching out to external APIs to get data and also training machine learning models and model scoring.

Today, we’re thrilled to announce that we are open sourcing the Java language extension for SQL Server on GitHub.

This extension is the first example of using an evolved programming language extensibility architecture which allows integration with a new type of language extensions. This new architecture gives customers the freedom to bring their own runtime and execute programs using that runtime in SQL Server while leveraging the existing security and governance that the SQL Server programming language extensibility architecture provides.

Choosing which runtime to use does provide the flexibility to choose different distributions of Java, and as newer versions of the Java runtime get released, this architecture will make it easier to upgrade the Java runtime. However, this freedom may raise some questions around support. Enterprises need to have a support contract in place for their Java runtime. The answer, in this case, is that we’ve got you covered. Thanks to a partnership between Microsoft and Azul, all Azure and SQL Server customers can use Azul’s Zulu for Azure – Enterprise distribution of Java for free with support jointly provided by Microsoft and Azul. This supported distribution of Java is included in SQL Server out of the box!

Now that support is not an issue, let’s look at what use cases Java can enable inside SQL Server. Bringing Java workloads closer to the data opens a variety of possibilities:

This extends the TSQL surface area to better handle use cases involving regular expressions, string handling, and NLP support.
This functionality also helps in migration scenarios from Oracle, where applications rely on Oracle Java procs in the database. With the ability to execute Java inside stored procedures in SQL Server, there is now a path for enabling Java application migrations to SQL Server.
Java application development teams that leverage SQL Server as backend storage can now even embed Java code in stored procedures which enables pushing business logic down into the database for better performance.
Furthermore, this will help avoid unnecessary data movement and latency when data must be retrieved from SQL Server and moved into the app tier to do the business logic processing.

Why Open Source?

The Java language extension leverages the Extensibility Framework API for SQL Server to communicate and exchange data with SQL Server. This API has been publicly documented. The API in combination with the open source code of the Java language extension provides an end to end example implementation of how a programming language extension can be built. This makes it easier for additional programming language extensions to be built for SQL Server by the community. What language extensions would you like to see?

Get started

Whether you are interested in creating your own language extension or just using the Java language extension for SQL Server, here is a tutorial to get you started.

The post Open sourcing the Java language extension for SQL Server appeared first on Microsoft SQL Server Blog.

Unify your data lakes with HDFS tiering in SQL Server Big Data Clusters

Nellie Gustafsson — Thu, 31 Oct 2019 17:00:22 +0000

As the volume and variety of data has risen, it has become more common to store the data in disparate and diverse data sources. A challenge many organizations face today is how to gain insights from all of their data across many different data sources. With SQL Server 2019 Big Data Clusters, through innovative enhancements, we’re extending the data virtualization capabilities even more with a new feature called HDFS tiering.

HDFS tiering allows you to easily integrate and gain insights from all of your data by accessing unstructured data stored on remote data lakes. This can be done by mounting the remote HDFS/S3 compatible data source to your local HDFS data lake.

This new functionality is Microsoft’s latest major contribution to the Apache Hadoop open source project and will be available in the market first in SQL Server 2019 Big Data Clusters.

Before we look closer at HDFS tiering, let’s quickly look at SQL Server Big Data Clusters as a data platform.

SQL Server Big Data Clusters

SQL Server Big Data Clusters is a complete data platform for analytics and AI with a local HDFS data lake built-in for storing high volume and/or unstructured data. In the big data cluster, you can use two different compute engines for querying and machine learning: Apache Spark™ and SQL Server.

Currently in SQL Server Big Data Clusters, you can use HDFS tiering to mount the following storages: Azure Data Lake Storage Gen2, AWS S3, Isilon, StorageGRID, and Flashblade. We are expanding this list to include other major HDFS/S3 compatible storage solutions both on-premises and in the cloud.

Now let’s take a closer look at HDFS tiering

HDFS tiering

HDFS tiering allows you to mount a remote storage to your big data cluster and instantly gain access to the remote data from either Apache Spark™ or SQL Server, seamlessly.

When the mount command is issued, the mount credentials are used to authenticate to the remote storage and copy the remote file and directory metadata including permissions to the local HDFS. This operation is relatively quick since only metadata is copied. There is no data movement!

After completion of the mount operation, you gain immediate access to your remote data. On the first read operation, the data that was read will be cached locally by default. This means that subsequent reads of the same data will experience better performance since the data will be read from the local cache.

The default cache size is set to two percent of the total storage capacity in the local HDFS data lake and the cache for a specific mount will be emptied when a mount is refreshed or deleted.

Create a mount with a single command

Creating an HDFS tiering mount in SQL Server Big Data Clusters can be done with one command:

azdata bdc hdfs mount create –remote-uri –mount-path

Watch this video for a demo of how HDFS tiering can be used in SQL Server Big Data Clusters.

It has never been this easy to gain instant access to remote data and limitless storage in the cloud from your local big data cluster. However, ease of use is not the only value gain with HDFS tiering:

Save costs and reduce data movement

Instead of copying large amounts of data from one data lake to another, and maintaining additional integration pipelines for data movement, HDFS tiering allows you to leave the data in in cheaper object stores, and get faster turnaround time with on-demand reads and caching.

Secure sharing of big data

HDFS tiering makes it easier to securely share your organizations big data across teams to ensure you get the most value out of your data. Upon mounting, the remote permissions are copied to your local data lake, which means that the remote permissions will always be honored every time the remote data is accessed. In addition to this, HDFS tiering supports secure mount operations using OAuth access keys to authenticate to the remote data source. Azure Active Directory support for mounting against Kerberos and Azure Active Directory joined data sources is coming soon.

Portability across compute engines

Analyzing all your data across different data lakes provides the freedom to use the compute engine that best fits a given use case. In the big data cluster, you can use SQL Server and Apache Spark™ out of the box, for your data processing and analysis. HDFS tiering enables both compute engines to process data in your local and mounted data lakes seamlessly.

Join our customers and experience the benefits of HDFS tiering yourself.

“HDFS tiering has saved us lots of time and money in development costs. We have lots of data stored in Azure Data Lake Storage Gen2. With HDFS tiering we can simply mount to the data in those locations without having to create and maintain a separate integration process.” – Lance Milton, Application Management Advisor – Data Integration at ENGIE North America

To learn more about how you can unify your data lakes with HDFS tiering in SQL Server Big Data Clusters, please visit the HDFS tiering documentation. And if you’re interested in the technical details of how we built this new dynamic mounting functionality in HDFS, we encourage you to read more on the Jira page.

The post Unify your data lakes with HDFS tiering in SQL Server Big Data Clusters appeared first on Microsoft SQL Server Blog.