{"id":1023,"date":"2014-06-02T09:00:00","date_gmt":"2014-06-02T16:00:00","guid":{"rendered":"https:\/\/blogs.technet.microsoft.com\/dataplatforminsider\/2014\/06\/02\/polybase-in-aps-yet-another-sql-over-hadoop-solution\/"},"modified":"2024-01-22T22:48:53","modified_gmt":"2024-01-23T06:48:53","slug":"polybase-in-aps-yet-another-sql-over-hadoop-solution","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/sql-server\/blog\/2014\/06\/02\/polybase-in-aps-yet-another-sql-over-hadoop-solution\/","title":{"rendered":"PolyBase in APS – Yet another SQL over Hadoop solution?"},"content":{"rendered":"
This blog post will highlight PolyBase\u2019s truly unique approach focusing on:<\/p>\n
In the very recent past, various SQL over Hadoop\/HDFS solutions have been developed, such as Impala, HAWQ, Stinger, SQL-H, Hadapt to name just a few. While there are clear technical differences between the various solutions, at a high level, they are similar in offering a SQL-like front end over data stored in HDFS.<\/p>\n
So, is PolyBase yet another similar solution competing with these approaches? The answer is yes and no. On first glance, PolyBase is a T-SQL front end that allows customers to query data stored in HDFS. However, with the recently announced Analytics Platform System (APS)<\/a>, we have updated PolyBase with new syntax to highlight our extensible approach. With PolyBase, we bring various Microsoft data management services together and allow appliance users to leverage a variety of Azure services. This enables a new class of hybrid scenarios <\/b>and reflects the evolution of PolyBase to a true multi-data source query engine. It allows users to query their big data \u2013 regardless of whether it is stored in an on-premises Hadoop\/HDFS cluster, Azure storage, Parallel Data Warehouse, and other relational DBMS systems (offered in a future PolyBase release).<\/p>\n Complete Data Platform with PolyBase as key integrative component<\/i><\/p>\n Openness <\/b><\/p>\n One important key differentiator of PolyBase compared to all of the existing competitive approaches is \u2018openness\u2019<\/b>. We do not force <\/b>users to decide on a single solution, like some Hadoop providers are pursuing. With PolyBase, you have the freedom to use an HDInsight region as a part of your APS appliance, to query an external Hadoop cluster connected to APS, or to leverage Azure services from your APS appliance (such as HDInsight on Azure).<\/p>\n To achieve this openness, PolyBase offers these three building blocks.<\/p>\n Building blocks for PolyBase<\/i><\/p>\n<\/p>\n The syntax for using PolyBase is simple and follows familiar T-SQL language constructs.<\/p>\n T-SQL for creating external data sources <\/b>(Azure, external Hadoop cluster, HDI region)<\/p>\n T-SQL for creating external file formats <\/b>(delimited text files and Hive RCFiles)<\/p>\n T-SQL for creating external tables <\/b>(for Azure, external Hadoop cluster, HDI region)<\/b><\/p>\n A user can now create statistics for each of the external tables shown above to improve the query performance. We extended SQL Server\u2019s mature stats framework to work against external tables in the same way it works against regular tables. Statistics are crucial for the PolyBase query engine in order to generate optimal execution plans and to decide when pushing computation into the external data source is beneficial.<\/p>\n Performance<\/b><\/p>\n While other SQL over Hadoop solutions (e.g. Impala, Stinger, and HAWQ) have improved, it remains true that they still cannot match the query performance of a mature relational MPP system<\/b>. With PolyBase, the user can import data in a very simple fashion into PDW (through a CTAS statement, see below), use the fast SQL Server column store technology along with the MPP architecture, or let the PDW\/PolyBase query optimizer decide which parts of the query get executed in Hadoop and which parts in PDW. This optimized querying, called split-based query processing, allows parts of the query to be executed as Hadoop MR jobs that are generated on-the-fly completely transparent for the end user. Thereby, the PolyBase query optimizer takes into account parameters such as the spin-up time for MR jobs and the generated statistics to determine the optimal query plan.<\/p>\n In general, if it comes to performance the answer usually is \u2018it depends on the actual use case\/query\u2019. With PolyBase, the user has total freedom<\/b> and can leverage capabilities of PDW and\/or Hadoop based on their actual needs and application requirements.<\/p>\n PolyBase in APS bridging the gap between the relational world, Hadoop (external or internal) and Azure<\/i><\/p>\n The T-SQL statement below will run across all data sources combining structured appliance data with un\/semi-structured data in external Hadoop, internal HDInsight region, and Azure (e.g. historical data) \u2013<\/p>\n T-SQL SELECT querying external Hadoop, HDInsight & PDW regions, and Azure<\/b><\/p>\n This query example shows how simplicity and performance are combined at the same time. It shows three external tables referring to three different locations plus one regular (distributed) PDW table. While executing the query, the PolyBase\/PDW query engine will decide, based on the statistics, whether or not to push computation to the external data source (i.e. Hadoop).<\/p>\n Rewriting & Migrating existing applications<\/b><\/p>\n Finally, you may have heard that Hadoop is \u2018cheaper\u2019 than more mature MPP DBMS systems. However, what you might not have heard about is the cost associated with rewriting existing applications<\/b> and ensuring continued tool support. This goes beyond simple demos showing that tool \u2018xyz\u2019 works on top of Hadoop\/HDFS.<\/p>\n PolyBase does not require you to download and install different drivers. The beauty of our approach is that external tables appear like regular tables in your tool of choice. The information about the external data sources and file formats is abstracted away. Many Hadoop-only solutions are not fully SQL-ANSI compliant and do not support various SQL constructs. With PolyBase, however, you don\u2019t need to rewrite your apps because it uses T-SQL and preserves its semantics. This is specifically relevant when users are coming from a \u2018non-Java\/non-Hadoop world\u2019. You can explore and visualize your data sets either by using the Microsoft BI solutions (initiated on-premises or through corresponding Azure services) or by using the visualization tool of your choice. PolyBase keeps the user experience the same.\u00a0<\/p>\n It\u2019s already a painful reality that many enterprises store and maintain data in different systems that are optimized for different workloads and applications, respectively. Admins are spending much time moving, organizing, and keeping data in sync.<\/b> This reality imposes another key challenge which we are address with PolyBase \u2013 in addition to querying data in external data sources, a user can achieve a simpler and more performant ETL (extraction, transformation, loading). Different than existing connector technologies, such as SQOOP, a PolyBase user can use T-SQL statements to either import data from external data sources (CTAS) or export data to external data sources (CETAS).<\/p>\n T-SQL CETAS statement to age out Hadoop & PDW data to Azure<\/b><\/p>\n Combines data from external Hadoop and PDW sources and stores the results in Azure<\/i><\/p>\n Under-the-covers, the PolyBase query engine is not only leveraging the parallelism of an MPP system, it also pushes computation to the external data source to reduce the data volume that needs to be moved. The entire procedure remains totally transparent for the user while ensuring a very fast import & export of data that greatly outperforms any connector technology offered today. With the CTAS statement, a user can import data into the relational PDW region where it stores the data as column store. This way, users can immediately leverage the column store technology in APS without any further action.<\/p>\n T-SQL CTAS statement for importing Hadoop data into PDW<\/b><\/p>\n Combines PolyBase with column store \u2013 Imports data from Hadoop into PDW CCI tables<\/i><\/p>\n In summary, PolyBase is more than just another T-SQL front end over Hadoop. It has evolved into a key integrative component that allows users to query, in a simple fashion, data stored in heterogeneous data stores. There is no need to maintain separate import\/export utilities. PolyBase ensures great performance by leveraging the computation power available in external data sources. Finally, the user has freedom in almost every dimension whether it\u2019s about tuning the system and getting the best performance, choosing their tools of choice to derive valuable insights, and to leverage data assets stored both on-premises and within the Azure data platform. \u00a0 \u00a0<\/p>\n Watch how APS seamlessly integrates data of all sizes and types here<\/a><\/p>\n
<\/a><\/p>\n\n2.\u00a0\u00a0\u00a0 <\/b>Freedom of Choice <\/b>
\n<\/h2>\n
<\/a><\/p>\n\n
CREATE EXTERNAL DATA SOURCE Azure_DS<\/b> WITH (TYPE=Hadoop, LOCATION= 'wasbs:\/\/<youraccount>.blob.core.windows.net');<\/code><\/li>\nCREATE EXTERNAL DATA SOURCE External_HDP_DS<\/b> WITH (TYPE=Hadoop, LOCATION='hdfs:\/\/\u2026\u2019, JOB_TRACKER_LOCATION='\u2026');<\/code><\/li>\nCREATE EXTERNAL DATA SOURCE HDI_DS<\/b> WITH (TYPE=Hadoop, LOCATION = 'hdfs:\/\/\u2026\u2019, JOB_TRACKER_LOCATION = '\u2026\u2019);<\/code><\/li>\n<\/ul>\n\n
CREATE EXTERNAL FILE FORMAT DelimText1<\/b> WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR =','));<\/code><\/li>\nCREATE EXTERNAL FILE FORMAT DelimText2<\/b> (FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE));<\/code><\/li>\nCREATE EXTERNAL FILE FORMAT HiveRC<\/b> WITH (FORMAT_TYPE = RCFILE, SERDE_METHOD = 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe')<\/code><\/li>\n<\/ul>\n\n
CREATE EXTERNAL TABLE [dbo].[Old_SensorData_Azure] (\u2026)
WITH (LOCATION='\/\/Sensor_Data\/May2009\/sensordata.tbl', DATA_SOURCE \u00a0\u00a0Azure_DS, FILE_FORMAT = DelimText2, REJECT_TYPE = VALUE, REJECT_VALUE = 100)<\/code><\/li>\nCREATE EXTERNAL TABLE [dbo].[SensorData_ExternalHDP] (\u2026)
WITH (LOCATION='\/\/Sensor_Data\/May2014\/sensordata.tbl', DATA_SOURCE = External_HDP_DS, FILE_FORMAT = DelimText1, REJECT_TYPE = VALUE, REJECT_VALUE = 0)<\/code><\/li>\nCREATE EXTERNAL TABLE [dbo].[SensorData_HDI] (\u2026)
WITH (LOCATION='\/\/Sensor_Data\/Year2013', DATA_SOURCE = HDI_DS, FILE_FORMAT = HiveRC)<\/code><\/li>\n<\/ul>\n
<\/a><\/p>\n<\/p>\nSELECT machine_name, machine.location
FROM Machine_Information_PDW, Old_SensorData_Azure, SensorData_HDI, SensorData_ExternalHDP\nWHERE Machine_Information_PDW.MachineKey = Old_SensorData_Azure.MachineKey\nand Machine_Information_PDW.MachineKey = SensorData_HDI.MachineKey\nand Machine_Information_PDW.MachineKey = SensorData_ExternalHDP.MachineKey\nand SensorData_HDI.Temperature> 80 and Old_SensorData_Azure.Temperature > 80\nand SensorData_ExternalHDP.Temperature > 80<\/pre>\n\n3.\u00a0\u00a0\u00a0 <\/b>Simplified ETL & Fast Insights <\/b>
\n<\/h2>\nCREATE EXTERNAL TABLE Old_Data_2008_Azure\nWITH (LOCATION='\/\/Sensor_Data\/2008\/sensordata.tbl', DATA_SOURCE=Azure_DS, FILE_FORMAT=DelimText2)\nAS SELECT T1.* FROM Machine_Information_PDW T1 JOIN SensorData_ExternalHDP T2\nON (T1.MachineKey = T2.MachineKey) WHERE T2.YearMeasured = 2008\u00a0<\/pre>\n
CREATE TABLE Hot_Machines_2011 WITH (Distribution = hash(MachineKey),\nCLUSTERED COLUMNSTORE INDEX)\nAS SELECT * FROM SensorData_HDI where SensorData_HDI.YearMeasured = 2011 and SensorData_HDI.Temperature > 150<\/pre>\n