{"id":15102,"date":"2016-03-10T08:30:59","date_gmt":"2016-03-10T16:30:59","guid":{"rendered":"https:\/\/blogs.technet.microsoft.com\/dataplatforminsider\/?p=15102"},"modified":"2024-01-22T22:50:25","modified_gmt":"2024-01-23T06:50:25","slug":"mapping-the-universe-with-sql-server","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/sql-server\/blog\/2016\/03\/10\/mapping-the-universe-with-sql-server\/","title":{"rendered":"Mapping the Universe with SQL Server"},"content":{"rendered":"
This blog post was co-authored by Joseph Sirosh, Corporate Vice President, and Rimma V. Nehme, Principal Software Engineer, at the Data Group at Microsoft.<\/em><\/p>\n \u00a0<\/p>\n Figure 1: Visible objects of the Sloan Digital Sky Survey (SDSS) DR7 dataset.<\/span><\/em><\/p>\n Over the last 15 years a database helped revolutionize an entire field of science. Astronomical discovery and sophisticated analyses of properties of the aggregate universe was turbocharged by a vast public mapping effort of the sky, called the Sloan Digital Sky Survey, whose data was served in a public database built with Microsoft SQL Server. This was the first in the field and opened up an entirely new window into the Universe.<\/p>\n Figure 2: The Fourth Paradigm: Data-Intensive Scientific Discovery book dedicated to Jim Gray.<\/em><\/span><\/p>\n Now scientists in every field, from astronomy to zoology, are recognizing that the rate of accumulation of data in their fields are greatly outstripping the rate of accumulation of interpretation, i.e. the rate at which the scientific community can assimilate data into an interpretive framework. And there\u2019s widespread recognition that powerful scientific discoveries lie hidden in such massive data. The Fourth Paradigm of scientific discovery, driven by novel techniques for analyzing massive data, is a driving force in science like never before.<\/p>\n It all started in the early 90\u2019s when Dr. Alex Szalay together with the late Dr. Jim Gray took on a daring endeavor to build what could be called the first \u201cDataScope\u201d \u2013 an efficient data intensive computing infrastructure for astronomers called the Sloan Digital Sky Survey (SDSS)<\/a> using Microsoft SQL Server<\/a> as the back-end database.<\/p>\n Figure 3: Jim Gray, Alex Szalay and other astronomers at Super Computing 2003<\/em><\/span><\/p>\n SDSS had a bold goal \u2013 to create a map of the universe in a database for exploration by all. It is often referred to as the Cosmic Genome Project. A dedicated 2.5-m-diameter telescope in New Mexico used a 120-megapixel camera to image more than one-quarter of the entire night sky, 1.5 square degrees of sky at a time, about eight times the area of the full Moon, both inside and outside of the Milky Way, and helped create a three-dimensional (3D) map of millions of galaxies, quasars and stars.<\/p>\n The SDSS maps sparked a revolution in the way astronomy is practiced. No longer did scientists have to wait months for access to a telescope to learn about the night sky; instead, entire research projects could be accomplished by querying the online database. The SDSS made its entire data set available through SkyServer<\/a> database – an online portal for public use,\u00a0 and invited volunteer contributions to scienti\ufb01c research. Prior to SDSS, only the leading scientists and astronomers had telescopes and instruments to collect data for serious research, with most others largely excluded from direct and active engagement with astronomy. Now, with access to the visual data that SkyServer offers, anyone with Internet access could explore the universe with data just as the top scientists do.<\/p>\n Figure 4: SDSS-IV can view the whole Milky Way<\/em><\/span><\/p>\n SkyServer’s architecture was fairly simple to start with: a front-end IIS web server accepted HTTP requests processed by JavaScript Active Server Pages (ASP). These scripts used Active Data Objects (ADO) to query the backend Microsoft SQL Server database. SQL Server returned record sets that the JavaScript formatted into pages. The website was about 40,000 lines of code and was originally built by two people as a spare-time activity.<\/p>\n While building applications to study the correlation properties of galaxies, Szalay and his team have discovered that many of the patterns in their statistical analysis involved tasks that were much better performed inside the database engine than outside, on flat files. The Microsoft SQL Server gave them high-speed sequential search of complex predicates using multiple CPUs, multiple disks and large main memories. It also had sophisticated indexing and data joining algorithms far outperforming hand-written programs against flat files. Many of the multi-day batch files were replaced with database queries that ran in minutes thanks to the sophisticated query optimizer.<\/p>\n <\/a> The most recent version of the database has a 15TB queryable public dataset, with about 150TB additional raw and calibrated files. A recent scan of the logs showed more than 1.6 billion web hits in the past 14 years and more than four million distinct IP addresses accessing the site. The total number of professional astronomers worldwide is only about 15,000. Furthermore, the multiuser collaborative environment in SDSS called CasJobs<\/a> which allows users to launch extensive analyses has more than 6,820 registered users – almost half of the professional astronomy community.<\/p>\n SDSS has been successful in generating new scienti\ufb01c discoveries, including the measurements of thousands of asteroids, maps of the complicated merger history of the outer Milky Way, and the \ufb01rst detection of the baryon acoustic peak – a measurement of how structure formed from ultra-low frequency standing sound waves in the early universe. These surveys have produced data to support 5,800 papers with more than 245,000 citations.\u00a0 This has made SDSS one of the highest impact projects in the field of astronomy.<\/p>\n The amount of astronomical data in SkyServer is truly unprecedented. When the SDSS began in 1998, astronomers had data for less than 200,000 galaxies. Within five years after SDSS began, SkyServer had data on 200 million galaxies in the database. Today, the SDSS data exceeds 150 terabytes, covering more than 220 million galaxies and 260 million stars. The images alone include 2.5 trillion pixels of original raw data. SkyServer allows users to search for stars at a given position in the sky, or they can search for galaxies brighter than a certain limit. Users can also enter queries to the database in SQL directly, which allows more \ufb02exible and sophisticated searches.<\/p>\n Examples of queries users can ask in SkyServer:<\/p>\n Figure 6: SkyServer portal<\/span><\/em><\/p>\n Another project that SDSS data access has enabled is a \u201ccitizen science\u201d website, called Galaxy Zoo<\/a>, where Internet volunteers have classified galaxies using SDSS images. Typically, astronomers used to classify galaxies by eye. If you have 200 million galaxies, on average at three per minute, classification would take 600 million minutes or 1142 years of 24 hours per day, seven days per week. Galaxy Zoo was the first astronomy crowdsourcing portal which allowed private citizens to look at data by eye, and contribute classifications to scientists in a much shorter time.<\/p>\n Figure 7: Hanny’s Voorwerp. The mass (shown here in green) is a new cosmic object discovered by a Dutch school teacher, an astronomy novice, while using Galaxy Zoo.<\/span><\/em><\/p>\n There have been a number of scientific discoveries using Galaxy Zoo including determination of the relation between the morphology of galaxies and their environment and the discovery by a Dutch school teacher of Hanny\u2019s Voorwerp \u2013 a very rare type of astronomical object called a quasar ionization echo. These discoveries would not have been possible without the participation of thousands of Galaxy Zoo volunteers – between them, they have visually classi\ufb01ed over 40 million galaxies to date.<\/p>\n A new effort called SciServer<\/a>, a descendant from SkyServer, aims to go beyond astronomy and build a long-term, flexible ecosystem for scientists to provide access to the enormous data sets from observations and simulation to enable collaborative research. SciServer aims to meet the challenges of Big Data in scientific world. By building a common infrastructure, the goal is to create data access and analysis tools useful to all areas of science. Led by Alex Szalay, the work on SciServer will deliver significant benefits to the scientific community by extending the infrastructure developed for SDSS astronomy data to many other areas of science.<\/p>\n Figure 8: SciServer: A collaborative research environment for large-scale data-driven science.<\/span><\/em><\/p>\n The approach in designing SciServer is the same as in the SkyServer: bring the analysis to the data. This means that scientists can search and analyze Big Data without downloading terabytes or petabytes of data, resulting in much faster processing times. Bringing analysis to data also makes it much easier to compare and combine datasets allowing researchers to discover new and surprising connections between data and make experiments more reproducible.<\/p>\n To help ease the burden on researchers, the team developed SciDrive<\/a>, a cloud data storage system for scientific data that allows scientists to upload and share data using a Dropbox-like interface. The interface automatically reads the data into a database, and one can search online and cross-correlate with other data sources. SciDrive tries to address the \u201clong tail\u201d of a huge number of small data sets that scientists have. The goal is to try bring many small, seemingly unrelated data to a single place and see if new value emerges. People can simply drag and drop (and share) their data without any metadata required.<\/p>\n SDSS team in collaboration with Jim Gray took on the enormous task of putting all of the astronomy data into SQL Server database, preserving as much provenance as possible, and making the data as accessible and query-able as possible.<\/p>\n Figure 9: Dr. Alex Szalay<\/span><\/em><\/p>\n The processed image data were stored in databases. The logical database design consisted of photographic<\/em> and spectrographic<\/em> objects. They were organized into a pair of snowflake schemas. Sub-setting views and many indices gave convenient and fast access to the conventional subsets (such as stars and galaxies). Procedures and indices were defined to make spatial lookups even more convenient and faster.<\/p>\n SkyServer initially took a simple approach to database design (see Figure 11 below) and it worked right from the beginning. The design counted on the SQL storage engine and the query optimizer to make all the intelligent decisions about data layout and data access. As Alex Szalay put it: \u201cGreat query optimizer made all the difference. Even \u2018the worst\u2019 query plans were actually quite good!\u201d<\/p>\n Figure 11: The photoObj table at left is the center of one star schema describing photographic objects. The specObj table at right is the center of a star schema describing spectrograms and the extracted spectral lines. The photoObj and specObj tables are joined by objectId. Not shown are the dataConstants table that names the photoObj flags and tables that support web access and data loading.<\/span><\/em><\/p>\n To speed up the access, the base tables were heavily indexed (these indices also benefited view access). In addition to the indices, the database design includes a fairly complete set of foreign key declarations to insure that every profile has an object; every object is within a valid field, and so on. The design also insisted that all fields were non-null.\u00a0 These integrity constraints were invaluable tools in detecting errors during loading and they aided tools that automatically navigated the databases.<\/p>\n Beyond the file group striping (to automatically get the sum of the disk bandwidths without any special user effort), SkyServer used, for the most part, all of the SQL Server default values; there was not much special tuning. This is the hallmark of SQL Server \u2013 the system aims to have the out-of-the box performance to be great, and the SkyServer project has been a true testimonial to that goal.<\/p>\n Figure 12: Hierarchical triangular mesh<\/span><\/em><\/p>\n “Spatial was special.\u201d<\/em> Astronomers are particularly interested in executing spatial queries to obtain galactic clustering and large-scale structure of the universe. The common theme in SDSS experience was that it was possible to embed spatial concepts in a relational framework in a very simple manner. To make spatial area queries run quickly, SDSS team integrated the hierarchical triangular mesh (HTM)<\/em> code with the SQL Server, which became a new \u201cspatial access method\u201d in the engine. HTM is a method to subdivide the surface of a sphere into spherical triangles of similar, but not identical, shapes and sizes. It is basically a quad-tree that is particularly good at supporting searches at different resolutions, from arc seconds to hemispheres. The HTM library was an external stored procedure wrapped in a table-valued stored procedure spHTM_Cover(<area>).<\/p>\n So all the users had to do was to simply invoke the procedure call similar to this:\u00a0 select * from spHTM_Cover(\u2018Circle J2000 12 5.5 60.2 1\u2019) which would return the table with four rows, each row defining the start and end of a 12-deep HTM triangle like below.<\/p>\nSloan Digital Sky Survey: The Cosmic Genome Project<\/h2>\n
Why Microsoft SQL Server?<\/h2>\n
Impact<\/h2>\n
\nFigure 5: Dr. Jim Gray in front of the Sloan telescope in Apache Point, NM<\/span><\/em><\/p>\nSkyServer data<\/h2>\n
\n
Galaxy Zoo<\/h2>\n
From SkyServer to SciServer: Big Data infrastructure for science<\/h2>\n
In the heart of it all is SQL Server<\/h2>\n
Database logical design<\/h2>\n
Database physical design<\/h2>\n
\u201cIndexing the Sky\u201d<\/h2>\n
Spatial data access<\/h2>\n