Open Data Collaboration and Sharing

Advancing open data

Microsoft aims to close the data divide and help organizations of all sizes to realize the benefits of data and the new technologies it powers.

Play the video About the campaign
Data for Society

View a collection of open datasets from Microsoft and how they're being used to address societal challenges.

Explore the catalog
Industry Data for Society Partnership

The partnership is committed to making private sector data more open and accessible to address societal challenges.

Learn about the partnership
Open Data for Social Impact Framework

The Open Data for Social Impact Framework is a tool to help leaders put data to work to solve pressing societal issues.

Explore the framework

Our approach

We believe everyone can benefit from opening, sharing, and collaborating around data to make better decisions, improve efficiency, and help tackle some of the world’s most pressing societal challenges.

Set data collaboration principles

We adopted five principles to guide our participation in data collaborations: open; usable; empowering; secure; and private. These principles underpin our participation, and we hope other organizations can build on them to share their data responsibly.

Learn about our principles Building on a year of open data

Engage partnerships and explore projects

We believe success will depend on building deep collaborations with others from industry, government, and civil society around the world. This includes work with leading organizations in the open data movement, such as the Open Data Institute and The GovLab at New York University.

About the Open Data Institute Explore the Open Data Policy Lab

Make data sharing easier

We're committed to investing in the essential assets that will make data sharing easier, including the necessary tools; frameworks; and templates. This is especially important when it comes to opening and collaborating around data to solve important societal issues.

Explore the Data for Society resource center Read the Open Data for Social Impact Framework

Closing the data divide

Access to data is a big challenge. The benefits for organizations of all sizes and the broader community are significant if we can work together to make progress on open data.

Aerial view of people walking on a street.

Industry Data for Society Partnership

Working across industry to make private sector data more open and accessible for societal good.

Learn about the partnership

Line drawing of a neighborhood connected to a cloud with lines.

The open data opportunity

The importance behind data sharing explained

Watch the video

Open data stories

Stories of open data and data sharing driving change

Read the stories

Equity and inclusion Sustainability Health

Colorful banknotes from different countries.

BankNote-Net

Worldwide millions of people have low or no vision. BankNote-Net was created as an open dataset for assistive universal currency recognition to help with daily tasks such as currency recognition.

Explore BankNote-Net on GitHub

Map of the United States color coded by broadband usage per county.

United States broadband usage dataset

Broadband internet access is critical to providing communities with education, employment, and telecare. The broadband usage percentages dataset shows broadband access at the US county-level to help address gaps in service availability.

Explore broadband data on Github

MS-ASL American Sign Language (ASL) dataset

In the US, over 500,000 people use ASL for communication. This ASL dataset of over 25,000 annotated videos with sign and action recognition can help researchers build machine learning models to advance sign language recognition.

Explore the ASL project

Images of hands in different positions shown on a monitor.

Tagged hands dataset

Development of a rich hand-gesture-based interface is currently a tedious process. This dataset of 3,500 labeled depth frames of various hand poses and 140 gesture clips helps enable easy development of a gesture-based interface.

Explore the hand gestures project

Collage of drawings developing progressively.

Generative Neural Visual Artist (GeNeVA)

Intelligent systems can generate images and video for a range of applications, from education to accessibility. This dataset has sequences of images, associated instructions and linguistic feedback, and a modified version of the Compositional Language and Elementary Visual Reasoning (CLEVR) dataset.

Explore the GeNeVA project Read GeNeVA publication

Learning from analog pen use to improve digital ink experiences

To help researchers understand the gaps between analog versus digital pens and improve digital experiences, this dataset contains 493 entries of a diary study with 26 participants using analog pens and 178 entries from 30 participants using digital pens.

Read pen use publication

Collage of colorful question marks on speech bubbles.

Microsoft Machine Reading Comprehension (MS MARCO)

AI and automated assistants need strong machine reading comprehension (MRC) and question answering (QA) capabilities to understand real-world dialog. This dataset contains 1,010,916 questions and 182,669 answers to improve QA and MRC.

Explore the MS MARCO project Read MS MARCO publication

Graphic of "Online risks experienced vary by gender: Total, 2016-2021 average".

Digital Civility Gender Equality Dataset

Microsoft recognizes the importance of advocating for and advancing the release of gender disaggregated data to realize gender equality and to close the data divide. This dataset can be leverage by researchers and organizations to advance better gender data policies and solutions.

Explore gender equality dataset on GitHub

Solar farms mapping

The solar farms mapping data can help researchers identify factors driving land suitability for solar projects and help public agencies better plan siting of solar energy development in India.

Explore solar farms on GitHub Read solar farms publication

HKH glacier mapping

Glacier mapping is key to ecological monitoring in the Hindu Kush Himalaya (HKH) region, climate change poses a risk to those dependent on the health of glacier ecosystems. The (HKH) glacier mapping dataset includes imagery with locations of glaciers.

Explore on glacier mapping LILA BC Read HKH glacier publication

Satellite view of a city with color coding to show land cover.

Chesapeake land cover

The Chesapeake Conservancy created a landcover dataset for conservation efforts, this same data containing high-resolution aerial imagery and land cover labels can be used to train ML models to map an even wider area of land cover.

Explore Chesapeake land cover data on LILA BC Read land cover publication

Concentrated Animal Feeding Operations (CAFO)

The poultry CAFO GitHub repository contains US-wide datasets of predicted poultry barn locations to help researchers identify CAFOs for conservation groups to address water and air quality issues.

Explore CAFO data on GitHub Read CAFO publication

A multicolored aerial view of an urban area.

TorchGeo

TorchGeo is a PyTorch domain library that includes several Geospatial benchmark datasets such as CDL, Landsat7, and Landsat8 to help support research tasks like image classification, semantic segmentation, object detection, instance segmentation, change detection, and more.

Explore TorchGeo on GitHub Read TorchGeo publication

World map showing spread analysis by country/region..

Bing COVID-19 data

Bing COVID-19 data includes confirmed, fatal, and recovered cases from all regions, updated daily from multiple reliable sources. This data is reflected in the Bing COVID-19 Tracker.

Explore Bing COVID-19 data

Scientist holding DNA gel in laboratory.

NCI-PID-PubMed Genomics KB

NCI-PID-PubMed Genomics Knowledge Base Completion Dataset is derived from the National Cancer Institute Pathway Interaction Database, and contains textual mentions extracted from cooccurring pairs of genes in PubMed abstracts, to help support the cancer research community and others interested in cellular pathways.

Explore the NCI-PID data Read NCI-PID publication

Athletes working out inside and outside a gym.

Exercise recognition from wearable sensors

Exercise is an important part of maintaining good health. This data set contains accelerometer and gyroscope recordings from over 200 participants performing various gym exercises that can be leveraged by researchers developing exercise devices.

Explore exercise data project Read exercise data publication

Microsoft Nonprofit Innovation Hub

The Nonprofit Innovation Hub is an open-source GitHub repository with lightweight solutions that enable nonprofits to innovate.

Check out the repository

Legal frameworks

Data sharing agreements can take months to draw up, oftentimes deterring organizations from sharing data at all. As a first step toward building better processes and tools, we're sharing a set of data agreements to govern the sharing of data, particularly in the context of training AI models.

CDLA Permissive 2.0

The Community Data License Agreement (CDLA) Permissive 2.0 is an open data agreement designed to make it easier to share and collaborate with open data.

Read the CDLA Get more details

C-UDA 1.0

The Computational Use of Data Agreement (C-UDA) 1.0 is intended for use with datasets that may include material not owned by the data provider, but where it may have been assembled lawfully from publicly accessible sources.

Read the C-UDA See the annotated agreement Find the agreement on GitHub

DUA-OAI

The Data Use Agreement for Open AI Model Development (DUA-OAI) provides terms to govern the sharing of data by an organization with another for the purpose of allowing that second organization to use the data to train an AI model, where the trained model is open sourced.

Read the DUA-OAI Find the annotated agreement Get the details

DUA-DC

The Data Use Agreement for Data Commons (DUA-DC) can be used by multiple parties who want to share data through a common, Application Programming Interface (API)-enabled database.

Read the DUA-DC Get the annotated agreement Find out more

Capabilities

Learn more about the tools and practices we employ to enable more secure and streamlined access to data.

Differential privacy

Differential privacy introduces statistical noise–slight alterations–to mask datasets and protect the privacy of individuals.

Learn about differential privacy

Azure confidential computing

Confidential computing helps to protect sensitive data in the cloud by offering security through data-in-use encryption–additional protection for your data while it's being processed.

Read about Azure confidential computing

Azure Open Datasets

A curated collection of publicly available datasets that are ready to use in machine learning workflows and easy to access from Azure services.

Review the Azure Open Datasets

Researcher tools

Explore a collection of datasets, code, and models from Microsoft Research for the broader academic community to advance state-of-the-art research across all disciplines.

Explore researcher tools

Advancing open data

Data for Society

Industry Data for Society Partnership

Open Data for Social Impact Framework

Our approach

Set data collaboration principles

Engage partnerships and explore projects

Make data sharing easier

Closing the data divide

Industry Data for Society Partnership

The open data opportunity

Open data stories

Microsoft Data for Society catalog

Solar farms mapping

HKH glacier mapping

Chesapeake land cover

Concentrated Animal Feeding Operations (CAFO)

TorchGeo

Microsoft Nonprofit Innovation Hub

Legal frameworks

CDLA Permissive 2.0

C-UDA 1.0

DUA-OAI

DUA-DC

Capabilities

Differential privacy

Azure confidential computing

Azure Open Datasets

Researcher tools