{"id":430839,"date":"2017-10-05T11:03:40","date_gmt":"2017-10-05T18:03:40","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=430839"},"modified":"2020-03-13T17:08:00","modified_gmt":"2020-03-14T00:08:00","slug":"device-ml-ambient-aware-applications","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/device-ml-ambient-aware-applications\/","title":{"rendered":"On-device ML for Object and Activity Detection"},"content":{"rendered":"

\"drone,<\/h2>\n

Summary<\/h2>\n

Ambient-aware systems are omnipresent in\u00a0drones, self-driving cars, body-worn cameras and AR-VR devices. These systems process environmental data to recognize objects, measure physical parameters and determine what actions to take.\u00a0Intelligent cameras and microphones are key sensors that enable ambient-aware applications. For a good user experience, it is important that these sensors remain always-on and always-connected. Thus, they need to collect, transmit and process data with low latency and minimal power consumption. Simple approaches that sample and stream signals constantly to the cloud for processing are unacceptable. One problem that we have considered in this project is to efficiently map machine-learning algorithms onto edge devices and optimize their communication mechanism so that they can support always-on, always-connected computer vision for ambient-aware applications.<\/p>\n

Project Overview<\/h2>\n

Fundamentally, ambient-aware applications rely on a multi-tiered sensing architecture, (opens in new tab)<\/span><\/a>\u00a0where data processing is split between the cloud and device. An efficient arbitration between these resources is the key to achieving performance with low-latency and low-power consumption. For instance, consider the local frame-filtering<\/em> approach illustrated in the figure below. Only transmitting selected video frames after light-weight on-device processing can lower communication energies substantially. This in-turn leads to increased battery times on the device (8.8 hrs. with local processing vs.<\/em>\u00a096 minutes without). Thus, in order to achieve a seamless ambient-aware experience, it is important to communicate as well as compute efficiently on the local device.\"local<\/p>\n

In this project, we aim to address both of these issues. We have developed efficient implementations of ML algorithms so that they can be mapped to resource-constrained devices. We have also streamlined remote communications by relying on long-range radios and optimally-biased ML classifiers.<\/p>\n

Project Details<\/h2>\n

To process data locally, we have accelerated ML computations via<\/em> ASICs that incorporate efficient pipelining and parallelism techniques. We have also compressed ML models by scaling their bit-precision values. To lower communication overheads, we have looked at cascading ML algorithms through the network. We have also developed prediction models that lower contact frequency with the cloud. Our first candidate computing task for optimization has been object detection in video data for ambient-awareness.<\/p>\n

Efficient Local Computations<\/h3>\n

We have addressed the local computation issue via <\/em>two approaches: architectural and algorithmic. In the architectural approach, we have developed efficient classification algorithms that rely on simple features obtained from video frames. In the algorithmic approach, we have explored different ways of compressing ML algorithms and adjusting their classification behavior at run time.<\/p>\n

Architectural approaches:<\/strong> We have considered the problem of image classification, which is a fundamental building block in ambient-aware applications. Our on-device classification accelerator comprises feature-extraction and classification. The figure below is a block diagram of our feature extractor. It utilizes a proprietary\u00a0systolic-array engine (opens in new tab)<\/span><\/a>\u00a0and flexible\u00a0interest-point detection algorithm (opens in new tab)<\/span><\/a>. We have developed hardware IP and tested it in a 45 nm SOI process to achieve 37x speed up over CPU computations. Through various techniques, we have been able to lower the power consumption of this IP block to a mere 62.7 mW.<\/p>\n

\"block<\/p>\n

Our feature-extraction core has the ability to process many video frames per second. Beyond image classification, its potential lies in multi-frame processing (MFP). Thus, our IP core has found its way into the Windows Phone camera-processing pipeline. It has proven useful for several MFP applications like real-time HDR imaging, panorama creation, de-noising, de-hazing and de-blurring. Our feature-extraction module is extensively configurable to achieve multiple levels of data parallelism and interleaving through systolic arrays. It also supports 2-level vector reduction operations depending on application demands.\u00a0Compared to other platforms, the performance of this engine was better in terms of both energy efficiency and run time (see figure below). We are continuing to evangelize this technology (esp.<\/em> the hardware sub-systems) with Azure and HoloLens product teams to accelerate structured ML algorithms like neural-networks.\"illustration<\/p>\n

\"illustration<\/p>\n

Furthermore, we have augmented our feature-extraction core with light-weight classifiers like those based on Fisher Vectors to achieve minimum-energy operation at a frame rate of 12 fps, while consuming a mere 3 mJ of energy per frame. Using a prototype system, we have estimated 70% reduction in communication energy when 5% of frames are interesting in a video stream. See our DATE 15 (opens in new tab)<\/span><\/a>\u00a0paper for more details. We have also looked at efficiently implementing other classifiers like RBMs on FPGAs (see adjacent figure). Thus, by exploiting the parallelism and computational-scaling properties of neural networks in hardware, we have demonstrated image classification with very low power. Our work is summarized in papers published at\u00a0ISLPED 15 (opens in new tab)<\/span><\/a> and TMSCS 16 (opens in new tab)<\/span><\/a>. Our hardware system prototypes have also motivated specs for emerging intelligent cameras.<\/p>\n

Algorithmic approaches:\u00a0<\/strong>In 2016, we joined forces with a\u00a0team of researchers (opens in new tab)<\/span><\/a>\u00a0to make inference algorithms efficient so that they can be mapped to embedded devices. This bigger effort brought together several groups at Microsoft to work on related technologies. As part of this project, we have specifically explored bit-precision scaling in deep neural networks for ambient-aware computing tasks like image classification, object detection and audio processing. Owing to several issues, we have found that practical implementations of binarized neural-networks, such as those described in the literature (opens in new tab)<\/span><\/a>, do not necessarily provide the promised 64x speed ups (see figure). Thus, one outcome of our work has been the efficient implementation of XNOR nets (binary equivalent of algorithms) on CPUs by utilizing CNTK-bsaed ML tools from Microsoft.\"illustration<\/p>\n

Efficient Communication<\/h3>\n

We have addressed communication issues in ambient-aware applications\u00a0via <\/em>two techniques: utilizing low-energy networking technologies and tweaking the performance of ML algorithms.<\/p>\n

Serving clients on low-bandwidth radios:\u00a0<\/strong>ML on the edge sounds great. But, once a trigger has been initiated, what is the best way for edge devices to communicate with the cloud? We have taken the first steps to answer this question. Specifically, we have spent time and resources studying low-power wide-area network (LPWAN) technologies. Today cellular communication is the most viable technology for ambient-aware computing, however it faces a number of short comings: (1) signal overheads are huge and single message transmissions require 7-8 packet exchanges, (2) cellular chips are overly complex and are not optimized for ambient-aware computing, and (3) cost of messages are not set up for variable data rates. While investments in LTE-M promise to overcome some of the limitations, the technology is in it infancy as it requires over 85% reduction in modem complexity for practical use. Thus, low cost, proprietary networks such as RPMA, DSSS, mesh networking, narrow band and LoRa are emerging as suitable candidates for ambient-ware computing applications.<\/p>\n

\"illustration<\/p>\n

We have explored the use of some of these techniques to enable ultra low-power, long range communication. One key shortcoming of these wireless technologies is scale. Due to the limited bandwidth, these networks can only support a small number of devices. As part of this project, we have developed constant-velocity prediction models that allow us to lower sampling rates by 33% for the same error tolerance limit and transmission range of 100 m. For low data rates, we have developed techniques that allow us to scale from 100 devices to 150 devices at a single base station. We are continuing to optimize these protocols and are investigating other emerging ones to enable long range, higher data rate communication necessary for ambient-aware computing applications.<\/p>\n

Trading ML precision for bandwidth savings:\u00a0<\/b>As discussed before, an important part of any ambient-aware application front end is a low-complexity image classifier. Such a classifier helps filter out unnecessary frames before transmission to the cloud. We have identified that our local-processing algorithm should balance precision with recall so as to maximize energy consumption. It is not obvious which metric to maximize since it depends on the conditions of the network and the associated computational costs. In our DAC 15 paper (opens in new tab)<\/span><\/a>, we have shown how to achieve a balance between precision and recall for arbitrary classifiers. Our approach was based on sample weighting and cascaded classification to manipulate the decision boundary of classifiers. Given this and other emerging approaches, we were able to exploit on-device classification efficiently within a connected network.<\/p>\n

\"illustration<\/p>\n

The precision-recall trade off is explained further as follows (see figure above). We biased a traditional set of classification algorithms B such that they led us to a new “lower-energy” alternative set B*, which had a high true positive rate (potentially close to that provided by a high-energy, high-accuracy set A) at the cost of a higher false positive rate than B (and A). We have thus demonstrated how to achieve energy-scalability at the highest level of performance in ambient-aware systems.\u00a0For more details, see our\u00a0DAC 15 (opens in new tab)<\/span><\/a> and MCVMC 15 (opens in new tab)<\/span><\/a>\u00a0papers.<\/p>\n

<\/h2>\n

Going Forward<\/h2>\n

Ambient-aware systems constitute a specific class of IoT devices. They form vehicles for technologies like AR, VR, and autonomous agents. In this project, we have developed several building blocks and prototypes that show the benefits of always-on, always-connected intelligent cameras. We believe these are the first steps towards achieving true ambient-aware computing. Information from multiple sensor modalities has to come together for a practically-useful system. This includes data from other environmental and IoT sensors. Furthermore, there are issues of optics, audio-video algorithms, artificial intelligence and user experiences that need to be addressed. As we move along, we hope to forge collaborative efforts that target these other issues of the ambient-computing puzzle.<\/p>\n\t\t\t

\n\t\t\t\t\t\t\t
\n\t\t\t\t\t\n\t\t\t\t\t\t