November 7, 2019 - November 8, 2019

MSRA Academic Day 2019

Location: Beijing, China

Workshops

  • Speaker: Peng Cheng, Microsoft Research

    Programmable hardware has been used to build high throughput, low latency real-time core AI engine such as BrainWave. Instead of AI engine, we focus on solving AI-platform-related bottlenecks, for instance in this case, storage and networking I/O, model distribution, synchronization and data pre-processing in machine learning tasks, with acceleration from programmable hardware. Our proposed system enables direct hardware-assisted device-to-device interconnection with inline processing. We choose FPGA as our first prototype to build a general platform for AI acceleration since FPGA has been widely deployed in Azure to achieve high performance with much lower economy cost. Our system can accelerate AI in many aspects. It now enables GPUs directly fetch training data from storage to GPU memory to bypass costly CPU involvement. As an intelligent hub, it can also do inline data pre-processing efficiently. More accelerating scenarios are under development including in-network inference acceleration and hardware parameter server for distributed machine learning, etc.

  • Speaker: Gunhee Kim, Seoul National University

    In this talk, I will introduce two recent works about NLP from Vision and Learning Lab of Seoul National University. First, we present our work that explores the problem of audio captioning: generating natural language description for any kind of audio in the wild, which has been surprisingly unexplored in previous research. We not only contribute a large-scale dataset of about 46K audio clips to human-written text pairs collected via crowdsourcing but also propose two novel components that help improve audio captioning performance of attention-based neural models. Second, I discuss about our work on knowledge-grounded dialogues, in which we address the problem of better modeling the knowledge selection in the multi-turn knowledge-grounded dialogue. We propose a sequential latent variable model as the first approach to this matter. Our experimental results show that the proposed model improves the knowledge selection accuracy and subsequently the performance of utterance generation.

  • Speaker: Xuanzhe Liu, Peking University

    We are in the fast-growing flood of “data” and we significantly benefit from the “intelligence” derived from it. Such intelligence heavily relies on the centralized paradigm, i.e., the cloud-based systems and services. However, we realize that we are also at the dawn of emerging “decentralized” fashion to make intelligence more pervasive and even “handy” over smartphones, wearables, IoT devices, along with the collaborations among them and the cloud. This talk tries to discuss some technical challenges and opportunities of building the decentralized intelligence, mostly from a software system perspective, covering aspects of programming abstraction, performance, privacy, energy, and interoperability. We also share our recent efforts on building such software systems and industrial experiences.

  • Speaker: Jaegul Choo, Korea University

    Despite recent advancements in deep learning-based automatic colorization, they are still limited when it comes to few-shot learning. Existing models require a significant amount of training data. To tackle this issue, we present a novel memory-augmented colorization model that can produce high-quality colorization with limited data. In particular, our model can capture rare instances and successfully colorize them. We also propose a novel threshold triplet loss that enables unsupervised training of memory networks without the need of class labels. Experiments show that our model has superior quality in both few-shot and one-shot colorization tasks.

  • Speaker: Xu Tan, Microsoft Research

    Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. However, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS. Experiments show that our parallel model matches autoregressive models in terms of speech quality, nearly eliminates the problem of word skipping and repeating in particularly hard cases, and can adjust voice speed smoothly. Most importantly, compared with autoregressive Transformer TTS, our model speeds up the mel-spectrogram generation by 270x and the end-to-end speech synthesis by 38x.

  • Speaker: Rajesh Krishna Balan, Singapore Management University

    Automatic analysis of the behaviour of large groups of people is an important requirement for a large class of important applications such as crowd management, traffic control, and surveillance. For example, attributes such as the number of people, how they are distributed, which groups they belong to, and what trajectories they are taking can be used to optimize the layout of a mall to increase overall revenue. A common way to obtain these attributes is to use video camera feeds coupled with advanced video analytics solutions. However, solely utilizing video feeds is challenging in high people-density areas, such as a normal mall in Asia, as the high people density significantly reduces the effectiveness of video analytics due to factors such as occlusion. In this work, we propose to combine video feeds with WiFi data to achieve better classification results of the number of people in the area and the trajectories of those people. In particular, we believe that our approach will combine the strengths. of the two different sensors, WiFi and video, while reducing the weaknesses of each sensor. This work has started fairly recently and we will present our thoughts and current results up to now.

  • Speaker: Winston Hsu, National Taiwan University

    We observed super-human capabilities from current (2D) convolutional networks for the images — either for discriminative or generative models. For this talk, we will show our recent attempts in visual cognitive computing beyond 2D images. We will first demonstrate the huge opportunities as augmenting the leaning with temporal cues, 3D (point cloud) data, raw data, audio, etc. over emerging domains such as entertainment, security, healthcare, manufacturing, etc. In an explainable manner, we will justify how to design neural networks leveraging the novel (and diverse) modalities. We will demystify the pros and cons for these novel signals. We will showcase a few tangible applications ranging from video QA, robotic object referring, situation understanding, autonomous driving, etc. We will also review the lessons we learned as designing the advanced neural networks which accommodate the multimodal signals in an end-to-end manner.

  • Speaker: Guolin Ke, Microsoft Research

    Gradient Boosting Decision Tree (GBDT) is a popular machine learning algorithm and widely-used in the real-world applications. We open-sourced LightGBM, which contains many critical optimizations for the efficient training of GBDT and becomes one of the most popular GBDT tools. During this talk, I will introduce the key technologies behind LightGBM.

  • Speaker: Ting Cao, Microsoft Research

    Deep learning (DL) models are increasingly deployed into real-world applications on mobile devices. However, current mobile DL frameworks neglect the CPU asymmetry, and the CPUs are seriously underutilized. We propose MobiDL for mobile DL inference, targeting improved CPU utilization and energy efficiency through novel designs for hardware asymmetry and appropriate frequency setting. It integrates four main techniques: 1) cost-model directed matrix block partition; 2) prearranged memory layout for model parameters; 3) asymmetry-aware task scheduling; and 4) data-reuse based CPU frequency setting. During the one-time initialization, the proper block partition, parameter layout, and efficient frequency for DL models can be configured by MobiDL. During inference, MobiDL scheduling balances tasks to fully utilize all the CPU cores. Evaluation shows that for CNN models, MobiDL can achieve 85% performance and 72% energy efficiency improvement on average compared to default TensorFlow. For RNN, it achieves up-to 17.51X performance and 8.26X energy efficiency improvement.

  • Speaker: Yingce Xia, Microsoft Research

    Dual learning is our recently proposed framework, where a primal task (e.g. Chinese-to-English translation) and a dual task (e.g., English-to-Chinese translation) are jointly optimized through a feedback signal. We extend standard dual learning to multi-agent dual learning, where multiple models for the primal task and multiple models for the dual task are evolved. In the case, the feedback signal is enhanced and we can get better performances. Experimental results on low-resource settings show that our method works pretty well. On WMT’19 machine translation competition, we won four top places using multi-agent dual learning.

  • Speaker: Jiwen Lu, Tsinghua University

    In this talk, I will overview the trend of multi-view deep learning techniques and discuss how they are used to improve the performance of various visual content understanding tasks. Specifically, I will present three multi-view deep learning approaches: multi-view deep metric learning, multi-modal deep representation learning, and multi-agent deep reinforcement learning, and show how these methods are used for visual content understanding tasks. Lastly, I will discuss some open problems in multi-view deep learning to show how to further develop more advanced multi-view deep learning methods for computer vision in the future.

  • Speaker: Quanlu Zhang, Microsoft Research

    Recent years have witnessed the great success of deep learning in a broad range of applications. Model tuning becomes a key step for finding good models. To be effective in practice, a system is demanded to facilitate this tuning procedure from both programming effort and searching efficiency. Thus, we open source NNI (Neural Network Intelligence), a toolkit for neural architecture search and hyper-parameter tuning, which provides easy-to-use interface, rich built-in AutoML algorithms. Moreover, it is highly extensible to support various new tuning algorithms and requirements. With high scalability, many trials could run in parallel on various training platforms.

  • Speaker: Chong Luo, Microsoft Research

    Video-language cross-modal tasks are receiving increasing interests in recent years, from video retrieval, video captioning, to spatial-temporal localization in video by language query. In this talk, we will present the research and application of some of these tasks. We will show how pre-trained single-modality models have made these tasks tractable and discuss the paradigm shift in deep neural network design with pre-trained models. In addition, we propose a universal cross-modality pre-training framework which may benefit a wide range of video-language tasks. We hope that our work will provide inspiration to other researchers in solving these interesting but challenging cross-modal tasks.

  • Speaker: Chuan Wu, University of Hong Kong

    More and more companies/institutions are running AI clouds/machine learning clusters with various ML model training workloads, to support various AI-driven services. Efficient resource scheduling is the key to maximize the performance of ML workloads, as well as hardware efficiency of the very expensive ML cluster. A large room exists in improving today’s ML cluster schedulers, e.g., to include interference awareness in task placement and to schedule not only computation but also communication, etc. In this talk, I will share our recent work on designing deep learning job schedulers for ML clusters, aiming at expediting training speeds and minimizing training completion time. Our schedulers decide communication scheduling, the number of workers/PSs, and the placement of workers/PSs for jobs in the cluster, through both heuristics with theoretical support and reinforcement learning approaches.

  • Speaker: Sinno Jialin Pan, Nanyang Technological University

    In fine-grained sentiment analysis, extracting aspect terms and opinion terms from user-generated texts is the most fundamental task in order to generate structured opinion summarization. Existing studies have shown that the syntactic relations between aspect and opinion words play an important role for aspect and opinion terms extraction. However, most of the works either relied on pre-defined rules or separated relation mining with feature learning. Moreover, these works only focused on single-domain extraction which failed to adapt well to other domains of interest, where only unlabeled data is available. In real-world scenarios, annotated resources are extremely scarce for many domains or languages. In this talk, I am going to introduce our recent series of works on transfer learning for cross-domain and cross-language fine-grained sentiment analysis based on recursive neural networks.

  • Speaker: Yue Cao, Microsoft Research

    We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on the massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension.

  • Speaker: Nan Duan, Microsoft Research

    In this talk, I will introduce our latest work on multi-modal NLP, including (i) multi-modal pre-training, which aims to learn the joint representations between language and visual contents; (ii) multi-modal reasoning, which aims to handle complex queries by manipulating knowledge extracted from language and visual contents; (iii) video-based QA/summarization, which aims to make video contents readable and searchable.

Breakout Sessions

  • Speaker: Lijun Zhang, Nanjing University

    To deal with changing environments, a new performance measure—adaptive regret, defined as the maximum static regret over any interval, is proposed in online learning. Under the setting of online convex optimization, several algorithms have been developed to minimize the adaptive regret. However, existing algorithms are problem-independent and lack universality. In this talk, I will briefly introduce our two contributions in this direction. The first one is to establish problem-dependent bounds of adaptive regret by exploiting the smoothness condition. The second one is to design an universal algorithm that can handle multiple types of functions simultaneously.

  • Speaker: Rui Yan, Peking University

    Nowadays, automatic human-computer conversational systems have attracted great attention from both industry and academia. Intelligent products such as XiaoIce (by Microsoft) have been released, while tons of Artificial Intelligence companies have been established. We see that the technology behind the conversational systems is accumulating and now open to the public gradually. With the investigation of researchers, conversational systems are more than scientific fictions: they become real. It is interesting to review the recent advances of human-computer conversational systems, especially the significant changes brought by deep learning techniques. It would also be exciting to anticipate the development and challenges in the future.

  • Speaker: Hongzhi Wang, Harbin Institute of Technology

    Data is the base of modern Artificial Intelligence (AI). Efficient and effective AI requires the support of data acquirement, governance, management, analytics and mining, which brings new challenges. From another aspect, the advances of AI provide new chances for data process to increase its automation. Thus, AI and data forms a closed loop and promote each other. In this talk, the speaker will demonstrate the mutual promotion of AI and data with some examples and discuss the further chance of promote bother of these areas.

  • Speaker: Wen-Huang Cheng, National Chiao Tung University

    The fashion industry is one of the biggest in the world, representing over 2 percent of global GDP (2018). Artificial intelligence (AI) has been a predominant theme in the fashion industry and is impacting its every part in scales from personal to industrial and beyond. In recent years, I and my research group have devoted to advanced AI research on helping revolutionize the fashion industry to enable innovative applications and services with improved user experience. In this talk, I would like to give an overview of the major outcomes of our researches and discuss what research subjects we can further work on together with Microsoft researchers to make new impact on the fashion domains.

  • Speaker: Seung-won Hwang, Yonsei University

    This talk is inspired by a question to my talk at MSRA faculty summit last year: presenting NLP models where injecting (diverse forms of) knowledge contributes to meaningfully enhancing the accuracy and robustness. Then Chin-yew asked: “Do you think BERT implicitly contains all these information already?” This talk an extended investigation to support my short answer at the talk. The title is a spoiler.

  • Speaker: Lei Chen, Hong Kong University of Science and Technology

    Recently, AI has become quite popular and attractive, not only to academia but also to the industry. The successful stories of AI on various applications raise significant public interests in AI. Meanwhile, human intelligence is turning out to be more sophisticated, and Big Data technology is everywhere to improve our life quality. The question that we all want to ask is “what is the next?”. In this talk, I will discuss about DHA, a new computing paradigm, which combines big Data, Human intelligence, and AI (DHA). Specifically, I will first briefly explain the motivation of the DHA. Then I will present challenges, after that, I will highlight some possible solutions to build such a new paradigm.

  • Speaker: Bohyung Han, Seoul National University

    Label noise is one of the critical sources that degrade generalization performance of deep neural networks significantly. To handle the label noise issue in a principled way, we propose a unique classification framework of constructing multiple models in heterogeneous coarse-grained meta-class spaces and making joint inference of the trained models for the final predictions in the original (base) class space. Our approach reduces noise level by simply constructing meta-classes and improves accuracy via combinatorial inferences over multiple constituent classifiers. Since the proposed framework has distinct and complementary properties for the given problem, we can even incorporate additional off-the-shelf learning algorithms to improve accuracy further. We also introduce techniques to organize multiple heterogeneous meta-class sets using k-means clustering and identify a desirable subset leading to learn compact models. Our extensive experiments demonstrate outstanding performance in terms of accuracy and efficiency compared to the state- of-the-art methods under various synthetic noise configurations and in a real-world noisy dataset.

  • Speaker: Sinno Jialin Pan, Nanyang Technological University

    Multi-task learning aims to learn multiple tasks jointly by exploiting their relatedness to improve the generalization performance for each task. Traditionally, to perform multi-task learning, one needs to centralize data from all the tasks to a single machine. However, in many real-world applications, data of different tasks is owned by different organizations and geo-distributed over different local machines. Due to heavy communication caused by transmitting the data and the issue of data privacy and security, it is impossible to send data of different task to a master machine to perform multi-task learning. In this paper, we present our recent work on distributed multi-task learning, which jointly learns multiple tasks in the parameter server paradigm without sharing any training data, and has a theoretical guarantee on convergence to the solution obtained by the corresponding centralized multi-task learning algorithm.

  • Speaker: Min H. Kim, KAIST

    Traditional snapshot hyperspectral imaging systems include various optical elements: a dispersive optical element (prism), a coded aperture, several relay lenses, and an imaging lens, resulting in an impractically large form factor. We seek an alternative, minimal form factor of snapshot spectral imaging based on recent advances in diffractive optical technology. We there- upon present a compact, diffraction-based snapshot hyperspectral imaging method, using only a novel diffractive optical element (DOE) in front of a conventional, bare image sensor. Our diffractive imaging method replaces the common optical elements in hyperspectral imaging with a single optical element. To this end, we tackle two main challenges: First, the traditional diffractive lenses are not suitable for color imaging under incoherent illumination due to severe chromatic aberration because the size of the point spread function (PSF) changes depending on the wavelength. By leveraging this wavelength-dependent property alternatively for hyperspectral imaging, we introduce a novel DOE design that generates an anisotropic shape of the spectrally-varying PSF. The PSF size remains virtually unchanged, but instead the PSF shape rotates as the wavelength of light changes. Second, since there is no dispersive element and no coded aperture mask, the ill-posedness of spectral reconstruction increases significantly. Thus, we pro- pose an end-to-end network solution based on the unrolled architecture of an optimization procedure with a spatial-spectral prior, specifically designed for deconvolution-based spectral reconstruction. Finally, we demonstrate hyperspectral imaging with a fabricated DOE attached to a conventional DSLR sensor. Results show that our method compares well with other state- of-the-art hyperspectral imaging methods in terms of spectral accuracy and spatial resolution, while our compact, diffraction-based spectral imaging method uses only a single optical element on a bare image sensor.

  • Speaker: Jong Kim, Pohang University of Science and Technology (POSTECH)

    The data management practices by third-party apps have failed in terms of manageability and security because the modern systems cannot provide a fine-grained data management and security due to lack of understanding about stored data. As results, users suffer from storage shortage, data stealing, and data tampering.

    To tackle the problem, we propose a novel and general data management framework, ContextDM, that sheds light on the storage to help system services and aid-apps for storage to have a better understanding on permanent data. In specific, the framework provides permanent data with metadata that includes contextual semantic information in terms of importance and sensitivity of data. Further, we show the effectiveness of our framework by demonstrating ContextDM based aid-tools that automatically identifying important and useless data as well as sensitive data that is disclosed.

  • Speaker: Shou-De Lin, National Taiwan University

    Deep Neural Network based solutions have shown promising results in natural language generation recently. From Autoencoder to the Seq2Seq models to the GAN-based solutions, deep learning models can already generate text that pass Turing Test, making the outputs non-distinguishable to human generated ones. However, researchers have pointed out that the content generated from deep neural networks can be fairly unpredictable, meaning that it is non-trivial for human to control the outputs to be generated. This talk will be discussing how to control the outputs of an NLG model and demonstrating some of our recent works along this line.

  • Speaker: Chenhui Chu, Osaka University

    In this talk, we will introduce two of our recent work on multilingual and multimodal processing: cross-lingual visual grounding and multimodal machine translation. Visual grounding is a vision and language understanding task aiming at locating a region in an image according to a specific query phrase. We will present our work on cross-lingual visual grounding to expand the task to different languages. In addition, we will introduce our work on multimodal machine translation that incorporate semantic image regions with both visual and textural attention.

  • Speaker: Atsuko Miyaji, Osaka University

    The consequences of security failures in the era of internet of things (IoT) can be catastrophic, as have been demonstrated by a rapidly growing list of IoT security incidents. As a result, people have begun to recognize the importance and value of bringing the highest level of security to IoT. Tradition wisdom has it that, though technologically superior, public-key cryptography (PKC) is too expensive to deploy in IoT devices and networks. In this talk, we present our cost-effective improvement of elliptic curve cryptography (ECC) in terms of memory and computational resource.

  • Speaker: Huanjing Yue, Tianjin University

    In this talk, I will introduce our team’s work on image (video) denoising and demoiréing.

    Realistic noise, which is introduced when capturing images under high ISO modes or low light conditions, is more complex than Gaussian noise, and therefore is difficult to be removed. By exploring the spatial, channel, and temporal correlations via deep CNNs, we can efficiently remove noise for images and videos. We construct two datasets to facilitate research on realistic noise removal for images and videos.

    Moiré patterns, caused by aliasing between the grid of the display device and the array of camera sensor, greatly degrade the visual quality of recaptured screen images. Considering that the recaptured screen image and the original screen content usually have a large difference in brightness, we construct a moiré removal and brightness improvement (MRBI) database with moiré-free and moiré image pairs to facilitate the supervised learning and quantitative evaluation. Correspondingly, we propose a CNN based moiré removal and brightness improvement method. Our work provides a benchmark dataset and a good baseline method for the demoiréing task.

  • Speaker: Seong-Whan Lee, Korea University

    Recently, deep reinforcement learning (DRL) has even enabled real world applications such as robotics. Here we teach a robot to succeed in curling (Olympic discipline), which is a highly complex real-world application where a robot needs to carefully learn to play the game on the slippery ice sheet in order to compete well against human opponents. This scenario encompasses fundamental challenges: uncertainty, non-stationarity, infinite state spaces and most importantly scarce data. One fundamental objective of this study is thus to better understand and model the transfer from simulation to real-world scenarios with uncertainty. We demonstrate our proposed framework and show videos, experiments and statistics about Curly our AI curling robot being tested on a real curling ice sheet. Curly performed well both, in classical game situations and when interacting with human opponents.

  • Speaker: Ryo Furukawa, Hiroshima City University

    For effective in situ endoscopic diagnosis and treatment, or robotic surgery, 3D endoscopic systems have been attracting many researchers. We have been developing a 3D endoscopic system based on an active stereo technique, which projects a special pattern wherein each feature is coded. We believe it is a promising approach because of simplicity and high precision. However, previous works of this approach have problems. First, the quality of 3D reconstruction depended on stabilities of feature extraction from the images captured by the endoscope camera. Second, due to the limited pattern projection area, the reconstructed region was relatively small. In this talk, we describe our works of a learning-based technique using CNNs to solve the first problem and an extended bundle adjustment technique, which integrates multiple shapes into a consistent single shape, to address the second. The effectiveness of the proposed techniques compared to previous techniques was evaluated experimentally.

  • Speaker: Masatoshi Yoshikawa, Kyoto University

    Differential Privacy (DP) has received increased attention as a rigorous privacy framework. In this talk, we introduce our recent studies on extension of DP to spatial temporal data. The topics include i) DP mechanism under temporal correlations in the context of continuous data release; and ii) location privacy for location-based service over road networks.

  • Speaker: Jingwen Leng, Shanghai Jiao Tong University

    Despite the enormous success of deep neural network, there is still no solid understanding of deep neural network’s working mechanism. As such, one fundamental question arises – how should architects and system developers perform optimizations centering DNNs? Treating them as black box leads to efficiency and security issues: 1) DNN models require fixed computation budge regardless of input; 2) a human-imperceivable perturbation to the input causes a DNN misclassification. This talk will present our efforts toward addressing those challenges. We recognize an increasing need of monitoring and modifying the DNN’s runtime behavior, as evident by our recent work effective path, and other researchers’ work of network pruning and quantization. As such, we present our on-going effort of building a graph instrumentation framework that provides programmers with the great convenience of achieving those abilities.

  • Speaker: Yu Zhang, University of Science & Technology of China

    While deep learning researchers are seeking deeper and wider nonlinear networks, there is an increasing challenge for deploying deep neural network applications on low-end GPU devices for mobile and edge computing due to the limited size of GPU DRAM. The existing deep learning frameworks lack effective GPU memory management for different reasons. It is hard to apply effective GPU memory management on dynamic computation graphs which cannot get global computation graph (e.g. PyTorch), or can only impose limited dynamic GPU memory management strategies for static computation graphs (e.g. Tensorflow). In this talk, I will analyze the state of the art GPU memory management in the existing DL frameworks, present challenges on GPU memory management faced by running deep neural networks on low-end resource-constrained devices and finally give our thinking.

  • Speaker: Hong-Goo Kang, Yonsei University

    Tangible interaction allows a user to interact with a computer using ordinary physical objects. It substantially expands the interaction space owing to the natural affordance and metaphors provided by real objects. However, tangible interaction requires to identify the object held by the user or how the user is touching the object. In this talk, I will introduce two sensing techniques for tangible interaction, which exploits active sensing using mechanical vibration. A vIn end-to-end deep learning-based emotional text-to-speech (TTS) systems such as the ones using Tacotron networks, it is very important to provide additional embedding vectors to flexibly control the distinct characteristic of target emotion.

    This talk introduces a couple of methods to effectively estimate representative embedding vectors. Using the mean of embedding vectors is a simple approach, but the expressiveness of synthesized speech is not satisfactory. To enhance the expressiveness, we needs to consider the distribution of emotion embedding vectors. An inter-to-intra (I2I) distance ratio-based algorithm recently proposed by our research team shows much higher performance than the conventional mean-based one. The I2I algorithm is also useful for gradually changing the intensity of expressiveness. Listening test results verify that the emotional expressiveness and control-ability of the I2I algorithm is superior to those of the mean-based one. ibration is transmitted from an exciter worn in the user’s hand or fingers, and the transmitted vibration is measured using a sensor. By comparing the input-output pair, we can recognize the object held between two fingers or the fingers touching the object. The mechanical vibrations also provide pleasant confirmation feedback to the user. Details will be shared in the talk.

  • Speaker: Min Zhang, Tsinghua University

    Recommender systems have played significant roles in our daily life, and are expected to be available to any user, regardless of their gender, age or other demographic factors. Recently, there has been a growing concern about the bias that can creep into personalization algorithms and produce unfairness issues. In this talk, I will introduce the trending topics and our recent research progresses at THUIR (Tsinghua University Information Retrieval) group on fairness issue in recommender systems, including the causes of unfairness and the approaches to handle it. These series of work provide new ideas for building fairness-aware recommender system, and have been published on related top-tier international conferences SIGIR 2018, WWW 2019, SIGIR 2019, etc.

  • Speaker: Insik Shin, KAIST

    The growing trend of multi-device ownerships creates a need and an opportunity to use applications across multiple devices. However, in general, the current app development and usage still remain within the single-device paradigm, falling far short of user expectations. For example, it is currently not possible for a user to dynamically partition an existing live streaming app with chatting capabilities across different devices, such that she watches her favorite broadcast on her smart TV while real-time chatting on her smartphone. In this paper, we present FLUID, a new Android-based multi-device platform that enables innovative ways of using multiple devices. FLUID aims to i) allow users to migrate or replicate individual user interfaces (UIs) of a single app on multiple devices (high flexibility), ii) require no additional development effort to support unmodified, legacy applications (ease of development), and iii) support a wide range of apps that follow the trend of using custom-made UIs (wide applicability). FLUID, on the other hand, meets the goals by carefully analyzing which UI states are necessary to correctly render UI objects, deploying only those states on different devices, supporting cross-device function calls transparently, and synchronizing the UI states of replicated UI objects across multiple devices. Our evaluation with 20 unmodified, real-world Android apps shows that FLUID can transparently support a wide range of apps and is fast enough for interactive use.

  • Speaker: Seungyong Lee, Pohang University of Science and Technology (POSTECH)

    In this talk, I will introduce a novel framework to generate a global texture atlas for a deforming geometry. Our approach distinguishes from prior arts in two aspects. First, instead of generating a texture map for each timestamp to color a dynamic scene, our framework reconstructs a global texture atlas that can be consistently mapped to a deforming object. Second, our approach is based on a single RGB-D camera, without the need of a multiple-camera setup surrounding a scene. In our framework, the input is a 3D template model with an RGB-D image sequence, and geometric warping fields are found using a state-of-the-art non-rigid registration method to align the template mesh to noisy and incomplete input depth images. With these warping fields, our multi-scale approach for texture coordinate optimization generates a sharp and clear texture atlas that is consistent with multiple color observations over time. Our approach provides a handy configuration to capture a dynamic geometry along with a clean texture atlas, and we demonstrate it with practical scenarios, particularly human performance capture.

  • Speaker: Liwei Wang, Peking University

    Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm. We further extend our analysis to deep residual convolutional neural networks and obtain a similar convergence result.

  • Speaker: Wei-Shi Zheng, Sun Yat-sen University

    We present a new model to assess the performance of actions visually from videos by graph-based joint relation modelling. Previous works mainly focused on the whole scene including the performer’s body and background, yet they ignored the detailed joint interactions. This is insufficient for fine-grained and accurate action assessment, because the action quality of each joint is dependent of its neighboring joints. Therefore, we propose to learn the detailed joint motion based on the joint relations. We build trainable Joint Relation Graphs, and analyze joint motion on them. We propose two novel modules, namely the Joint Commonality Module and the Joint Difference Module, for joint motion learning. The Joint Commonality Module models the general motion for certain body parts, and the Joint Difference Module models the motion differences within body parts. We evaluate our method on six public Olympic actions for performance assessment. Our method outperforms previous approaches (+0.0912) and the whole-scene model (+0.0623) in terms of the Spearman’s Rank Correlation. We also demonstrate our model’s ability of interpreting the action assessment process.

  • Speaker: Jiaying Liu, Peking University

    In this talk, we focus on intelligent action analytics in videos with multi-modal reasoning, which is important but remains under explored. We first present challenges in this problem by introducing PKU-MMD dataset collected by ourselves, i.e., multi-modal complementary feature learning, noise-robust feature learning, and dealing with tedious label annotation, etc. To tackle the above issues, we propose initial solutions with multi-modal reasoning. A modality compensation network is proposed to explicitly explore relationship of different modalities and further boost multi-modal feature learning. A noise-invariant network is developed to recognize human actions from noisy skeletons by referring denoised skeletons. To light up the community, we introduce possible future work in the end, such as self-supervised learning, language-guided reasoning.

  • Speaker: Chuck Yoo, Korea University

    It is widely believed that commodity operating systems cannot deliver high-speed packet processing, and a number of alternative approaches (including user-space network stacks) have been proposed. This talk revisits the inefficiency of packet processing inside kernel and explores whether a redesign of kernel network stacks can improve the incompetence. We present a case through a redesign: Kafe – a kernel-based advanced forwarding engine. Contrary to the belief, Kafe can process packets as fast as user-space network stacks. Kafe neither adds any new API nor depends on proprietary hardware features.

  • Speaker: Xueming Qian, Xi’an Jiaotong University

    Fine-grained food recognition is the detailed classification provide more specialized and professional attribute information of food. It is the basic work to realize healthy diet recommendation and cooking instructions, nutrition intake management and caféteria self-checkout system. Chinese food appearance without the structured information, and ingredients composition is an important consideration. We proposed a new method for fine-grained food and ingredients recognition, include Attention Fusion Network (AFN) and Food-Ingredient Joint Learning. In AFN, it is focus on important attention regional features, and generates the feature descriptor. In Food-Ingredient Joint Learning, we proposed the balance focal loss to solve the issue of imbalanced ingredients multi-label. Finally, a series of experiments to prove results have significantly improved on the existing methods.

  • Speaker: Yonggang Wen, Nanyang Technological University

    Media-rich applications will continue to dominate mobile data traffic with an exponential growth, as predicted by Cisco Video Index. The improved quality of experience (QoE) for the video consumers plays an important role in shaping this growth. However, most of the existing approaches in improving video QoE are system-centric and model-based, in that they tend to derive insights from system parameters (e.g., bandwidth, buffer time, etc) and propose various mathematical models to predict QoE scores (e.g., mean opinion score, etc). In this talk, we will share our latest work in developing a unified and scalable framework to transform multimedia communications via deep video analytics. Specifically, our framework consists two main components. One is a deep-learning based QoE prediction algorithm, by combining multi-modal data inputs to provide a more accurate assessment of QoE in real-time manner. The other is a model-free QoE optimization paradigm built upon deep reinforcement learning algorithm. Our preliminary results verify the effectiveness of our proposed framework. We believe that the hybrid approach of multimedia communications and computing would fundamentally transform how we optimization multimedia communications system design and operations.

  • Speaker: Seung Ah Lee, Yonsei University

    Miniaturization of microscopes can be a crucial stepping stone towards realizing compact,cost-effective and portable platforms for biomedical research and healthcare. This talk reports on implementations lensless microscopes and lensless cameras for a variety of biological imaging applications in the form of mass-producible semiconductor devices, which transforms the fundamental design of optical imaging systems.

  • Speaker: Jaegul Choo, Korea University

    Considering its success in generating high-quality, realistic data, generative adversarial networks (GANs) have potentials to be used for data augmentation to improve the prediction accuracy in diverse problems where the limited amount of training data is given. However, GANs themselves require a nontrivial amount of data for their training, so data augmentation via GANs does not often improve the accuracy in practice. This talk will briefly review existing literature and our on-going approach based on feature disentanglement. I will conclude the talk with further research issues that I would like to address in the future.

  • Speaker: Hiroki Watanabe, Hokkaido University

    Since auditory perception is passive sense, we often do not notice important information and acquire unimportant information. We focused on a earphone-type wearable computer (hearable device) that not only has speakers but also microphones. In a hearable computing environment, we always attach microphones and speakers to the ears. Therefore, we can manipulate our auditory perception using a hearable device. We manipulated the frequency of the input sound from the microphones and transmitted the converted sound from the speakers. Thus, we could acquire the sound that is not heard with our normal auditory perception and eliminate the unwanted sound according to the user’s requirements.

  • Speaker: Wenfei Wu, Tsinghua University

    Network Functions play important roles in improving performance and enhancing security in modern computer networks. More and more NFs are being developed, integrated, and managed in production networks. However, the connection between the development and the operation for network functions has not drawn attention yet, which slows down the development and delivery of NFs and complicates NF network management.

    We propose that building a common abstraction layer for network functions would benefit both the development and operation. For NF development, having a uniform abstraction layer to describe NF behaviors would make the cross-platform development to be rapid and agile, which accelerate the NF delivery for NF vendors, and we would introduce our recent NF development framework based on language and compiler technologies. For NF operation, having a behavior model would ease the network reasoning, which can avoid runtime bugs, and more crucially, the behavior model is guaranteed to reflect the actual implementation; we would introduce our NF verification work based on the NF modeling language. Around our model-centric NF development and operation, we also other NF model works which lay the foundation of NF modeling language, and fill in the semantic gap between legacy NFs and NF models.

  • Speaker: Mingkui Tan, South China University of Technology

    Architecture design is one of the key factors behind the success of deep neural networks. Existing deep architectures are either manually designed or automatically searched by some Neural Architecture Search (NAS) methods. However, even a well-searched architecture may still contain many non-significant or redundant modules or operations (e.g., convolution or pooling), which not only incur substantial memory consumption and computational cost but may also deteriorate the performance. Thus, it is necessary to optimize the operations inside the architecture to improve the performance without introducing extra computational cost. However, such a constrained optimization problem is an NP-hard problem and is very hard to solve. To address this problem, we cast the optimization problem into a Markov decision process (MDP) and learn a Neural Architecture Transformer (NAT) to replace the redundant operations with the more computationally efficient ones (e.g., skip connection or directly removing the connection). In MDP, we train NAT with reinforcement learning to obtain the architecture optimization policies w.r.t. different architectures. To verify the effectiveness of the proposed method, we apply NAT on both hand-crafted architectures and NAS based architectures. Extensive experiments on two benchmark datasets, i.e., CIFAR-10 and ImageNet, show that the transformed architecture significantly outperforms both the original architecture and the architectures optimized by the existing methods.

  • Speaker: Gunhee Kim, Seoul National University

    In this talk, I will introduce two recent works on machine learning from Vision and Learning Lab of Seoul National University. First, we present our work in reinforcement learning. We introduce an information-theoretic exploration strategy named Curiosity-Bottleneck (CB) that distills task-relevant information from observation. In our experiments, we observe that the CB algorithm robustly measures the state novelty in distractive environments where state-of-the-art exploration methods often degenerate. Second, we propose novel training schemes with a new set of losses that can prevent conditional GANs from losing the diversity in their outputs. We perform thorough experiments on image-to-image translation, super-resolution and image inpainting and show that our methods achieve a great diversity in outputs while retaining or even improving the visual fidelity of generated samples.

  • Speaker: Hiroaki Yamane, RIKEN AIP & The University of Tokyo

    Numerical common sense (e.g., “a person with a height of 2m is very tall”) is essential when deploying artificial intelligence (AI) systems in society. We construct methods for converting contextual language to numerical variables for quantitative/numerical common sense in natural language processing (NLP).

    We are living the world where we need common sense. We use some common sense when observing objects: A 165 cm human cannot be bigger than a 1 km bridge. The weight of the aforementioned human ranges from 40 kg to 90 kg. If one’s weight is less than 50 kg, they are more likely to be very thin. This can be also applied to money. If the latest Surface Pro is $500, it is quite cheap. There is a necessity to account for common sense in future AI system.

    To address this problem, we first use a crowdsourcing service to obtain sufficient data for a subjective agreement on numerical common sense. Second, to examine whether common sense is attributed to current word embedding, we examined the performance of a regressor trained on the obtained data.

  • Speaker: Tadashi Nomoto, The SOKENDAI Graduate School of Advanced Studies

    In this work, we examine whether it is possible to achieve the state of the art performance in paraphrase generation with reduced vocabulary. Our approach consists of building a convolution to sequence model (Conv2Seq) partially guided by the reinforcement learning, and training it on the sub-word representation of the input. The experiment on the Quora dataset, which contains over 140,000 pairs of sentences and corresponding paraphrases, found that with less than 1,000 token types, we were able to achieve performance that exceeded that of the current state of the art. We also report that the same architecture works equally well for text simplification, with little change.

  • Speaker: Sung-eui Yoon, KAIST

    In this talk, we discuss a novel, ray tracing based technique for 3D sound source localization for indoor and outdoor environments. Unlike prior approaches, which are mainly based on continuous sound signals from a stationary source, our formulation is designed to localize the position instantaneously from signals within a single frame. We consider direct sound and indirect sound signals that reach the microphones after reflecting off surfaces such as ceilings or walls. We then generate and trace direct and reflected acoustic paths using backward acoustic ray tracing and utilize these paths with Monte Carlo localization to estimate a 3D sound source position. For complex cases with many objects, we also found that diffraction effects caused by the wave characteristics of sound become dominant. We propose to handle such non-trivial problems even with ray tracing, since directly applying wave simulation is prohibitively expensive.

  • Speaker: Tianzhu Zhang, University of Science and Technology of China

    Visual tracking is one of the most fundamental topics in computer vision with various applications in video surveillance, human computer interaction and vehicle navigation. Although great progress has been made in recent years, it remains a challenging problem due to factors such as illumination changes, geometric deformations, partial occlusions, fast motions and background clutters. In this talk, I will first review several recent models of visual tracking including particle filtering, classifier learning for tracking, sparse tracking, deep learning tracking, and correlation filter based tracking. Then, I will review several recent works of our group including correlation particle filter tracking, and graph convolutional tracking.

  • Speaker: Minsu Cho, Pohang University of Science and Technology (POSTECH)

    Knowledge distillation aims at transferring knowledge acquired in one model (a teacher) to another model (a student) that is typically smaller. Previous approaches can be expressed as a form of training the student to mimic output activations of individual data examples represented by the teacher. We introduce a novel approach, dubbed relational knowledge distillation (RKD), that transfers mutual relations of data examples instead. For concrete realizations of RKD, we propose distance-wise and angle-wise distillation losses that penalize structural differences in relations. Experiments conducted on different tasks show that the proposed method improves educated student models with a significant margin. In particular for metric learning, it allows students to outperform their teachers’ performance, achieving the state of the arts on standard benchmark datasets.

  • Speaker: Jun Takamatsu, Nara Institute of Science and Technology

    For household robots that work in everyday-life dynamic environments, the computer vision (CV) to recognize the environments is essential. Unfortunately, CV issues in household robots sometimes cannot be solved by the methods that were usually proposed in the CV fields. In this talk, I exemplify the two examples and would like to ask their solutions. The first example is CV in learning-from-observation, where it is not enough to recognize names of actions, such as walk and jump. The second example is analysis of usage of time. This requires recognizing activities in the level such as watch TV and spend one’s hobby.

  • Speaker: Youyou Lu, Tsinghua University

    Non-volatile memory (NVM) and remote direct memory access (RDMA) provide extremely high performance in storage and network hardware. Comparatively, the software overhead in the file systems become a non-negligible part in persistent memory storage systems. To achieve efficient networked memory design, I will present this design choices in Octopus. Octopus is a distributed file system that redesigns file system internal mechanisms by closely coupling NVM and RDMA features. I will further discuss the possible hardware enhancements for networked memory for research in my group.

  • Speaker: Cheng Li, University of Science and Technology of China

    Training DNN models across a large number of connected devices or machines has been at norm. Studies suggest that the major bottleneck of scaling out the training jobs is to exchange the huge amount of gradients per mini-batch. Thus, a few compression algorithms have been proposed, such as Deep Gradients Compression, Terngrad, and evaluated to demonstrate their benefits of reducing the transmission cost. However, when re-implementing these algorithms and integrating them into mainstream frameworks such as MxNet, we identified that they performed less efficiently than what was claimed in their original papers. The major gap is that the developers of those algorithms did not necessarily understand the internals of the deep learning frameworks. As a consequence, we believe that there is lack of system support for enabling the algorithm developers to primarily focus on the innovations of the compression algorithms, rather than the efficient implementations which may take into account various levels of parallelism. To this end, we propose a domain-specific language that allows the algorithm developers to sketch their compression algorithms, a translator that converts the high-level descriptions into low-level highly optimized GPU codes, and a compiler that generates new computation DAGs that fuses the compression algorithms with proper operators that produce gradients.

  • Speaker: Jun Du, University of Science and Technology of China

    Solving the cocktail party problem is one ultimate goal for the machine to achieve the human-level auditory perception. Speech separation and recognition are two related key techniques. With the emergence of deep learning, new milestones are achieved for both speech separation and recognition. In this talk, I will introduce our recent progress and future trends in these areas with the development of DIHARD and CHiME Challenges.

  • Speaker: Yao Guo, Peking University

    In recent years, operating systems have expanded beyond traditional computing systems into the cloud, IoT devices, and other emerging technologies and will soon become ubiquitous. We call this new generation of OSs as ubiquitous operating systems (UOSs). Despite the apparent differences among existing OSs, they all have in common so-called “software-defined” capabilities—namely, resource virtualization and function programmability. In this talk, I will present our vision and some recent work toward the development of ubiquitous operating systems (UOSs).

  • Speaker: Seungmoon Choi, Pohang University of Science and Technology (POSTECH)

    Tangible interaction allows a user to interact with a computer using ordinary physical objects. It substantially expands the interaction space owing to the natural affordance and metaphors provided by real objects. However, tangible interaction requires to identify the object held by the user or how the user is touching the object. In this talk, I will introduce two sensing techniques for tangible interaction, which exploits active sensing using mechanical vibration. A vibration is transmitted from an exciter worn in the user’s hand or fingers, and the transmitted vibration is measured using a sensor. By comparing the input-output pair, we can recognize the object held between two fingers or the fingers touching the object. The mechanical vibrations also provide pleasant confirmation feedback to the user. Details will be shared in the talk.

  • Speaker: Rajesh Krishna Balan, Singapore Management University

    I will describe the flow of work I am starting on video analytics in crowded spaces. This includes malls, conferences centres, and university campuses in Asia. The goal of this work is to use video analytics, combined with other sensors to accurately count the number of people in the environments, track their movement trajectories, and discover their demographics and persona.

  • Speaker: Zhou Zhao, Zhejiang University

    Video dialog is a new and challenging task, which requires the agent to answer questions combining video information with dialog history. And different from single-turn video question answering, the additional dialog history is important for video dialog, which often includes contextual information for the question. Existing visual dialog methods mainly use RNN to encode the dialog history as a single vector representation, which might be rough and straightforward. Some more advanced methods utilize hierarchical structure, attention and memory mechanisms, which still lack an explicit reasoning process. In this paper, we introduce a novel progressive inference mechanism for video dialog, which progressively updates query information based on dialog history and video content until the agent think the information is sufficient and unambiguous. In order to tackle the multimodal fusion problem, we propose a cross-transformer module, which could learn more fine-grained and comprehensive interactions both inside and between the modalities. And besides answer generation, we also consider question generation, which is more challenging but significant for a complete video dialog system. We evaluate our method on two largescale datasets, and the extensive experiments show the effectiveness of our method.

  • Speaker: Yingcai Wu, Zhejiang University

    With the rapid development of sensing technologies and wearable devices, large sports data have been acquired daily. The data usually implies a wide spectrum of information and rich knowledge about sports. Visual analytics, which facilitates analytical reasoning by interactive visual interfaces, has proven its value in solving various problems. In this talk, I will discuss our research experiences in visual analytics of sports data and introduce several recent studies of our group of making sense of sports data through interactive visualization.

  • Speaker: Shixia Liu, Tsinghua University

    The quality of training data is crucial to the success of supervised and semi-supervised learning. Errors in data have long been known to limit the performance of machine learning models. This talk presents the motivation, major challenges of interactive data quality analysis and improvement. With that perspective, I will then discuss some of my recent efforts on 1) analyzing and correcting poor label quality, and 2) resolving the poor coverage of the training data caused by dataset bias.

  • Speaker: Huamin Qu, Hong Kong University of Science and Technology

    VIS for AI and AI for VIS have become hot research topics recently. On the one side, visualization plays an important role in explainable AI. On the other side, AI has been transforming the visualization field and automated the whole visualization system development pipeline. In this talk, I will introduce the emerging opportunities of combining AI and VIS to leverage both human intelligence and artificial intelligence to solve some grand challenging problems facing both fields and the society.

  • Speaker: Winston Hsu, National Taiwan University

    We observed super-human capabilities from convolutional networks for image learning. It is a natural extension for advancing the technologies towards healthcare applications such as medical image segmentation (CT, MRI), registration, detection, prediction, etc. In the past few years, working closely with the university hospitals, we found many exciting developments in this aspect. However, we also learn a lot as working in the cross-disciplinary setup, which requires strong devotions and deep technologies from the medical and machine learning domains. We’d like to take this opportunity to share what we failed and succeeded for the few attempts in advancing machine learning for medical applications. We will identity promising working models (also the misunderstandings between these two disciplines) derived with the medical experts and evidence the great opportunities to discover new treatment or diagnosis methods across numerous common diseases.