a tall building lit up at night

Microsoft Research Lab – Asia

Netizen-style commenting for fashion photos: autonomous, diverse, and cognitive

Share this page

The advance of deep neural networks has brought huge advances in image captioning. However, current work is deficient in several ways. It simply generates “vanilla” sentences, which describe the shallow appearance of things (e.g., color, types) in the photo and typically doesn’t create a caption with engaging information about context or their intentions, in a way that a human would.

Recently, Professor Winston Hsu from National Taiwan University collaborated with researchers at Microsoft Research Asia (MSRA) to address this challenge in social media photo commenting for user-contributed fashion photos.

Hsu noted, “This idea was developed during our previous collaboration on XiaoIce  (opens in new tab)with Ruihua Song, Principal Data and Applied Science Lead at MRSA, where we empowered a chatbot to compose modern Chinese poems for user-uploaded photos. We were very excited about how much more can be done in photo commenting for user-contributed fashion photos. In this work, we aim to create comments like a “netizen” would actually sound, which reflects the culture in the designated social community and fosters more engagement with the chatbot and between the users.  We expect the results have application in social media, customer services, e-commerce, fashion mining, and other areas.”

In their project, Hsu and MSRA researchers aimed for an autonomous learning process by leveraging freely available online social media big data. They focused on designing robust and fine-grained neural networks that train by aligning noisy comments with photos. Their approach addressed the attention issue in cluttered photos in an automatic manner and avoided costly object-level annotations. Given that freshness and diversity in comments is desired, they brought in diversity by further marrying a topic discovery model (i.e., latent Dirichlet allocation) with the neural networks (i.e., coupling the generative and discriminative network models in the end-to-end learning framework). They quickly realized that there are usually intentions behind the user-contributed photos. Their work comprehends cognition in photos by further reasoning (a) the intention and (b) the context and proactively consider the two in commenting (e.g. “might be cold tonight; bring a coat with you”).

netizen-style commenting for fashion photos

Legend: Project result examples. Last row: Result from this project. 2nd row: Input by humans. 3rd-5th rows: Results by other state-of-the-art methods. As shown in the last row, comments generated from this project are more vivid and dynamic, and sound more like a human said them.

To generate more human-like online comments for fashion photos, the team compiled a large collection of paired user-contributed fashion photos and comments, called NetiLook, from an online clothing style community. “Our collected NetiLook has 350 thousand photos, with 5 million comments. To the best of our knowledge, it’s the largest fashion comment dataset so far. We have made the dataset public for the research community,” Hsu said. In their experiment on NetiLook, they found that current methods tend to overfit to a general pattern, which makes captioning results insipid and banal (e.g., “love the …”). To compensate for this deficiency, and enrich the diversity in the text content, they decided to integrate style-weight from topic discovery models (i.e., latent Dirichlet allocation (LDA)) with neural networks in generating diverse comments that are more vivid and human-like.

“Diversity is one of the biggest challenges in text generation. This project not only designed brand-new diversity measurements, but also proposed a smart way of marrying topic models with neural networks to make up for the insufficiency of conventional image captioning,” said Song.

Related: