Zero-Shot Transfer for Wildlife Bioacoustics Detection

DOI

Automatically detecting sound events with Artificial Intelligence (AI) has become increasingly popular in the field of bioacoustics, particularly for wildlife monitoring and conservation. Conventional methods predominantly employ supervised learning techniques that depend on substantial amounts of manually annotated bioacoustics data. However, manual annotation in bioacoustics is tremendously resource-intensive, both in terms of human labor and financial resources, and requires considerable domain expertise. This consequently undermines the validity of crowdsourcing annotation methods, such as Amazon Mechanical Turk. Additionally, the supervised learning framework restricts application scope to predefined categories within a closed setting. To address these challenges, we present a novel approach leveraging a multi-modal contrastive learning technique called Contrastive Language-Audio Pretraining (CLAP). CLAP allows for flexible class definition during inference through the use of descriptive text prompts and is capable of performing Zero-Shot Transfer on previously unencountered datasets. In this study, we demonstrate that without specific fine-tuning or additional training, an out-of-the-box CLAP model can effectively generalize across 9 bioacoustics benchmarks, covering a wide variety of sounds that are unfamiliar to the model. We show that CLAP achieves comparable, if not superior, recognition performance compared to supervised learning baselines that are fine-tuned on the training data of these benchmarks. Our experiments also indicate that CLAP holds the potential to perform tasks previously unachievable in supervised bioacoustics approaches, such as foreground / background sound separation and the discovery of unknown animals. Consequently, CLAP offers a promising foundational alternative to traditional supervised learning methods for bioacoustics tasks, facilitating more versatile applications within the field.