Anders R. Bargum

Aalborg University

A.C. Meyers Vænge 15

DK-2450 Copenhagen SV, Denmark

A research and development oriented PhD student in the field of audio processing, speech represenatation learning, deep learning and voice synthesis. I am affiliated with the Multisensory Experience Lab at Aalborg University in Copenhagen and actively collaborating with the industrial partner Heka VR. Project use-cases ranges everything in between virtual reality, audio analysis and creative music production.

I am currently working on alternative deep learning methods and models for real-time voice conversion in virtual therapeutic scenarios (AVATAR Therapy). Within the field of audio AI, I have worked on, developed, and trained models across a wide range of topics, including speech verification, differentiable DSP, and neural audio codecs. I also have extensive experience exporting these models for real-time use, for example via Hugging Face or C++/JUCE-hosted TorchScript models.

I have worked and been an intern at Native Instruments in Berlin and Neutone AI in Tokyo. I have additionally been hosting several workshops and supervised groups on the Medialogy and Sound and Music Computing educations at Aalborg University.

I am always open to collaboration, new insights or general talk on audio, speech synthesis and AI. You can reach me at arba@create.aau.dk.

selected publications

Frontiers

Reimagining Speech: A Scoping Review of Deep Learning-based Methods for Non-parallel Voice Conversion

Anders R. Bargum, Stefania Serafin, and Cumhur Erkut

Frontiers in signal processing, 2024

Abs HTML

Research on deep learning-powered voice conversion (VC) in speech-to-speech scenarios are gaining increasing popularity. Although many of the works in the field of voice conversion share a common global pipeline, there is considerable diversity in the underlying structures, methods, and neural sub-blocks used across research efforts. Thus, obtaining a comprehensive understanding of the reasons behind the choice of the different methods included when training voice conversion models can be challenging, and the actual hurdles in the proposed solutions are often unclear. To shed light on these aspects, this paper presents a scoping review that explores the use of deep learning in speech analysis, synthesis, and disentangled speech representation learning within modern voice conversion systems. We screened 628 publications from more than 38 venues between 2017 and 2023, followed by an in-depth review of a final database of 130 eligible studies. Based on the review, we summarise the most frequently used approaches to voice conversion based on deep learning and highlight common pitfalls. We condense the knowledge gathered to identify main challenges, supply solutions grounded in the analysis and provide recommendations for future research directions.
APSIPA

Unified Timbre Transfer: A Compact Model for Real-Time Multi-Instrument Sound Morphing

Anders R. Bargum, Naotake Masuda, Bogdan Teleaga, and 2 more authors

In Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2025

Abs Video

Recent advances in transformer-and diffusion-based deep-generative models have significantly impacted the field of music and audio synthesis. However, controllable and real-time interactive models, such as those used for timbre transfer in music production, remain largely dominated by auto-encoders and generative adversarial networks. In pursuit of efficient and flexible timbre morphing and multi-instrument timbre trans- fer, we propose a simplified modeling approach, utilizing an upsampled two-dimensional timbre space in conjunction with engineered and instrument-dependent excitation signals. Different from many other works, our model enables any-to-many timbre transfer with added control over timbre, pitch, and loudness. We additionally allow for seamless interpolation between instruments, eliminating the need for separate model training. Our evaluation shows performance comparable to specialized models, making it highly relevant for the broader creative audio community.