2026
Gaussian Process Regression of Steering Vectors With Physics-Aware Deep Composite Kernels for Augmented Listening
Diego Di Carlo, Shoichi Koyama, Arie Aditya Nugraha, Mathieu Fontaine, Yoshiaki Bando, Kazuyoshi Yoshii
in IEEE Transactions on Audio, Speech, and Language Processing,
Vol. ???, Num. ???,
pp. ???, 2026.
This paper investigates continuous representations of steering vectors over frequency and position of microphone and source for augmented listening (e.g., spatial filtering and binaural rendering) with precise control of the sound field perceived by the user. Steering vectors have typically been used for representing the spatial characteristics of the sound field as a function of the listening position. The basic algebraic representation of steering vectors assuming an idealized environment cannot deal with the scattering effect of the sound field. One may thus collect a discrete set of real steering vectors measured in dedicated facilities and super-resolve (i.e., upsample) them. Recently, physics-aware deep learning methods have been effectively used for this purpose. Such deterministic super-resolution, however, suffers from the overfitting problem due to the non-uniform uncertainty over the measurement space. To solve this problem, we integrate an expressive representation based on the neural field (NF) into the principled probabilistic framework based on the Gaussian process (GP). Specifically, we propose a physics-aware composite kernel that model the directional incoming waves and the subsequent scattering effect. Our comprehensive comparative experiment showed the effectiveness of the proposed method under data insufficiency conditions. In downstream tasks such as speech enhancement and binaural rendering using the simulated data of the SPEAR challenge, the oracle performances were attained with less than ten times fewer measurements.
2021
Mean absorption estimation from room impulse responses using virtually supervised learning
Cedric Foy, Antoine Deleforge, Diego Di Carlo
in The Journal of the Acoustical Society of America,
Vol. 150, Num. 2,
pp. 1286--1299, 2021.
@article{foy2021mean,
title={Mean absorption estimation from room impulse responses using virtually supervised learning},
author={Foy, C{\'e}dric and Deleforge, Antoine and Di Carlo, Diego},
journal={The Journal of the Acoustical Society of America},
volume={150},
number={2},
pages={1286--1299},
year={2021},
publisher={AIP Publishing}
}
In the context of building acoustics and the acoustic diagnosis of an existing room, it introduces and investigates a new approach to estimate the mean absorption coefficients solely from a room impulse response (RIR).
This inverse problem is tackled via virtually supervised learning, namely, the RIR-to-absorption mapping is implicitly learned by regression on a simulated dataset using artificial neural networks.
Simple models based on well-understood architectures are the focus of this work. The critical choices of geometric, acoustic, and simulation parameters, which are used to train the models, are extensively discussed and studied while keeping in mind the conditions that are representative of the field of building acoustics.
Estimation errors from the learned neural models are compared to those obtained with classical formulas that require knowledge of the room's geometry and reverberation times. Extensive comparisons made on a variety of simulated test sets highlight different conditions under which the learned models can overcome the well-known limitations of the diffuse sound field hypothesis underlying these formulas.
Results obtained on real RIRs measured in an acoustically configurable room show that at 1 kHz and above, the proposed approach performs comparably to classical models when reverberation times can be reliably estimated and continues to work even when they cannot.
2021
dEchorate: a calibrated room impulse response dataset for echo-aware signal processing
Diego Di Carlo, Pinchas Tandeitnik, Cedric Foy, Nancy Bertin, Antoine Deleforge, Sharon Gannot
in IEEE Signal Processing Magazine,
Vol. 2021, Num. 5,
pp. 1--15, 2021.
@article{carlo2021dechorate,
title={dEchorate: a calibrated room impulse response dataset for echo-aware signal processing},
author={Carlo, Diego Di and Tandeitnik, Pinchas and Foy, Cedri{\'c} and Bertin, Nancy and Deleforge, Antoine and Gannot, Sharon},
journal={EURASIP Journal on Audio, Speech, and Music Processing},
volume={2021},
pages={1--15},
year={2021},
publisher={Springer}
}
This paper presents a new dataset of measured multichannel room impulse responses (RIRs) named dEchorate.
It includes annotations of early echo timings and 3D positions of microphones, real sources, and image sources under different wall configurations in a cuboid room. These data provide a tool for benchmarking recent methods in echo-aware speech enhancement, room geometry estimation, RIR estimation, acoustic echo retrieval, microphone calibration, echo labeling, and reflector position estimation. The dataset is provided with software utilities to easily access, manipulate, and visualize the data as well as baseline methods for echo-related tasks.
2019
Audio-Based Search and Rescue With a Drone: Highlights From the IEEE Signal Processing Cup 2019 Student Competition
Antoine Deleforge, Diego Di Carlo, Martin Strauss, Romain Serizel, Lucio Marcenaro
in IEEE Signal Processing Magazine,
Vol. 36, Num. 5,
pp. 138--144, 2019.
@article{Deleforge2019audio,
author = {Deleforge, Antoine and {Di Carlo}, Diego and Strauss, Martin and Serizel, Romain and Marcenaro, Lucio},
journal = {IEEE Signal Processing Magazine},
number = {5},
pages = {138--144},
publisher = {IEEE},
title = {Audio-Based Search and Rescue With a Drone: Highlights From the IEEE Signal Processing Cup 2019 Student Competition [SP Competitions]},
url = {https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8827999},
volume = {36},
year = {2019}
}
Increasing interest in unmanned aerial vehicles (UAVs), commonly referred to as drones, has occurred in recent years.
Search and rescue scenarios where humans in emergency situations need to be quickly found in difficult to access areas constitute
an important field of application for this technology.
Drones have already been used by humanitarian organizations in countries such as Haiti and
the Philippines to map areas after a natural disaster using high-resolution embedded cameras,
as documented in a recent United Nations report [1].
Although research efforts have focused mostly on developing video-based solutions for this task [2],
UAV-embedded audio-based localization has received relatively less attention [3-7].
However, UAVs equipped with a microphone array could be of critical help to localize people in emergency situations,
especially when video sensors are limited by a lack of visual feedback due to bad lighting conditions
(such as at night or in fog) or obstacles limiting the field of view (Figure 1).
2026
SIRUP: A diffusion-based virtual upmixer of steering vectors for highly-directive spatialization with first-order ambisonics
Emilio Picard, Diego Di Carlo, Aditya Arie Nugraha, Mathieu Fontaine, Kazuyoshi Yoshii
in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
2026.
@inproceedings{picard2026sirup,
title={SIRUP: A diffusion-based virtual upmixer of steering vectors for highly-directive spatialization with first-order ambisonics},
author={Picard, Emilio and Di Carlo, Diego and Nugraha, Aditya Arie and Fontaine, Mathieu and Yoshii, Kazuyoshi},
booktitle={ICASSP 2026},
year={2026}
}
This paper presents virtual upmixing of steering vectors captured by a fewer-channel spherical microphone array. This challenge has conventionally been addressed by recovering the directions and signals of sound sources from first-order ambisonics (FOA) data, and then rendering the higher-order ambisonics (HOA) data using a physics-based acoustic simulator. This approach, however, struggles to handle the mutual dependency between the spatial directivity of source estimation and the spatial resolution of FOA ambisonics data. Our method, named SIRUP, employs a latent diffusion model architecture. Specifically, a variational autoencoder (VAE) is used to learn a compact encoding of the HOA data in a latent space and a diffusion model is then trained to generate the HOA embeddings, conditioned by the FOA data. Experimental results showed that SIRUP achieved a significant improvement compared to FOA systems for steering vector upmixing, source localization, and speech denoising.
2026
Physics-informed Learning Of Neural Scattering Fields Towards Measurement-free Mesh-to-HRTF Estimation
Tancrède Martinez, Diego Di Carlo, Aditya Arie Nugraha, Mathieu Fontaine, Kazuyoshi Yoshii
in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
2026.
@inproceedings{martinez2026physics,
title={PHYSICS-INFORMED LEARNING OF NEURAL SCATTERING FIELDS TOWARDS MEASUREMENT-FREE MESH-TO-HRTF ESTIMATION},
author={Martinez, Tancr{\`e}de and Di Carlo, Diego and Nugraha, Aditya Arie and Fontaine, Mathieu and Yoshii, Kazuyoshi},
booktitle={ICASSP 2026},
year={2026}
}
This paper describes neural simulation of the scattered pressure field from a plane wave around a scattering object in both continuous 2D and 3D domains. This task has typically been treated as a regression problem that aims to train a physicsinformed neural network (PINN) using pressure measurements at discrete positions. This approach, however, needs to train the whole network for each incident wave direction. To address this, we propose a measurement-free simulator based on a PINN purely driven by the Helmholtz equation with the Robin boundary condition and the Sommerfeld radiation condition with the aid of the perfectly matched layer (PML) framework. More specifically, we design a physics-informed scattering hypernetwork (PHISK) that can generalize to incident waves from any direction via low-rank adaptation (LoRA) of a PINN trained for a specific configuration. The experiment shows that the proposed method accurately simulated sound scattering around various objects, adapting to unseen incident wave directions with minimal performance loss, and realized reasonable simulation of head-related transfer functions (HRTFs) from complex mesh data of a human head.
2025
Visually-Informed Multichannel Sound Source Separation Based on 3D Gaussian Primitives
Haruaki Asano, Ryunosuke Nihei, Yoshiaki Bando, Aditya Arie Nugraha, Diego Di Carlo, Hiroyuki Ueda, Yosuke Ito, Kazuyoshi Yoshii
in IEEE Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC),
2025.
@inproceedings{asano2025visually,
title={Visually-Informed Multichannel Sound Source Separation Based on 3D Gaussian Primitives},
author={Asano, Haruaki and Nihei, Ryunosuke and Bando, Yoshiaki and Nugraha, Aditya Arie and Di Carlo, Diego and Ueda, Hiroyuki and Ito, Yosuke and Yoshii, Kazuyoshi},
booktitle={2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)},
pages={36--41},
year={2025},
organization={IEEE}
}
This paper proposes visually-informed sound source separation for audio-visual understanding of indoor scenes captured by distributed microphone arrays and cameras. Our approach leverages the 3D information of sound-emitting objects, reconstructed via 3D Gaussian splatting (3DGS), to overcome a limitation of modern blind source separation methods like multichannel nonnegative matrix factorization (MNMF). While adaptable and potentially performant, the iterative optimization of MNMF often converges to poor local minima due to the highly-expressive full-rank spatial covariance matrices (SCMs) of sources. Our key idea is to treat the set of 3D Gaussians representing a sizable sound source object as a collection of sub-sources that share an audio signal but have unique emission weights, both of which are to be estimated jointly from an observed mixture. To enforce this structure, we guide MNMF by regularizing the SCM of each source object at each frequency. Specifically, we use a prior that centers the SCM estimate around a weighted sum of theoretical SCMs, which are analytically derived from the 3D Gaussian positions. Experiments with simulated data, featuring two 3D human models, demonstrated the effectiveness of the proposed method. To our knowledge, this is the first work to use 3D Gaussians as a common primitive for joint audio-visual analysis.
2025
Physically Informed Spatial Regularization for Sound Event Localization and Detection
Haocheng Liu, Diego Di Carlo, Aditya Arie Nugraha, Kazuyoshi Yoshii, Gaël Richard, Mathieu Fontaine
in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA),
2025.
@inproceedings{liu2025physically,
title={Physically Informed Spatial Regularization for Sound Event Localization and Detection},
author={Liu, Haocheng and Di Carlo, Diego and Nugraha, Aditya Arie and Yoshii, Kazuyoshi and Richard, Ga{\"e}l and Fontaine, Mathieu},
booktitle={2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
pages={1--5},
year={2025},
organization={IEEE}
}
Building Sound Event Localization and Detection (SELD) models that are robust to diverse acoustic environments remains one of the major challenges in multichannel signal processing, as reflections and reverberation can significantly confuse both the source direction and event detection. Introducing priors such as microphone geometry or room impulse response (RIR) into the model has proven effective in addressing this issue. Existing methods typically incorporate such priors in a deterministic way, often through data augmentation to enlarge data diversity. However, the uncertainty arising from the complex nature of audio acoustics remains largely underexplored in the SELD literature and naturally call for incorporating a stochastic modeling of acoustic prior. In this paper, we propose regularizing deep learning based SELD models with a physically constructed spatial covariance matrix (SCM) based on the estimated direction of arrival (DOA) and sound event detection (SED).
2025
SHAMaNS: Sound Localization with Hybrid Alpha-Stable Spatial Measure and Neural Steerer
Diego Di Carlo (RIKEN AIP), Mathieu Fontaine (LTCI, IP Paris), Aditya Arie Nugraha (RIKEN AIP), Yoshiaki Bando (RIKEN AIP), Kazuyoshi Yoshii
in European Signal Processing Conference (EUSIPCO),
2025.
@inproceedings{dicarlo2025shamans,
author = {Diego Di Carlo, Mathieu Fontaine, Kouhei Sekiguchi, Aditya Arie Nugraha, Yoshiaki Bando, and Kazuyoshi Yoshii},
title = {SHAMaNS: Sound Localization with Hybrid Alpha-Stable Spatial Measure and Neural Steerer},
booktitle = {Proceedings of European Signal Processing Conference (EUSIPCO)},
year = {2025},
preprint = {https://arxiv.org/abs/2506.18954}
}
This paper describes speech enhancement for realtime automatic speech recognition (ASR) in real environments. A standard approach to this task is to use neural beamforming that can work efficiently in an online manner. It estimates the masks of clean dry speech from a noisy echoic mixture spectrogram with a deep neural network (DNN) and then computes a enhancement filter used for beamforming. The performance of such a supervised approach, however, is drastically degraded under mismatched conditions. This calls for run-time adaptation of the DNN. Although the ground-truth speech spectrogram required for adaptation is not available at run time, blind dereverberation and separation methods such as weighted prediction error (WPE) and fast multichannel nonnegative matrix factorization (FastMNMF) can be used for generating pseudo groundtruth data from a mixture. Based on this idea, a prior work proposed a dual-process system based on a cascade of WPE and minimum variance distortionless response (MVDR) beamforming asynchronously fine-tuned by block-online FastMNMF. To integrate the dereverberation capability into neural beamforming and make it fine-tunable at run time, we propose to use weighted power minimization distortionless response (WPD) beamforming, a unified version of WPE and minimum power distortionless response (MPDR), whose joint dereverberation and denoising filter is estimated using a DNN. We evaluated the impact of run-time adaptation under various conditions with different numbers of speakers, reverberation times, and signal-to-noise ratios (SNRs).
2024
Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising
Yoto Fujita, Aditya Arie Nugraha, Diego Di Carlo, Yoshiaki Bando, Mathieu Fontaine, and Kazuyoshi Yoshii
in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA),
2024.
@inproceedings{fujita2024runtimeadaptation,
author = {Fujita, Yoto and Nugraha, Aditya Arie and Di Carlo, Diego and Bando, Yoshiaki and Fontaine, Mathieu and Yoshii, Kazuyoshi},
title = {Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising},
booktitle = {Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)},
year = {2024},
month = dec,
pages = {},
address = {Macau, China},
preprint = {https://arxiv.org/abs/2410.22805}
}
This paper describes speech enhancement for realtime automatic speech recognition (ASR) in real environments. A standard approach to this task is to use neural beamforming that can work efficiently in an online manner. It estimates the masks of clean dry speech from a noisy echoic mixture spectrogram with a deep neural network (DNN) and then computes a enhancement filter used for beamforming. The performance of such a supervised approach, however, is drastically degraded under mismatched conditions. This calls for run-time adaptation of the DNN. Although the ground-truth speech spectrogram required for adaptation is not available at run time, blind dereverberation and separation methods such as weighted prediction error (WPE) and fast multichannel nonnegative matrix factorization (FastMNMF) can be used for generating pseudo groundtruth data from a mixture. Based on this idea, a prior work proposed a dual-process system based on a cascade of WPE and minimum variance distortionless response (MVDR) beamforming asynchronously fine-tuned by block-online FastMNMF. To integrate the dereverberation capability into neural beamforming and make it fine-tunable at run time, we propose to use weighted power minimization distortionless response (WPD) beamforming, a unified version of WPE and minimum power distortionless response (MPDR), whose joint dereverberation and denoising filter is estimated using a DNN. We evaluated the impact of run-time adaptation under various conditions with different numbers of speakers, reverberation times, and signal-to-noise ratios (SNRs).
2024
RIR-in-a-Box: Estimating Room Acoustics from 3D Mesh Data through Shoebox Approximation
Liam Kelley, Diego Di Carlo, Aditya Arie Nugraha, Mathieu Fontaine, Yoshiaki Bando, and Kazuyoshi Yoshii
in Annual Conference of the International Speech Communication Association (Interspeech),
2024.
@inproceedings{kelley2024ririnabox,
abbr = {Interspeech},
bibtex_show = {true},
author = {Kelley, Liam and Di Carlo, Diego and Nugraha, Aditya Arie and Fontaine, Mathieu and Bando, Yoshiaki and Yoshii, Kazuyoshi},
title = {RIR-in-a-Box: Estimating Room Acoustics from 3D Mesh Data through Shoebox Approximation},
booktitle = {Proceedings of Annual Conference of the International Speech Communication
Association (Interspeech)},
year = {2024},
month = sep,
pages = {3255-3259},
address = {Kos Island, Greece},
url = {https://www.isca-archive.org/interspeech_2024/kelley24_interspeech.html},
html = {https://www.isca-archive.org/interspeech_2024/kelley24_interspeech.html},
pdf = {https://www.isca-archive.org/interspeech_2024/kelley24_interspeech.pdf},
preprint = {https://telecom-paris.hal.science/hal-04632526},
doi = {10.21437/Interspeech.2024-2053}
}
Acoustic echoes retrieval is a research topic that is gaining importance
in many speech and audio signal processing applications such as speech enhancement,
source separation, dereverberation and room geometry estimation.
This work proposes a novel approach to retrieve acoustic echoes timing off-grid and blindly,
i.e., from a stereophonic recording of an unknown sound source such as speech.
It builds on the recent framework of continuous dictionaries.
In contrast with existing methods, the proposed approach does not
rely on parameter tuning nor peak picking techniques by working directly
in the parameter space of interest. The accuracy and robustness of
the method are assessed on challenging simulated setups with
varying noise and reverberation levels and are compared to two state-of-the-art methods.
2024
Joint Audio Source Localization and Separation with Distributed Microphone Arrays Based on Spatially-Regularized Multichannel NMF
Yoshiaki Sumura, Diego Di Carlo, Aditya Arie Nugraha, Yoshiaki Bando, and Kazuyoshi Yoshii
in International Workshop on Acoustic Signal Enhancement (IWAENC),
2024.
@inproceedings{sumura2024jointlocalsep,
abbr = {IWAENC},
bibtex_show = {true},
author = {Sumura, Yoshiaki and Di Carlo, Diego and Nugraha, Aditya Arie and Bando, Yoshiaki and Yoshii, Kazuyoshi},
title = {Joint Audio Source Localization and Separation with Distributed Microphone Arrays Based on Spatially-Regularized Multichannel NMF},
booktitle = {Proceedings of International Workshop on Acoustic Signal Enhancement (IWAENC)},
year = {2024},
month = sep,
pages = {145-149},
address = {Aalborg, Denmark},
url = {https://ieeexplore.ieee.org/document/10694042},
html = {https://ieeexplore.ieee.org/document/10694042},
doi = {10.1109/IWAENC61483.2024.10694042}
}
This paper describes a statistically principled method that simultaneously localizes and separates multiple sound sources using multiple calibrated microphone arrays distributed in a room. Given the extensive research on direction of arrival (DOA) estimation with a single microphone array, for 3D source localization, one may attempt triangulation based on DOAs separately and egocentrically estimated by multiple arrays. However, in multiple sources scenarios, this cascading approach faces both the inter-array DOA association problem and the error accumulation problem. To solve these problems, we propose a spatially regularized extension of a versatile blind source separation method called multichannel nonnegative matrix factorization (MNMF). Our method treats multiple microphone arrays as a single big array and puts priors on the frequency-wise spatial covariance matrices (SCMs) of each source. These priors are defined using the source DOA computed from the 3D positions of the source and arrays. The power spectral densities (PSDs), SCMs, and positions of multiple sources are jointly estimated under the unified maximum-a-posteriori (MAP) principle. We show the effectiveness of the joint statistical estimation for real data recorded by four five-channel microphone arrays of Microsoft Azure Kinect.
2024
Neural Steerer: Novel Steering Vector Synthesis with a Causal Neural Field over Frequency and Direction
Diego Di Carlo, Aditya Arie Nugraha, Mathieu Fontaine, Yoshiaki Bando, and Kazuyoshi Yoshii
in IEEE International Conference on Acoustics, Speech and Signal Processing Workshops (ICASSPW),,
2024.
@inproceedings{dicarlo2024neuralsteerer,
abbr = {ICASSPW},
bibtex_show = {true},
author = {Di Carlo, Diego and Nugraha, Aditya Arie and Fontaine, Mathieu and Bando, Yoshiaki and Yoshii, Kazuyoshi},
title = {Neural Steerer: Novel Steering Vector Synthesis with a Causal Neural Field over Frequency and Direction},
booktitle = {Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing Workshops (ICASSPW)},
month = apr,
year = {2024},
pages = {740-744},
address = {Seoul, South Korea},
url = {https://ieeexplore.ieee.org/document/10626510},
html = {https://ieeexplore.ieee.org/document/10626510},
preprint = {https://arxiv.org/abs/2305.04447},
doi = {10.1109/ICASSPW62465.2024.10626510}
}
We address the problem of accurately interpolating measured anechoic steering vectors with a deep learning framework called the neural field. This task plays a pivotal role in reducing the resource-intensive measurements required for precise sound source separation and localization, essential as the front-end of speech recognition. Classical approaches to interpolation rely on linear weighting of nearby measurements in space on a fixed, discrete set of frequencies. Drawing inspiration from the success of neural fields for novel view synthesis in computer vision, we introduce the neural steerer, a continuous complex-valued function that takes both frequency and direction as input and produces the corresponding steering vector. Importantly, it incorporates inter-channel phase difference information and a regularization term enforcing filter causality, essential for accurate steering vector modeling. Our experiments, conducted using a dataset of real measured steering vectors, demonstrate the effectiveness of our resolution-free model in interpolating such measurements.
2024
Implicit neural representation for change detection
Peter Naylor, Diego Di Carlo, Arianna Traviglia, Makoto Yamada, Marco Fiorucci
in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV),
2024.
@inproceedings{naylor2024implicit,
title={Implicit neural representation for change detection},
author={Naylor, Peter and Di Carlo, Diego and Traviglia, Arianna and Yamada, Makoto and Fiorucci, Marco},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
pages={935--945},
year={2024}
doi={10.1109/WACV57701.2024.00098}
html={https://ieeexplore.ieee.org/abstract/document/10483630/}
}
Identifying changes in a pair of 3D aerial LiDAR point
clouds, obtained during two distinct time periods over the
same geographic region presents a significant challenge due
to the disparities in spatial coverage and the presence of
noise in the acquisition system. The most commonly used
approaches to detecting changes in point clouds are based
on supervised methods which necessitate extensive labelled
data often unavailable in real-world applications. To ad-
dress these issues, we propose an unsupervised approach
that comprises two components: Implcit Neural Represena-
tion (INR) for continuous shape reconstruction and a Gaus-
sian Mixture Model for categorising changes. INR offers a
grid-agnostic representation for encoding bi-temporal point
clouds, with unmatched spatial support that can be regu-
larised to enhance high-frequency details and reduce noise.
The reconstructions at each timestamp are compared at ar-
bitrary spatial scales, leading to a significant increase in
detection capabilities. We apply our method to a benchmark
dataset comprising simulated LiDAR point clouds for ur-
ban sprawling. This dataset encompasses diverse challeng-
ing scenarios, varying in resolutions, input modalities and
noise levels. This enables a comprehensive multi-scenario
evaluation, comparing our method with the current state-of-
the-art approach. We outperform the previous methods by
a margin of 10% in the intersection over union metric. In
addition, we put our techniques to practical use by applying
them in a real-world scenario to identify instances of illicit
excavation of archaeological sites and validate our results
by comparing them with findings from field experts.
2023
Time-Domain Audio Source Separation Based on Gaussian Processes with Deep Kernel Learning
Aditya Arie Nugraha, Diego Di Carlo, Yoshiaki Bando, Mathieu Fontaine, and Kazuyoshi Yoshii
in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA),
2023.
@inproceedings{nugraha2023gpdkl,
selected = {true},
abbr = {WASPAA},
bibtex_show = {true},
author = {Nugraha, Aditya Arie and Di Carlo, Diego and Bando, Yoshiaki and Fontaine, Mathieu and Yoshii, Kazuyoshi},
title = {Time-Domain Audio Source Separation Based on Gaussian Processes with Deep Kernel Learning},
booktitle = {Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
year = {2023},
month = oct,
pages = {1--5},
address = {New Paltz, NY, USA},
url = {https://ieeexplore.ieee.org/document/10248168},
html = {https://ieeexplore.ieee.org/document/10248168},
preprint = {https://hal.science/hal-04172863},
doi = {10.1109/WASPAA58266.2023.10248168}
}
This paper revisits single-channel audio source separation based on a probabilistic generative model of a mixture signal defined in the continuous time domain. We assume that each source signal follows a non-stationary Gaussian process (GP), i.e., any finite set of sampled points follows a zero-mean multivariate Gaussian distribution whose covariance matrix is governed by a kernel function over time-varying latent variables. The mixture signal composed of such source signals thus follows a GP whose covariance matrix is given by the sum of the source covariance matrices. To estimate the latent variables from the mixture signal, we use a deep neural network with an encoder-separator-decoder architecture (e.g., Conv-TasNet) that separates the latent variables in a pseudo-time-frequency space. The key feature of our method is to feed the latent variables into the kernel function for estimating the source covariance matrices, instead of using the decoder for directly estimating the time-domain source signals. This enables the decomposition of a mixture signal into the source signals with a classical yet powerful Wiener filter that considers the full covariance structure over all samples. The kernel function and the network are trained jointly in the maximum likelihood framework. Comparative experiments using two-speech mixtures under clean, noisy, and noisy-reverberant conditions from the WSJ0-2mix, WHAM!, and WHAMR! benchmark datasets demonstrated that the proposed method performed well and outperformed the baseline method under noisy and noisy-reverberant conditions.
2022
Elliptically Contoured Alpha-Stable Representation for MUSIC-Based Sound Source Localization
Mathieu Fontaine, Diego Di Carlo, Kouhei Sekiguchi, Aditya Arie Nugraha, Yoshiaki Bando, and Kazuyoshi Yoshii
in European Signal Processing Conference (EUSIPCO),
2022.
@inproceedings{fontaine2022alphamusic,
abbr = {EUSIPCO},
bibtex_show = {true},
author = {Fontaine, Mathieu and Di Carlo, Diego and Sekiguchi, Kouhei and Nugraha, Aditya Arie and Bando, Yoshiaki and Yoshii, Kazuyoshi},
title = {Elliptically Contoured Alpha-Stable Representation for MUSIC-Based Sound Source Localization},
booktitle = {Proceedings of European Signal Processing Conference (EUSIPCO)},
year = {2022},
month = aug,
pages = {26--30},
address = {Belgrade, Serbia},
url = {https://ieeexplore.ieee.org/document/9909944},
html = {https://ieeexplore.ieee.org/document/9909944},
pdf = {https://eurasip.org/Proceedings/Eusipco/Eusipco2022/pdfs/0000026.pdf}
}
This paper introduces a theoretically-rigorous sound source localization (SSL) method based on a robust extension of the classical multiple signal classification (MUSIC) algorithm. The original SSL method estimates the noise eigenvectors and the MUSIC spectrum by computing the spatial covariance matrix of the observed multichannel signal and then detects the peaks from the spectrum. In this work, the covariance matrix is replaced with the positive definite shape matrix originating from the elliptically contoured α-stable model, which is more suitable under real noisy high-reverberant conditions. Evaluation on synthetic data shows that the proposed method outperforms baseline methods under such adverse conditions, while it is comparable on real data recorded in a mild acoustic condition.
2022
Post processing sparse and instantaneous 2D velocity fields using physics-informed neural networks
Diego Di Carlo, Dominique Heitz, Thomas Corpetti
in 20th International Symposium on Application of Laser and Imaging Techniques to Fluid Mechanics (LXLASER),
2022.
@inproceedings{di2022post,
title={Post processing sparse and instantaneous 2D velocity fields using physics-informed neural networks},
author={Di Carlo, Diego and Heitz, Dominique and Corpetti, Thomas},
booktitle={Proceedings of the 20th International Symposium on Application of Laser and Imaging Techniques to Fluid Mechanics},
doi={10.55037/lxlaser.20th.183},
year={2022}
}
This work tackles the problem of resolving high-resolution velocity fields from a set of sparse off-grid observations. This task, crucial in many applications spanning from experimental fluiddynamics to compute vision and medicine, can be addressed with deep neural network models trained to employ physics-based constraints. This work proposes an original unsupervised deep learning framework involving sub-grid models that improve the accuracy of super-resolved instantaneous and sparse velocity fields of turbulent flows. Python code, dataset and results are available at https://github.com/Chutlhu/TurboSuperResultion/
2020
BLASTER: An Off-Grid Method for Blind and Regularized Acoustic Echoes Retrieval
Di Carlo, Diego and Elvira, Clement and Deleforge, Antoine and Bertin, Nancy and Gibonval, Remi
in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
2020.
@inproceedings{kelley2024ririnabox,
abbr = {Interspeech},
bibtex_show = {true},
author = {Kelley, Liam and Di Carlo, Diego and Nugraha, Aditya Arie and Fontaine, Mathieu and Bando, Yoshiaki and Yoshii, Kazuyoshi},
title = {RIR-in-a-Box: Estimating Room Acoustics from 3D Mesh Data through Shoebox Approximation},
booktitle = {Proceedings of Annual Conference of the International Speech Communication
Association (Interspeech)},
year = {2024},
month = sep,
pages = {3255-3259},
address = {Kos Island, Greece},
url = {https://www.isca-archive.org/interspeech_2024/kelley24_interspeech.html},
html = {https://www.isca-archive.org/interspeech_2024/kelley24_interspeech.html},
pdf = {https://www.isca-archive.org/interspeech_2024/kelley24_interspeech.pdf},
preprint = {https://telecom-paris.hal.science/hal-04632526},
doi = {10.21437/Interspeech.2024-2053}
}
This paper describes a method for estimating the room impulse response (RIR) for a microphone and a sound source located at arbitrary positions from the 3D mesh data of the room. Simulating realistic RIRs with pure physics-driven methods often fails the balance between physical consistency and computational efficiency, hindering application to real time speech processing. Alternatively, one can use MESH2IR, a fast black-box estimator that consists of an encoder extracting latent code from mesh data with a graph convolutional network (GCN) and a decoder generating the RIR from the latent code. Combining these two approaches, we propose a fast yet physically coherent estimator with interpretable latent code based on differentiable digital signal processing (DDSP). Specifically, the encoder estimates a virtual shoebox room scene that acoustically approximates the real scene, accelerating physical simulation with the differentiable image-source model in the decoder. Our experiments showed that our method outperformed MESH2IR for real mesh data obtained with the depth scanner of Microsoft HoloLens 2, and can provide correct spatial consistency for binaural RIRs.
2019
MIRAGE: 2D Source Localization Using Microphone Pair Augmentation with Echoes
Di Carlo, Diego and Deleforge, Antoine and Bertin, Nancy
in IEEE International Conference on Acoustics, Speech and Signal Processing,
2019.
@inproceedings{DiCarlo2019mirage,
arxiv = {1906.08968},
author = { Di Carlo, Diego and Deleforge, Antoine and Bertin, Nancy},
booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
doi = {10.1109/ICASSP.2019.8683534},
hal_id = {hal-01909531},
keywords = {Image Microphones,Sound Source Localization,Supervised Learning,TDOA Estimation},
pages = {775--779},
title = {Mirage: 2D Source Localization Using Microphone Pair Augmentation with Echoes},
url = {https://github.com/Chutlhu/MIRAGE},
volume = {2019-May},
year = {2019}
}
It is commonly observed that acoustic echoes hurt performance of sound
source localization (SSL) methods.
We introduce the concept of microphone array augmentation with echoes (MIRAGE)
and show how estimation of early-echo characteristics can in fact benefit SSL.
We propose a learning based scheme for echo estimation combined with a physics based
scheme for echo aggregation.
In a simple scenario involving 2 microphones close to a reflective surface and
one source,
we show using simulated data that the proposed approach performs similarly to
a correlation-based method
in azimuth estimation while retrieving elevation as well from 2 microphones only,
an impossible task in anechoic settings.
2018
SEPARAKE: Source Separation with a Little Help from Echoes
Scheibler, Robin and Di Carlo, Diego and Deleforge, Antoine and Dokmanic, Ivan
in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
2018.
@inproceedings{Scheibler2017separake,
arxiv = {1711.06805},
author = {Scheibler, Robin and Di Carlo, Diego and Deleforge, Antoine and Dokmanic, Ivan},
doi = {10.1109/ICASSP.2018.8461345},
hal_id = {hal-01909531},
journal = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
keywords = {Echoes,Multi-channel,NMF,Room geometry,Source separation},
pages = {6897--6901},
title = {Separake: Source Separation with a Little Help from Echoes},
url = {https://github.com/fakufaku/separake},
year = {2018}
}
It is commonly believed that multipath hurts various audio process-ing
algorithms.At odds with this belief, we show that multipath in fact helps sound
source separation,even with very simple propagation models.Unlike most existing
methods, we neither ignore the room impulse responses, nor we attempt to estimate
them fully. Weather assume that we know the positions of a few virtual
micro-phones generated by echoes and we show how this gives us enough spatial
diversity to get a performance boost over the anechoic case.We show improvements
for two standard algorithms\u2014one that uses only magnitudes of the transfer
functions, and one that also uses the phases.Concretely, we show that multichannel
non-negative matrix factorization aided with a small number of echoes beats the
vanilla variant of the same algorithm, and that with magnitude information only,
echoes enable separation where it was previously impossible
2018
Evaluation of an Open-Source Implementation of the SPR-PHAT Algorithm Within the 2018 Locata Challenge
Lebarbenchon, Romain and Camberlein, Ewen and Di Carlo, Diego and Deleforge, Antoine and Bertin, Nancy
in LOCATA Challenge Workshop - a satellite event of International Workshop on Acoustic Signal Enhancement (IWAENC),
2018.
2018
Interference reduction on full-length live recordings
Di Carlo, Diego and Liutkus, Antoine and Déguernel, Ken
in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
2018.
2017
Gaussian framework for interference reduction in live recordings
Di Carlo, Diego and Déguernel, Ken and Liutkus, Antoine
in AES International Conference on Semantic Audio,
2017.
2016
Gestural Control Of Wavefield synthesis
Grani, Francesco and Di Carlo, Diego and Portillo, Jorge Madrid and Girardi, Matteo and Paisa, Razvan and Banas, Jian Stian and Vogiatzoglou, Iakovos and Overholt, Dan and Serafin, Stefania
in Sound and Music Computing Conference (SMC),
2016.
2014
Automatic music listening for automatic music performance: a grandpiano dynamics classifier
Di Carlo, Diego and Rodá, Antonio
in Proceedings of the 1st International Workshop on Computer and Robotic Systems for Automatic Music Performance (SAMP 14),
2014.
2025
Augmented Listening with Physics-Coherent Neural Fields
Telecom Paris, France, 2025, April 25.
2024
from Neural Field for Augmented Listening
Kyoto University, Enginnering School, 2024, December.
2024
from Neural Fields to PINNs, ... and beyond.
Prism, CNRS, France, 2024, September.
2024
from Neural Fields to PINNs, ... and beyond.
Telecom Paris, France, 2024, September.
Neural Fields
Kyoto University.
Neural Fields
Strasbourg.
Neural Fields for Urban Change detection
Kyoto Univesity.
2021
Echo-aware Signal Processing for Audio Scene Analysis
Riken AIP Center, Kyoto (Japan), 2021, July.
2019
Hunting Echoes for Auditory Scene Analysis
Bar-Ilan University, Israel, 2019, November.
What is an Hackathon?
Journeé Science et Musisque, BU Univ Rennes 2, Rennes.
2019
My Pythonic Workflow
Seminaire Au Vert (Team Building Seminar), 2019, August.
2019
Hunting Echoes for Auditory Scene Analysis
Rosckoff, Frances, 2019, July.