Hi,
I’m wondering what the license is for the hydrophone recordings.
I’m compiling a dataset of a bunch of marine mammal vocalizations from various open source (creative commons) locations. I have several OOI vocalizations, but it is unclear if I can place them in a single repository with other CC sources. Ideally, end users could download the dataset from a single source rather than having to navigate several different APIs. That way I could also distribute them in WAV/FLAC format instead of mseed.
Thanks,
Bret
Hi Bret,
There is no specific license for OOI data. The acknowledgement and citation guidelines are found here: How to Use, Acknowledge, and Cite Data - Ocean Observatories Initiative. As long as the data are properly cited, I don’t see any reason why they couldn’t be added to your repository. Implementation of DOIs for OOI data is still in progress, as is an improved audio data distribution interface (which should include WAV and FLAC download options).
Let us know if you have any other questions about citation format or attribution. And we would be very interested to know when your vocalization repository is up and running!
Thanks,
Mike Vardaro
OOI Regional Cabled Array Data Team
University of Washington
Thanks. I’ll note that in the repository. What is the best way to publicise it to OOI users? I’ll be putting my derived datasets on hugging face (a common ML data platform with no upload size constraints)
If you put the link here or post a new forum message with the link (and a little information about your project) then we can get it added to the Community Datasets page on the OOI website (https://oceanobservatories.org/community-data-tools/community-datasets/) and maybe draft a short article for the OOI newsletter.
There is also a separate Discourse forum page for Community Tools where you could post the link: Data Tools - OOI Data Users
Thanks!
Thanks,
The data are tentatively available here: https://huggingface.co/datasets/DORI-SRKW/DORI-OOI
I do not plan to annotate species in this dataset yet, as I am primarily using this as an out-of-domain marine mammal detection dataset. On 2015-2017 data I was getting OOD performance of up to 49.3% specificity at 95% sensitivity.
About the Project: DORI (Dataset for Orca Resident Interpretation) is an effort to curate all open-source SRKW data from archives. The dataset is intended to be used for unsupervised machine translation from SRKW “dialect” to human languages, to reduce barriers to marine biology research, and to provide unbiased observations of SRKWs (including at night and in inclement weather). Positive-unlabelled machine learning was used to curate 1603 hours of marine mammal data from over 27 years of audio archives in Ocean Networks Canada (CC-BY) and Orcasound (CC-BY-NC-SA). We also include 3 test sets spanning including nearly 400 expert annotated files from passive acoustic monitoring at Bamfield Inlet, and the Strait of Georgia, as well as marine mammal presence labels for 2015-2017 Coastal Endurance hydrophone from Ocean Observatories Initiative.
From call annotation and paired field sightings, there are (so far) 478.7 hours of confirmed SRKW chatter, 43.7 hours of Bigg’s killer whale chatter, and 145.3 hours of humpback vocalisations (I am still in the midst of call annotation and will update the fractions once I finish)
We also provide open source pre-trained wav2vecU 2.0 foundation models, fine-tuned whisper detection models, and our custom conformer for resource-constrained devices. Our models outperform ANIMAL-SPOT and PAMGuard ROCCA on the ONC test sets, and the OOI test set, while remaining competetive on the Deepal ComParE dataset. The presented models are also substantially faster and more power efficient than ANIMAL-SPOT and PAMGuard ROCCA. Using our model, we can process 276 days of audio recordings in a single day on a consumer GPU laptop.
I will update again when I submit the paper preprint if people are interested in reading it. I am also happy to write a short article for the OOI newsletter once the paper is wrapped up too.
This project has been a joint effort between Bohan Yao (UW), Jasper Kanes (ONC/UVic), Jasmine Moore (UCalgary), and myself (Completed while at UToronto and then UW).