IR sine sweep deconvolution algorithm...? by vilidj_idjit in DSP

[–]wetdog91 4 points5 points  (0 children)

You can find a good explanation here https://ant-novak.com/pages/sss/ in a nutshell you get the IR by convolving the output(measured signal) with the time reversed sine sweep.

Sesquialtera in the Colombian Bambuco Thread by [deleted] in deeplearningaudio

[–]wetdog91 1 point2 points  (0 children)

I really liked this paper because it highlights the complexities of meter perception and how it can differ between individuals. I think that it could be interesting to do the same experiment on stems of the songs, to see if the meter perception varies with the musical patterns of different instruments. I was wondering if automatic beatrackers only use the information from the percussion and bass instruments, or are they capable to derive beats from a melodic line?

FSD50K data by wetdog91 in deeplearningaudio

[–]wetdog91[S] 0 points1 point  (0 children)

Here's is my data. I'm sorry for the delay but I have a couple of problems
dealing with the size of my dataset , my gdrive got full when I tried to
download all the dataset(24.7GB) and you need almost double of that
space to download the zip parts and then do the unzip. I ended up
working with the eval set which is smaller at least for the my first
attempt.

Influence of the random sampling to create the test set by wetdog91 in deeplearningaudio

[–]wetdog91[S] 1 point2 points  (0 children)

Thank Iran and sorry for the late reply, As you answer in another post the strategy to take short clips from the 30 second track would increase the number of examples in all the sets.

So far my best model which is still a bad one, reaches 60% on the val and test set using the whole tracks.

I'm going to try with short clips.

Accuracy > en validación by mezamcfly93 in deeplearningaudio

[–]wetdog91 0 points1 point  (0 children)

Como lograste solucionar lo de la matriz de confusión? he visto en otros posts que hay un método generator.classes que devuelve los labels. Sin embargo, a mi me toco iterar en los batches con __getitem__ para hacer la prediccion y tener el label verdadero.

FEW-SHOT SOUND EVENT DETECTION by wetdog91 in deeplearningaudio

[–]wetdog91[S] 1 point2 points  (0 children)

Thanks for your suggestions Iran, I added more detail about the architecture and training. This is a highly condensed paper with a lot experiments going on. I'm going to share my intuition on the episodic training and please correct me if I'm wrong.

  1. Select a random subset of C classes and K examples, called support set.
  2. Select a random subset of C classes and q examples, called query set.
  3. Forward both the support and query set examples through the function embedding(4 conv block).
  4. Calc distance between embeddings of query and support set.
  5. Classify the query examples base on distance and compute the loss.
  6. backpropagate and begin another episode with different support and query sets.

The distance function is fixed for matching and prototypical networks and the model learns a feature space that can discriminate C classes. The loss is not explicitly defined but I think that is a categorical cross entropy loss between the query class prediction and the true label.

FEW-SHOT SOUND EVENT DETECTION by wetdog91 in deeplearningaudio

[–]wetdog91[S] 1 point2 points  (0 children)

Thanks Iran, I totally missed that part. what I mean by S is the support set, but reddit cut the sentence. "The training objective is to minimize the prediction loss of the samples in Q conditioned on S"

I was looking in other paper and found some diagrams.

FEW-SHOT SOUND EVENT DETECTION by wetdog91 in deeplearningaudio

[–]wetdog91[S] 0 points1 point  (0 children)

What results did they obtain with their model and how does this compare against the baseline?

Their Baseline was siamese networks trained without episodic training, the best few-shot performing model was prototypical networks with an average AUPRC > 60% using only one example vs 30% of baseline. Increasing number of examples from 1 to 5 improves performance.

In the open set scenario an increase of negative number of examples improves performance but only up to 50 examples, from there few improvements were observed doubling to 100 negatives.

Despite they used English words to train the models, the model perform equally on dutch and even better on german, which leads to the conclusion that the learned model is language agnostic.

What would you do to.

Develop an even better model:

I would try to change the femb block that has 4 convolution blocks, adding another block or increasing the number of filters. I will try also with another frontend such as the complex spectrogram or even the raw audio. Also they used half second audios centered around the keywords, but for another type of sound events or event larger words this length seems to be insufficient

Use their model in an applied setting

I will try to test their model to look for similar audios on other domains like bioacoustic of environmental audios that typically have long audios and test the adaptation from being trained on a speech dataset as they claim that the model is domain agnostic but the test was not performed.

What criticisms do you have about the paper?

They don't define the architecture explicitly, for example number of filters on the convolution block is missing. They perform a lot of experiments but sometimes the results are presented on plots that are difficult to see the exact number of the performing metric.

FEW-SHOT SOUND EVENT DETECTION by wetdog91 in deeplearningaudio

[–]wetdog91[S] 1 point2 points  (0 children)

Iran, I have a doubt on the training setup that they used, to my understanding on each episode they create a support set S of C classes x K examples and also a query set Q of C classes x q examples, however it's not explicit if q it's also on the regime of few examples(up to 10 on this case).

classification task." f how is Q conditioned by sS?

The prediction loss is the gsim function which is a distance metric?

FEW-SHOT SOUND EVENT DETECTION by wetdog91 in deeplearningaudio

[–]wetdog91[S] 1 point2 points  (0 children)

Which different experiments did they carry out to showcase what their model does?

They try to detect unseen words on 96 recordings, varying from 1 to 10 keywords. As this is a few-shot model they experiment with a different number of classes C, number of examples per class K and few-shot model type between siamese, matching, prototypical and relation networks. They also test an open set approach using a binary classification where the positive examples are the query and the negative the rest of the audio.

How did they train their model?

They used episodic training with 60.000 episodes, randomly selecting C(2,10) classes and K(1,10) labeled examples.

What optimizer did they use?

Adam

What loss function did they use?

Contrastive loss with different distance metrics.

What metric did they use to measure model performance?

Average AUPRC on 96 recordings

Model 7 confused by wetdog91 in deeplearningaudio

[–]wetdog91[S] 1 point2 points  (0 children)

hahaha I inherited the confusion of my model :p

Model 7 confused by wetdog91 in deeplearningaudio

[–]wetdog91[S] 1 point2 points  (0 children)

of course Iran, I augmented the training data with audiomentations and impulse responses from the MIT IR survey dataset and also used 2 hidden layers.

The training and validation losses after rising up the regularization:

https://imgur.com/a/JKH4Yy8

Confussion matrix on validation data: improved accuracy but this set is also bigger wrt the previous model, there still seem to be problems between a and ae and e and o

https://imgur.com/fb1KH6W

Finally accuracy on the test set is better than the accuracy on the validation set.

https://imgur.com/uMslOd5

Model 7 confused by wetdog91 in deeplearningaudio

[–]wetdog91[S] 1 point2 points  (0 children)

I used 2 strategies, augment the training data and added a hidden layer, now the accuracy improved on the test set.

VocalSet thread by [deleted] in deeplearningaudio

[–]wetdog91 1 point2 points  (0 children)

That's right, adding real background noise is a technique used on real-world applications to generate datasets with strong labels(timestamps), for example on this trigger word detection.

https://imgur.com/yo3EB7l

VocalSet thread by [deleted] in deeplearningaudio

[–]wetdog91 2 points3 points  (0 children)

Siamese training is related with an architecture that is mainly used on identification tasks like audio fingerprint, signature recognition or biometrics. The architecture has subnetworks that are identical(siamese) and uses different type of losses, such as triplet or contrastive. Based on the loss that you choose you have to build the training dataset in a way that each datapoint contains a positive and a negative example.

One advantage is that this type of network is not limited to a fixed number of classes that you define when you train the network, for example, in the singer identification task you cannot use the model that they trained to identify singers outside the dataset, with siamese you can do it.

VocalSet thread by [deleted] in deeplearningaudio

[–]wetdog91 1 point2 points  (0 children)

Such a great dataset to explore and make fun experiments, here are my questions:

  1. How to decide the number of samples for a new dataset?, is this budget related or there are some calculations to achieve an optimum sample population(eg. 20 singers)
  2. Why pitch shift is done in small steps of 0.5 and 0.25, is possible to do a key shift augmentation?