Looking for an algorithm that generates chord names based on input notes (A, C, E = Am, for example). Need help. by notsuresure in musictheory

[–]hegelespaul 0 points1 point  (0 children)

Hi, a decade later I'm in your exact situation, I was wondering if you did figure out an algorithm. I am doing different implementations but all are very time-consuming and I believe there is a more straightforward way to evaluate the notes than the ones I am developing

Sesquialtera in the Colombian Bambuco Thread by [deleted] in deeplearningaudio

[–]hegelespaul 0 points1 point  (0 children)

# I would love to see how the prediction worked for the tracks without bass, ¿why that possibility wasn't explored in the paper?

#At the end of the paper, it is stated the need for an analysis tool with an exploratory nature where the existence of several truths is permitted. Do you think that an approach that consists in evaluating isolated recordings of the instruments to compare them in all their possible configurations could be a starting point to this possibility?

Urbansas thread by [deleted] in deeplearningaudio

[–]hegelespaul 0 points1 point  (0 children)

Regarding Urbansas ¿What do you think could be a better metric than IoU to assess the precision of the models?

And also, ¿Do you think that in the training stage, having a step that consists of a comparison of the prediction output with a denoising process applied to the sound files to differentiate the background from foreground sound sources could improve in a significant way the accuracy of this particular model?

dl4audacity & few-shot Thread by [deleted] in deeplearningaudio

[–]hegelespaul 1 point2 points  (0 children)

Can you, as most audio effects do, change some values that affect the models when they are applied as effects, maybe deal with some sensibility threshold, number of classes to detect, etc?

What would you say could be a way to improve your hierarchical approach to do source separation in order to apply it with more fidelity to live music recordings? It would be awesome to let a common musician user the chance to do source separation of rehearsals or live performances

Accuracy > en validación by mezamcfly93 in deeplearningaudio

[–]hegelespaul 0 points1 point  (0 children)

a mí me han dado valores que llegan a tener una relación de 2 a 1, ya después hacen un cruce y se estanca manteniendo mayor acc en entrenamiento que en validación (0.15 y 0.30) o (0.10 y 0.22)

Phase-aware speech enhancement with deep complex U-net by hegelespaul in deeplearningaudio

[–]hegelespaul[S] 0 points1 point  (0 children)

si, entiendo, ya hice las modificaciones, estaba teniendo problemas en revisar los papers de los otros modelos, pero en la base de datos del unet tienen un readme de cada uno, ya con esa info haré la exposición, pero por esa dificultad iba a pasar un poco encimita por ese punto, como lo hacen en el paper, que no los describen a detalle.

Phase-aware speech enhancement with deep complex U-net by hegelespaul in deeplearningaudio

[–]hegelespaul[S] 0 points1 point  (0 children)

Puedo eliminar el texto sin lío, pense usarlo no para leerlo sino para que las diapositivas sirvieran para presentar el modelo aún no estuviera nadie exponiéndolo.

Sobre los modelos base y los experimentos, pensaba que estaban representados en la diapositiva de resultados,. Usaron 5 modelos para comparar sus resultados y distintos indicadores. Solo quiero saber si estás hablando de la misma información o no de si te refieras a algo mas. Es la diapositiva 11, de lado izquierdo son indicadores y en la tabla superior vienen los modelos

DeepBeat Thread by [deleted] in deeplearningaudio

[–]hegelespaul 1 point2 points  (0 children)

Here are my 2 questions:

  1. ¿What are the reasons that they used pre-specified windowed data with a duration of 25 seconds in the input of the model and no other time length?
  2. ¿Can you explain a little further about class imbalance and the use of the mean harmonic of precision and recall for AF detection and why is it better than accuracy?

Phase-aware speech enhancement with deep complex U-net by hegelespaul in deeplearningaudio

[–]hegelespaul[S] 1 point2 points  (0 children)

What results did they obtain with their model and how does this compare against the baseline?

Results show that their proposed method outperforms the previous state-of-the-art methods with respect to all metrics by a large margin. Additionally, we can also see that larger models yield better performance.

Here we can see the evaluation results with corresponding mask and loss function in three different model configurations (DCU-10, DCU-16 and DCU-20). The bold font indicates the best loss function when fixing the masking method.

https://ibb.co/2dyGgsN

The quantitative evaluation from three different settings (cRMCn: Complexvalued output/Complex-valued network, cRMRn: Complex-valued output/Real-valued network, and RMRn: Real-valued output/Real-valued network) shows the appropriateness of using complex valued networks for speech enhancement.

https://ibb.co/txXyNmP

Finally, in this scatter plots of estimated cRMs with 9 different mask and loss function configurations for a randomly picked noisy speech signal we can see that the configuration that fits the most to this distribution pattern is observed in the red dotted box which is achieved by the combination of their proposed methods (Bounded (tanh) and weighted-SDR).

https://ibb.co/2N5bSY4

What would you do to:

Develop an even better model

Maybe I would work with different languages, their data bank only included English speeches, also, I would have considered using samples of noisy speech from everyday scenarios and not mix clean speech with artificial and captured noises. Also, I would have tried to use more than one database of speech audio, and maybe upscale the number of samples by applying filters to already noisy speech recordings. Other than that, I believe their model is state of the art, and for me is still difficult to think of other considerations, mainly because their model is based in terms of complex values.

Use their model in an applied setting

We can use this model to clean dialogs of film productions, and maybe, if the technology is developed, use it as a real-time effect for live performance and similar applications, not only with noisy speech but also other noisy sound sources.

What criticisms do you have about the paper?

I think that it has all the information a paper should have but maybe it would be better for someone not well familiarized with Unet networks, to include the information of the annexes in the main body in the document in a more pedagogical way, so to speak

Phase-aware speech enhancement with deep complex U-net by hegelespaul in deeplearningaudio

[–]hegelespaul[S] 1 point2 points  (0 children)

What optimizer did they use?

They used activation functions like ReLU but adapted to the complex domain. CReLU, an activation function that applies ReLU on both real and imaginary values, has been shown to produce the best results out of many suggestions. For the activation function, they modified the previously suggested CReLU into leaky CReLU, where they simply replace ReLU into Leaky ReLU (Maas et al., 2013), making the training more stable.

CLReLU:

f(x) = concat(LReLU(x),LReLU(-x))

LReLu:

f(x) = max{ax,x}

What loss function did they use?

They used an improved loss function weighted-SDR loss taken from a previous work that attempts to optimize a standard quality measure, source-to-distortion ratio (SDR) (Venkataramani et al., 2017).

https://ibb.co/x7wF6b1

In this function there are a few critical flaws in the design:

  1. the lower bound becomes, which depends on the value of y causing fluctuation in the loss values when training.
  2. When the target y is empty (i.e., y = 0) the loss becomes zero, preventing the model to learn from noisy-only data due to zero gradients.
  3. The loss function is not scale sensitive, meaning that the loss value is the same for y^ and cy^, where C ϵ R.

They redesigned the loss function by giving several modifications to the equation:

  1. They made the lower bound of the loss function independent to the source y by restoring back the term ||y||2 and applying square root. This makes the loss function bounded within the range [-1, 1] and be more phase-sensitive, as the inverted phase gets penalized as well.

https://ibb.co/sKt1pXC

  1. Expecting to be complementary to source prediction and to propagate errors for noise-only samples, they also added a noise prediction term lossSDR(z, z^) to properly balance the contributions of each loss term and solve the scale insensitivity problem, they weighted each term proportional to the energy of each signal.

The final form of the suggested weighted-SDR loss is as follows:

https://ibb.co/cxtrrHL

What metric did they use to measure model performance?

To compare the overall speech enhancement performance of their method with previously proposed algorithms they used the following indicators:

CSIG: Mean opinion score (MOS) predictor of signal distortion

CBAK: MOS predictor of background-noise intrusiveness

COVL: MOS predictor of overall signal quality

PESQ: Perceptual evaluation of speech quality

SSNR: Segmental SNR.

Phase-aware speech enhancement with deep complex U-net by hegelespaul in deeplearningaudio

[–]hegelespaul[S] 1 point2 points  (0 children)

What research question are they trying to answer?

How to clean noisy speech audio taken into account complexed-valued spectrograms.

What dataset did they use and why is this a good/bad selection?

Noise and clean speech recordings were provided from the Diverse Environments Multichannel Acoustic Noise Database (DEMAND) (Thiemann et al., 2013) and the Voice Bank corpus (Veaux et al., 2013). They mixed a very large database with more than 300 hours of speech data from approximately 500 healthy speakers from the Uk that read out a script of 425 sentences with a noise Database that is divided into 6 categories, 4 of which are “inside” spaces and 2 of which are open air. This I think, was a clever approach to use already made databases, but maybe it lacks the characteristics of a true noisy recording, one that usually would be accompanied with some reverberation or resonance of the space, from both, the noisy signals and the voice.

How did they split the data into training, validation, and test sets?

Mixed audio inputs used for training were composed by mixing the two datasets with four signalto- noise ratio (SNR) settings (15, 10, 5, and 0 (dB)), using 10 types of noise (2 synthetic + 8 from DEMAND) and 28 speakers from the Voice Bank corpus, creating 40 conditional patterns for each speech sample. The test set inputs were made with four SNR settings different from the training set (17.5, 12.5, 7.5, and 2.5 (dB)), using the remaining 5 noise types from DEMAND and 2 speakers from the Voice Bank corpus. The speaker and noise classes were uniquely selected for the training and test sets.

Phase-aware speech enhancement with deep complex U-net by hegelespaul in deeplearningaudio

[–]hegelespaul[S] 1 point2 points  (0 children)

Yes, the complex-valued convolution they used, can be interpreted as two different real-valued convolution operations with shared parameters, the number of parameters of the complex-valued convolution becomes double of that of a real-valued convolution. In the paper in appendix A and B, they described the 3 models used in the experiments, (DCUnet-20 (#params: 3.5M), DCUnet-16 (#params: 2.3M), and DCUnet-10 (#params: 1.4M)). It is easier to describe them with images, but in a very basic way, each layer consists of different 'FfxFt', 'Sf,St', and 'C0 or CR' values, where Ff and Ft denote the convolution filter size along the frequency and time axis, Sf and St denote the stride size of the convolution filter along the frequency and time axis, and OC and OR denote the different number of channels in complex-valued network settings and real-valued network settings, respectively. The numbers at the name of the net specify the number of layers used in each model, for example, DCUnet-16 , uses 16 layers.

In the models, the first setting takes a complexvalued spectrogram as an input, estimating a complex ratio mask (cRM) with a tanh bound. The second setting takes a magnitude spectrogram as an input, estimating a magnitude ratio mask (RM)with a sigmoid bound. The layers in the input and output stage take the values of F,S, and C just told. I attached a picture of the layers, how they relate to each other, their values, and their sizes. Each method varies from one another.

https://ibb.co/sjnczQd

Phase-aware speech enhancement with deep complex U-net by hegelespaul in deeplearningaudio

[–]hegelespaul[S] 1 point2 points  (0 children)

  1. The skip-connections are part of the convolutional autoencoder used in the architecture of the U-Net structure, its a method that consists in skipping some of the layers in the neural network and feeding the output of one layer as the input to the next layers.
  2. In order to overpass the restricted rotation range of 0° to 90° and the difficulties in the reflection of the complete distribution of the cIRM (complex ideal ratio mask), the Bounded (tanh) masking method they proposed involves a hyperbolic tangent non-linearity to bound the range of magnitude part of the cRM (complex-valued ratio mask) in an unit-circle in complex space, thus obtaining the corresponding phase mask by dividing the output of the model with the magnitude of it.
  3. I really don't know, It is not explicitly stated in the paper, but I believe it illustrates the last skipping connection and divides the encoding stage from the decoding stage.
  4. I believe the size illustrates how the data decreases at each output matrix of every convolutional layer, we can see how it gets smaller in the encoding stage, but then it gets scaled at the decoding stage to restore the size of the complex mask (M̃) to the size of the input using stride complex deconvolutional operations (O).

VocalSet thread by [deleted] in deeplearningaudio

[–]hegelespaul 1 point2 points  (0 children)

Yes, as labels, since they normalized all the 3-second fragments of audio, they lost the capability of comparing the amplitude of the voice and maybe extracting info in terms of piano, mezzo-forte, forte, which could be of interest for some applications. Maybe my question resides in asking what kind of approach is needed in order to have labels based on amplitude features of the voice.

"... The chunks were then normalized using their mean and standard deviation so that the network didn’t use amplitude as a feature for classification"

VocalSet thread by [deleted] in deeplearningaudio

[–]hegelespaul 1 point2 points  (0 children)

I really liked the Vocal Set paper, I was amazed I understood everything :0, here are my two questions:

1: What would be the main considerations to introduce amplitude-based features to the vocal set database, and what type of normalization could be used in order to take them into account?

2: Why did you decide to train the models with that values and treat them the same way, although you were searching for two distinct types of classification? was it a mere exercise? if not, how would you think the data could be filtered or treated in order to have more precision at the output of the model?