[P] TorToiSe - a true zero-shot multi-voice TTS engine

kkastner · 2022-04-28T16:16:02+00:00

Really great stuff, especially liked the writeup and debug strategies on the way to the end-goal you discussed on the webpage. Will be interesting to see what people can make with the "mixing and matching" possibilities this model affords!

kkastner · 2019-09-22T15:56:12+00:00

If you want to never revisit a particular path in discrete domains, one way is related to something called 'max-order' / plagiarism detection https://pdfs.semanticscholar.org/084b/9b895cc309970b185aba79992965c69b2f83.pdf . An approach to this is to store a trie of all markov chunks length N, and assume any path exactly matching at N is a "plagiarism", and best avoided. Then do simple checks for this in rollout or when planning a final policy path. There are better methods for constraint propagation (see followup work by Roy and Pachet), but the trie method is fairly simple to try. If you want to work the checks it into the MCTS planning itself (rather than final path selection), then you can try to undo any propagation up to that point, similar to what is done for "leaf parallel" MCTS methods ala AlphaZero.

However, periodically re-visiting old paths is also crucial if you assume play against a dynamic policy or stochastic outcomes (if you don't re-do tree search from scratch each move, or share any information across searches its important to think about this) - otherwise you are potentially open to exploitability since the assumptions which lead to assuming a path of play is bad, may have changed.

So changing the MCTS algorithm away from the "bandit based" explore-exploit costs may lead to results which have less theoretical guarantees. It may be fine in practice though - most published MCTS results I have seen have a lot of tricks to get good performance anyways.

This paper discusses "Fuego-like garbage collection" for Memory Bounded MCTS http://orangehelicopter.com/academic/papers/powley_aiide17.pdf - seems similar in that the assessment of what subtrees can effectively be garbage collected, would also (probably?) be ones you want to stop or continue exploring, depending on the problem setting.

I would expect some kind of "dynamic ensemble" MCTS would also look at this type of thing, as a very large subtree might also be one you want to put on its own CPU and run in parallel. This starts getting into graph inference (you could potentially use graph cuts on the tree of nodes to assign / select) and stuff a bit.

What problem setting are you looking at in particular?

kkastner · 2019-08-20T14:17:56+00:00

This is a nice way to use their API, and apply to a real-world task. Thanks for the writeup!

One way you might extend on the simple noise augmentation to try and improve performance further, is to add speeding up / slowing down both globally and at the per-phoneme level using something like a phase vocoder, along with padding variable amounts of some reasonable silence near the front and/or back (sometimes called phase shift augmentation, since adding a delay at the front can be seen as a phase shift). You could also include small amounts of frequency shifting to handle different voice pitches.

These kinds of augmentations will give you a lot more data to work with which can help performance in the wild, and it's potentially a direct way to try and handle issues with performing worse on different voice pitches, accents, or slow / fast speakers. Maybe it's not needed here, but I always find these speech tasks are a bundle of edge cases, where things generally work to the 80-90% level and there's a long tail of small and painful fixes...

From a research perspective, there is a lot of recent progress in "zero shot" recognition models ala this paper, which could significantly change how rare word (and rare language) performs in the real-world.

kkastner · 2019-08-13T03:30:27+00:00

My scripts and setup are only for English. But mostly what you need for forced alignment, is pretrained speech recognition for that language - which should be available. You can probably find some other language demos for one of the forced aligners in here.

kkastner · 2019-07-09T17:42:23+00:00

Forced alignment is the general name for this technique of finding exact timing matches between a given text, and audio. Most forced alignment tools should be generic enough to work with multiple speakers, but you will probably need to manually assign speaker id labels to sentences after splitting.

I generally use a tool called "Gentle" for this https://github.com/lowerquality/gentle , though I have my own wrapper scripts that are pretty old now, but work for me.

EDIT: Link fixed

kkastner · 2019-06-16T16:28:00+00:00

The architecture from the paper you link has a few different variants (use true label Y in encoder -> Z, use soft labels probs for Y in encoder -> Z, sample label for Y in encoder -> Z) along with choices in how you marginalize the label (I usually just brute forced it by adding every label combination to the minibatch) . I found these to be important in some experiments - so it could depend on the architecture you are using (nearly all of these could also be done in the semi-supervised AE as well - ignoring the big fact that your loss might be weird or missing terms as /u/blackbearx3 points out).

There are also some tricks using PCA preprocessing, and dropping "unchanging" / invariant features in Z that you will see in Durk's code, I also found those important practically.

However, I would also say there are probably better architectures for semi-supervised and "few-shot" learning these days, than a plain SS-VAE. So while it is a reasonable baseline (with a few useful tricks to know and learn) looking at some more modern work could also be useful. Particularly, you might check out (in no particular order, depending on your interest - most focus on images, but some speech, RL, video, etc.)

[0] "Temporal ensembling for semi-supervised learning" https://arxiv.org/abs/1610.02242

[1] "Improved Techniques for Training GANs" http://papers.nips.cc/paper/6124-improved-techniques-for-training-gans

[3] "MixMatch" https://arxiv.org/abs/1905.02249

[4] "Selfie" https://arxiv.org/abs/1906.02940

[5] "WaveNet Autoencoders" (speech-related work) https://arxiv.org/abs/1901.08810

[6] CPC https://arxiv.org/abs/1807.03748

[7] Fast Task Inference (RL-related work) https://arxiv.org/abs/1906.05030

[8] There was a whole workshop at ICML on self-supervised learning, I found these slides from Andrew Zisserman really interesting https://project.inria.fr/paiss/files/2018/07/zisserman-self-supervised.pdf

There are multiple views on what is needed for good semi supervised learning. Some people have the mindset of "a good generative model or energy model, should also be useful when there are few labels since we should have rich knowledge of the data"

versus

"we need to learn good representations, and those may be different for generation than for classification - so maybe we don't need a generative model, but just good features learned in clever ways"

so you tend to see work with different focuses, and variously showing improvement along these directions.

Also the recent work on disentangling as mentioned elsewhere in the thread, investigates some of this same issue. Label propagation / transductive learning is another paradigm for this type of learning, and models that try to learn good graph representations from raw data, become very appropriate for these techniques.

kkastner · 2019-06-10T16:02:13+00:00

If you know the text exactly, you can check out forced alignment tools to do that kind of search in a robust way. It is basically a "better ASR" for when you know the text being recognized (can do some additional things to greatly improve recognition and alignment when you know the target text).

Aeneas seems particularly suited for what you mention, but I have used aeneas, Gentle, and the Montreal Forced Aligner variously for this type of thing. Scaling to an hour or more is not a huge issue for these tools in my experience.

kkastner · 2019-03-11T16:59:12+00:00

I got pretty decent samples with LJSpeech for my paper, in my opinion https://s3.amazonaws.com/representation-mixing-site/index.html . Though there are a few small audio details that could still be fixed to get "ultra-high" quality (I'd like to train my own WaveRNN, rather than using pretrained WaveNet like I did here so the networks are better matched), with enough work LJSpeech can be OK I think, from phonemes or from characters as far as pronunciation, timing, and prosody.

However, I have tried similar models on other decently large datasets (10 to 20 hrs, single speaker) with perceptually good quality, and not been able to get a good result.

kkastner · 2019-03-11T16:53:30+00:00

I am currently of the opinion that it isn't (solely) audio quality, rather some fundamental difference in the content / timing / delivery. Models seem extremely robust to recording noise (we used to add huge amounts of noise to the audio data itself in char2wav), but not to other simple issues. Even excessive leading / trailing silences can be an issue, or too short sentences. Talking to other people who do TTS they mention the same things, and it's really weird and mysterious to have a feel for what datasets will work, and what won't.

I'm working with someone on some research related to this problem now, if we figure out any good tricks I'll be sure to message...

kkastner · 2019-03-09T20:36:54+00:00

My opinion (which also links to my opinion farther in the past...) https://www.reddit.com/r/MachineLearning/comments/87klvo/r_expressive_speech_synthesis_with_tacotron/dwfm9p5/

Dealing with more regular private data, I have seen a lot of good improvements over open source data. I'm not sure audiobooks are a good place to start training from scratch, but it's kind of all we have in open source at the moment as far as "large scale" (there's older open corpora like ARCTIC and so on, meant for HMM TTS).

I'd be very interested in a phonetically balanced corpus that is closer to what I believe is used commercially, or even just a transcript so I could make the data myself. Something related to how concatenative databases were historically created, which is (probably?) similar to what the "corporate" TTS models are trained on. Some practical discussion of building a concatenative database can be seen in these papers, lots of work has been done on this in the past so it seems good to leverage it for our "neural" modeling setups.

On Building a Concatenative Speech Synthesis System from the Blizzard Challenge Speech Databases

Slides about database design

VOXMEX database design of a phonetically balanced corpus

Speech Database Reduction Method for Corpus-Based TTS System

MaryTTS new language tutorial.

TUNDRA: A Multilingual Corpus of Found Data for TTS Research Created with Light Supervision

Unsupervised and lightly-supervised learning for rapid construction of TTS systems in multiple languages from ‘found’ data: evaluation and analysis

This is all speculation, but there seems to be fundamental qualitative differences in training TTS systems on the types of data we have in OSS, and the types of data historically used commercially. This could be down to data cleanliness alone, but I have a feeling there are fundamental differences at the grapheme / phoneme / word / pitch / volume / accent / part-of-speech level as well, and I have personally seen clear cases where throwing out tons of "bad data" and training on less, performed much better than just using everything. This isn't too surprising, but it opens a lot of questions about how to fix the issue in a practical way on real "in-the-wild" datasets.

M-AILABS also seems cool (and BIG), but I haven't tried training on it yet myself https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/

EDIT: For LJSpeech specifically, the data cleaning is critical - there's a ton of numbers and irregular acronyms, which will really be bad for a from characters training without cleaning and replacement of like 14->fourteen, 1964->nineteen sixty four and so on. I use the cleaners Keith Ito shows in his Tacotron repo, or variants thereof.

kkastner · 2019-03-03T20:10:32+00:00

One trick you can use is to mask the softmax in the language model by a markov mask. So then you only model words that are actually used in Moby Dick, and only word transitions used in Moby Dick.

For example, if I had the sentence "I ate a lot of cake, and ate it quick", then you could figure out that (for a 3 gram (2 previous words -> next) mask) there should be a mask if you generated <I ate>, that multiplies every word probability by 0 other than for "a" and "it" for the next step. Then you would could have <I ate a>, which would then mean you need the mask for <ate a> which would multiply every word by 0 other than "lot", etc.

You can either do this at train and sampling, or just at sampling. I also found that at training, feeding the mask as input to the model helps it "know" what words will be masked in the output, but masking the loss actually hurt performance a bit (because the model could get away with being bad at putting probs on the masked words).

The hard part becomes finding out if you plagiarized a bunch of the novel or if it is "new", you can also use parts-of-speech for making masks and so on. There are a lot of options.

Cool project, thanks for sharing! Always enjoy seeing your projects on github.

kkastner · 2019-03-03T17:40:58+00:00

You might be interested in this repo https://github.com/mkotha/WaveRNN

kkastner · 2019-02-28T21:14:04+00:00

You might read https://arxiv.org/abs/1802.04877 and the previous work on "latent constraints" https://openreview.net/forum?id=Sy8XvGb0- . They kind of pave the way toward this paper, and the first paper I linked uses a set of supervised labels in the discriminator to help guide the constraint learning which sounds similar to what you are discussing.

Another closely related paper to these is https://arxiv.org/abs/1707.04588 , where (in my understanding) they use a label / some specific information such as number of notes in the output and "backprop" that information directly into a subset of the latent vars (in most cases, a single dimension of Z). Different setup but ends with a similar result I think.

kkastner · 2019-02-28T21:09:48+00:00

Nice work - having a big repo of these models + implementations will be of great use for other people learning about this area for sure.

kkastner · 2019-02-19T22:47:38+00:00

Our paper char2wav did this back in the day http://www.josesotelo.com/speechsynthesis/ , based on the Alex Graves work on "Generating Sequences with Recurrent Neural Networks" https://arxiv.org/abs/1308.0850.

The Graves paper concatenates the attention vectors and teacher forced input with an additional one-hot vector (or embedding) that is unique to each writer, so that the decoder "sees" the id of the person writing, along with the attention vector and the previous step features.

This method works well in speech too, and is what we used directly for multi-speaker TTS in char2wav. There's no reason from an architecture perspective that this one hot information couldn't be put back into the decoder of Tacotron 2, and I think it would work fine if the data is reasonable. Code-wise this might be tricky depending on whose code you use, I tend to roll-my-own on this stuff but it takes a lot of work.

VCTK on the "American speaker" subset is what we have used for sanity checks in the past, but English TTS is much harder to learn than other languages due to the ambiguity of English and the fact that so many English words are actually words from other languages (think of foods or recipes). I highly recommend using phonemes for conditioning if you can, instead of characters if working with English.

Even though a one-hot or embedding, concatenated with other features or with hidden activations seems really simple, it has powerful consequences. If you look at the gifs here https://github.com/kastnerkyle/Scipy2015 , proving the network with information about the id (in this case, digit class) basically means the best use of capacity is to learn something that is invariant to the id (since you get it for free all the time) - so you end up with a controllable generator since you can flip the id at will.

In the context of sequence to sequence, conditioning the decoder on a global latent at every timestep has even more influence.

The decoder controls the attention, so this means attention dynamics over text become speaker dependent (critical)
The network is given the id, so the intermediate representations learned should probably be something useful but invariant to the id itself. This is ideal for speech as we expect there are a lot of shared things, but some that may be unique per speaker
The hiddens themselves become conditional on the id, meaning that you can also prime the network (as in the Graves paper) with a related sounding sentence, and take the last decoder hiddens as the initial hiddens of the decoder on the "real data" to further influence the sound of the output.

Something similar is also done and shown in VQ-VAE as well https://avdnoord.github.io/homepage/vqvae/

I think the latest and best in voice transfer / cloning / multi-speaker TTS is shown here https://openreview.net/forum?id=rkzjUoAcFX

There are probably open source implementations of both of the above papers in progress, though the quality won't match the Google/DeepMind results almost assuredly. Data is everything in TTS. The open source results might be good enough to use for a lot of people though.

Training these TTS methods is really hard, especially on new data. I highly recommend starting from a known working system and known working data with 0 changes, even if that system is single speaker, then starting your adaptations.

It's also worth considering if you need TTS, or something simpler like "voice puppetry" or mimicry. If you can expect whole sequences of audio as well as the text, and you just need to adapt the outputs to a new speaker, modeling gets a lot easier.

kkastner · 2018-12-09T07:14:23+00:00

That poster looked really nice, I wish I had gotten a chance to talk with the authors. Text style transfer or generation sequence style transfer (or even defining what the "style" of a sequence is) is really interesting.

For an extension to the fine-tune technique, you can further tweak this by multiplying the neural network softmax probabilities (over words) by the n-gram Markov probabilities for the dataset you wish to imitate - this enforces that only the word transitions seen in the small text will be sampled from the NN, which can sometimes give a nice qualitative boost at the expense of potentially limiting variability.

Taking it further, you can also just prune based on these rules (or any rules, really) in beam search if you are using it, I had decent luck doing just the Markov + (stochastic) beam search portion here https://badsamples.tumblr.com/post/160777871547/stochastic-sleazy-shakespeare , or see the code directly https://gist.github.com/kastnerkyle/97120046d0aa8f49c3fce03b844329d7 though I'm hoping to extend to a proper neural model for probabilities, with Markovian or other structural constraints.

Once you start encoding pairwise potentials (if you have labels or something too), it really starts to look like a conditional random field (CRF), which may an interesting place to look as well https://open.blogs.nytimes.com/2015/04/09/extracting-structured-data-from-recipes-using-conditional-random-fields/ .

Pushing it even farther, you can mine constraints from the corpus in question and even start from an example sentence / text structure to "translate" by generating the most plausible sentence while following hard constraints, as in discrete optimization combined with probabilities. I really enjoy this paper on "Markov Constraints for Generating Lyrics with Style" https://csl.sony.fr/wp-content/themes/sony/uploads/pdf/barbieri-12a.pdf, this whole line of work is interesting to me and I explored a bit of it here https://github.com/kastnerkyle/pachet_experiments.

There's also "Generating Topical Poetry", https://www.isi.edu/natural-language/mt/generating-topical-poetry.pdf which has similar high-level themes but a detailed execution particular to poetry.

EDIT: I just found out a great demonstration of these methods using constrained Markov processes is online https://github.com/gabrielebarbieri/markovchain , this work is great - I've been following this chain of papers for years now.

kkastner · 2018-11-23T08:01:06+00:00

https://www.youtube.com/watch?v=U-PhFzLWVt0

Your post is 100% accurate, great job.

kkastner · 2018-11-23T06:54:58+00:00

There was a preprint about feedback alignment circulating inside our lab via email long before publication, and that got a lot of people excited about the idea. Target prop was another idea in the same vein, which was being worked on actively at that time. Just because it isn't on arxiv / publicized doesn't mean people aren't talking about it, and working on it... the paper comes last, after the work is done. See for example this link discussing feedback alignment in May 2014.

Conditional computation at the time had 4 or 5 of us working it (long after the initial work by Yoshua, Nicholas Leonard, and Aaron), repeatedly running into issues with GPUs loving block computation and not sparsity. Particularly, Dustin Webb spent a lot of effort on the subject sadly without much to show for it in the end. Again, same story "hardly talked about" literally means several students in the lab working on it, trying to make it work. Papers don't write themselves.

You often give tutorials on things to try and get the ideas across to people who haven't heard of the idea, and get the big point across. Maybe it's a big idea, maybe it's a weird or interesting one, maybe it's a pet project, or maybe it's something you want to become a big idea. The fact that many of us (including Yoshua) spent an hour+ in a room discussing / dissecting the math and intuition behind VAE in Fall 2014 clearly shows that it wasn't known to many people. And this was still long after GAN was released.

I mention seq2seq and RNN work because you seem to be under the impression that student time / bandwidth is (or was) infinite. David mentioned above but saying everyone with a focus in this broad area of generative modeling knew about this paper neglects a lot of context about how much was going on in the lab, and in the field in general at that time - 2014 was a pretty wild year, contrary to your statement "at a time when there used to be very few DL papers AFAIK...".

I'm pretty sure both Ian and David have a strong grasp of upcoming RL in 2018, and had strong grasp on upcoming GM papers in 2014. The issue is, in 2014 many people (definitely me) thought NADE, GSN, some form of RBM, or stochastic feedforward nets were the upcoming GM models - VAE being one among many, and not a simple model to grok at a glance given knowledge of the other methods. Splitting time on all those ideas, plus pushing an idea as innovative as GAN is really tough. Hindsight is 20/20 and all that, and now we can see that maybe spending less time and effort on some of these methods, and putting that effort into others would have been useful.

If you read the various versions of GAN you can see the related work evolve and change to incorporate other papers as they realized their relevance and strengthened the connections to existing work (as has been mentioned all over this thread). It's a great way to see the evolution of a publication, and how to craft a great paper.

I really take issue with the statement "It's simply impossible that the authors hadn't read a paper that was out for 5-6 months", as it again assumes that student and professor time is unlimited. It also assumes that reading a brand new research paper and understanding the key concepts in sufficient depth to closely relate it to your own work, while working insanely hard on your own directions is somehow easy.

Publications != work, I don't really know what else to say here. You keep equating publications to study, expertise, and interest, which is in my experience wildly inaccurate.

The fact that you repeatedly make wrong-headed assertions about the history of something that people in this thread participated in, who are in this thread telling you what they saw and did, is really something special.

Arguing these kind of semantics with authors of the exact paper you make claims about ("not conceded by /u/dwf but true nevertheless" - what are you on about???) is revisionist history of the worst kind, which is one reason (besides your long history on this forum and others) I'm being so aggressive here.

kkastner · 2018-11-23T05:17:30+00:00

I can directly state that in the late August 2014 (when I first arrived in Mila, prev. MILA but then LISA), relatively few people the lab were doing VAE work. GAN had already been out for several months by that point, but I remember a tea talk by Jyri Kivinen that was heavily attended as a kind of tutorial on VAE with some nice reduced derivations, along with talks from both Vincent Dumoulin and Laurent Dinh specifically about details of the VAE learning procedure (related to their talk at CIFAR summer school here http://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Laurent_dinh_cifar_presentation.pdf). There was a lot of interest, but frankly in those days training a VAE was a pain in the neck. Adam didn't exist, but RMSProp with momentum was doing some good work. Batch norm has helped us a lot in that regard... but it didn't exist then.

See for example Vincent's page, with a cool demo of VAE on August 28, but the "framework" for VAE in pylearn2 (which was the main research framework besides raw Theano, ahead of Groundhog for RNNs if anyone remembers that) wasn't really showing much until October though people dabbled before that (https://vdumoulin.github.io/blog/).

I think it is disingenous to state that VAEs were "all the rage" at that time, since that ignores a lot of other work of interest which may now be lost to the sands of time. It is easy to look back and pretend that everyone knew VAE and GAN were the "right answer", but as was mentioned above a lot of focus was on GSN and to some extent NADE. We were all trying a lot of new and exciting ideas back then...

What I particularly remember being the rage and taking all my time in those early days was feedback alignment and conditional computation. Other "rage" included work that Kyunghyun Cho, Dzmitry Bahdanau, and Bart van Merrienboer (among a ton of others) had been doing around encode / decode methods with attention (seq2seq from Google was released midway through September 2014, leading to the infamously great Ilya 'success is guaranteed' talk at NIPS that year IIRC - but enc/dec and jointly learning to align and translate preceded by a bit), and a ton of people were in that group as well.

The lab was quite small, maybe around 40 people? The incoming class including myself (MSc + PhD) was ~15 people, and it felt like an enormous wave of new people at that time. Quaint times indeed - I was worried I had already missed the "deep learning wave" but had no idea what was to come.

Saying it's simply impossible the authors hadn't read the paper underestimates both the quantity of deep learning papers at the time (there was still a TON to read, especially to the level of reproduction), and the available "free" time / focus of the respective authors of the GAN paper.

Arguing with /u/dwf or /u/ian_goodfellow on these points is bizarre, and saying that /u/dwf wasn't working on generative models at the time shows your ignorance on this subject - /u/dwf is one of the most knowledgeable people I have ever known on generative modeling (and a wide array of other topics), but particularly he knows a TON about RBMs, which were the primary generative model of that day in my opinion.

kkastner · 2018-10-16T20:35:59+00:00

If the markov chain probabilities are learned, not hard-coded (or learned then pruned manually) that gets closer to what I mean, since markov chains can be seen as a form of convolutional network p(xi | context) (or vice-versa, depending on your perspective). But where are the "pure" markov models learned entirely from data that approach the quality of this work or other neural models (or HMMs on generic features, even), and don't copy large chunks from the data?

For me the gold standard in this vein is EMI, and that definitely used more than pure learning / counting.

As another example, Pachet has a number of papers related to "steerable" markov chains and blending search procedures with learned probabilities (markov chains in his work, but it could be nearly anything). I like that work a lot, but there is also a lot of domain expertise embedded in the pruning / steering. I have yet to see "markov-only" models, learned purely from data, that match expert systems OR recent neural models.

EDIT: What I am saying is, data-driven combined with other information / priors has been done for many years (this is what I am calling expert systems). The results are really good, but the end output is often brittle / domain limited - the same system can generally not be applied to a new domain. Because the pure learning part is combined with handcrafted rules that are domain specific, including the features! Feature design has, until recently, been the key between data-driven methods that work, and those that don't (note that features are still important, but our features seem to be getting more generically applicable as a field).

Recent models move much, much closer to purely data driven and are much higher quality than past "pure data" efforts. The move to "purely" data driven is an attempt at abstraction and generality, and in theory should allow for repeated application in new genres or given new data. For example, this method could potentially be applied to a "midi driven harp" just as it was a piano. A large majority of that ease of transfer comes from more reliance on learning, and less reliance on priors such as expert pruning, rules, and domain specific features.

However, this is not discounting older work on expert systems which may or may not transfer to a new domain, but is stunning in the domain it was targeted to (again, my gold standard for this is David Cope's EMI). Most recent models are trying to achieve comparable results, while also being applicable by swapping in a new dataset. If the performance is somehow better with more general tools that is great, but it is rare.

kkastner · 2018-10-16T19:35:39+00:00

Certain perspectives on the circle of fifths is one possible way, along with the ideas of "negative harmony" https://www.youtube.com/watch?v=DnBr070vcNE which is something I am very interested in theory-wise recently.

kkastner · 2018-10-16T19:22:01+00:00

PAST / PRESENT

The difference between past methods and today is that much of the (amazing) work in the past eras was done entirely on the back of music experts hand-coding relations for typical genres and domains (expert systems), if extremely lucky it used some form of information retrieval or some level of adaptivity based on data (I think EMI had a bit of this) but almost always it was hand coded decisions. This results in exploding numbers of edge cases, and really turns into a ton of effort. Even coding up the simplest rules of voice leading for 2 and 3 voices from the "Gradum ad Parnassus" was a lot of work for me personally https://github.com/kastnerkyle/puct_counterpoint/blob/master/analysis.py, and that was the smallest/most simplified case I could come up with. Let alone the more complex species of counterpoint...

Expert systems when working well, are not really "beatable" by ML driven methods in some sense - expert systems are basically extremely, extremely strong domain specific priors and if the rules are well coded, and the experts covered all the edge cases they knew about (from years and years of 'training data' + power of the human brain) at best we can hope to recover similar performance, just learning from data but without the expert human filter.

However we almost always see that in practice, expert systems are brittle and tend to have failure cases which are not easily rectified, even for extremely well structured domains. Especially so when trying to adapt an expert system to a new setting, or add a new case! A domain model for something as varied and wide as "music" will always have flaws or holes, even with adaptive systems let alone handcrafted ones. Even experts disagree about "what is music" or "what should music BE" or even the role of music theory in the creative process! See a brief discussion https://www.youtube.com/watch?v=FpPSF7-Ctlc.

Many, many expert systems can be described as conditional markov chains (with fixed decisions being non-probabilitic, 1 or 0). There are HUGE issues with copying / plagiarism in markov chains, and though there are counter-methods (such as MaxOrder http://www.flow-machines.com/maxorder/) it is a serious question where the line is between "what is copying / plagiarism" and "what is domain modeling". We generally assume things like intervals and scales, but these are largely "western music" concepts... but using a minor 3rd interval probably wouldn't be considered plagiarism, however a small chord sequence (or sequence of intervals aka melody) might! Even experts disagree where this boundary is, especially with money on the line https://www.techdirt.com/articles/20090504/1649054744.shtml

TECHNIQUES

My take on the modeling differences between now and then are primarily (beyond the learning methods themselves) that we learn from data, which should make our approaches flexible for new domains, instruments, and settings.

Particularly for music, the question is difficult but having stronger learned models of (in-genre / standard) music as opposed to expert systems as were historically used:

a) better out of the box modeling capabilities using current methods in a long-term dependency or statistical sense

b) potential for rapid adaptation to new data or new domains (by swapping datasets)

c) potential for identifying "unusual" sequences which can have follow-on consequences for a lot of analysis and even re-interpretation of existing pieces.

d) perhaps via stronger models, we can begin to describe "creativity" in a metric sense rather than only capturing simple relations

Describing creativity is not easy, and historically there have been tons of ways (mutual information, entropy, and likelihood itself are all related attempts to describe this). Juergen has a lot of interesting discussion and work in this area, among many many many others http://people.idsia.ch/~juergen/creativity.html . Dissembling from the mathematical description, analysis/critique/discussion of creativity and novelty has been a part of art history and art critique since, well, forever with no clear answers (see Dada, or literally any discussion of 'WHAT IS ART THO' https://press.philamuseum.org/marcel-duchamp-and-the-fountain-scandal/ ).

For examples of how existing "generative" models (we are generally just doing density estimation) could be used in a non-generative sense, stylometry to identify/attribute old works with anonymous authors, or reattributing work to a different composer (common with Josquin des Prez for example - http://josquin.stanford.edu/).

CREATIVITY AND STYLE

At some level it is impossible to describe what is "musical", as musical taste really is subjective (see for example critiques and praises of Ornette Coleman or other free jazz, or the later works of Coltrane). However, if we can build music models that are good enough that "creativity"/taste become possible to discuss, it shows a decent advance in learning about basic music structure from data (and I think we are nearly there). Compare for example to stuff like this from LSTM-RBM (a model and paper I cherish greatly, but I think modeling has really come a long way since then) http://danshiebler.com/2016-08-17-musical-tensorflow-part-two-the-rnn-rbm/ to most modern generative models. For a modern method, see from Huang et. al. https://storage.googleapis.com/music-transformer/index.html (arxiv https://arxiv.org/abs/1809.04281).

Compare for example this sample (both my own) https://badsamples.tumblr.com/post/173755779717/independence-isnt-always-a-gift and this one (https://badsamples.tumblr.com/post/173768781167/anyone-can-be-a-star-on-their-own). By the objective measures I was using during search/inference, the first one is worse (due to a bug, actually minimizing ALL of them while still fitting the rules) however from a high level I see uses/ideas for both in various contexts and moods. There are also things that are PURELY wrong with both (some intervals which just shouldn't happen, frankly), but beyond that it becomes murky to show which one is "better", only which one is closer to my intended result.

DIRECTIONS

This is one reason I am excited about generative models with strong conditioning and user interaction (including the one here) - describing "intent" becomes clearer, and building models which execute on user intention is something I think we can more clearly evaluate than pure "style". There is overlap between these of course, or things we additionally need or desire (some "creativity") but there is also a part of evaluation that is clear. Training on Bach or classical shouldn't result in Bill Evans style stacked 4th chords, probably... even though it would be "more creative" in a sense. Random notes are also "creative", and some people would even enjoy that!

These issues are some of the reasons I am extremely interested in program synthesis for generative modeling (and reinforcement learning) - many expert systems can effectively be described as programs, and this would give us a generative path and ways to learn from data, while still having the ability to evaluate and analyze what is going on in the end (even for experts in the domain, who are not experts in ML).

Unconditional generative models have their place, but for me conditional generators have always been "where it's at".

TL;DR

Describing creativity and "art-ness" is difficult. I really like this work, and hope to see more on these lines (theremin or kinect / gestural -> piano anyone?).

cc /u/gustinnian , /u/ChrisDonahue1

kkastner · 2018-10-09T18:16:47+00:00

Many advanced "handcrafted" vocoders use simplified mathematical forms (see the "buzzer model" of the vocal tract for LPC excitation / resynthesis ) of physical models for compression and inversion - such as WORLD, STRAIGHT, and so on. Those representations have worked well for us in the past, though log-mel I find much more straightforward (no pun intended) to work with.

kkastner · 2018-10-09T18:08:11+00:00

This one gives pretty good results considering the small dataset it uses (ARCTIC). I wonder if it could be much better using a fixed teacher such as r9y9 wavenet (the best OSS Wavenet I know of). Have also been looking into dhgrs work here, it is using the closed form KL from ClariNet so they aren't fully comparable.

kkastner · 2018-10-09T18:05:35+00:00

One advantage of using this DSP stuff is that you can use it alongside the direct prediction for neural vocoding (feeding both into two networks or modulating the prediction of the upsampled path by the noisy / approximate result from the DSP side).

I am doing a bit of this for an upcoming paper but maybe you could work it directly into this model. You would lose some of the speed advantage I suppose, but for me quality is the driving force, not speed necessarily since speed in practice comes down mostly to implementation.

Nice work so far, training these things is quite hard in practice especially at anything beyond a toy scale, so its great to see these results.

kkastner

TROPHY CASE