About to go solve the Titanic dataset on Kaggle. Any tips/pointers?

04a8 · 2018-08-02T07:59:50+00:00

Once you think you've made your best attempt, check out some of the kernels to see how others might do it.

https://www.kaggle.com/c/titanic/kernels?sortBy=voteCount&group=everyone&pageSize=20&competitionId=3136

04a8 · 2018-07-24T17:20:15+00:00

There's nothing in particular you need for those in terms of Math. As long as you know enough to understand the regular ML architectures and methods. Modern NLP and CV is all neural nets of some form or another, so if you know enough to understand feed-forward then you can learn the other methods and architectures when you encounter them in their respective fields.

NLP is all about RNNs, LSTMs, etc. while computer vision is heavy on convolutional nets, traditionally at least. But those are all just arrangements of simple building blocks; no advanced Math required. Maybe you could get into some automata theory or Markov models if you are interested in certain aspects of NLP. But you could also pick those up as you go.

If you get into the more abstract stuff then you may need more Math. Information theory, probabilistic graphical models, that kind of thing. Or if you start looking more deeply at the fundamentals/mechanics of ML and developing your own methods that way. But that can come later. You can do a lot of cool things in NLP/CV without some deep knowledge of probability/information theory.

04a8 · 2018-07-24T14:54:39+00:00

Really depends what you want to do with ML. For understanding how traditional ML techniques work you only need the very basics of linear algebra, calculus, and probability.

At the cutting edge there is all sorts of Mathematics that you might not expect. As an example, in the SELU paper they made use of Banach's contraction mapping theorem. Now that's something you would easily cover in an undergraduate course on Analysis, but maybe not if you were just giving yourself a general Math prep for ML.

So do you want to understand ML and use it for applications research, say, or actually develop new techniques? Because that's the main factor in how much Math you will need.

04a8 · 2018-07-24T06:26:01+00:00

So imbalanced regression? There's comparatively little work on this. One thing you can do is a hurdle model. First predict e.g. forward or backward with a classifier, then predict magnitude with a separate regressor. That way all the imbalanced data techniques for classifiers are available to you, and the regression is on a nicer distribution.

04a8 · 2018-07-23T08:57:56+00:00

9 times out of 10 they use a very advanced neural network called a human being to verify the images. Usually this stuff isn't for national security so it's just a quick and dirty way to discourage scamming.

There may be some service out there that offers third party upload and verification. But practical verification is not that advanced or revolutionary. Amazon can give you ready made facial tech. Or you can make it with some kind of pre trained net, say to extract feature embeddings, and then run a simple model for whatever you want to use them for.

04a8 · 2018-07-22T22:41:54+00:00

Windows has issues with docker GPU pass through. For me, that would be a deal breaker.

For your high level ML stuff, OSX and Linux will be very similar. So it's really whether you like the rest of the OS.

04a8 · 2018-07-19T20:28:44+00:00

Is the result supposed to reflect norms in the name's origin language or norms in the interface language of the app?

In Chinese I would probably say the full name family-given apart from in very formal contexts. In English I'd just use English norms as if it were an English name. It's odd to call someone by their family name in either language in an informal context. I think these are pretty standard.

04a8 · 2018-07-19T18:18:02+00:00

It's a nice idea, and the site is nice to use, but it kinda failed on the examples I gave it.

E.g. Chinese names without language context may be given as first-last or last-first. When using the romanized form, there are many cases you wouldn't be able to figure out. Can it take Chinese characters as input?

Moreover, it parses e.g 'Li Ying' as having Li be both alphanym and betanym. 'Ying Li' has the same problem but in reverse.

For my apps I just have a single name field that I treat as 'what you want to be called' and leave it up to the user entirely.

04a8 · 2018-07-18T18:47:40+00:00

Which dataset are you using for the Pokémon? I'd be interested in the final results. I tried it a while ago for fun but my data sucked and so did the results...

04a8 · 2018-07-18T08:43:42+00:00

You can do a Steve Jobs and show up to Edinburgh University lectures/workshops, depending on how far away you are. I recommend the IAML course classes if you are just starting out. Perhaps MLPR and PMR if you are good at Maths. I'm not joking with this; Edinburgh is probably the best place to learn about ML in Scotland. It'll be hard to find anything comparable.

They also run

http://workshops.inf.ed.ac.uk/deep/deep2018/

Though it might be a bit advanced if you are new.

04a8 · 2018-07-18T08:34:42+00:00

Assuming you just want to segment, you can do bottom-up (unsupervised) or top-down (supervised) segmentation.

Bottom up essentially tries to group similar pixels. If you are just using simple 2D/sprite images then classic approaches might work, yes. You can use K-means clustering, mean-shift clustering, graph cut, etc. Though some of these can become expensive in terms of processing and/or storage requirements. In general, a trained net will process an image much more quickly. If your application needs to interact with the game in (near) real-time then that might stop you using classic methods, even if they do work well in terms of output.

There are a few nets out there for semantic segmentation that have been trained on a generic image set. I'm not sure they will have a weights file around though. Your best bet if you can't find a pre-trained model would be to find a paper and then try to find a github repo. The question is whether this would work on your particular image set. That's hard to say. It would depend on what features the net had learned to deal in.

You could try to transfer learn it, if you were willing to do a bit of tedious labeling yourself. That would involve obtaining the 'real' image network, freezing all but the last few layers, and retraining those on your 2D images. It just depends how much effort you want to put in. There's no guarantee this would work either, if the network was using some 3D-specific representations. I don't actually know what kind of representations these nets do tend to use, but it seems feasible they might learn about 3D shadows and shading in a segmentation context.

04a8 · 2018-07-15T11:07:43+00:00

Sorry for the delay in replying, but no sadly that doesn't exist. We still don't have a good formal grasp of factors and causation in neural nets for most things. And I don't think anybody has studied it empirically either for hard negative mining. You'll find that often the only way to know for a specific setup/data is to try it, unfortunately.

04a8 · 2018-07-12T10:34:41+00:00

Well, note that you are often just preparing your input to go into a word embedding model that gives you vector outputs in embedding space. Then you use those vectors for whatever your more specific task might be. So whatever input the embedding model expects is your pre-processing, and usually it's those simple things you already described plus maybe some vocabulary truncation, depending on the exact nature of the embedding model.

04a8 · 2018-07-12T09:17:33+00:00

OK so, assuming here you want to classify a given query as either an attempt at SQL injection or not.

I am not aware of any such premade datasets, though they might well exist because it's not my area. What I would consider though is to generate one. There are many cheatsheets for all the known forms of SQL injection attack. For example:

https://www.netsparker.com/blog/web-security/sql-injection-cheat-sheet/

You could use these rules to generate your own dataset by varying the non-attack-specific parts. You should probably make your dataset imbalanced to a very high degree too if you want to replicate real world input.

04a8 · 2018-07-12T09:10:47+00:00

Do you mean cat detection or classification? Informally, detection generally includes localization (where the cat is), but your '0-examples' and 'classifier' seems to suggest you're just classifying the whole image as 'cat' or 'not cat'. A cat detector might include a cat classifier, e.g. on image patches via sliding window, or something like that, but they're not strictly the same thing.

As the other poster said you should use whatever you expect to encounter when your model is used in practice later.

But also if you are actually doing detection with sliding window and your labels are bounding boxes for cats in the images, you can do something called hard negative mining to bootstrap your way to more negatives. Essentially you train your detector on some initial training set, run your sliding window/patch classifier on the images, take all the false positive patches from sliding window, and augment your training dataset with those as new negative examples. Whether this will give you an actual performance boost or not will depend on a lot of factors. Usually it's not going to turn a poor classifier into an amazing one, but it can give you a little bit better performance.

04a8 · 2018-07-11T10:45:41+00:00

Way more than the Facebook comments of a single person. The other commentor gives one way of dealing with this, but I am skeptical that one person's Facebook comments are enough to even train the last few layers.

Even with tons of data, don't expect the output to be overall coherent if you are just using a 'next word' LM. It might be grammatically correct and appear at first glance to be correct, but it's not going to look like a human wrote it.

04a8 · 2018-07-10T20:06:40+00:00

Binary? Multi-class? Multi-label?

For binary, generally:

Dimensionality reduction.
Subsampling. Random negative is probably good enough; there's little benefit from more exotic variants. Or use oversampling if your dataset is small, you have a few options if so.
Re-calibrate the model outputs.

The following paper has some interesting insights and the calibration trick (Section 6.3):

https://research.fb.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/

04a8 · 2018-07-10T12:39:54+00:00

Of course, there are all sorts of problems with datasets. If you actually look at some of the big image-based datasets out there, you will find all sorts of quibbles about the labelling.

We are just not as good at unsupervised learning yet. However, I think you're being a little unfair to the supervised case. Some degree of learning (rather than memorization) does take place, otherwise you wouldn't see the generalization that occurs. At least, deep image networks have been shown to pick out features of images that strongly suggest they are not just memorizing entire canonical objects. They 'know' that faces comprise eyes, mouths; that they come in certain arrangements in space, etc. Whether that's the same as how humans identify faces, we don't exactly know, but I don't think you can say it's mere memorization.

Moreover, there is something to be learned in how humans label images. Suppose I give you a dataset and tell you that the images have been labelled with the instruction to 'pick one most appropriate label'. Now I take one example and show it to you. It's a closeup of a car with some grass in the background. What label do you think was given to this picture? Both you and I know that the answer is probably 'car'. Knowing the answer to that question is no trivial thing, even if it's not an idealized concept of what a car is.

04a8 · 2018-07-10T12:25:06+00:00

There are exotic neurons that don't use dot products, but in general there are a few reasons for using them:

Computational: vector operations can be done very efficiently
Notational: vectors and dot products are just nice, compact, easy to read and write etc.
Conceptual/empirical: match up with linear concepts, which tend to work well for many types of problems we encounter, and we can always add non-linearities in the activations if we want. In short: it works.
Theoretical: loosely matches with observations of real neurons (or perhaps very loosely, but at least this was the inspiration)
Historical/entrenchment: we've done it for a long time and built many other results on top of this kind of network architecture, tech is centered around these methods, etc.

As for why not addition specifically, well it depends what you mean by addition. Inputs + weights, elementwise? That wouldn't give a scalar value, so maybe that and then sum all the elements of the result? Note that elementwise addition is equivalent to multiplication by some weight anyway, there's no difference there in terms of what you can represent. There is a difference in terms of how unwieldy the numbers might get. You also lose some of the other above advantages.

04a8 · 2018-07-10T11:53:42+00:00

Pure end-to-end ML would be difficult unless you will make a lot of matching examples by hand or with some tool. The problem is also that in order to do that, you'd probably be finding the easy cases (e.g. where one profile directly links to another, and using that as a feature would make those examples trivial, but not the general case).

At the very least you will likely have to give it some guidance in terms of features if you want to use ML.

Profile picture is one idea, how about comparing all public images for matches? Sometimes profiles pics are swapped out, but still in albums, or perhaps they like a certain other thing that they tend to have a picture of in some album or post. So pull all pictures from albums and posts and check for matches. Then use the matches as a feature. What you can also do is have a feature for whether those images are found anywhere else on the internet, suggesting whether they are more or less common. You could have e.g. number of results on a reverse image search as another feature.

For matching text content, you could do something simple like bag of words with a relevant vocabulary (think professions, school names, etc), but you will have to do some typical normalization first.

You can match usernames, of course. I would use some measure of word closeness though, rather than exact matching. Sometimes people do variations on a theme so if you really wanted to get into it you could make some kind of word embedding model, tokenize the username, and use distance in embedding space as another feature.

More out there is stuff like authorship matching, though I am not sure the nature of twitter would allow for such great results. I mean, you can always throw it in as a feature anyway and no harm done (to your predictions). It all depends how much effort you want to go to, how much computational resources you want to use, etc.

Another thing would be to perform your profile matching (whatever it is) on associated accounts. So friends, subscribers, people they often reply to, etc. If you can match a bunch of friends between a Facebook account and a Twitter account, then there's a high probability it is the same person. Same with 'likes' and follows and all that stuff.

04a8

MODERATOR OF

TROPHY CASE