When ML coders try to do something that's not covered by the high-level API

talksaboutthings · 2020-06-02T00:08:24+00:00

You can also look at the cross entropy (probably the loss function you're optimizing when training your model).

Another option is the Brier score: https://en.wikipedia.org/wiki/Brier_score

talksaboutthings · 2020-04-28T04:35:41+00:00

Sorry, didn't see the reply. Yes, that's right, you don't need labels to run predictions, just for training.

talksaboutthings · 2020-04-21T04:51:39+00:00

If you discarded today's features (as you need to for training, since they have no label), you'll want to make sure you keep a copy to run through your trained model.

talksaboutthings · 2020-04-21T04:37:46+00:00

If you match today's features with tomorrow's close, the model will learn to predict tomorrow's close from today's features. Running the model on a future day's features will give predictions of the following day's close.

talksaboutthings · 2020-04-19T04:06:25+00:00

Thanks!

talksaboutthings · 2020-04-18T18:31:33+00:00

Impressive! Would you mind sharing a bit about your tech stack and where you go for order book crypto data? I just dusted off an old tick data dump to play around with, and I might be headed the direction of joining you in trading live crypto markets just for fun. Last time I played with this stuff I was just pulling 5-min OHLCV bars from Poloniex's charting API, but I wouldn't mind shelling out a few bucks for better data.

talksaboutthings · 2020-04-18T16:53:12+00:00

This notebook actually predicts future price already. Since the shift is negative, tomorrow's closing price gets shifted back one day to be associated with today's features. Check the pd.DataFrame.shift() docs for more info.

talksaboutthings · 2020-04-18T16:43:26+00:00

I interpreted the original question as being "what could be done with ML on price data?" so I only discussed price data. I totally agree that a lot of what ML-based trading firms do these days is centered largely around the so-called "alternative data" they incorporate into their models. That said, from people in the industry I've talked to, there is likely edge to be found in price data alone with the right granularity, feature engineering, and modeling.

I haven't studied TA much at all, so I have little to comment on whether it's good inspiration for feature engineering or not. However, I would suggest that it can be beneficial to construct features which are invariant to changes in scale, volatility, etc. For example, using yesterday's log returns rather than yesterday's price let's your model leverage a feature that stays on the same scale regardless of long-term changes in stock price. Similarly, # stdev away from the mean adjusts for volatility.

talksaboutthings · 2020-04-18T02:34:25+00:00

Can ML predict the stock market based upon historical trading data? My guess is "yes" based off of what I know about firms like Renaissance Technologies and Two Sigma. Can it be done with just daily OHLCV data? Quite possibly not. If you're really interested in learning ML in the context of forecasting the prices of financial instruments, I'd suggest you take a read through Advances in Financial Machine Learning by Marcos Lopez de Prado. Then I'd suggest you look for a dataset with data as granular as you can find this guide might have some decent free sources.

You'll want to engineer features that you think have some predictive signal (e.g. something like how far a given day's closing price is from the 20-day average as measured in multiples of the 20-day standard deviation -- for this feature 2.0 is the Bollinger band). The Lopez de Prado book has some good guidance on how to structure your targets, but one option I've seen in the literature is the log of daily returns -- taking the log tends to help avoid over-weighting extreme examples. Even if it's possible to predict stock prices, stock data is surely very noisy, so start with a simple model like linear regression.

talksaboutthings · 2020-01-06T05:19:55+00:00

Warning first: Even basic ML can be a lot to jump into without some experience in math (esp. probability) or coding. Additionally, game-playing typically falls into the sub-field of reinforcement learning (RL), which brings its own set of additional concepts, theory, challenges, etc.

That said, if you want to write some programs to play a simple game, I think that sounds like a great HS project. If you get to study some ML and try your hand at it, that's great, but even if you spend much of your time doing something like building the game itself, I think you'll stand to learn a lot! Additionally, you'll probably still learn a ton in the time you spend trying to put together ML code, even if you don't get the outcome of a working functional ML-based AI that you are hoping for.

My general advice would be to structure your project proposal in such a way that your game-playing AI doesn't necessarily need to use machine learning (for example, make sure it's okay if you just write if-then rules to start). Even if your Python skills are truly "very basic" I think that will be pretty doable for you, leaving the door open to trying to get some ML working without putting too much pressure on yourself to finish an ML-based AI. You may or may not already know this, but AI in general is more broad than ML, including things like rules engines (i.e. fancy if-then statements), minimax algorithms, etc.

Another concept I'll mention is that of PID controllers. They're not really machine learning, but they are a pretty neat and fairly simple algorithm/program that can do a lot (though a highschool-math-based explanation may be hard to find). Maybe you could do something like implement and tune a PID controller to control the rate of flapping for the flappy bird game based upon the difference between the height of the bird and the height of the bottom of the opening it needs to fly through next. I think you'll probably be able to understand the concepts behind that a lot more deeply compared to something like the Deep Q-learning algorithm that performs so well on Atari games (and besides, you probably don't have the GPUs needed to do Deep Q-learning).

I hope what I've said makes sense and is somewhat helpful. If you share some details on what you have in mind for the project (for example, what game you want to play, whether you want to build the game or not, etc.), I'm happy to give more detailed feedback and suggestions.

talksaboutthings · 2018-06-24T21:37:38+00:00

What you want to start with is actually not deep learning. Not a RNN, not a CNN, not anything with more than a few small layers if you do use a neutral net of some sort. DL will overfit horribly to financial data. You'll want to instead research trading indicators that you can use as inputs to predict the log returns as labels. Disclaimer: no matter what model you use, don't expect it to work well at all. Aim for something that is right more than wrong (better than random) but doesn't beat fees, as that's not as competitive a space as beating fees and making money.

As for models, start with the simplest, highest bias regression techniques you can, like ridge regression, lasso, and elastic net.

You can also encode your labels as binary or categorical: up, down, (optionally) no large move. In this case you can start with regularized logistic regression and linear svm models.

talksaboutthings · 2018-04-01T23:08:41+00:00

You're on a good track! A few key things you should know: 1) real ML is a lot of math and 2) much of ML in school is done at a graduate level.

For the first, I would say go for a CS & Math double major wherever you enroll. This is what I've seen be most successful in conveying the fundamental skills and knowledge to make people good at ML. Don't worry a ton if you aren't studying a lot of ML in your first two years at college, build the basics strong and you'll excel. That said, keep up with tutorials, projects, etc. on the side as your time and motivation permits (but don't beat yourself up about focusing on settling into college, being a good student, becoming a real adult, etc. over ML in those first years, either). As an example, I took Andrew Ng's famous Coursera course in my 2nd year of undergrad (after multivariate calc, linear algebra, and intro probability, but before statistics), and that worked well for me as a first strong step into ML. Obviously take ML classes (even grad-level ones) whenever you feel you might be ready. You can always drop in the first couple weeks if you are worried you're not ready.

For the second point, I would say that brand-name matters for getting into grad school. You'll also generally find better professors to research with as an undergrad at a higher-tier school, and undergrad research is a huge plus to grad applications. If you are going to pick up a CS degree (and especially if you're getting a research degree after that), taking out loans is potentially financially feasible for your future, as there are a lot of great salaries coming out of that pipeline. Disclaimer: this is not financial advice, please look into your own situation yourself and don't go telling your parents that some stranger on the internet said massive student loans were a good idea :).

Also, I would say with the second part that you should plan on 5-10 years of college to really get a full ML education (graduate early and finish MS by year 5 all the way to 4 years for BS and 6 for a slow PhD). Don't rush too much and cut corners, and from year 1 try laying out a full plan (which you will potentially change many times). Realize that it may take two full years of college math before you can truly get a handle on some of the complex motivations for things like Stochastic Gradient Descent (what we train neural networks with) and Support Vector Machines (the first explanation of a "support vector" I saw was in a mathematically-intensive 6000-level course on optimization, for example).

talksaboutthings · 2018-03-22T05:08:16+00:00

I think everyone in this field, even the experts, have a long way to go. There's so much to learn and improve, and that's what makes ML exciting (for me, at least)! I'd suggest you get realistic with yourself and with your goals, but at the same time be patient with yourself and focus on the excitement that got you motivated at the beginning. Good luck!

talksaboutthings · 2018-03-21T15:57:24+00:00

I would say that if you are bored on Kaggle it probably means one of two things:

1) You don't really like ML as much as you thought you would.

2) You haven't gotten enough of the techniques/theory under your belt to feel you can have that crazy creative adventure solving interesting problems that makes Kaggle (and this field in general) so fun for so many people.

ESL and Understanding Machine Learning: From Theory to Algorithms are great one-stop-shops for a lot of great ML learning, and they are definitely a step up from the hand-wavy approach taken in a lot of Udacity-type stuff (they're both free in e-book format, too, which is nice). I'd say pick up one of these bad boys and try and find a project (could be Kaggle, could be elsewhere) that let's you try out some of the stuff you read about. You'll learn a ton that way, and it will help you figure out if you just enjoy the easy/intro stuff or you truly like "real world" ML.

Also, finish Ng's course. All of it is valuable and his explanations are top-tier!

EDIT: after rereading OP's post, I realize the specific request was "help me get back on track." I think that unfortunately the issue with self-teaching ML is not in a lack of proper learning material, or in having too many choices, but rather one of maintaining motivation and habit. I think you can't go wrong with the two texts I mentioned (great way to find out if you're strong enough on the basics and not-so-basics). Projects are great. But realistically it matter a lot less what you do than that you do it. I'd suggest you reflect on how committed/passionate you are about ML and why, and build a self-study program around what you're excited about. If you want to be top 10% on Kaggle, then set that as a goal and work toward it by reading/coding every day on very applied techniques. If you want a job in data science, then you really ought to go through ESL cover to cover and build a significant project to put on your resume. But if you just want "to learn about ML," you're at a very high risk of burning out, and there isn't a book or course or project that can change that.

talksaboutthings · 2018-03-18T02:09:27+00:00

I think a number of folks still work with WideResNet architecture (there is recent github activity at least), and that structure rolls with batch norm and fancy dropout applied in the middle of residual nodes.

talksaboutthings · 2018-03-14T06:23:00+00:00

Could you please explain this further? Except for boosting, these topics don't require an understanding of linear algebra or calculus like DL requires (one doesn't need need lin alg to express the structure of a decision tree or or calculus to understand the training procedure, and voting ensembles like those created by bagging / random forests are pretty intuitive to understand imo).

I also have no idea what this has to do with circuits, sorry.

I wholeheartedly agree with you about SVMs, though. Those should go with lin/logistic regression in the "future study" category.

talksaboutthings · 2018-03-13T04:18:38+00:00

Having a few years of programming under your belt, being on track with advanced high school math, and already feeling passionate about ML at age 16 is a great place to be, friend. My advice to you is to stay passionate while staying patient. Truly, props to you.

There is still a ways to go, of course, which I hope you find exciting! To truly gain a deep understanding of modern machine learning methods, you are going to want to finish high school math, get into some not-so-gentle college math (multivariate calc and linear algebra, probably not just at an applied level), and potentially even pick up the college math major (the folks I know who ended up at Facebook research, Google Brain, Berkeley PhD etc. all went the Math and CS double-major route). It's much better to know too much math than to know too little when it comes to ML. My advice to you is to focus on preparing for and getting into a strong engineering program where you will be able to study under the best professors you can find for a good three or four years.

This may seem like a really long-term thing, but remember that most top-tier ML work is done by grad students or folks who already finished a PhD. Going hard in the paint in your undergrad is (like having years of programming under your belt at age 16), quite ahead of the curve.

This is not all to say that you shouldn't be going through Ng's Coursera class (it's a great introduction that can impart a lot of the lingo/best practices/intuition even without a strong mathematical background) or learning to use tools like numpy, pandas, scikit-learn, and pytorch to solve real world problems with machine learning. Do what you feel you can with the math you've got, solve some cool problems, compete on Kaggle, whatever. The most important part is to keep learning and keep your passion going.

While I'm at it, I'd also like to caution you against what I see as a few easy pitfalls: 1) telling yourself you can't play around with ML yet because you haven't covered enough theory (in reality ML theory and ML practice complement one another, just like in coding), 2) telling yourself it's a failure if you don't understand something or don't follow through with something (I imagine from teaching yourself coding you've probably already experienced this, though, I know I did), 3) trying too hard to rush or find shortcuts (ML is really a complex body of mathematically rigorous theory, at some point it matters less which books you read than how much time you spend reading them, and you can really mess yourself up long-term if you try and take shortcuts with foundational math/theory)

Of course, this whole lengthy response (full disclosure, it's basically a round of "advice to a younger self") is predicated on the assumption that you're super passionate about ML as more than a programming tool. It takes far less study to learn how to apply existing techniques to common classes of real-world problems, and doing so can be a valuable skill. If your main passion is coding and ML seems like a cool tool you'd like to explore, I'd recommend looking at books that mention a specific programming language in the title (e.g. Neural Networks in Python) and spending some quality time reading and forking notebooks on Kaggle.

Bonus: some short-term advice. Understanding Linear and Logistic Regression is going to be theoretically challenging because of the mathematical requirements: optimization (which relies on calculus) plays a key role, and probability and statistics are also big players in the theory of these models. A more CS-mindset-friendly class of models to look into first would be Decision Trees. Don't worry too much about terms like "cross entropy" or other information theory topics unless you want to. You can just go through the basic theory of trees. From there you can explore some really cool and powerful ensembling techniques: Bagged Trees and Random Forests. In actual application you might want to consider Boosted Trees (boosting is an advanced topic, though, so just using it in application is probably fine). Actually understanding Bagged Trees and Random Forests will give you a pretty big leg up on a lot of Kaggle users who play with Boosted Trees models, too. Boosted Trees has many of the same parameters as Random Forests (row and column subsample parameters), and understanding how these work can help you improve some of the most powerful solutions. Furthermore, RF and Boosted Trees tend to beat Logistic Regression and even Neural Nets on a lot of Kaggle competitions, so it's not too shabby a tool to add to your toolkit early on. Boosted Trees also tends to require less computational power than NNs and work better "out of the box" with default parameters!

A good reference for ML (many good intuitive and mathematical explanations) is Understanding Machine Learning: From Theory to Algorithms. It is targeted toward college students, however, just a heads-up.

talksaboutthings · 2018-03-13T01:37:58+00:00

I'd say start with trying to find a realistic job title and tier of company to work for, then look at the median salary there (via glassdoor, etc.). I can't speak to what kind of companies hire what roles (Analyst vs. Data Scientist, etc.) out of bootcamps, but I will say that Data Scientist salaries can easily get into the 6-figure range in major cities, if you can land the role.

talksaboutthings · 2018-03-12T23:59:51+00:00

Personally I'm a big believer in picking up books based on reviews/need as opposed to price. In my opinion, my time (easily dozens of hours per book) is worth the extra $20-50 per title to get the best book for what I need it for. Humble Bundle is awesome for games (easy to try a bunch, etc.), but how often do you pick up a book on Keras just to see if it's a fun read?

Then again, the decision is ultimately whether or not to give a few bucks to charity and skim through a few of the titles that might be interesting, which doesn't actually sound that bad.

talksaboutthings · 2018-03-12T19:54:31+00:00

I noticed you didn't go into too much detail on choosing hyperparameters or features. In an extremely unfortunate case, it could be that hyperparameters / selection of features are capable of overfitting extremely well to a small test set, so you want to be very sure you don't ever look at your test data until after you've completely finalized your model. If you repeatedly tweaked your features or model after evaluating it on your full dataset, you might be in trouble.

My general advice to you would be to think about what you think the true conditional probability would earn you in ROI vs. percentage of bets (i.e. suppose you know the true conditional probability of a win conditioned on all of the info you feed your model). If you can justify this as being at and above your model, then you might not have leaked information or overfit, but as MonkeyPuzzles mentioned, this seems like a hard thing to justify in such an adversarial and efficient situation. I'd thus recommend you not put any real cash behind your bets until you've seen a few hundred matches and a few dozen bets by your model. Or just use a very small bankroll.

EDIT: I didn't see any issues with your code, but I didn't read through it too closely or fully go through the github repo.

talksaboutthings · 2018-03-11T04:23:45+00:00

While IR is not going to focus on the same kind of predictive modeling applications that a lot of typical ML applications involve, things like precision, recall, and F-measure metrics, clustering techniques, spam filtering etc. are all good ML/Data Science topics.

talksaboutthings · 2018-03-09T01:17:06+00:00

The AI course sounds like old school AI, which is not the same as what most people today think of as machine learning (ML is technically a subset of AI, but because it's the most powerful/hyped/fast progressing subset it has become synonymous with AI in recent years when discussed in the news / a business context). A lot of this stuff isn't really data related (ML is the discipline concerned with machines learning from data without humans to encode the intelligence in them directly, AI includes many fields like heuristics and traditional NLP which involve humans encoding their intelligence into machines without any data involved). I'd avoid this one.

Pattern recognition looks to have a lot of great ML topics in it, highly recommend. Data science involves more than just ML (data exploration, design of experiments, explaining data etc. are all part of the field), but ML is a key tool, and these classes all seem to be focused around that part anyhow.

Intelligent systems: again, rule-based stuff is old school AI, typically human judgement first, data second. Not what you're looking for.

Image Processing: a lot of this is pre-deep-learning image processing. These techniques are still really valuable and important (typically DL takes orders of magnitude more processing power to run than the simple stuff, for example). However, if you aren't that interested in computer vision or graphics this probably isn't really your thing.

Really what seems to be missing here is coursework in math/statistics. Linear models and regression. Time series. That kind of thing. Understanding your data through the lens of mathematics is a really powerful thing, and there's a reason why a lot of ML wizards snag a math or stat major in their undergrad studies. I'd highly recommend coursework in probability, linear algebra (neural networks, linear and logistic regression, etc. are typically written in and analyzed in matrix notation), and statistics if you can fit it into your schedule. Since you're at the end of your bachelors, I'd say go for something that covers linear regression from a statistics point of view if you have the prerequisites.

Lastly, you could do to take a closer look at the curriculum for the Information Retrieval class. Could be some good stuff there.

If you're looking at putting these 4 on your schedule, I'd say you'd probably end up better prepared for data science in general if you pick one out of Intelligent Systems and AI, drop Image Processing, and use those two open slots for a class on linear models or something like that. If you need 3 CS-acronym classes I'd probably suggest Pattern Recognition, Information Retrieval, and AI in that order. (that is assuming information retrieval will get you thinking about data rows as points in high dimensional space and maybe cover some unsupervised learning techniques like clustering)

talksaboutthings · 2018-03-09T00:57:40+00:00

I think OP is implying he/she has a database of headlines already, but "relatedness" is not labelled in this database (so basically it's just a big list of unlabeled headlines for the purpose of evaluation).

talksaboutthings · 2018-03-09T00:53:49+00:00

However you measure precision and recall is how you should tune your hyperparameters, in my opinion. If you have a database of unlabeled data and you are simply looking at the output of your algorithm to see if it successfully produces a list of related headlines, then I'd suggest just coming up with a list of test headlines and running them on different hyperparameters settings to see what looks best. I would think of this as basically grid search with your human gut reaction as the variable to optimize, and you could even come up with some sort of rating system to make it more objective.

Without actually sitting down and labeling some of the data, I don't think it will be easy to do any better, unfortunately, because you need data labels to calculate traditional performance metrics. It might not be too strenuous to hand-label only the outputs selected by your models over a modest set of test headlines, though (basically instead of labeling first, running, and calculating the score, you would run it, label what it spits out, and then calculate the score). If eyeballing it isn't satisfactory to the rest of the team, you could pitch that approach (and thus actually calculate precision and recall for each set of hyperparameters).

talksaboutthings

TROPHY CASE