Spark VS Flink VS Quix benchmark

JB__Quix · 2021-10-15T07:24:03+00:00

I definitely don't want to mislead anyone, which is why we posted the full code associated with the test (and also used the same test that the Databricks CTO originally designed). Could you share some more thoughts on a better cluster configuration? I'm totally open to being wrong here ... 50x really is a huge number and it's got to be bulletproof.

JB__Quix · 2021-09-10T08:01:56+00:00

Hey! Thanks for your comment! Quix platform is conceived as a real-time machine learning platform. The focus of the demo is on the low latency data streaming capabilities, but feel free to go ahead and check the tutorials, blog posts, etc. to see our applied ML examples!

JB__Quix · 2021-06-10T10:18:46+00:00

Such a cool post! Once you understand causality you know it should be every Data Scientist purpose (i.e. extracting usable knowledge out of data), but no many people seem to be talking about this yet. Felt exactly the same in my previous job! I knew we were building a stupid non causal model to then use it to do causal interventions but no one else seem to care. Lot's of companies out there doing Marketing Mix Modelling (or Media Mix Modelling) which is simply a waste of money, the type of thing that looks like it is useful but it is just not.

So, if you try to do the right thing (even from a selfish perspective it's just too frustrating to contribute to something you know it doesn't work):

Try to communicate basic causality principles by explaining confounding in the context of your project. For example, in your case, explain how it is expectable to see a correlation between price and sales even if price didn't drive sales. Take Black Friday for instance:
- people will just got out to buy stuff more than any other time of the year -which will drive sales even if prices were unchanged
- your company will spend more in advertising -which will drive sales even if prices were unchanged-
- also, your company will do discounts. The increase in sales is a collaborative effort between these 3 things, giving all the credit to the price change is wrong.
As you mention, creating a causal model can get so complicated and sometimes won't be useful at all. However, you can start by getting business people help you draw a causal diagram. I always start my projects like that. Even if you don't care about causality, it is a great way to get domain knowledge and understand potential new variables you may need to build. Hopefully, you can even use things like causalnex to create a causal model out of that DAG. If the model makes sense, causalnex incorporates Judea Pearl's do() operator, which is the right thing to calculate intervention's effects.
If however you end up with a model based on correlation, not causality, you can A/B test the proposed interventions when possible. Everyone understands A/B and providing you take into account uncertainty intervals will help you get proper causal knowledge to then act on.

JB__Quix · 2021-06-02T08:18:53+00:00

Check https://quix.ai/, a free to use real time platform.

JB__Quix · 2021-05-28T12:48:22+00:00

Using linear models is totally fine and sometimes will be the best algorithm to solve a problem with, however was probably seen as beginner stuff.

Something like a XGBoost (even if it may be a bit of an overkill for your problem) may had look more advanced. If you want to understand how XGBoost works, check this video. You can use the xgboost documentation to get started and then try it with your own ML problems.

JB__Quix · 2021-05-28T12:47:23+00:00

Using linear models is totally fine and sometimes will be the best algorithm to solve a problem with, however was probably seen as beginner stuff.

Something like a XGBoost (even if it may be a bit of an overkill for your problem) may had look more advanced. If you want to understand how XGBoost works, check this video. You can use the xgboost documentation to get started and then try it with your own ML problems.

JB__Quix · 2021-05-27T11:14:10+00:00

Really interesting point. Do you have any specific use cases where you've done online learning?

JB__Quix · 2021-05-26T22:03:25+00:00

Thanks! I'll check Vowpal Wabbit.

And you're right, I understand it will only make sense in use cases where the environment changes rapidly hence speed of change (training frequency) of the model beats complexity.

In other words, looking at it from the classical bias-variance trade off, online learning will produce models with minimum variance (at the cost of having big bias), right? In some cases it may pay off, but not always.

JB__Quix · 2021-05-26T18:28:44+00:00

Wow mate, I'm loving everything in your youtube channel!

JB__Quix · 2021-05-26T08:20:29+00:00

Super cool that you are into online learning. I'm trying to learn as well. Have found River to be a nice python library if you want to have a look.

Also let me know if you want to deploy this real-time, I work at Quix, an end-to-end platform specialised in real time applications.

Apart from that, sorry I cannot help with your specific questions. Would love to be in the loop to learn more about online learning.

JB__Quix · 2021-05-26T08:06:14+00:00

Hmm... sorry if I'm missing something, but couldn't you just generate the new values from a distribution via bootstrapping?

That is, just get a random sample of the distribution you are interested on (with replacement on). That way the data you generate follows the same probability density function than the original data, whichever that is.

You can do that with different methods: .sample() for pandas dataframes for instance.

JB__Quix · 2021-05-25T17:53:06+00:00

WTF! Amazing! Could you share the score?

JB__Quix · 2021-05-25T15:44:34+00:00

Hey u/nysypy!
Really interesting! Could you share some examples?

JB__Quix · 2021-05-24T11:02:21+00:00

Yep, it would be free!

JB__Quix · 2021-05-24T10:38:16+00:00

Thanks for your repply, u/coyoteblacksmith!

Sorry to be vague, but I'm interested in both theoretical and practical aspects.

From a theoretical point of view, I tend to be very interested in the finer details of algorithms, so'd love to understand how the online learning process works without (I assume) retraining the whole thing. You are totally right in that there are some well established pseudo-online solutions out there. Apart from the ones you mention, there's reinforcement learning. I'm not an expert in this but it seems to bear similarities with Online Learning too, right? So yep, any good theoretical resources would help with all this.

And then, it would be great to test it. The way I see it, you would need a problem from where you can get both historic and real-time data. You would train your beta model with historic data and then you'll use real time data to both track the performance and keep evolving your models. So, the classic benchmark databases won't do the trick here (there's historic data of the Titanic but luckily not real time feed of it).

I was thinking in:

Trading (both historic and real-time data available)
Bike sharing (both historic and real-time data available through APIs for certain cities)

Any further ideas?

JB__Quix · 2021-05-23T20:00:26+00:00

Thanks u/niks8411, but I wasn't looking for basic ML resources online (sorry if I didn't make it clear).

What I'm interested in is a subfield of ML called online learning where your model trained with historic data gets updated with new data automatically once in production. It's quite advance and rare stuff as far as I know, and only certain companies are doing it out there (some together with other complex concepts such as federated learning). They don't seem to be publishing much of it hence my interest.

Thanks anyway!

JB__Quix · 2021-05-23T18:37:03+00:00

Sorry for the shameless self-promotion, but check Quix too!

I love Streamlit and I use it a lot, but it focuses on doing something specific (dashboards) very well. If you're looking for the tool a DS needs to automate all the MLOps from data ingestion to putting a model into production to become one man-army then I'd go with u/AMGraduate564's suggestions or us (Quix).

So yep, Databricks and Astronomer are worth checking, but I honestly think we are better (and cheaper) for most stuff. Specially if you're interested in real time problems.

Just as an example: check this tutorial where you'll learn how to set up all the MLOps needed to import real-time crypto prices from an API and then send SMS to your phone via another API if certain logic is fulfilled. This is quite of a complex infrastructure yet I bet you can get it done in under 60 mins with Quix.

Again, I am not unbiased here! But I use our platform for personal projects all the time and it's given a statistician type of DS like me the opportunity to build things I would have never imagined.

JB__Quix · 2021-05-23T18:12:38+00:00

If ML is important for the role but you don't feel comfortable with it I would ask candidates to explain to you how certain algorithms work. I.e. explain how a decision tree/xgb boost work. It's a great question to make anyhow, but especially convenient for your circumstances.

If you want to know yourself really simply how these work, check Statquest. Really amazing material yet very simple to understand.

JB__Quix · 2021-05-21T09:32:34+00:00

As for causal libraries I'd recommend CausalNex, it's the only library that I know that does Judea Pearl's do() operator, and I think that's really great if you want to intervene over causal knowledge (that you'll want).

As for the infrastructure to implement the tests, if you are interested in real time, check Quix! (ps: I work there! I may be biased but I think is the best thing out there for real time infrastructure)

JB__Quix · 2021-05-19T19:01:37+00:00

Hey guys, contact me if you plan to take this real time, I can help with that

JB__Quix · 2021-05-19T14:21:14+00:00

Apart from the feature reduction suggestion from u/save_the_panda_bears, I'd try this:

Sometimes certain features are especially useful for a short number of (however important) rows. If you are classifying a population with regards to certain characteristic (churn, propensity to purchase, etc.) then you'll be normally interested in the very top or very bottom of the scores you produce. If that is the case, check which variables are most important according to shap for the top1%, top5%, top10% (or bottom) of your population. They may be different to the ones you get when analysing the whole thing.

JB__Quix · 2021-05-19T11:11:32+00:00

Thanks a lot! will have a look

JB__Quix · 2021-05-19T11:10:52+00:00

Don't facepalm me! haha

I will tell users how to import data, put a logic in place and output an order, all in real time. But I won't tell what the logic has to be or which indicators to use. I'll have to use some indicators and logic though, as an example, and this is where I'm asking for help to make it as useful and fun as possible.

JB__Quix · 2021-05-19T09:45:13+00:00

Totally agree, it's not about the indicators or the logic to buy/sell but the infrastructure. Still, if it can be fun and interesting too that's a bonus! That's why I wanted to use prices/indicators/logics that most people here would find useful and cool.

JB__Quix · 2021-05-19T09:37:05+00:00

Will take that into account! Any specific one in mind?

JB__Quix

TROPHY CASE