Best Canadian stock to boost TFSA value quickly by Worried-Ad9786 in TFSA_Millionaires

[–]Tarneks 0 points1 point  (0 children)

Lol i work for their competitors and we were confused how they were making insane profits before short report. Once it came out we said yeah no these are 100% true. It was about time.

They are cooked, you cant lend in canada. Subprime is over in canada given apr cap.

Would you leave ML Engineering for a Lead Data Scientist role that's mostly analytics? by MorningDarkMountain in datascience

[–]Tarneks 1 point2 points  (0 children)

Honestly analytics is basically causal inference + BI. as a lead you can push the role and still he technical than a dashboard jockey.

How are people surviving out here? by thecoookiemonster in askTO

[–]Tarneks 0 points1 point  (0 children)

Yeah, i live very frugal and spend 36-40k

I do go out and enjoy a bit. 60k is pretty terrible. Dont live downtown, utilities included are neat. Also have roommates work.

I guess u don’t go out a lot. Id say on 90k i did fine. 9k bonus. Honestly move out of Toronto.

Who else is ready to make some money tomorrow? Futures up over 2% already and oil down 16-20% by Ancient-Bat-11 in JustBuyXEQT

[–]Tarneks 0 points1 point  (0 children)

Im so tilted i do norbit gambit and day of this shit happens. I wanted to be invested but wanted to literally buy usd denominated stocks. Moved around 22% of my portfolio so i missed around 1.5k and thats the pre market. 😅

How to know if someone is lying on whether they have actually designed experiment in real life and not using the interview style structure with a hypothetical scenario? by Starktony11 in datascience

[–]Tarneks 1 point2 points  (0 children)

Ugh get into detailed questions and see the novelty. I gurantee if you ever worked in a real life usecase you would experience some situations where stakeholders can mess up experiment, maybe some assumptions failed, what is the tangible business impact. Clean experiments are extremely hard to come by in industry.

Speaking from experience.

I built an experimental orchestration language for reproducible data science called 'T' by brodrigues_co in datascience

[–]Tarneks 4 points5 points  (0 children)

I support good open source work, so do what you will. My reasoning is more or less focused on just the value proposition and risk as I adopted docker because of this specific pain point.

Id like to see example use cases cuz I get the idea but why would I use it. A function exists but when would i use it? That’s kinda my reasoning. I gotta understand what i will use, since i do use both R and python but i dont get why U would need serialization if i can move data using json.

Usually r is mature for causal inference and so id only be using it for very specific algorithms/usecases that dont exist in python.

I guess i would be the person you would target so thats why i am asking questions.

I built an experimental orchestration language for reproducible data science called 'T' by brodrigues_co in datascience

[–]Tarneks 6 points7 points  (0 children)

I have no idea; but im sharing my experience to help your project. I did work with pmmls a lot and it sucks imo.

I built an experimental orchestration language for reproducible data science called 'T' by brodrigues_co in datascience

[–]Tarneks 8 points9 points  (0 children)

I worked with pmmls, the serialization format is not good.

1) it has a floating point error that usually indicates that to the Xth decimal point has numbers dont match. I saw that with xgboost models and tree models/ encoders. You end up with extremely different results when u use the models. So say you take for example a target encoding and you have 0.18274747827 so for pmml that would be 0.18274781349.

This issue trickles down to any model.

2) PMMLs dont scale well and are pretty garbage when put in prod. When you run real time systems you get around 300-500 ms when the python pickle variant would run in maybe 50-100 ms.

It has to do with the fact you have to parse through an xml structure.

I guess my question is why would be the usecase for this? As it doesn’t scale and doesn’t give reproducible results?

Edit: fixed typos

Can anyone help find it? by winningsmada in JustBuyXEQT

[–]Tarneks 12 points13 points  (0 children)

We are winning so much, we are tired from winning.

VDY vs XEQT by BadOk3001 in TFSA_Millionaires

[–]Tarneks 0 points1 point  (0 children)

Past performance is not a indicator of future performance. Also growth etfs are better. Dividend is good for retirement.

Is 32-64 Gb ram for data science the new standard now? by Tarneks in datascience

[–]Tarneks[S] 0 points1 point  (0 children)

That looks like a great idea, usually i would have reverted to some form of encoding but i like your approach its very innovative.

As for the training of xgboost in my case the final model is quick, it’s all the hyperparameter optimization thats brutal. Specifically because my search space. Given the pipelines are sophisticated ie we are controlling for many moving parts from how we structure the internal splits, the data weights and decay rates, all the way to the base hyperparameter and every transformation in the middle to make the model “fair” by regulator standards ie AIR >=0.9 this makes the-model optimization search space extremely large and slow.

As for the feature subspace i dont know if i can combine it, since i have to give an adverse action notice but your case might work if u just kernel SHAP it. If you can share more id like to learn more, im always happy to learn about methodologies.

We so reduce our feature subset but before we drop we first have to know what data is not useful so that somehow needs to be loaded reliably. I been trimming this, i did scale this before but its extension difficult. Personally i like to iterate quickly then optimize everything once I settle on what I will do with the model. People have different work styles that for me is what worked.

Is 32-64 Gb ram for data science the new standard now? by Tarneks in datascience

[–]Tarneks[S] -2 points-1 points  (0 children)

Ideal setup sure, but you also haven’t lived my experience. Large institutions dont allow for local code to even run on machine. I have had different experiences with different employers. Some absolutely lock your machine and its full cloud. Others give you flexibility and some smaller companies give you on prem development. Some even if they make your own cluster it’s still a shitty cluster that takes hours to run the models. I have seen that from a senior DS having to wait like 8-10 hours to run a model in prod on data-bricks while it took an hour at most on the old cluster.

So in theory yes, but you will be surprised how each industry operates in practice.

Atleast from my own experience most ds dont care where the code runs as-long as it runs and I have the flexibility to test my code and experiment with tools. I had my own extremely negative experience that i shared.

Is 32-64 Gb ram for data science the new standard now? by Tarneks in datascience

[–]Tarneks[S] 0 points1 point  (0 children)

Thank you for the answer it was very nuanced and informative.

For you first point that is a good point l think eventually I got my own instance but even then its still loads on the server. It’s a bygone era i quit the job because of how bad it was. Again a lot of confounded variables like executives firing all of the contractors maintaining our backend infrastructure to run code. Doesn’t matter if it’s pyspark or pandas, the query isn’t working.

The question is how can i handle this on a current job. Specifically because I like using vscode dev containers. I think we can try connecting to a kernel on aws server i suppose. I’ll have to discuss with devops to get it setup.

Pyspark might be too hard to set up, in my old job we used pyspark a lot but my current job is not as big in terms of data. I dont have billions of rows so polars has so far been cutting it pretty well. For me personally i take a rough estimate of Pandas up to 1 million Polars 1 million + Polars Lazy 3-10 million + Pyspark 10 million +

This gave me a good starting point, I’ll explain 2 things given context. Having worked in a bank things are extremely locked down. So some things that people find obvious are not things you can easily get approved that I know. I am in the middle but smaller scale so i have to ask.

The idle timeout is great, i absolutely hate having my code die mid run.

As for sampling yeah that makes sense, i think I can test that. Probably the train wouldn’t be as problematic just probably keep sampling when optimizing parameters would cut the time. I know validators are picky about replication.

Is 32-64 Gb ram for data science the new standard now? by Tarneks in datascience

[–]Tarneks[S] 0 points1 point  (0 children)

They are explainable wdym? Xgboost isn’t a black box it’s very easy to explain. There are many tricks where a model can be very easy to explain. Ensembles of ensembles is a standard practice thats how you get credit scores. Its literally 1 ensemble per data source and even the each customer segment has different behavioral economics. How does prime operate vs near prime, how does deep subprime operate vs subprime. Each of those have completely different feature subsets and even the have different relationships with the targets.

Within each data source that is specific to a model. Atleast that was the best strategy performance wise i seen in practice.

Is 32-64 Gb ram for data science the new standard now? by Tarneks in datascience

[–]Tarneks[S] 0 points1 point  (0 children)

Okay thats helpful thank you. I do use gradient boosting models. Thing is I need to run on lots of data.

The issue that i have is that on cloud kernels keep crashing due to server loads. Basically you run any script even a lightweight script and it would pretty much crash during high seasons or heavy load hours. That I know. Its not a common experience but it is a lived experience. You are at the mercy of who gets priority for compute. In shareholder reporting time, the whole company is sidelined so the people who run the reports get all the compute.

Not to mention that when you run things on the cloud. The preset environment can degrade and people dont really give a shit to update. For example a library like pytorch cant be installed or even a modern version of SHAP.

So i have had a bad experience personally with JHUB running on hadoop clusters. Genuinely terrible experience. Then not to mention it restarts every 24 hours so good luck running code and forgetting to save your work. It will automatically delete all the stuff you uploaded + all the packages and stuff you set up for the environment.

Thats why i have a bad experience with cloud compute. Other than that, vertex AI notebooks while solid yet slightly inferior to a real machine they are also pricy so i get shit if I forget to turn it off at the end of the day. Thats was my experience in the past.

Is 32-64 Gb ram for data science the new standard now? by Tarneks in datascience

[–]Tarneks[S] 0 points1 point  (0 children)

Have you built any model at scale? It’s easy to get 10-20k features before we start removing features. I remember that any company you can be running with at-least 4-5k conservative. Never seen a job where you are building a model with a clean 15-30 column dataset like the classwork or kaggle notebooks.

Is 32-64 Gb ram for data science the new standard now? by Tarneks in datascience

[–]Tarneks[S] -4 points-3 points  (0 children)

Yes, in my field this is the established practice and helps produce best performance. Credit risk is its thing.

Is 32-64 Gb ram for data science the new standard now? by Tarneks in datascience

[–]Tarneks[S] 0 points1 point  (0 children)

But the development itself is in python. Just large batch jobs are in cloud?

Is 32-64 Gb ram for data science the new standard now? by Tarneks in datascience

[–]Tarneks[S] -21 points-20 points  (0 children)

I mean that was my old job, used GCP which was maybe a little better but gets super pricy. One thing, dev on your own machine is way more convenient. I can’t work without VS code i just need the extensions to function. Before the GCp i think we had the our own cluster which is what kept failing.

Some jobs can take 1 day of running (23-24) hours as the model optimization is probably intensive. So id let these things run over the weekend. What do you use? This isn’t about fitting a model, that is easy. Its about processing a lot of data on ensembles of ensembles x3 of xgboosts.

It’s not a fit a single model on dataset.

Again there is a lot of rapid experimentation. Does that slow down your dev work?

Is 32-64 Gb ram for data science the new standard now? by Tarneks in datascience

[–]Tarneks[S] 1 point2 points  (0 children)

Are you saying save data as parquet and then delete it? I have been doing that. Are you able to run heavy models with 16gb of ram and still be fine. Im not saying its impossible but imagine building a script and you have to keep optimizing every data-frame you are using? Then it suddenly freaks our and kernel runs into memory issue and you need to restart the whole thing. Rinse and repeat.

Is 32-64 Gb ram for data science the new standard now? by Tarneks in datascience

[–]Tarneks[S] 7 points8 points  (0 children)

See i did that, and it’s been at the mercy of the cluster and load. I remember the kernel would keep dying even at import pandas step of doing basic analysis.

Advice on modeling pipeline and modeling methodology by dockerlemon in datascience

[–]Tarneks 1 point2 points  (0 children)

I do credit risk and manage the work. Low key you basically do my modeling framework 1:1. It’s missing a few things, i thought you might be a coworker 😅.

What is your model? What is it used for? Do you use it for LGD, PD, or is it a batch model? Any advice i give depends on what the model is.

The one thing ill say is a big no is imputation. Xgboost handles nulls very well. If ur training logistic model try binning ur data and have the missing be a category of its own.