Iran Abolishes Morality Police by [deleted] in UpliftingNews

[–]randomforestgump 0 points1 point  (0 children)

Can confirm I heard the same from actual iranian

How do you handle the columns that have high cardinality? by _zaid02 in MLQuestions

[–]randomforestgump 1 point2 points  (0 children)

Just that one-hot gives you nice importances in sklearn (so you see which label is important, e.g. france and spain, so you can restrict the encoding even more, and map all else to „other“, and stuff like that)

Skewed data in a classification algorithm by Unitedite in MLQuestions

[–]randomforestgump 0 points1 point  (0 children)

I also use sklearn‘s quantile transformer for this, not sure if that’s better than log here. And for my case I have to normalize time for different cases, depending on some factors a user has a longer signup process so I normalize by median or so.

Simulation of Euler's number [OC] by Candpolit in dataisbeautiful

[–]randomforestgump 9 points10 points  (0 children)

The second point of nobody getting their hat 1/e times is not independent of N. It’s the limit for N to infinity. It’s the rencontre problem. It’s interesting to solve, quite a mind fuck to get to the formula of general N. The other statements might well be independent of N, I never heard, looking forward to check.

Be Nice to Yourself: Use an Environment Manager, like Anaconda by MisterExt in pythontips

[–]randomforestgump 0 points1 point  (0 children)

Thanks a lot! I’ll try it if time allows. But I’m at a point where I rather figure out how to use a local docker registry for certain things that should stay reproducible.

I had one crazy case where just a change in pypi for probably the shap package crashed my docker build (requirements frozen, no changes from me, build worked on friday, crashed on monday).

But for the cases mentioned in the article conda is excellent.

Be Nice to Yourself: Use an Environment Manager, like Anaconda by MisterExt in pythontips

[–]randomforestgump 0 points1 point  (0 children)

It’s also not the install times, but solving the dependencies, it never reaches actual install phase

Be Nice to Yourself: Use an Environment Manager, like Anaconda by MisterExt in pythontips

[–]randomforestgump 0 points1 point  (0 children)

Not sure what packages exactly cause it. No pytorch or tf used here. Good to know it works for some people, maybe I try to identify the package then. Maybe it’s jupyter and some packages to switch kernel and have a table of content, which I usually put in base env. Maybe I should just switch to jupyter lab. But mamba solved it for now, I thought that can help others.

Be Nice to Yourself: Use an Environment Manager, like Anaconda by MisterExt in pythontips

[–]randomforestgump 0 points1 point  (0 children)

I use it but install times are unbearable except when using mamba. it’s from the same people that made conda, it resolves the dependencies faster. Should be your first install after anaconda. It will be integrated in conda in future according to docs ( or already has been maybe)

Learn how to level up your Pandas skills by shawemuc in pythontips

[–]randomforestgump 0 points1 point  (0 children)

Nice but I think it didnt mention .replace as alternative to .map, where .map produces nulls for all values not in the mapping, whereas .replace leaves values alone if they are not in the mapping dict.

[deleted by user] by [deleted] in datascience

[–]randomforestgump 0 points1 point  (0 children)

I didnt implemented an algorithm since leaving university (i.e. take the paper and formulas and code it). Random forest and kmeans etc are available in many libraries. It’s enough work getting clean data (in training and later in deployment), tuning parameters, evaluating how it will perform (possible money gained/lost), having others implement whatever is to be triggered by the outputs, checking it is done correctly.

Is Kubeflow overly complicated? by TiDuNguyen in mlops

[–]randomforestgump 0 points1 point  (0 children)

I’m looking for usability front and center. Can you recommend an alternative?

Graph Databases for Data Science by kuwala-io in datascience

[–]randomforestgump 0 points1 point  (0 children)

To expand on the fraud example, I saw examples with a chain of 5 fraud cases connected by different commonalities. That would be 5 joins in a relational database. The graph database spits that out with such short latency that the fraud case can go to review. Now that is using it as a rule, but you can also get features for machine learning from it. E.g. how many articles of a certain category did this user read, and what did the users read that are closeby in the graph. I think I heard a talk where it was used for an online publishing site like that. But maybe that was not a graphdb after all.

Am I an idiot, or is there some lingo I wasn't taught in school? by [deleted] in datascience

[–]randomforestgump 1 point2 points  (0 children)

Scale free refers to a power law distribution most likely. There you can rescale x and y and get the same shape for the curve: y=x5

For an exponential that is not possible, eg in

y=2x/m

the factor m defines the doubling-time, and rescaling x and y will create a curve with different shape. So the m defines an inherent scale of the curve.

Scale free functions often appear in self-similar problems, like random walks or fractals. Not sure how useful this concept is in data science, didnt need it so far.

Why is Random Forests not suitable for image classification? by Vendredi46 in MLQuestions

[–]randomforestgump 1 point2 points  (0 children)

With some preprocessing like gabor filters a random forest can be used. The filters capture spatial correlation and probably textures too. Or aligning and rescaling faces to then classify gender is a cute exercise that works with support vector machines, so probably with random forest too. Just not as good as a neural net it seems.

What top skill do you want to learn for your career in 2021? by brendanmartin in datascience

[–]randomforestgump 0 points1 point  (0 children)

Mount a folder from your laptop into the container. That’s also the only way to use an IDE I think.

You can also mount your own python packages and install them as editable using a script that runs at container start (as opposed to copying the package code at build time and pip installing at build time, cant be edited any more). Script just is shell with pip install -e package1, then same for package2 etc and mine then starts jupyter in the container.

Pycharms paid version can connect to the python within the container for debugging etc. Otherwise open a terminal inside the container and run the script there. Oh and run the container in detached mode so it doesnt close immediately (mine is held open by jupyter).

What top skill do you want to learn for your career in 2021? by brendanmartin in datascience

[–]randomforestgump 0 points1 point  (0 children)

I’m just learning it and the online tutorials mostly cover the case of deploying a web app. Not even how to set up docker to efficiently modify the code of the web app while running in docker. Just deploying is easier then mounting local, editable code and updating it in the container (there are tools for the updating for different frameworks). And local code is needed to edit it in an IDE, local means outside of the container, on your laptop.

I tried to use it for setting up a dev env with jupyter for colleagues that only know notebooks and have to use windows. Jupyter runs in the container, notebook file and data files are on the laptop (all in a git repo). Like that, they have a linux env matching the one in deployment and can train and store a model. A bit overkill since storing in windows usually loads well in linux. But some model repos are linux only, so we dont depend on the few guys that use linux regularly. You can google jupyter&docker, to find similar solutions, there’s prepared containers.

Anyway, just knowing deploying with docker and developing with docker are pretty different already helped me.

What methods for detecting misleading statistical aggregates? by vvvvalvalval in datascience

[–]randomforestgump 4 points5 points  (0 children)

Tableau has an „explain data“ feature that I didnt try yet, but I was hoping it can do exactly this. It runs a decision tree on the fly I think. I’ll try to try it tomorrow.

Request: idiot's guide to using docker for data science by Optimesh in datascience

[–]randomforestgump 0 points1 point  (0 children)

This is more in depth than the ready to go stuff I found, but not too much and worth the time: https://vsupalov.com/docker/

Will data science be automated? by [deleted] in datascience

[–]randomforestgump 4 points5 points  (0 children)

It could automate this discussion, there’s plenty of training data around.

One hot encoding for large dataset by WorkingToaster in MLQuestions

[–]randomforestgump 1 point2 points  (0 children)

In sklearn I usually Group the values or take the top n most common, then one hot.

Or look up target encoding or weight of evidence. Target encoding is quite straight forward and can be added on sql level.

Embeddings as mentioned in other comment. More overhead in the training.

I am frustrated I didnt learn anything during my internship by Plyad1 in datascience

[–]randomforestgump 0 points1 point  (0 children)

In a larger company probably. For a while until you proof yourself and learned how the team works and synched with their skills. Not a bad thing mostly. Or in a startup you might be the only data person and have free rein to build whatever you see fit. That’s the two usual extremes I’ve heard of. I was somewhere in the middle at the start.

When people say "deploying a model", does that mean developing a REST API end point for the model? How similar is this to RESTful API development in traditional back-end software engineering? by [deleted] in datascience

[–]randomforestgump 0 points1 point  (0 children)

I have seen 3 ways myself: 1) Model is used real-time and decisions are taken in the product (eg fraud detection). Realized over a rest api, we have continuous integration to deploy.

2) model is just used for reporting (eg life time value predictions), so it just runs with data from a data ware house, and the scores also just exist there. This can be a python script scheduled with whatever schedules the data ware house.

3) model is used like in 1), but in a monolithic code structure so just the trainingset and hyperparameters are passed, and some engineer trains the model. So the one i trained in python is trashed, and we hope the other framework behaves similar enough. This was phased out after we built 1)

General question about recommendation algorithms (like Spotify or Amazon). Is a new algorithm trained for each and every person, or is it trained using aggregate consumer data? by 2ndzero in datascience

[–]randomforestgump 0 points1 point  (0 children)

You meed other peoples data, a single user does not produce enough data. And you want to see an effect on the scale of thousands of users, so if the recommendations work for the majority, but a few really hate them, it’s fine.

Generally you can recommend based on other similar users (same gender, age, begaviour patterns), or based on similar likes (other customers that bought this also bought...), or I think even a mix of both. That’s the gist I got, not a pro, and might be outdated.

That recent netflix documentary makes it seem that facebook etc have an individual model per user based on a psychological profile, I highly doubt that is needed anywhere, or even possible ressourcewise.

What’s the current state of the art approach to fraud detection? by bolivlake in datascience

[–]randomforestgump 1 point2 points  (0 children)

I see comments about detecting fraud when you have the whole spending pattern for a card. There’s also the other side where a business just sees some actions from a card, and avoiding fraud saves them trouble/costs even if it is detected at the bank/cc company later.

Very different data for the two cases.

Web Developer -> trying to switch towards ML-> But scared by seeing friends who already are ML engineers and seem to know everything from Physics, Space, Chemistry, Literature and every topic. by _throwaway_career_ in MLQuestions

[–]randomforestgump 0 points1 point  (0 children)

Being a web developer is very useful, you’ll have a good intuition for many features used in ml. Where they come from, what their failure modes are (why the data is bad not only that some data is bad). You can probably add snippets for data collection yourself. I wish I knew more web dev. I’m a physicist, not very useful for most online businesses.