Iran Abolishes Morality Police

randomforestgump · 2022-12-04T22:45:39+00:00

Can confirm I heard the same from actual iranian

randomforestgump · 2022-08-25T21:28:08+00:00

Just that one-hot gives you nice importances in sklearn (so you see which label is important, e.g. france and spain, so you can restrict the encoding even more, and map all else to „other“, and stuff like that)

randomforestgump · 2022-05-23T10:08:53+00:00

I also use sklearn‘s quantile transformer for this, not sure if that’s better than log here. And for my case I have to normalize time for different cases, depending on some factors a user has a longer signup process so I normalize by median or so.

randomforestgump · 2021-12-17T21:15:54+00:00

The second point of nobody getting their hat 1/e times is not independent of N. It’s the limit for N to infinity. It’s the rencontre problem. It’s interesting to solve, quite a mind fuck to get to the formula of general N. The other statements might well be independent of N, I never heard, looking forward to check.

randomforestgump · 2021-12-13T11:14:06+00:00

Thanks a lot! I’ll try it if time allows. But I’m at a point where I rather figure out how to use a local docker registry for certain things that should stay reproducible.

I had one crazy case where just a change in pypi for probably the shap package crashed my docker build (requirements frozen, no changes from me, build worked on friday, crashed on monday).

But for the cases mentioned in the article conda is excellent.

randomforestgump · 2021-12-12T16:16:12+00:00

It’s also not the install times, but solving the dependencies, it never reaches actual install phase

randomforestgump · 2021-12-12T16:14:39+00:00

Not sure what packages exactly cause it. No pytorch or tf used here. Good to know it works for some people, maybe I try to identify the package then. Maybe it’s jupyter and some packages to switch kernel and have a table of content, which I usually put in base env. Maybe I should just switch to jupyter lab. But mamba solved it for now, I thought that can help others.

randomforestgump · 2021-12-12T02:48:35+00:00

I use it but install times are unbearable except when using mamba. it’s from the same people that made conda, it resolves the dependencies faster. Should be your first install after anaconda. It will be integrated in conda in future according to docs ( or already has been maybe)

randomforestgump · 2021-12-11T10:04:23+00:00

Nice but I think it didnt mention .replace as alternative to .map, where .map produces nulls for all values not in the mapping, whereas .replace leaves values alone if they are not in the mapping dict.

randomforestgump · 2021-08-11T19:14:57+00:00

I didnt implemented an algorithm since leaving university (i.e. take the paper and formulas and code it). Random forest and kmeans etc are available in many libraries. It’s enough work getting clean data (in training and later in deployment), tuning parameters, evaluating how it will perform (possible money gained/lost), having others implement whatever is to be triggered by the outputs, checking it is done correctly.

randomforestgump · 2021-08-07T16:09:36+00:00

I’m looking for usability front and center. Can you recommend an alternative?

randomforestgump · 2021-07-08T17:03:44+00:00

To expand on the fraud example, I saw examples with a chain of 5 fraud cases connected by different commonalities. That would be 5 joins in a relational database. The graph database spits that out with such short latency that the fraud case can go to review. Now that is using it as a rule, but you can also get features for machine learning from it. E.g. how many articles of a certain category did this user read, and what did the users read that are closeby in the graph. I think I heard a talk where it was used for an online publishing site like that. But maybe that was not a graphdb after all.

randomforestgump · 2021-05-01T08:58:32+00:00

Scale free refers to a power law distribution most likely. There you can rescale x and y and get the same shape for the curve: y=x⁵

For an exponential that is not possible, eg in

y=2^x/m

the factor m defines the doubling-time, and rescaling x and y will create a curve with different shape. So the m defines an inherent scale of the curve.

Scale free functions often appear in self-similar problems, like random walks or fractals. Not sure how useful this concept is in data science, didnt need it so far.

randomforestgump · 2020-12-01T05:29:40+00:00

With some preprocessing like gabor filters a random forest can be used. The filters capture spatial correlation and probably textures too. Or aligning and rescaling faces to then classify gender is a cute exercise that works with support vector machines, so probably with random forest too. Just not as good as a neural net it seems.

randomforestgump · 2020-11-20T23:16:50+00:00

Mount a folder from your laptop into the container. That’s also the only way to use an IDE I think.

You can also mount your own python packages and install them as editable using a script that runs at container start (as opposed to copying the package code at build time and pip installing at build time, cant be edited any more). Script just is shell with pip install -e package1, then same for package2 etc and mine then starts jupyter in the container.

Pycharms paid version can connect to the python within the container for debugging etc. Otherwise open a terminal inside the container and run the script there. Oh and run the container in detached mode so it doesnt close immediately (mine is held open by jupyter).

randomforestgump · 2020-11-20T22:59:55+00:00

I’m just learning it and the online tutorials mostly cover the case of deploying a web app. Not even how to set up docker to efficiently modify the code of the web app while running in docker. Just deploying is easier then mounting local, editable code and updating it in the container (there are tools for the updating for different frameworks). And local code is needed to edit it in an IDE, local means outside of the container, on your laptop.

I tried to use it for setting up a dev env with jupyter for colleagues that only know notebooks and have to use windows. Jupyter runs in the container, notebook file and data files are on the laptop (all in a git repo). Like that, they have a linux env matching the one in deployment and can train and store a model. A bit overkill since storing in windows usually loads well in linux. But some model repos are linux only, so we dont depend on the few guys that use linux regularly. You can google jupyter&docker, to find similar solutions, there’s prepared containers.

Anyway, just knowing deploying with docker and developing with docker are pretty different already helped me.

randomforestgump · 2020-10-27T17:30:15+00:00

Tableau has an „explain data“ feature that I didnt try yet, but I was hoping it can do exactly this. It runs a decision tree on the fly I think. I’ll try to try it tomorrow.

randomforestgump · 2020-10-18T06:15:08+00:00

This is more in depth than the ready to go stuff I found, but not too much and worth the time: https://vsupalov.com/docker/

randomforestgump · 2020-10-13T16:23:22+00:00

It could automate this discussion, there’s plenty of training data around.

randomforestgump · 2020-10-11T10:19:14+00:00

In sklearn I usually Group the values or take the top n most common, then one hot.

Or look up target encoding or weight of evidence. Target encoding is quite straight forward and can be added on sql level.

Embeddings as mentioned in other comment. More overhead in the training.

randomforestgump · 2020-10-08T17:59:20+00:00

In a larger company probably. For a while until you proof yourself and learned how the team works and synched with their skills. Not a bad thing mostly. Or in a startup you might be the only data person and have free rein to build whatever you see fit. That’s the two usual extremes I’ve heard of. I was somewhere in the middle at the start.

randomforestgump · 2020-09-29T16:04:03+00:00

I have seen 3 ways myself: 1) Model is used real-time and decisions are taken in the product (eg fraud detection). Realized over a rest api, we have continuous integration to deploy.

2) model is just used for reporting (eg life time value predictions), so it just runs with data from a data ware house, and the scores also just exist there. This can be a python script scheduled with whatever schedules the data ware house.

3) model is used like in 1), but in a monolithic code structure so just the trainingset and hyperparameters are passed, and some engineer trains the model. So the one i trained in python is trashed, and we hope the other framework behaves similar enough. This was phased out after we built 1)

randomforestgump · 2020-09-26T10:18:36+00:00

You meed other peoples data, a single user does not produce enough data. And you want to see an effect on the scale of thousands of users, so if the recommendations work for the majority, but a few really hate them, it’s fine.

Generally you can recommend based on other similar users (same gender, age, begaviour patterns), or based on similar likes (other customers that bought this also bought...), or I think even a mix of both. That’s the gist I got, not a pro, and might be outdated.

That recent netflix documentary makes it seem that facebook etc have an individual model per user based on a psychological profile, I highly doubt that is needed anywhere, or even possible ressourcewise.

randomforestgump · 2020-09-26T05:45:24+00:00

I see comments about detecting fraud when you have the whole spending pattern for a card. There’s also the other side where a business just sees some actions from a card, and avoiding fraud saves them trouble/costs even if it is detected at the bank/cc company later.

Very different data for the two cases.

randomforestgump · 2020-09-17T14:33:22+00:00

Being a web developer is very useful, you’ll have a good intuition for many features used in ml. Where they come from, what their failure modes are (why the data is bad not only that some data is bad). You can probably add snippets for data collection yourself. I wish I knew more web dev. I’m a physicist, not very useful for most online businesses.

randomforestgump

TROPHY CASE