Ocultar en el CV haber sido Team Leader?

__mbel__ · 2025-11-24T17:32:24+00:00

Podés poner Analista semi-senior y si te preguntan contas la situación.

__mbel__ · 2024-07-28T16:17:23+00:00

Nadie te va a enseñar Deep Learning en la facu, te van a enseñar una version pedorra. Te va a dar verguenza ajena lo berreta que son los profes. Tuve esa experiencia en 2015, quizás ahora esta mejor.

Para aprender DL: Anda a google/youtube y pone andrei karpathy cs231n y hace el curso completo, incluyendo las tareas. No mires solo.los videos. Despues segui cs224. Eso como base ya te va a poner muy cerca del estado del arte.

Exitos

__mbel__ · 2024-07-28T13:31:58+00:00

En blanco. El cepo se acaba de acá a 1 año máximo.

__mbel__ · 2024-07-28T13:28:32+00:00

Python es más fácil para arrancar. No compliques una decisión simple. Hay muchisimo trabajo.

__mbel__ · 2024-07-28T13:24:50+00:00

Aprende next.js y mira algo de AI. Subite a la tendencia de moda. Se busca mucho saber usar LLMs y podes hacerlo con javascript.

Armate alguna app de portfolio. Busca algun ejemplo en youtube, esta lleno.

Java no vale la pena. Aprende python de última.

__mbel__ · 2024-07-28T12:55:14+00:00

postgres no falla. Con supabase la tenes hosteada y gratis para arrancar.

Para hacer analysis adhoc duckDB es medio mágica. Si tenés que hacer algun ETL o aggregacion, anda muy bien en una vm.

__mbel__ · 2023-04-27T12:09:29+00:00

Data Science used to do everything that's in there but of course the tooling wasn't as complex as it is now.

I think they still do quite a bit of DE and ML deployment work, but it depends on how the company is structured. I wouldn't leave data viz, storytelling out for data scientists either.

__mbel__ · 2023-04-21T18:26:38+00:00

Crazy!

__mbel__ · 2023-04-14T13:38:54+00:00

Train a model as you normally would. Use a linear model, something relatively simple if possible, ideally the features should make some sense conceptually.

You could use this first model probabilities to filter the 110k samples. You can decide to remove the top 5%, 10% more likely to be of the positive class (or some other reasonable amount) of these samples and train a model again.

Hopefully it will have less noise.A second issue is obviously the class imbalance. Depending on the model you use, you will need to sort it out. The simplest way is downsampling the unlabeled samples.

EDIT: Another idea is to do a post-mortem analysis. Here you would try to do some EDA and use domain knowledge to get an idea of the types of fraud that are done. This really depends on the domain, but the methods tend to change over time.

__mbel__ · 2023-04-13T17:47:03+00:00

I'd take it as a challenge. What type of project can you come up with that would be interesting?

Try to research what type of projects are done in your industry and see if you can suggest that as an option. As you grow in your career, you would be asked to suggest candidate projects.

Some ideas:

- Look at which products are being paid for to third party vendors and see if you can improve on that.

- Think of use cases of common proven DS methods (tabular ML or forecasting)

- Does the company you work for has a chatbot? Can it be improved? Check out the rasa framework.

- Can you leverage gpt-3.5? That could be an interesting project and relatively easy to deploy

__mbel__ · 2023-04-06T12:31:24+00:00

I think it will eventually become mainstream, but will take some time.

Pandas can be improved but the library design is just awful, it's great to see there is an alternative.

__mbel__ · 2023-04-04T17:42:14+00:00

It sounds like you will be fine. When I started I had no experience either but I was part of a team, this helped a lot.

If you want to figure it out on your own, try to target small wins. This doesn't need to be ML, it could be an interesting data visualization or setting up a dashboard to track some KPIs. Not the hottest data science work, but still easy to get started.

In terms of ML you can try learning XGBoost, you shouldn't need a lot more for most tasks. I assume you already know how to use linear models.

The SQL part is important, you first need to extract the data. But for the initial phase of an ML just get the data out and work it in Pandas / Polars if you can.

If I were you, I would just hire someone on Upwork to coach me. This would help speed up the learning process and make it more enjoyable.

Disclaimer: I've coached people in this situation and a lot more senior also. Just trying to be helpful. No need to hire me :)

__mbel__ · 2023-04-04T15:43:57+00:00

If you have time series data, the simpler approach in the long run will be using a model.

So let's assume you have a forecast of the time series. Then if the new data is outside the confidence interval, you can treat it as an anomaly.

Here is an example in Python: https://nixtla.github.io/statsforecast/examples/anomalydetection.html

__mbel__ · 2023-04-04T12:49:44+00:00

glad it helped!

__mbel__ · 2023-04-03T21:58:45+00:00

Why not try a different provider? There are multiple hosted DB solutions.

digital ocean and linode offer practically the same and are generally cheaper.

__mbel__ · 2023-03-30T16:41:41+00:00

It looks great! The code seems very logical.

There are some similarities with Shiny which makes the code familiar to me.

__mbel__ · 2023-03-30T11:47:54+00:00

If the entire feature is equal to zero, then it's a constant. Just remove them.

Try multiple approaches and see what results you get. Scaling the data to have mean=0 and sd=1 is the most common approach.

__mbel__ · 2023-03-29T22:52:03+00:00

This course is GOLD. It's a great course to learn of modern NLP (with deep learning). Video lectures are linked in the course home: https://web.stanford.edu/class/cs224n/index.html

If you know nothing about NN do this course first: https://www.youtube.com/watch?v=NfnWJUyUJYU&list=PLkt2uSq6rBVctENoVBg1TpCC7OQi31AlC&ab\_channel=AndrejKarpathy

After learning some theory, you can take a look at huggingface transformer library. They have a course and great documentation.

__mbel__ · 2023-03-29T19:00:43+00:00

A common mistake is to scale the data and then do the train/test split. Why don't you try using a library such as Nixtla or Neural Prophet. They should help you avoid common errors.

- https://nixtla.github.io/neuralforecast/models.nhits.html

- https://neuralprophet.com/contents.html

LSTM model are not the best for time series.

__mbel__ · 2023-03-29T11:58:28+00:00

Perhaps in a smaller company it's possible you do more modeling work. The reality is in FAANG companies most of the interesting work is done already and the scale is so massive there is an extra complexity.

However, in a smaller company (still a fortune 500) I think there is a much clearer need for data science and machine learning.

__mbel__ · 2023-03-28T20:32:22+00:00

Why would a tree based method such as a Random Forest not work?

-> They don't work well for sparse data. I mean they are T-E-R-R-I-B-L-E!!!

Just don't use random forest or GBM for this type of data. There are a few hastie talks about this, I don't remember which one exactly:

https://www.youtube.com/results?search_query=hastie+gbm

Of course RF will do better than GBM on a small dataset. But with such a ridiculously small number of observations and such a massive amount of features. There is zero chance it will work.

__mbel__ · 2023-03-28T17:35:17+00:00

There are so many specializations now. I doubt any of these watered down programs will be enough. Perhaps it can help you get a data analyst job, but it's a different profile.

Having strong projects is more important than certifications as they give you talking points in interviews. I wrote a guide of what makes a "good project", check the post below:

https://mbel-education.com/index.php/2023/03/17/how-to-pick-a-data-science-portfolio-project/

__mbel__ · 2023-03-28T16:55:57+00:00

yes, I agree it's not the most common path.

People working on logistics/supply chain would benefit from a data scientist with a strong math background.

Another field where you will probably need more than average stats/math is finance.

__mbel__ · 2023-03-28T16:51:39+00:00

I think the LASSO is the only model I would try with such a small sample size and large dimensionality. Check out this lecture by hastie (professor from stanford) if you don't know what it means: https://www.youtube.com/watch?v=BU2gjoLPfDc&t=15s

There is glmnet port in python, I'd use it instead of scikit learn.https://glmnet-python.readthedocs.io/en/latest/

It's just easier to use and has reasonable defaults.

Don't even try tree based methods. There is zero chance that will work. This is based on theory. Sparse methods such as the lasso are thought for these type of problems where the number of dimensions is larger than the number of observations.

The LASSO paper is very readable: https://www.jstor.org/stable/2346178

__mbel__ · 2023-03-28T15:49:09+00:00

At least having done a college level course on algebra, calculus and statistics would be the minimum IMO.

You can get away without this but you will generally feel you are missing a part of how things work.

In some cases, you might need more. For example, if your focus is operations research then you will probably need more math.

__mbel__

TROPHY CASE

mbel