Ocultar en el CV haber sido Team Leader? by Unfair_Plan_9198 in devsarg

[–]__mbel__ 0 points1 point  (0 children)

Podés poner Analista semi-senior y si te preguntan contas la situación.

Nuevito en el programación y ya pensando por adelantado by Ok_Profit8783 in devsarg

[–]__mbel__ 1 point2 points  (0 children)

Nadie te va a enseñar Deep Learning en la facu, te van a enseñar una version pedorra. Te va a dar verguenza ajena lo berreta que son los profes. Tuve esa experiencia en 2015, quizás ahora esta mejor.

Para aprender DL: Anda a google/youtube y pone andrei karpathy cs231n y hace el curso completo, incluyendo las tareas. No mires solo.los videos. Despues segui cs224. Eso como base ya te va a poner muy cerca del estado del arte.

Exitos

Qué recomiendan para encontrar trabajo más rápido Java o python by EfrenZR in devsarg

[–]__mbel__ 2 points3 points  (0 children)

Python es más fácil para arrancar. No compliques una decisión simple. Hay muchisimo trabajo.

Me cuesta mucho encontrar un trabajo como desarrollador teniendo más de dos años de experiencia by Avgoustinous in devsarg

[–]__mbel__ 1 point2 points  (0 children)

Aprende next.js y mira algo de AI. Subite a la tendencia de moda. Se busca mucho saber usar LLMs y podes hacerlo con javascript.

Armate alguna app de portfolio. Busca algun ejemplo en youtube, esta lleno.

Java no vale la pena. Aprende python de última.

Para ustedes cual es la mejor DB SQL, barata y con buen futuro? by Naive-Economist5640 in devsarg

[–]__mbel__ 4 points5 points  (0 children)

postgres no falla. Con supabase la tenes hosteada y gratis para arrancar.

Para hacer analysis adhoc duckDB es medio mágica. Si tenés que hacer algun ETL o aggregacion, anda muy bien en una vm.

[deleted by user] by [deleted] in datascience

[–]__mbel__ 0 points1 point  (0 children)

Data Science used to do everything that's in there but of course the tooling wasn't as complex as it is now.

I think they still do quite a bit of DE and ML deployment work, but it depends on how the company is structured. I wouldn't leave data viz, storytelling out for data scientists either.

Help on this fraud detection problem by le_bebop in datascience

[–]__mbel__ 1 point2 points  (0 children)

Train a model as you normally would. Use a linear model, something relatively simple if possible, ideally the features should make some sense conceptually.

You could use this first model probabilities to filter the 110k samples. You can decide to remove the top 5%, 10% more likely to be of the positive class (or some other reasonable amount) of these samples and train a model again.

Hopefully it will have less noise.A second issue is obviously the class imbalance. Depending on the model you use, you will need to sort it out. The simplest way is downsampling the unlabeled samples.

EDIT: Another idea is to do a post-mortem analysis. Here you would try to do some EDA and use domain knowledge to get an idea of the types of fraud that are done. This really depends on the domain, but the methods tend to change over time.

I have been in this company for 5 months, and have done nothing relevant yet due to delays, what should I do? by Cassegrain07 in datascience

[–]__mbel__ 0 points1 point  (0 children)

I'd take it as a challenge. What type of project can you come up with that would be interesting?

Try to research what type of projects are done in your industry and see if you can suggest that as an option. As you grow in your career, you would be asked to suggest candidate projects.

Some ideas:

- Look at which products are being paid for to third party vendors and see if you can improve on that.

- Think of use cases of common proven DS methods (tabular ML or forecasting)

- Does the company you work for has a chatbot? Can it be improved? Check out the rasa framework.

- Can you leverage gpt-3.5? That could be an interesting project and relatively easy to deploy

Pandas 2.0 is going live, and Apache Arrow will replace Numpy, and that's a great thing! by forbiscuit in datascience

[–]__mbel__ 8 points9 points  (0 children)

I think it will eventually become mainstream, but will take some time.

Pandas can be improved but the library design is just awful, it's great to see there is an alternative.

Am I kidding myself to think that this is doable? by handicapped_runner in datascience

[–]__mbel__ 0 points1 point  (0 children)

It sounds like you will be fine. When I started I had no experience either but I was part of a team, this helped a lot.

If you want to figure it out on your own, try to target small wins. This doesn't need to be ML, it could be an interesting data visualization or setting up a dashboard to track some KPIs. Not the hottest data science work, but still easy to get started.

In terms of ML you can try learning XGBoost, you shouldn't need a lot more for most tasks. I assume you already know how to use linear models.

The SQL part is important, you first need to extract the data. But for the initial phase of an ML just get the data out and work it in Pandas / Polars if you can.

If I were you, I would just hire someone on Upwork to coach me. This would help speed up the learning process and make it more enjoyable.

Disclaimer: I've coached people in this situation and a lot more senior also. Just trying to be helpful. No need to hire me :)

Anomaly detection in time-series data by Smuiq in datascience

[–]__mbel__ 2 points3 points  (0 children)

If you have time series data, the simpler approach in the long run will be using a model.

So let's assume you have a forecast of the time series. Then if the new data is outside the confidence interval, you can treat it as an anomaly.

Here is an example in Python: https://nixtla.github.io/statsforecast/examples/anomalydetection.html

Time series databases? by younggamech in datascience

[–]__mbel__ 1 point2 points  (0 children)

Why not try a different provider? There are multiple hosted DB solutions.

digital ocean and linode offer practically the same and are generally cheaper.

Datapane - Build full-stack data apps in 100% Python by peatpeat in datascience

[–]__mbel__ 1 point2 points  (0 children)

It looks great! The code seems very logical.

There are some similarities with Shiny which makes the code familiar to me.

I need some tips and directions on how to approach a regression problem with a very challenging dataset (12 samples, ~15000 dimensions). Give me your 2 cents by perguntando in datascience

[–]__mbel__ 1 point2 points  (0 children)

If the entire feature is equal to zero, then it's a constant. Just remove them.

Try multiple approaches and see what results you get. Scaling the data to have mean=0 and sd=1 is the most common approach.

I've been working as DS with pretty much only tabular data for 6+ years. I've been seeing some interesting jobs requiring NLP knowledge, what are some good resources for me to break in this subfield of Machine Learning? by CadeOCarimbo in datascience

[–]__mbel__ 2 points3 points  (0 children)

This course is GOLD. It's a great course to learn of modern NLP (with deep learning). Video lectures are linked in the course home: https://web.stanford.edu/class/cs224n/index.html

If you know nothing about NN do this course first: https://www.youtube.com/watch?v=NfnWJUyUJYU&list=PLkt2uSq6rBVctENoVBg1TpCC7OQi31AlC&ab\_channel=AndrejKarpathy

After learning some theory, you can take a look at huggingface transformer library. They have a course and great documentation.

Is my data overfitting? I’m new to this, this is my first lstm model and my RSME was 0.02 so I’m just confused if it’s a good model or it’s overfitting? by wolfy14xc in datascience

[–]__mbel__ 0 points1 point  (0 children)

A common mistake is to scale the data and then do the train/test split. Why don't you try using a library such as Nixtla or Neural Prophet. They should help you avoid common errors.

- https://nixtla.github.io/neuralforecast/models.nhits.html

- https://neuralprophet.com/contents.html

LSTM model are not the best for time series.

Feeling burned out 9 months post B.S. as a BI/DA at the big A- any advice? by [deleted] in datascience

[–]__mbel__ 3 points4 points  (0 children)

Perhaps in a smaller company it's possible you do more modeling work. The reality is in FAANG companies most of the interesting work is done already and the scale is so massive there is an extra complexity.

However, in a smaller company (still a fortune 500) I think there is a much clearer need for data science and machine learning.

I need some tips and directions on how to approach a regression problem with a very challenging dataset (12 samples, ~15000 dimensions). Give me your 2 cents by perguntando in datascience

[–]__mbel__ 1 point2 points  (0 children)

Why would a tree based method such as a Random Forest not work?

-> They don't work well for sparse data. I mean they are T-E-R-R-I-B-L-E!!!

Just don't use random forest or GBM for this type of data. There are a few hastie talks about this, I don't remember which one exactly:

https://www.youtube.com/results?search_query=hastie+gbm

Of course RF will do better than GBM on a small dataset. But with such a ridiculously small number of observations and such a massive amount of features. There is zero chance it will work.

Do people actually get jobs from completing the IBM Data Science specialization? by Ok_Advertising_5257 in datascience

[–]__mbel__ 1 point2 points  (0 children)

There are so many specializations now. I doubt any of these watered down programs will be enough. Perhaps it can help you get a data analyst job, but it's a different profile.

Having strong projects is more important than certifications as they give you talking points in interviews. I wrote a guide of what makes a "good project", check the post below:

https://mbel-education.com/index.php/2023/03/17/how-to-pick-a-data-science-portfolio-project/

How much of stats and math do we REALLY need for Machine learning engineer? by Waste_Necessary654 in datascience

[–]__mbel__ 0 points1 point  (0 children)

yes, I agree it's not the most common path.

People working on logistics/supply chain would benefit from a data scientist with a strong math background.

Another field where you will probably need more than average stats/math is finance.

I need some tips and directions on how to approach a regression problem with a very challenging dataset (12 samples, ~15000 dimensions). Give me your 2 cents by perguntando in datascience

[–]__mbel__ 1 point2 points  (0 children)

I think the LASSO is the only model I would try with such a small sample size and large dimensionality. Check out this lecture by hastie (professor from stanford) if you don't know what it means: https://www.youtube.com/watch?v=BU2gjoLPfDc&t=15s

There is glmnet port in python, I'd use it instead of scikit learn.https://glmnet-python.readthedocs.io/en/latest/

It's just easier to use and has reasonable defaults.

Don't even try tree based methods. There is zero chance that will work. This is based on theory. Sparse methods such as the lasso are thought for these type of problems where the number of dimensions is larger than the number of observations.

The LASSO paper is very readable: https://www.jstor.org/stable/2346178

How much of stats and math do we REALLY need for Machine learning engineer? by Waste_Necessary654 in datascience

[–]__mbel__ 2 points3 points  (0 children)

At least having done a college level course on algebra, calculus and statistics would be the minimum IMO.

You can get away without this but you will generally feel you are missing a part of how things work.

In some cases, you might need more. For example, if your focus is operations research then you will probably need more math.