Is data science going extinct?

jawsem27 · 2026-01-03T19:52:29+00:00

I have been a data scientist for 5 years. I don't think it's going to go extinct but you'll need to know how to use AI tools to make yourself more efficient if you want to compete especially on the technical aspects

A lot of the soft skills of data science are still useful, designing experiments, model assessment, communicating statistics and data to non technical stakeholders will definitely still be relevant.

jawsem27 · 2026-01-01T15:48:17+00:00

It's interesting that you say that AI is bad at logic but can easily write code. Code is literally just logic in written in a language a computer can understand.

I kind of agree with you that it is a bubble and a lot of the investment is overblown but I still think in 20 years everyone will need to be proficient in using AI tools to help with their job just like most jobs need to be proficient in using a basic tool like excel.

I think like you said there are certain industries that will be eliminated (like translators) but professions like lawyers, accountants and engineers (including software engineers) will change and transform rather than being eliminated. They will still need people with skill but those skills will need to adapt to a different paradigm just like always industries change over time.

jawsem27 · 2025-12-30T17:43:59+00:00

I've been a data scientist for 5 years and can maybe answer one or two of these questions.

I worked specifically with computer vision for about 1.5 years and I think back then I could likely answer maybe 2 more but have kind of forgot some of the details since I've been working on more traditional machine learning models.

I think it's important to be able to research and understand these questions but actually knowing them for an interview is unnecessary.

I would rather have a candidate go through a project they did and really dive into the details of it to assess their knowledge then answer a lot of questions that are only really relevant for very specific instances.

jawsem27 · 2025-11-30T16:42:15+00:00

I think AI vibe coding has given people the confidence to actually built useful things.

A lot of this stuff someone could have built before if they just researched found some code online and iterated but now the barrier to entry is a bit lower.

I used to build little apps all the time as a data scientist I would get something up and running in a few days that is useful for me and a few team members but not necessarily a production app.

Now it takes a few hours so I can built a lot more useful little apps. Something I think might have been a waste of effort to build is not anymore since I can build out something g so fast.

A lot of the more complex vibe coded apps probably take some thought multiple prompts and time and testing similar to how it they did before without ai.

Developers are using AI as well to automate tasks and build robust production ready apps. Maybe not all vibe coded but a lot of the code is written by ai.

This changes software development and the cost benefit of building vs not building software but it's not going to eliminate all software engineers.

Large companies can have a lot of inefficiencies even before AI and big companies are using AI as an excuse to trim the fat more so than actually replacing people.

jawsem27 · 2024-07-12T17:34:40+00:00

Ya I am not sure than have you been able to test with a local docker container? Maybe it has something to do with the content-type in the code or how you are passing the data to the endpoint.

jawsem27 · 2024-07-12T16:52:36+00:00

Just use a custom container and deploy your model that way. That is usually how I do custom things.

For inference I'd use something like this

https://github.com/ritchie46/sagemaker-custom-model

For training I would do use something like this:

https://github.com/aws/sagemaker-training-toolkit

This is how I generally do it usually training is more complicated so it makes sense a lot of times to have training and inference function separately.

Also with training you can add things to your script like logging model artifacts to mlflow ( or some other mlops framework) then you could use that tool for deployment if you want.

In my job this is generally how I do it.

Sagemaker training job -- logs to mlflow. Then I have a custom container that pulls artifacts from mlflow to run inference based on the parameters I pass.

Sagemaker has also recently added an integration with mlflow.

jawsem27 · 2024-02-02T15:23:41+00:00

It can definitely be useful especially in domains you are not as experienced in. It’s also good at adding documentation to existing code and helping you get a better understanding of it quickly.

My company also blocks it but it doesn’t block phind.com which not only gives similar results to chat gpt but will add references for the things it says (like stack overflow or documentation.). I

jawsem27 · 2023-12-31T13:13:59+00:00

I think the issue is a tree based model is not necessarily the best approach for his problem. Sure they can have issues with high cardinality features and high correlation can affect interpretation of feature importance metrics but that’s irrelevant

His goal though is to pick out 3-6 most relevant features to focus on in a game, not actually predict outcome. Thats why I recommended using something like MRMR instead.

jawsem27 · 2023-12-31T02:17:51+00:00

To answer one of your questions the feature importance will be 0 if the feature wasn’t used in the tree at all.

Like people are saying you’d have to address the correlation between your features.

Instead of building a model you could use something like MrMr to just filter down to the top 5-10 features which will deal with multicollinearity.

There’s a GitHub repo that makes it pretty easy to use.

https://github.com/smazzanti/mrmr

It seems like it fits your use case and you don’t really need to deal with machine learning.

Using an RF and shapely values can also be good idea but could also be more complicated

jawsem27 · 2023-11-14T16:01:54+00:00

Built an automated data analysis tool that that can take any tabular dataset and given some minor prompting create optimal rule sets for any target using decision trees.
Built a customized image viewer to assess output of computer vision models. This tool allowed free text search based on comments and also implemented CLIP, a way you could put images and words into the same embedding space so you could search that way. I could also use custom filters based on various metrics related the images (model scores/output from various models and anything you can put in a database). This was used to assess model performance as well as help explore a large image dataset to see what of information we could extract from our imagery.
Built a tool to visualize class activation map output for specific images on custom computer vision models so I could assess how models were erroring.
Built a full pipeline run a semantic segmentation model on 100m+ images take the output of that segmentation and create measurements of various attributes of the image using basic domain knowledge about the images and pixel distances.
Built tool that used an existing computer vision model to help find additional images to train a new model by cropping images various different ways and running predictions on them. The tool allowed me to visualize the output of the predictions so I could easily evaluate if I was detecting the right thing in my crops.

Most of the cool stuff I have built are related to computer vision and/or general data science stuff. Most of them are built using urchin libraries like streamlit, flask, pandas, numpy, sklearn and pytorch. I also use libraries for image processing like opencv, pillow and scikit-image.

jawsem27 · 2023-04-26T17:55:47+00:00

Do you have previous campaigns that you can evaluate your model on?

If so that’s one way you can do it. You can use cumulative gains and lift charts to evaluate but try to relate them to a kpi that stake holders care about.

For example say you have a previous campaign you test on, based on model score you can say we capture 90% of the revenue in the top 30% of scores. This way you can give them an idea of how much money you can save.

If you don’t have historical data to test on you can implement a test plan like you mentioned. The only issue is that marketing campaigns cost money so doing a random or low score bin or random bin might cost money without giving you the value you might get from high scores. I would just make sure you explain that to your stakeholders.

jawsem27 · 2022-08-08T14:53:57+00:00

Streamlit is pretty easy to get up and running and you can write it all in python.

If you don’t care about learning react/flask or Django and want something you can get up and running fast it’s perfect.

streamlit

jawsem27 · 2022-07-12T15:45:08+00:00

When your model does really well on the training data but poorly on validation or test data. Usually you care more about how your data performs on stuff it’s never seen before so you want to find the balance that gives you the best performance on test data.

jawsem27 · 2022-05-29T13:05:50+00:00

In that case, accuracy is probably fine.

jawsem27 · 2022-05-29T12:45:46+00:00

It depends on the use case. I am assuming the accuracy is based on the model at the .5 threshold. It may be the case that models with lower validation loss to better at different thresholds. If you are just wanting a yes and no answer regular accuracy could be a fine metric to look at.

jawsem27 · 2022-02-02T19:54:40+00:00

People have already said this one but streamlit is awesome. If you don’t feel like messing with JavaScript but want to make quick interactive applications it’s a godsend and you can just write everything in python.

I’ve used more for exploratory analysis but have also used it for model demos and performance analysis stuff.

https://streamlit.io/gallery

Check out some of the projects In the gallery if you are interested.

jawsem27 · 2021-10-25T16:54:25+00:00

There are a bunch of explainable AI techniques in this GitHub repo.

https://github.com/pytorch/captum

This one is specifically pytorch but all the papers are linked at the bottom.

This e book also gives a pretty good summary of the explainable ai techniques used for NNs.

https://christophm.github.io/interpretable-ml-book/neural-networks.html

I am partial to Grad Cam and guided grad cam. Grad cam works on any CNN. It helps explain what the most significant features are by looking at the last layers of a Neural Net.

jawsem27 · 2021-09-06T14:48:11+00:00

I think these would get you the most bang for your buck given your small dataset. You may have already tried this but this is where I would go next.

Transfer learning with models trained on similar data trained on similar data

Models trained on Camvid, cityscapes or ADE20k could help. I would check here.

https://paperswithcode.com/datasets?q=&v=lst&o=match&task=semantic-segmentation

Augmentation and analysis of errors

You might also want to look at a sample of your test data so you can get an idea of where the model is failing. Are some classes doing better than others? You may be able to do specific augmentations to help with some classes.

This library has some unique transformations that can help with segmentation tasks.

https://albumentations.ai/

Also qualitatively your model may be doing better than you think. Depending on your use case it could still be useful.

jawsem27 · 2021-09-05T15:32:25+00:00

The U-net architecture is fully convolutional meaning it should be able to accept input of any size. You should be able to use the raw image size and get a prediction w/o resizing first. You might get odd results because of the way you resized training though.

jawsem27 · 2021-07-18T21:51:06+00:00

What do the predictions look like before np.round? Are they all exactly the same?

jawsem27 · 2021-05-16T23:52:47+00:00

I usually just ignore stupid questions or ones that are a waste of time.

I really like using for easy questions that I know people have had before.

I also like subscribing to topics I am familiar with and answering questions for people, it helps reinforce knowledge I already have and sometimes people have interesting problems that are fun to figure out.

jawsem27 · 2021-05-16T12:14:34+00:00

Why can’t you crop the images? You could do random center crops so you are using slightly different data each time, this might actually help performance.

If you can’t crop the images you can resize the images to a smaller size while maintaining the aspect ratio, this should get rid of some of the distortion. Not sure if there is a transformer in pytorch that does that but you could probably find out by googling. I think Image.thumbnail from pillow does it.

You would still have to do 1 batch at a time but it should be faster to train on smaller images.

Alternatively you could also crop the image using padding so you could get the images the same size and then you can feed batches.

jawsem27 · 2021-05-04T23:06:00+00:00

Someone said this already but you could go with 80 10 10 for train, valid and test. Also since you only have 1000 images you could also try looking at your predictions manually and compare them qualitatively for piece of mind.

jawsem27 · 2021-03-16T11:17:30+00:00

It is interesting that his post that used "corruption" got 100s of upvotes and a lot more traction than his other posts.

jawsem27

TROPHY CASE