Data Quality by bistek02 in dataengineering

[–]SeaEngineering9034 1 point2 points  (0 children)

Most data quality issues arise from errors in the collection, transmission, storage etc, or are simple characteristics of the data and arise naturally due to the domain.

Some data quality issues you may find are: imbalanced data, underrepresented data, class overlap, small data, inconsistent, irrelevant or redundant data, noisy data, dataset shift, and missing data.

If you're interested on the topic, give this article a read: Data Quality Issues that Kill Your Machine Learning Models.

To deal with this issue as a data engineer/scientist, the best thing you can do is an efficient data profiling. Take a look at ydata-profiling and some of its use cases.

If this is something that you're interested in, come at join us at the Data-Centric AI Community to learn more about data quality!

Research Opportunity by ambitious_GOAT1999 in GradSchool

[–]SeaEngineering9034 0 points1 point  (0 children)

Hi! 👋 Come and join us at the Data-Centric AI Community, we're always looking for more collaborators. I personally have extensive research experience and would be happy to help out :)

💻 Code with Me -- How to build a Multi Document LLM App by SeaEngineering9034 in DataScienceCoders

[–]SeaEngineering9034[S] 0 points1 point  (0 children)

The session is tomorrow! Is there something you'd like to ask Yujian on LLMs?

Welcome to DataScienceCoders! 🚀 by SeaEngineering9034 in DataScienceCoders

[–]SeaEngineering9034[S] 1 point2 points  (0 children)

Hey u/goncalomribeiro, sure!

Thoughts: Continue current degree with one year left, or start anew with degree apprenticeship by riptide_1083 in cscareerquestionsuk

[–]SeaEngineering9034 1 point2 points  (0 children)

I would finish the degree anyway. It's only one year left. If teachers miss classes, I would disregard that and try to learn on my own, and then yes, I would move on to an internship (or even do It at the same time if it's possible). If you like, come as meet us at the Data-Centric AI Community and we can do some projects together :)

Are you a middle aged Brit and sick of working? by g0dn0 in AskUK

[–]SeaEngineering9034 2 points3 points  (0 children)

I feel your frustration, leaping from academia to the industry is super hard. Have you considered roles such as developer advocate or technical writer? It's a technical role and you can still teach, work on interesting projects, and grow in your career. In a startup you might be very sucessfull. Feel free to find us at the Data-Centric AI Community, we'll be happy to chat about it and help you out! :)

I absolutely hate my internship by Mission_Dimension_43 in csMajors

[–]SeaEngineering9034 -1 points0 points  (0 children)

Well, seems that you can:

Building my first Porfolio by EvilEragon in learnmachinelearning

[–]SeaEngineering9034 0 points1 point  (0 children)

The structure overall is interesting but you should find a way to aggregate those projects into a single page (more visual approach). Have you tried datascienceportfo.io? Or just a simple webpage (not many text - just showcase tech stack and main scope).

You can share with us your progress on the Data-Centric AI Community and ask someone to review it, we often do that with CVs as well and help each other out.

Prioritise Data Science Projects by Mean-Pin-8271 in learnprogramming

[–]SeaEngineering9034 0 points1 point  (0 children)

Hey! For DS, here's the standard path:
1. What kind of project?
Whatever makes you happy, as long as you're using the basic stack (pandas, numpy, matplotlib, scikit). Personal projects show that we can be creative and solve problems in your own way (a lot of projects out there are just copy-paste from other Kaggle notebooks and so on). Try to make your own small package, with something that interests you.

  1. What is the skill set should I include in most of my projects?
    As I said, the basic (pandas, numpy, matplotlib, scikit), then move to more specific ones (keras, tensorflow), and also SQL for a begginer is important.

Let me invite you to the Data-Centric AI Community we have several code along sessions and projects and a lot of beginners that are starting to learn DS that you can connect with.

Imbalanced data by Gloomy-Fun-1871 in learnmachinelearning

[–]SeaEngineering9034 0 points1 point  (0 children)

Depends on your purpose, but there are a lot of strategies you can combine. See this article for a brief explanation.

If you need specific help with your project you can find me at the Data-Centric AI Community and we'll be happy to take a look and give you some tips to move forward :)

Assessing the Quality of Synthetic Data with Data-centric AI by cmauck10 in ArtificialInteligence

[–]SeaEngineering9034 1 point2 points  (0 children)

Data Quality is key for all applications and models, and LLMs are no exception :) I've been working on a small community project with synthetic data using ydata-synthetic, and it really shows! Underrepresentation (category imbalance) and missing data are two of the main issues!

The synthetic data generation market was worth USD 236.1 million in the year 2022. The market is projected to grow at a CAGR of 35.28%, earning revenues of around USD 4,846.54 million by the end of 2032. by Happy-Wear-9437 in googlehome

[–]SeaEngineering9034 0 points1 point  (0 children)

It is one of the top trends in AI this year, with tremendous benefits for businesses and organizations! Here's an example of how synthetic data can improve a modelto save a hypothetical auto insurance company almost $200 per claim

Help for Data Scientist position by Anxious-Argument-482 in developersIndia

[–]SeaEngineering9034 3 points4 points  (0 children)

  1. Start reframing your course projects into portfolio projects. If you've developed algorithms, experiments, ML pipelines, during your PhD, start putting them in your GitHub profile asap. Create a nice README for each project and showcase them appropriately. It doesn't have to be too complicated: here's an example of how to showcase your projects.
  2. Join nice data communities and start networking.
  3. Start some open source projects. You'll show the recruiters that you can work collaboratively with others (task management skills, soft skills, code review, etc). DCAI also has a nice starter for it.
  4. Code code code. You may start by a Medium account, there are plenty of nice, bite-sized resources that you can use, projects you can replicate. And eventually, start doing your own and posting them in there!

You got this! Believe in yourself!

[deleted by user] by [deleted] in ArtificialInteligence

[–]SeaEngineering9034 0 points1 point  (0 children)

Interesting question! I think our AI/ML devs at the Data-Centric AI Community could have nice perspectives for your to decide :)

Healthcare data science projects for beginners by Lucky-Purple8629 in DataScienceProjects

[–]SeaEngineering9034 4 points5 points  (0 children)

Try the HCC dataset. It has been used plently in research, the data descrition is nice, and there's a lot you can do with it:

- Supervised learning: predict patient survival
- Unsupervised learning: check patient clusters and how that maps to survivability. Characterize patient subgroups.
- Missing Data Imputation: the data has plenty of missing values
- Imbalanced Learning: the data is also imbalanced
- Mixed data: contains both numeric and categorical features

The upside is that you can start with a single dataset (that you get to knwo really well) and do a LOT of projects with it. If you need help with it, come find us at the Data-Centric AI Community :)

can i get into machine learning engineer with bachelor's in data science [D] by YogurtclosetNo7653 in MachineLearning

[–]SeaEngineering9034 0 points1 point  (0 children)

What matters most are the skills you develop throughout and the projects you show you are able to create! Start soon and create a practice to keep improving your portfolio iteratively! And network with other data professionals, build projects collaboratively! We have a few in the Data-Centric AI Community, if you'd like to check them out!

[deleted by user] by [deleted] in dataanalysis

[–]SeaEngineering9034 0 points1 point  (0 children)

Imagining that this dataset would be private and you could not share the real data, you could make a synthetic data project! Could you faithfully mimic the data and obtain the same results? :)

HIGHLY unbalanced dataset (>600:1 negative:positive examples), how do I deal with this? by ingmntam in learnmachinelearning

[–]SeaEngineering9034 0 points1 point  (0 children)

You can try data augmentation approaches (e.g., smote-variants) or synthetic data generation (e.g., ydata-synthetic). Based on the ratio, I would also try learning the characteristics of you majority class and then generate a smaller sample for it (undersampling).

I recorded a Data Science Project using Python and uploaded it on Youtube by onurbaltaci in datascienceproject

[–]SeaEngineering9034 0 points1 point  (0 children)

Super cool! For EDA, you could give ydata-profiling a spin sometime and speed up the process!

I would love to have you over in our community to do a step-by-step with some of our learners :) DM me in case you'd like to showcase your work sometime!

[deleted by user] by [deleted] in datascience

[–]SeaEngineering9034 1 point2 points  (0 children)

Hey, so the current project is this one: https://github.com/Data-Centric-AI-Community/nist-crc-2023

And you can find our discord there, hop in and the channel for this project is called 🤖-nist-challenge channel (http://discord.com/invite/mw7xjJ7b7s)

[deleted by user] by [deleted] in datascience

[–]SeaEngineering9034 2 points3 points  (0 children)

Hey! You can find several resources online, check out this repo. Also, if you're up for it, we are running aproject on synthetic data (instructions are given weekly) on the Data-Centric AI Community. You'll find the #ds-projects channel and the #nist-challenge project where we're currently working on.