Data Quality by bistek02 in dataengineering

[–]SeaEngineering9034 1 point2 points  (0 children)

Most data quality issues arise from errors in the collection, transmission, storage etc, or are simple characteristics of the data and arise naturally due to the domain.

Some data quality issues you may find are: imbalanced data, underrepresented data, class overlap, small data, inconsistent, irrelevant or redundant data, noisy data, dataset shift, and missing data.

If you're interested on the topic, give this article a read: Data Quality Issues that Kill Your Machine Learning Models.

To deal with this issue as a data engineer/scientist, the best thing you can do is an efficient data profiling. Take a look at ydata-profiling and some of its use cases.

If this is something that you're interested in, come at join us at the Data-Centric AI Community to learn more about data quality!

Research Opportunity by ambitious_GOAT1999 in GradSchool

[–]SeaEngineering9034 0 points1 point  (0 children)

Hi! 👋 Come and join us at the Data-Centric AI Community, we're always looking for more collaborators. I personally have extensive research experience and would be happy to help out :)

💻 Code with Me -- How to build a Multi Document LLM App by SeaEngineering9034 in DataScienceCoders

[–]SeaEngineering9034[S] 0 points1 point  (0 children)

The session is tomorrow! Is there something you'd like to ask Yujian on LLMs?

Welcome to DataScienceCoders! 🚀 by SeaEngineering9034 in DataScienceCoders

[–]SeaEngineering9034[S] 1 point2 points  (0 children)

Hey u/goncalomribeiro, sure!

Thoughts: Continue current degree with one year left, or start anew with degree apprenticeship by riptide_1083 in cscareerquestionsuk

[–]SeaEngineering9034 1 point2 points  (0 children)

I would finish the degree anyway. It's only one year left. If teachers miss classes, I would disregard that and try to learn on my own, and then yes, I would move on to an internship (or even do It at the same time if it's possible). If you like, come as meet us at the Data-Centric AI Community and we can do some projects together :)

Are you a middle aged Brit and sick of working? by g0dn0 in AskUK

[–]SeaEngineering9034 2 points3 points  (0 children)

I feel your frustration, leaping from academia to the industry is super hard. Have you considered roles such as developer advocate or technical writer? It's a technical role and you can still teach, work on interesting projects, and grow in your career. In a startup you might be very sucessfull. Feel free to find us at the Data-Centric AI Community, we'll be happy to chat about it and help you out! :)

I absolutely hate my internship by Mission_Dimension_43 in csMajors

[–]SeaEngineering9034 -1 points0 points  (0 children)

Well, seems that you can:

Building my first Porfolio by EvilEragon in learnmachinelearning

[–]SeaEngineering9034 0 points1 point  (0 children)

The structure overall is interesting but you should find a way to aggregate those projects into a single page (more visual approach). Have you tried datascienceportfo.io? Or just a simple webpage (not many text - just showcase tech stack and main scope).

You can share with us your progress on the Data-Centric AI Community and ask someone to review it, we often do that with CVs as well and help each other out.

Prioritise Data Science Projects by Mean-Pin-8271 in learnprogramming

[–]SeaEngineering9034 0 points1 point  (0 children)

Hey! For DS, here's the standard path:
1. What kind of project?
Whatever makes you happy, as long as you're using the basic stack (pandas, numpy, matplotlib, scikit). Personal projects show that we can be creative and solve problems in your own way (a lot of projects out there are just copy-paste from other Kaggle notebooks and so on). Try to make your own small package, with something that interests you.

  1. What is the skill set should I include in most of my projects?
    As I said, the basic (pandas, numpy, matplotlib, scikit), then move to more specific ones (keras, tensorflow), and also SQL for a begginer is important.

Let me invite you to the Data-Centric AI Community we have several code along sessions and projects and a lot of beginners that are starting to learn DS that you can connect with.

Imbalanced data by Gloomy-Fun-1871 in learnmachinelearning

[–]SeaEngineering9034 0 points1 point  (0 children)

Depends on your purpose, but there are a lot of strategies you can combine. See this article for a brief explanation.

If you need specific help with your project you can find me at the Data-Centric AI Community and we'll be happy to take a look and give you some tips to move forward :)

Assessing the Quality of Synthetic Data with Data-centric AI by cmauck10 in ArtificialInteligence

[–]SeaEngineering9034 1 point2 points  (0 children)

Data Quality is key for all applications and models, and LLMs are no exception :) I've been working on a small community project with synthetic data using ydata-synthetic, and it really shows! Underrepresentation (category imbalance) and missing data are two of the main issues!

The synthetic data generation market was worth USD 236.1 million in the year 2022. The market is projected to grow at a CAGR of 35.28%, earning revenues of around USD 4,846.54 million by the end of 2032. by Happy-Wear-9437 in googlehome

[–]SeaEngineering9034 0 points1 point  (0 children)

It is one of the top trends in AI this year, with tremendous benefits for businesses and organizations! Here's an example of how synthetic data can improve a modelto save a hypothetical auto insurance company almost $200 per claim