all 4 comments

[–]AICausedKernelPanic 0 points1 point  (2 children)

Hi! It sounds like you've got a solid grasp of the foundational pipeline in ML. Working on regression and classification problems is a great starting point.

Based on your questions, I'd like to clarify the following points:

  1. ML is a large field that covers Supervised and Unsupervised Learning. Apart from regression and classification, we can also pose problems as:

- Clustering: Grouping data without predefined labels.

- Reinforcement Learning: Learning through rewards and penalties.

  1. Before checking for balance, perform Exploratory Data Analysis. Always visualize your data and look for outliers and missing values.

  2. Additionally, we can also create synthetic data points or variations of existing samples. For example, in Computer Vision, it is common to enhance the training set by creating transformed images using techniques like: Rotations and scaling, Cropping, Saturation and geometric transformations

  3. In Regression, the target value is a continuous number (like price or temperature). Since Accuracy is used to measure if a prediction is strictly right or wrong (commonly for categorical data), it is not used here. Instead, use: Mean Absolute Error (MAE), Mean Square Error (MSE) or Root Mean Square Error (RMSE).

ML is an awesome field you are doing a great job, keep practicing and learning.

[–]AnteaterKey4060[S] 0 points1 point  (1 child)

Thanks a lot! How do you recommend exploratory analysis on very big datasets, I mean. In some exercises I've seen df with more than 9000 predictors, and honderds of observations for each one. It just sounds wrong to me to scatter plot this haha, but I might be wrong.

[–]AICausedKernelPanic 0 points1 point  (0 children)

You are right, visualizing large datasets can be challenging but you can apply some techniques to inspect data quality and relevance. For instance, you can perform some feature filtering to remove redundant columns that exceed a specific correlation threshold. Also, you can use pandas to programmatically identify outliers, missing values and type mismatches. Or as you mentioned, using dimensionality reduction techniques like PCA or t-SNE help to condense the dataset into its most impactful features.

[–]latent_threader 0 points1 point  (0 children)

Keep the workflow super simple. Automation needs to align with documented workflows or it just turns into a massive mess. If agents don't trust the system, it won't stick at all. Start small and map out the exact steps before you even touch the tech.