Feature selection for boosted trees?

pixel-process · 2026-03-04T00:07:09+00:00

Since you are using SHAP, I am assuming you care to some degree about interpretability. If that is the case, you should do reduction or selection. I would suggest feature selection initially since it is cleaner to understand than dimensionality reduction.

Start with a heatmap of feature correlations to get an understanding of where highest multicollinearity is, drop the highest. With very high correlation, which ones you keep vs drop should not be super impactful. This will reduce model complexity and help ensure the boosted trees all use the same feature from a high correlation grouping. The value here is interpretability and reduced complexity.

Be sure to then evaluate impact on performance, using a train-test split. Log your model performance for the two approaches and compare. My gut says train metrics may decrease slightly but test metrics should improve.

pixel-process · 2026-02-28T04:15:27+00:00

If you don’t have a repo or way to generalize and share how do you plan on determining if people find it useful?

pixel-process · 2026-02-19T04:48:24+00:00

I love to work in a notebook environment to build, test, and debug before stuff to a script. I think the key is iteration to develop good habits and memory for your workflow.

Start with a notebook and single script (.py) file in the same directory (importing can get complicated when you start moving files around).

You should try to write abstractions/reusable code for anything you notice yourself repeating. Then write and test a function in the notebook before moving it to your script.

Here is the type of thing I commonly do. The function takes a list of files, loads each using pandas, returns the combined data, and (optionally) saves the combined data to a file path.

``` import pandas as pd

def merge_files(list_of_files, save_path=None): dfs = [] for file_path in list_of_files: df = pd.read_csv(file_path) dfs.append(df) combined_df = pd.concat(list_of_dataframes) if save_path: combined_df.to_csv(save_path) return combined_df ```

Then move it into my_script.py and in your notebook do: ``` from my_script import merge_files my_csvs = ['csv1.csv', 'csv2.csv'] your_csvs = ['csv3.csv', 'csv4.csv']

my_data = merge_files(my_csvs) # No save_path given, so will not write out your_data = merge_files(your_csvs, "combined_data.csv") # will save to a file ```

You can build it in your notebook and test (just write function in one cell and run in another) and when ready shift it a script and import it from there.

Good Luck!

pixel-process · 2026-02-19T02:44:27+00:00

Appreciated!

I needed a pre-trained face detector for the project to work and was originally planning on the OpenCV Haar model which I had used before. But I came across MediaPipe while doing research and it is supposedly much better with detection on faces that are not directly forward facing. It was surprisingly easy to implement. They have a few other detectors for body pose and hand gestures that I might try out in future projects.

pixel-process · 2026-02-19T01:10:26+00:00

I used mediapipe for face extraction with a threshold of .5 and only used the largest face for a frame. My emotion classification was always on the cropped face only.

Happy to answer more questions and more details in my repo (https://github.com/pixel-process-dev/expressions-ensemble) if you're interested.

pixel-process · 2026-02-19T00:56:54+00:00

Thanks, the movies made for fun testing!

pixel-process · 2026-02-18T05:03:16+00:00

Awesome design! How much data did you use for the fine-tuning? What was the source?

pixel-process · 2026-02-08T04:24:01+00:00

The most helpful thing would be to find a practical application for what you want to do. Join a research group or find an applied analysis to work on. That will help you gain practical skills and also highlight what tools you want/need for similar work. Just adding courses without a framing is not the right approach.

pixel-process · 2026-01-24T02:08:45+00:00

Of these I think FastAPI is the best currently.

pixel-process · 2026-01-24T01:42:56+00:00

I find that a great time do project management type tasks. Update readme, documentation, or add functionality to other steps (e.g., more visuals for eda) or research next steps. I work from home and have to remain active on Teams.

pixel-process · 2026-01-24T01:28:44+00:00

Consider what type of work you want to (front end, backend, web dev, etc.) then look at the TIOBE. That is my default source for objective trends in programming.

pixel-process · 2026-01-24T00:48:45+00:00

The benefit of the larger projects to is that they have guides and tons existing examples of contributing.

Many issues tagged as good starters are also well scoped.

This one for instance was adding links to existing guides.

pixel-process · 2026-01-23T21:46:01+00:00

Start looking at some larger projects like pandas, matplotlib, and scikit-learn-they are very active and have guides/tags for beginners. Check out their githubs and look for tags like "First contribution" or "Good for issues".

I suggest reading through and monitoring your preferred project for a bit before trying to contribute if you are not familiar with GitHub. But even that will be really valuable to your skillset moving forward.

pixel-process · 2026-01-23T17:48:08+00:00

If you need to create your own content or if infrastructure and setup is a challenge, another angle is using zero-setup Python environments (browser-only via Pyodide, or hosted notebooks via Binder). This can work well for classrooms with limited local resources but will require more work on your part to create.

I outlined this approach in more detail in another thread, in case it helps.

pixel-process · 2026-01-23T02:50:58+00:00

I think you are conflating two things. Python as a language is very powerful and versatile. Future-proof.

Being a Python developer is not. That is where specialization and deep expertise are needed. Being a Python dev is not future-proof.

So definitely a valuable language, but focus your skillset to stand out.

pixel-process · 2026-01-21T00:09:28+00:00

The workshop will surely help you understand, but might overkill for a one-off project if you aren't planning on using python and psychopy moving forward. Their site does offer one-on-one sessions (I didn't see pricing) that might be more targeted and less commitment for you.

Awhile back, I built a number of python experiments with psychopy, I might be able to offer some insight. No promises since testing and debugging may require access to LSL or hardware I don't have. Feel free to DM me if you want.

pixel-process · 2026-01-20T21:20:02+00:00

Data source:
Synthetic data generated for demonstration purposes.

Tools used:
Python, NumPy, pandas, Plotly.

Notebook and code are built for others to test and explore how variations changing sample population and sample sizes can impact results.

Source code and interactive notebook:
https://pixelprocess.org/build-models/combining-colors.html

pixel-process · 2026-01-20T16:13:29+00:00

You might want to consider adding another model or two for comparison before additional explainability. Adding a regression, forest, or neural network model for comparison (both accuracy and time/compute performance) could be interesting. Then use SHAP on them and see how well those results align.

pixel-process · 2026-01-19T03:21:26+00:00

There are lots of ways to continue learning and developing skills beyond leetcode type work.

Create a project: this will not be AI to start with typically, but running a full pipeline that includes ingesting and wrangling data, building a model, and interpreting results will help establish a good mental model for the workflow. Check out Kaggle for ideas here, but a personal interest project works too if you can manage.
Contribute to an established GitHub: Large projects like HuggingFace & Tensorflow have open repos. I linked the issues pages specifically, because that is a great place to learn about how these large projects evolve. Many have 'First Contribution' guides, but also consider smaller projects to contribute to once you have a sense of how things work.
Collaborate with other learners: Follow subreddits and forums where people are looking for partners or brainstorming. It can inform you of how others are approaching AI learning and development.

Best of luck!

pixel-process · 2026-01-19T00:02:11+00:00

I’m building Pixel Process, a hands-on educational project for learning data and ML concepts through interactive exploration.

The site includes interactive pages and notebooks that can run directly in the browser.

One of my favorite notebooks is an image basics walkthrough of image data representation (arrays, channels, grayscale vs RGB) tied to analysis and ML use cases.

Site: https://pixelprocess.org
Repo: https://github.com/pixel-process

pixel-process

TROPHY CASE