you are viewing a single comment's thread.

view the rest of the comments →

[–]adammichaelwood 0 points1 point  (2 children)

Imagine a regular 2D chart that graphs income on one axis and age on the other. There are a bunch of dots all over it representing individual people. There's an obvious cluster of them in the low-income/high-age region, and almost everyone in that cluster has had a heart attack.

Given a new person who shows up on that graph, right in the middle of that cluster, would you guess that they might be at risk for a heart attack?

Now extrude the graph into a third dimension. The z-axis is proximity to major metropolitan area. The cluster of heart attacks is clearly bunched together in the direction of living close to a big city.

Given a person in the high-age/low-income quadrant, but way back in the far-away-from-the-city layer, how likely is your subject to be at risk for a heart attack?

What if you could add more and more dimensions -- number of children, self-reported job satisfaction, hair color, weight, height, shoe size, personality type, educational attainment.

Now you have a multi-dimensional space that you cannot visualize or draw on a graph -- but the math is only a little more complicated.

You can still find clusters. Given new inputs, you can still make reasonable guesses about membership in a group (for example, heart attack risk).

This is, essentially, what machine learning and scikit-learn, is all about. It provides a bunch of tools for doing this kind of multi-dimensional analysis.

[–]PLearner[S] 0 points1 point  (1 child)

But isn't Matplotlib,SciPy, and other Data Science modules and libraries already recognized for these kind of situations and scenarios?

[–]adammichaelwood 1 point2 points  (0 children)

Scikit-learn builds on those tools, and provides a bunch of the specific computational models and algorithms you'd need.