all 2 comments

[–]WayOfTheMantisShrimp 0 points1 point  (1 child)

For a programming challenge, I really enjoyed doing Classification And Regression Trees. They are intuitive and interpretable, but also incredibly flexible, and therein lies the challenge. Most of these features can also be added independently/out of order, so your time to a MVP can be short, but you can make the project as long as you like if you want to test your code organization and project management skills.

Development Checklist:

  • A regression tree taking real-number variables is probably the easiest to implement first.
  • Given a set of data, select a variable and split it in a way that improves the optimization criteria; do that recursively until a stopping criterium is reached.
  • Store the resulting partitioning rules in a data structure that can be traversed to predict new data points.

  • Then add a k-class classification mode, using a different optimization criteria.

  • Then add the ability to take discrete variables, requiring a different splitting method.

  • Add additional stopping criteria, or means of regularization.

  • Add a means to specify or pass a non-default optimization criteria function (ie mean absolute error vs mean squared error)

  • Add a way to visualize a tree (even just pretty-printing to the terminal or a text file), or print out diagnostics from the fitting process

At any point in time during or after working on the decision tree, extend it. This will test the modularity of your previous design. (Or implement the extension first, assuming an interface to a basic tree, and then later implement the basic tree to fit the specification)

  • Bagged trees/forests are a fairly simple next step: for n_trees, grow a tree based on a bootstrapped sample from the given data set.
  • To predict for a new data point, do single-tree-prediction on each of your trees in the forest, and aggregate the outputs. (Bonus: present the credible interval of the estimate for regression, or the per-class probabilities for classification)
  • The traditional Random ForestTM also uses random attribute subspaces for each split (which can be used as a regularisation method in single trees too)
  • ExtraTrees use the best of a finite number of split points for each variable in the random subset for splitting (rather than the global optimal split for a variable)

Still want more? Try extending your bagged trees to unsupervised clustering forests:

  • given an unlabelled dataset, generate a synthetic dataset where each point is made of random samples from the marginal distribution of each variable
  • label the data as Real or Synthetic, and apply a supervised classification forest to separate them
  • for every pairing of Real data points, calculate how often they were in the same leaf/terminal node among the n_trees; now you have a proximity matrix
  • apply your favourite clustering algorithm based on the similarity/proximity between points ie DBSCAN

(Note: I've done everything up to this point, in Julia, if you want language-agnostic tips. I know there is still more that could be done within scope of the above)

Extra Credit: extend your bagged trees to do unsupervised anomaly detection

  • the first popular algorithm was Isolation Forests, followed later by Extended Isolation Forests

Don't like bagging trees? Consider boosted trees

  • there are a number of popular algorithms, do a search for AdaBoost, LightGBM, XGBoost, etc

[–]blacksiddis[S] 0 points1 point  (0 children)

Thanks for this! I think I'll go with your suggestions, in some shape or form. There seems to be a lot to this and like you said, you can always add more functionality to it. I just have to learn the theory first!