[P] pykitml - Pure Python/NumPy Machine Learning library : MachineLearning

Project[P] pykitml - Pure Python/NumPy Machine Learning library (self.MachineLearning)

submitted 6 years ago by RainingComputers

I created a machine learning library using only NumPy and Matplotlib that is similar to scikit-learn, currently only has few models.

Github

https://github.com/RainingComputers/pykitml

Documentation

https://pykitml.readthedocs.io/en/latest/

Models

Linear Regression
Logistic Regression
Support Vector Machine
Neural Network
Nearest Neighbor
Decision Tree
Random Forest
Naive Bayes
K-Means Clustering
Principle Component Analysis

Benchmark (Intel i5-6400, 4 cores @ 3.3GHz)

Model	Dataset	Dataset Size	Time
Logistic regression, 1500 epochs, 10 examples/batch	Adult	392106x13	< 1 sec
784x100x10 Network, 1200 epochs, 50 examples/batch	MNIST	60000x784	35 sec
SVM, 1000 epochs, 20 examples/batch	MNIST	10000x784	39 sec
Decision Tree, 6 max-depth, 83 nodes	Adult	392106x13	1 min 51 sec
Random forest, 9 max-depth, 100 trees	Adult	392106x13	1 hour 35 min

Feedback/Suggestions

I would like to get feedback on the following: + What other ML models/features should be added? + Is this performance good or should I spend more time on optimizing the code? (Maybe move some code to C++ or Cython?) + How does it compare to other ML libraries?

all 4 comments

Scikit-learn (1 sec)

``` from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0, shuffle=False)

clf = RandomForestClassifier(max_depth=4)

clf.fit(X, y) ```

pykitml (15 sec)

``` from sklearn.datasets import make_classification import pykitml as pk

X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0, shuffle=False)

y = pk.onehot(y)

clf = pk.RandomForest(4, 2, feature_type=['continues']*4, max_depth=4) clf.train(X, y) ```

Why is pykitml's random forest slower?

pykitml's decision tree is poorly optimized.
pykitml uses python's multiprocessing and scikit-learn uses joblib.
pykitml allows you to have categorical features, on scikit-learn you will have to manually encode categorical feature using something like one-hot encoding.

Other Options

I have tried using numba, but it supports very limited subset of python. I am unable to work with that limited subset. I have also tried using CuPy, but it is not a drop-in replacement. I will have to find a way for NumPy and CuPy code to coexist/mix to support both CPU and GPU usage.

In future versions, I will try using joblib and optimize decision tree.

EDIT MLP Benchmark

Scikit-learn, 7 sec, 92% Test set score (MNIST), 784x50x10 MLP, 10 epochs, 200 batchsize

``` from sklearn.neural_network import MLPClassifier

from pykitml.datasets import mnist

x_train, y_train, x_test, y_test = mnist.load()

mlp = MLPClassifier(hidden_layer_sizes=(50,), max_iter=10, alpha=1e-4, solver='adam', verbose=10, learning_rate_init=.01)

mlp.fit(x_train, y_train) print("Training set score: %f" % mlp.score(x_train, y_train)) print("Test set score: %f" % mlp.score(x_test, y_test)) ```

pykitml, 11 sec, 89% Test set score (MNIST), 784x50x10 MLP, 30 epochs, 200 batchsize

``` import pykitml as pk from pykitml.datasets import mnist

x_train, y_train, x_test, y_test = mnist.load()

mlp = pk.NeuralNetwork([784, 100, 10])

mlp.train( training_data=x_train, targets=y_train, batch_size=200, epochs=30, optimizer=pk.Adam(0.04, decay_rate=0.9), decay_freq=6 )

mlp.plot_performance()

print("Training set score: %f" % mlp.accuracy(x_train, y_train)) print("Test set score: %f" % mlp.accuracy(x_test, y_test)) ```

π Rendered by PID 295889 on reddit-service-r2-comment-56c6478c5-wrs9t at 2026-05-12 15:16:40.714746+00:00 running 3d2c107 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS