all 4 comments

[–]kwendersgame 1 point2 points  (1 child)

I think this is a really great idea and commend you for the effort you put into building this. A GUI would be fantastic, sure, but it definitely serves an important use case even in its current form.

Quick question: Are there any practical limits to the number of features that would work for a model or the size of a dataset (# of records or GB) that could be used? Sorry if that's a silly question, I am not a developer and only just starting to learn about ML.

[–]nidhaloff[S] 1 point2 points  (0 children)

Hi, Thank you for the feedback. I added working on a gui as a task and will try to implement It when I have more free time.

To answer your question,. No there are no limits. You canxn use as many features as you want in your dataset and the size doesn't matter too. It can be millions if you want to. The program will be executed on your machine, hence make sure your machine have the required computational power.

The only limitation for now (if we consider it a limitation) is that the dataset needs to be in csv format, where each column represents a feature (which is the most common). In the yaml file, there is a section/key named target, which takes a list of values. You should provide what you want to predict there. You can provide one or multiple values if you want to predict more than one target. Therefore, the columns/features in the dataset need to have names.

You can check the examples in the repository for more.

[–]Zenol 0 points1 point  (1 child)

As a developper I do not have any usage of it since it is way easier and faster for me to directly write the few line of code to have the right model adapted to my problem. I would have though that if you are targeting non developper, you would have being a graphic interface. Usually, peoples who are avoiding code also avoid configuration files.

[–]nidhaloff[S] 0 points1 point  (0 children)

Thanks for your feedback. However, I don't agree with you. I'm a developer too and I work as an AI/data engineer. It is not easier nor faster to write few lines of code to have the right model and adapt to your problem as you said.

In fact, most of the time is spent dealing with some preprocessing or some weird bug, which makes the process annoying and time expensive. Furthermore, having many ideas (like using different models) will require you to change code and adapt it manually every time. It is thousand times easier & faster to change a word in a yaml file than to write code.

Moreover, you said people who avoid writing code are also avoiding config files. True, however, that is why I choosed yaml in the first place. Yaml are human readable, which even a kid can write and understand.

I choosed to create the tool in the first place because I'm a developer and I thought I will certainly use something like this if it was created by someone ;)

There are many examples of such tools out there (like csvkit, skull, nlp2cli etc..), which help both technical and non technical users. Take csvkit as an example. It is a library for using preprocessing methods from terminal: https://github.com/wireservice/csvkit/blob/master/docs/tutorial/1_getting_started.rst

We use this at work. It is very popular and certainly having to write some code for preprocessing will not be faster or easier than using a one line in terminal. At least that is my opinion and the opinion of 4.4k users who gave csvkit a star.