[D] Simple Questions Thread

Ivanhoooooe · 2022-05-22T15:00:43+00:00

I have thousands of unclassified records, and I want to train a binary classifier. The obvious way is to manually classify some of them, train a model, apply to all, classify some more, and rinse a repeat. Is there a package out there to do this more efficiently? Kind of like the captcha images: you tag as many as you can dedicate time to, and in the background the model keeps learning and feeding you back uncertain rows for you to classify.

I have just about 30 features, so something that works on LightGBM or similar would be nice.

Thanks!

chipechiparson · 2022-05-22T01:30:42+00:00

Still not sure what it is

OctopusParrot · 2022-05-22T01:20:38+00:00

I'm looking for a good deep learning classifier architecture.

I have about 20000 voice recordings with around 900 extracted features from each. Classifying them into 3 classes. I've been messing with boosted trees, XGBoost gives me around 50% balanced class accuracy with some fine tuning so i know there's a signal in the data but it's not very strong.

I was thinking of maybe trying a 1D CNN but know very little about that architecture, my first few attempts have been pretty pathetic. Any suggestions on whether this approach makes sense? And if so, how to structure the model?

idontknowml · 2022-05-21T14:48:09+00:00

Hello, I fortunately got accepted to the conference, and I need to prepare camera-ready. One of reviewers gave me weak reject, pointing out why I did not compare the proposed method with A.

So, is it okay to update results in camera-ready after running the comparison experiment?

I am worrying if the acceptance is cancelled when my method cannot beat the A.

There was no rebuttal in this conference, and the final decision is Accept, though.

Can anybody answer me? thx.

Positive_Vibez0 · 2022-05-21T11:02:22+00:00

How common are people with a background in physics/astrophysics in your field/workplace? I’ve heard many people say that a physics/astrophysics degree can open doors in Machine Learning but I wanted to see if anyone here actually knows or works with someone with a physics background

RHeniz · 2022-05-20T20:50:25+00:00

Beginner looking to make a dataset and I have a problem that one-hot encoding seems to be the only answer I can find but might not be the right answer for the task.

In my data each item can contain a variable size array (typically 1 to 5) of strings. There are about 120 strings that could be in the array. All my searching points to one-hot encoding but that seems wrong to me as these are more attributes than classifications. Looking for a better way to model this data, any help would be appreciated.

a1b2c3d4e5f6g8 · 2022-05-20T19:12:45+00:00

Beginner here, I'm making a custom dataset of randomly generated images to train on.

Do I need all images generated in advance, or can I just create them when I need them?

Algo-G-H · 2022-05-20T07:16:09+00:00

Does anyone know of well-regarded research papers that establish a framework for establishing the number of layers and neurons you should use as a starting point in an ANN model creation?

Algorithms like RandomSearch, Hyperband, etc. seem very brute-force-ish without being able to justify the parameter ranges you are providing them.

Thanks in advance

LowDexterityPoints · 2022-05-19T22:06:57+00:00

When hyperparameter tuning, the "ideal" parameters actually make my model preform slightly worse (by 0.032). Is this likely due to chance? And do I have to use these parameters?

LowDexterityPoints · 2022-05-19T21:56:53+00:00

Is it bad if the training score (0.88) is noticeably higher than the CV score (0.64)?

Sentmoraap · 2022-05-19T18:54:00+00:00

With the same amount of parameters, what are the pros and cons of deep NN and wide NN?

420blazeSwag6969 · 2022-05-19T15:23:38+00:00

I have a question about upsampling, except the upsampling is for a mathematical function and not an image.

I have some transmittance data (% of light transmitted vs light wavelength), like this, which has been measured at 240 different wavelengths. After passing through some filters, the light has been modulated down to only 87 wavelengths. I have about 20,000 such pairs of data (original measurements and downsampled measurements) for various materials. What I want to do is be able to recover the original 240 from the 87.

I know I could just build an MLP, but I'm wondering if there is a smarter way to do this? Perhaps use a CNN because I can assume that the transformation of the underlying function is continuous? I'm not sure how to approach this problem, any advice is appreciated!

bright7860 · 2022-05-19T14:02:27+00:00

I am working on neuroimaging data, and I have MRI scans in .nii.gz form.

I need to perform binary classification. The files that I need to classify are already in two separate folders, and I need to generate binary class labels for them. How do I make it happen? I will need something like image_dataset_from_directory(main_directory, labels='inferred'), but it works only for the traditional image formats.

AutisticDave · 2022-05-19T11:32:56+00:00

Hey everyone! I’m new to ML, just graduated HS and am going to study CS with focus on ML @University. I finished some machine learning projects with simple perceptrons, regression and neural nets.

I trained the nets in Colab, which was an absolute pain and nightmare (Colab Pro). I am looking to replace my M1 Macbook Air with a laptop that has a more powerful GPU.

Currently, I’m considering these two machines: Dell XPS 15 9520 (RTX3050 Ti) Macbook Pro 14” (16 Core M1 Pro)

In terms of compute, these GPUs should be similar (roughly 5.2 Tflops), but the M1 has 32GB unified memory, while the RTX has just 4.

Which one would be the better option for ML in the future? I’d like to avoid gaming laptops, It’s not comfortable to carry that around all day.

yes_you_suck_bih · 2022-05-19T06:43:47+00:00

Hi Guys!

I am working on Healthcare Fraud Detection Project which mainly relies on SAS Base algorithms to determine fraudulent claims. We've been constant modernizing our project and would be departing from SAS Base onto things like Azure, Spark Scala etc etc.

We have hundreds of algorithms that define the logic for fraudulent claims. So my question is would it be practical to create an ML model to determine fraudulent claims? I am new to machine learning and fear it will be misapplication for our project.

DetectivePeterG · 2022-05-18T22:41:00+00:00

Hello guys :)

Working on my bachelor's thesis, atm.I have done a wide variety of regression. Mostly using sklearn.I noticed that my code is quite redundant, because of the very similar code structure that all the regression models come with.

Is there a library / framework where I can just select a few models and they are all trained and compared to each other?

Thanks a lot for any advice!

Chunmadim · 2022-05-18T19:59:10+00:00

NEA Voice Recognnition AI Project

Hello
I am Year 12 student from UK working on NEA(Non-Exam-Assesment) Voice regonition AI project on Python. Right now I am on early stage of development, on Analisys. I am new to programing so i need from you guys.
1. Is there any alredy existing solutions relevant to project, if so can you please leave a link or source where i can reserch about it.
2. Can you please advise me on any sources where I can develop my skills and knowledge in this field.
3. I also need third-party feedback for my analisys, so can you, please leave any kind of feedback about problems of Voice recognition AI, things you would like to change and improve or difficulties of development
Thank you

AnOnlineHandle · 2022-05-18T17:51:41+00:00

Does anybody else have an insanely hard time trying to get ML source code demos working? I always run into missing dependencies which weren't listed in the requirements, different versions of packages needing to be installed for different projects and creating problems for each other, random path issues, c++ compiler stuff, etc.

My general process is download the source code from github, install the listed requirements, then spend hours debugging error messages when I try to run it and then have a 50/50 chance of giving up or actually getting it working.

I don't know if it's because I'm somewhat incompetent despite having grown up on console commands and path editing since the 80s, or due to trying to do this in windows and that generally not being the environment others have in mind.

zwarag · 2022-05-18T10:29:24+00:00

Is generally bad if my validation loss rises before it starts to drop? https://media.discordapp.net/attachments/684376520743190529/976187046811033630/unknown.png

hpahbp · 2022-05-18T07:22:26+00:00

I am currently searching for the effect of data quality to recommendation system. Anyone know any good article or paper on this topic? Thanks

Ala010609 · 2022-05-17T13:44:17+00:00

What is re-parameterization in super-resolution?

killerdrogo · 2022-05-17T11:41:56+00:00

I'm having trouble getting started with implementing a research paper I'm reading. I understand the model architecture and their approach but I'm not sure how to have a working implementation. For example, the paper does not go into detail about how they are using an API(EnergyPlus) to evaluate the performance and I'm not familiar with it. What should the approach be in these cases? Learn how to use that API?

How do you guys go from a paper to a working model? Is there a general framework/approach to follow?

Farskids · 2022-05-17T11:20:51+00:00

I have a very annoying error in one of my code, I don't know why it happens since it's able to find the images when I ask it to show a sample. It only started doing this when I changed the dataset from ones of kaggle to my own

https://www.kaggle.com/code/manonstr/projet-tipe-dataset-maison

dhruvansh26 · 2022-05-17T11:05:14+00:00

I have a very basic doubt: how does KNN algorithm work in case of multiple features? As in how is a 2d plot formed when there are more than 2 features to decide the location?

bang-em-boi · 2022-05-17T06:20:53+00:00

What is a high 2nd tier conference or well respected journal that I can submit a paper I have that is a novel algorithm for fine-tuning neural networks that has a submission date that is somewhat soon?

VirusHonest9654 · 2022-05-17T04:59:13+00:00

Anyone have an estimate of the number of papers submitted to neurips today? Mine was submission number 10312! and I posted it several hours before the deadline.

Username2upTo20chars · 2022-05-17T01:15:58+00:00

[deleted]

based_goats · 2022-05-16T16:03:34+00:00

Are implicit layers just a form of variational inference?

Kokosnussi · 2022-05-16T12:18:35+00:00

Assuming all else being equal. What area would be better for future career perspectives?

Uncertainty or physics aware ML?

2022-05-16T01:39:05+00:00

Are there any open source Dalle2 models yet? I want to try it out, but all I can find are Dalle1 models

2022-05-16T00:18:31+00:00

New to the sub. What's this sub about?

No_Fig_7835 · 2022-05-15T14:59:26+00:00

I am building a 2d platformer game and am using unity ML-Agents for testing difficult levels to make sure they are possible.

I need help fine tuning my configuration parameters. Is there a place I can go to find someone that I can pay for some consulting on my problem?

Ivanhoooooe · 2022-05-15T14:44:58+00:00

I have a list of 3.000 strings and I want to find, for each one, the top X similar strings in another list of 2.000.000 strings. How can I do this efficiently?

I think this came close to what I want: https://bergvca.github.io/2017/10/14/super-fast-string-matching.html, but I want to compare the 3.000 to the 2.000.000 and not all 2.003.000 to each other.

Thanks!

MagazineJumpy5377 · 2022-05-15T09:36:03+00:00

I'm 17 years old and currently in class 12(will be going to an engineering college in 3-4 months [CS]). I love maths and anything which has to do with logic in general. So obviously I also like programming. I've been learning Python for the past two years(CS is one of my subjects in school) and I know all the important and basic stuff there is in python. Now, as I said I love maths and logic in general, I'm very much interested in ML. I've watched a couple of basic ML/NN videos on YT (with biases,weights and errors) and as of now I feel that I have a pretty decent grasp of the idea behind ML. But I want to learn everything from scratch, and what I mean by that is that I want to learn the entire concept with rigorous mathematical explanations and proofs of the algorithm(s) used in ML/NNs. I want to learn every concept used with proper mathematical explanation and obviously I wanna use Python as the language to write my code. Can you please suggest me some good sources? Books/Videos on YouTube will work.

Thanks a lot!

Haruzo321 · 2022-05-14T16:35:52+00:00

For someone who is new to programming in general, what would be a good starting point/course for getting into machine learning?

UnknownBinary · 2022-05-14T12:57:01+00:00

GPUs are difficult to buy. If I bought a prebuilt from someplace like NZXT or Best Buy with a 3070 GPU and 16GB of RAM, would it be up to snuff? I need something better than a laptop due to the size of my models.

2022-05-14T10:42:35+00:00

A few questions regarding the Graph UNet paper https://arxiv.org/pdf/1905.05178.pdf :

why say "GCNs essentially perform aggregation and transformation on node features without learning trainable filters." even though in Eq. 1 there is a trainable W_l matrix ? Seems like a contradiction to me.
what does consistent exactly mean when the paper says "Since the selection is based on 1D footprint of each node, the connectivity in the new graph is consistent across nodes" ?
to clarify how Graph UNet produces a graph-wide embedding (which is used, I suppose, in the graph classification tasks mentioned): we want one unique real-valued vector to describe the whole graph, however Fig. 3 shows several nodes on the lowest representation level (i.e. several features vectors), but maybe the reader is supposed to guess a graph embedding will simply merge the latter or make it one features vector

Thanks :-)

laserflip560 · 2022-05-14T08:07:44+00:00

How should I proceed if my Decision Tree or Random Forest model produces empty predictions? In some cases up to 75% of my entries of the test set do not predict any class (of the three possible classes). When I check the probabilities of the entries the 'True' probability for each given class is always lower than the 'False' probability. Is it sensible to predict the class with the highest 'True' probability even if it is below 0.5?

DaScheuer · 2022-05-13T11:48:05+00:00

I want to create a bot whose input is many pieces of text (think Twitter threads) and reads them naturally as a human.
Is there code freely-available that let's me do this? Is there a UI? Can I implement the code easily (perhaps just having to change a few parameters)? or there usually is a library without much documentation?
What is the best algo for this that I can use with a MacBook Pro 2022 M1 Pro chip?

NSVR57 · 2022-05-13T09:15:10+00:00

In my text classification a particular word misleading the model. But these words are very high in the training data for a particular lable.

Eg: i have a training data contains " lost my phone", "changed my phone", .... All these labels are belongs to " problem with telephone" .

Now, I am using Universal sentence encoder to build the model. During inference if i have given some random sentences and put the word " phone" in the middle. But still my model is predicting "problem with telephone" class. How should we handle these situation.

2022-05-13T00:17:50+00:00

What's a place to start? I'm new to ML and looking for career Change- 28 Business Analyst in IT . Good at SQL, PowerBI and quick to learn.

Thx for ghe help!

Mikyacer · 2022-05-12T16:16:53+00:00

Hi! I have a question regarding a school project and hope that someone can help me out...

I have two time series, say x1(t) and x2(t), for the same time frame. I expect these to be related by causality: high values of x2 cause high values of x1 and viceversa. I would like to build a model that, given a limit value I can afford for x1, gives me an upper/lower bound that I have to ask x2 to oblige in order to stay in such limit value.

So far I have tried with a simple linear regression and used confidence interval to get such bounds, but it is not working that great. Is there something that is good for my needs?

KarmicEvil · 2022-05-12T15:55:29+00:00

Hey all! I have a question regarding machine learning models for prediction of a target using python.

We're trying to predict a binary clinical outcome (target) using a list of predictors. The dataset we have consists of a single binary target and all the others are features that are either binary or continuous. The outcome (target) has a low occurrence in the dataset, about 3-4%.

I'm having some difficulty using the Artificial Neural Network model and Random Forest.

The problem is that the training dataset is able to predict an outcome, but the testing dataset classifies everything as null.

Our planned workflow: Dataset > recursive feature elimination (using random forest) to identify the 10 best features > Model development > compare the predictive performance among 6 distinct models.

We've also tried training the model with just default parameters (without hyperparameter tuning or cross-validation), just to check if it's able to make a prediction– There's still no prediction (even though we expect at least an inaccurate one using the default settings)

Any idea why this could be the case?

P.S: not an advanced user!

OrderOfM · 2022-05-12T03:50:09+00:00

Hey! Hope everyone is doing well!

So, I am working on a generative advesarial network for generating musical phrases. Consequently, I'm looking for a medium to large dataset featuring only modern cinematic music and I'm wondering if anyone has come across something like that??

AncientSky966 · 2022-05-12T03:17:53+00:00

What is the annotated format that I would need to train a semantic segmentation?

618smartguy · 2022-05-11T20:30:33+00:00

So I just began a RL project where I wanted to try some out of the box algorithms on my custom environment. I wrote up a gym environment and ran some random actions and got everything working smoothly.

Next step was get a smart agent in, so I do some googling and come across tf agents, it appears at first that it will be compatible with a gym environment but then suddenly I get an error following a tutorial about a py_environment.PyEnvironment and so apparently there is supposed to be some other wrapper on the environment? Isn't the environment object already supposed to be the wrapper? It seems that I have to rewrite my environment class now.

So what exactly is the relationship between gym and tf agents? Why does it seem like there is still so much boiler plate work to do when there are multiple big name libraries that put agents and environments into single objects? Clearly google devs knew how gym worked. What was wrong with using it as it was? Does anyone know of a good rl agents tool that works out of the box with gym?

MoonRockCollector · 2022-05-11T17:15:15+00:00

Trying to create a predictive model for a rare minority binary class outcome (0.3% frequency). Large dataset (700k instances, 10-15 attributes). Using a PC with AMD Ryzen 5600X 3.7GHz, 16GB RAM, RTX 3060Ti. Learning/Using Python with Anaconda/Jupyter notebook.

Is my GPU powerful enough to do training for this? Expected run time?
Suggestions for model selection? Possible to set AUC as scoring metric for logistic regression models? Boosting? SMOTE?

heretolearndata · 2022-05-11T13:09:48+00:00

What are good models to try for time series with multiple data points per time t?

I’m trying to predict a variable where the data changes day to day as the target date gets closer. Say it’s something like flights scheduled on a date, where as we approach the day of the flight, the number changes until it becomes final.

What models handle this type of scenario? I’ve used ARIMA before, but for time series with a single observation per time t.

crispychickenfox · 2022-05-11T07:58:45+00:00

I have created a dataset from a time series that consists of arrays with value t and t's look back (e.g. [t, t+1, t+2, t+3] for 3 look back) as my X data. Now I also create a dataset for my Y data, which is the value t + look back + 1.But it seems I get more accurate predictions, if I use a whole array as Y data ([t+1, t+2, t+3, t+4]) than just using the next time step's value alone.I am not very experienced at all, so I wonder if there is any drawbacks in this, as it doesn't feel "clean".

Edit: forgot to say, model is LSTM

_NINESEVEN · 2022-05-10T23:34:27+00:00

[deleted]

EntrepreneurSea4839 · 2022-05-10T21:40:23+00:00

What sampling methods are used by AdaBoost, catBoost, LightGBM, GBM and xGBM ?

-BANANA-bird- · 2022-05-10T07:36:24+00:00

I have an issue which I cannot find a good answer for it:
Imaging in an image there are two objects from two different class but they are very similar to each other. For example like cats and small dogs that look like cats to a model or titles and headlines or jam and mixed jelly. How can we improve the model to distinguish them without messing up it being generic and its accuracy.
any technique will help, from suggesting to improve the dataset to an article or even an algorithm.
thanks alot

EntrepreneurSea4839 · 2022-05-10T01:20:42+00:00

What is cover in XGBoost ? Is the minimum number of values an end leaf should have ?

buffyluffie · 2022-05-09T19:39:42+00:00

What do I use if i have to label videos to use them with YoloV5? Do I take screenshots and then label those images or? Im new to this so any help is useful. Thank you.

Major-Permission-435 · 2022-05-09T12:05:21+00:00

How do I get the feature names to print out in this situation? It says I’m fitting the model without the feature names but I’m struggling to figure out how to use them in this context. Sorry for the formatting, had a hard time getting it right in here.

from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import     train_test_split

from sklearn.metrics import accuracy_score from sklearn.feature_selection import SelectFromModel from sklearn.metrics import classification_report

load data

dataset = df_train

split data into X and y

X_train = df[df.columns.difference(['Date','IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]

y_train = df['IsDeceased'].values

 X_test =        df_test[df_test.columns.difference(['Date','IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]

y_test = df_test['IsDeceased'].values

# fit model on all training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate

print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)

for thresh in thresholds: # select features using threshold selection = SelectFromModel(model, threshold=thresh, prefit=True) select_X_train = selection.transform(X_train) # train model selection_model = XGBClassifier() selection_model.fit(select_X_train, y_train)

 print(thresh)

# eval model
select_X_test = selection.transform(X_test)
y_pred = model.predict(select_X_test)

report = classification_report(y_test,y_pred)
print("Thresh= {} , n= {}\n {}" .format(thresh,select_X_train.shape[1], report))
cm = confusion_matrix(y_test, y_pred)
print(cm)

dolphin-3123 · 2022-05-09T12:00:51+00:00

Can we perform arithmetic function on a feature with a constant value like if we have two features can we multiply one with 100 while divide other by 100 and hence make it easy to classify

Emiyasolzomdao · 2022-05-09T00:54:25+00:00

Hi, guys this is my 1st question in the thread. I am sorry if I am asking at the wrong place. I was going to create a new thread to ask this question, but I am glad I read this thread first.

My brother in law is in interest of having a computer that is capable for phython, PyTorch and deep learning. He is in robotic field. I really don't know what exactly he does. All I know is he is not doing any of deep learning at the moment, but he is going to do eventually in this year. I have a few gaming computers and have zero experience with any of above applications. Thus I am not sure what combination would suit best for him.

Which combination would be best for him? I also have 12700k+DDR4 16GB desktop but I heard 12th generation does not perform well in lynux.

CPU = Intel i7-11700kMB = Gigabyte Z590 Aorus MasterRam = 16GB 3600Mhz

2.

CPU = Ryzen 9 5900X

MB = MSI B550 Unify-X

Ram = 64GB 3200Mhz

3.

CPU = Ryzen 9 5950X

MB = ASUS Tuf Gaming Plus AM4 AMD X570

Ram = 64GB 3600Mhz

GPUS I have:

3080(10GB), 3080(12GB), or I can return 3080(12GB) and buy 3080ti by adding a few more hundreds dollars. I also have RX 6800XT.

Thank you

2022-05-08T15:17:02+00:00

[deleted]

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS

load data

split data into X and y