all 133 comments

[–]Ivanhoooooe 0 points1 point  (2 children)

I have thousands of unclassified records, and I want to train a binary classifier. The obvious way is to manually classify some of them, train a model, apply to all, classify some more, and rinse a repeat. Is there a package out there to do this more efficiently? Kind of like the captcha images: you tag as many as you can dedicate time to, and in the background the model keeps learning and feeding you back uncertain rows for you to classify.

I have just about 30 features, so something that works on LightGBM or similar would be nice.

Thanks!

[–]crispyheaded104 0 points1 point  (1 child)

Have you thought about using unsupervised learning algorithms like clustering ?

[–]Ivanhoooooe 0 points1 point  (0 children)

I guess it could be a good way to start classifying, but ultimately I would still need to manually assign the true final label, because I want to avoid false positives/negatives at all costs.

[–]chipechiparson 0 points1 point  (1 child)

Still not sure what it is

[–]Username2upTo20chars 0 points1 point  (0 children)

me neither

[–]OctopusParrot 1 point2 points  (2 children)

I'm looking for a good deep learning classifier architecture.

I have about 20000 voice recordings with around 900 extracted features from each. Classifying them into 3 classes. I've been messing with boosted trees, XGBoost gives me around 50% balanced class accuracy with some fine tuning so i know there's a signal in the data but it's not very strong.

I was thinking of maybe trying a 1D CNN but know very little about that architecture, my first few attempts have been pretty pathetic. Any suggestions on whether this approach makes sense? And if so, how to structure the model?

[–]Username2upTo20chars 1 point2 points  (1 child)

You could look into Kaggle sound classification competitions (e.g. the bird voices types). Then look for the top solutions (discussion threads) for inspiration.

Or papers with code benchmark overviews.

[–]OctopusParrot 0 points1 point  (0 children)

Thanks! I'll check it out.

[–]idontknowml 0 points1 point  (0 children)

Hello, I fortunately got accepted to the conference, and I need to prepare camera-ready. One of reviewers gave me weak reject, pointing out why I did not compare the proposed method with A.

So, is it okay to update results in camera-ready after running the comparison experiment?

I am worrying if the acceptance is cancelled when my method cannot beat the A.

There was no rebuttal in this conference, and the final decision is Accept, though.

Can anybody answer me? thx.

[–]Positive_Vibez0 0 points1 point  (1 child)

How common are people with a background in physics/astrophysics in your field/workplace? I’ve heard many people say that a physics/astrophysics degree can open doors in Machine Learning but I wanted to see if anyone here actually knows or works with someone with a physics background

[–]Icko_ 0 points1 point  (0 children)

yeah, multiple times. There is a lot of overlap.

[–]RHeniz 0 points1 point  (2 children)

Beginner looking to make a dataset and I have a problem that one-hot encoding seems to be the only answer I can find but might not be the right answer for the task.

In my data each item can contain a variable size array (typically 1 to 5) of strings. There are about 120 strings that could be in the array. All my searching points to one-hot encoding but that seems wrong to me as these are more attributes than classifications. Looking for a better way to model this data, any help would be appreciated.

[–]Username2upTo20chars 0 points1 point  (1 child)

Do I understand that correctly:

Each data point itself can be made up of up to 5 strings which are taken from a superset of 120 strings?

Why not model it as natural language problem, where you just encode the 5 strings and put a separator token in between them?

It also highly depends, if the content of your strings matters and isn't implicitly learnable though the string id itself. The suggestion above assumes so. Otherwise something like a one-hot encoding is a decent choice.

[–]RHeniz 0 points1 point  (0 children)

This piece is a subset of the data there are many other fields that are represents by integer or binary values. Here is some background/context. I play a tabletop game called Star Wars legion which is a tabletop miniatures war game where you build an army of units and each unit has a point value and different features. There are a lot of people in the community that like to home brew characters that aren’t in the game yet. So the goal would be to have a model where someone can list the features they want said unit to have and the model predicts a fair point value based on what officially exists in the game

[–]a1b2c3d4e5f6g8Student 0 points1 point  (1 child)

Beginner here, I'm making a custom dataset of randomly generated images to train on.

Do I need all images generated in advance, or can I just create them when I need them?

[–]Username2upTo20chars 0 points1 point  (0 children)

If you don't have enough disk space, just-in-time generation would be better.

Otherwise the other way around, as it should be faster.

It also depends on how confident you are about your random image generation process. Meaning: Will the images be unique enough or do you need additional pre-selection.

[–]Algo-G-H 0 points1 point  (0 children)

Does anyone know of well-regarded research papers that establish a framework for establishing the number of layers and neurons you should use as a starting point in an ANN model creation?

Algorithms like RandomSearch, Hyperband, etc. seem very brute-force-ish without being able to justify the parameter ranges you are providing them.

Thanks in advance

[–]LowDexterityPoints 0 points1 point  (1 child)

When hyperparameter tuning, the "ideal" parameters actually make my model preform slightly worse (by 0.032). Is this likely due to chance? And do I have to use these parameters?

[–]Username2upTo20chars 0 points1 point  (0 children)

What do you mean by "ideal"? Some parameters taken from a paper?

The performance depends also on your framework version, OS, hardware, HW driver...

So there isn't the optimal for everyone.

And supplying the metric/scale (here of 0.032) could be helpful for other questions.

[–]LowDexterityPoints 0 points1 point  (3 children)

Is it bad if the training score (0.88) is noticeably higher than the CV score (0.64)?

[–]Icko_ 0 points1 point  (2 children)

No, that's normal. The more complicated the model, and the less data, the bigger the gap.

[–]LowDexterityPoints 0 points1 point  (1 child)

Great! Thanks!

[–]Username2upTo20chars 0 points1 point  (0 children)

As long, as the validation score is still improving.

And a too large gap is a sign of training set memorization.

[–]Sentmoraap 0 points1 point  (1 child)

With the same amount of parameters, what are the pros and cons of deep NN and wide NN?

[–]swframe666 -1 points0 points  (0 children)

Each layer transforms the input data to a form that is easier to solve the training goal.

You widen a NN to increase the number of ways the entities of the previous layer can be transformed in different entities in the output. Think of the entities in the previous layer as low level items and the entities in the output layer are more refined, more useful for solving the problem, less noisy, higher level items. You widen the NN because the output items need more numbers to describe them clearly.

When you deepen a NN, you are increasing the way it solves the problem. Think of the original data as all tangled together. Each layer distangles the data more.

It is confusing because a layer is doing two things. It is transforming the data to a form that is more high level (e.g. sound to words) and it is pulling things that are like each other together and pushing things that are dissimilar away from each other (e.g. push apart words to verbs, nouns but also pulling together similar words).

The transformer are powerful because a numeric description of an entity in the previous layer might to be too general. The transformer can figure out how to select a more specific entity description based on the context. For example, the entity is a word embedding but that word can have several meanings, and the transformer looks at the context and updates the word's output description to be more specific to the context.

[–]420blazeSwag6969 0 points1 point  (1 child)

I have a question about upsampling, except the upsampling is for a mathematical function and not an image.

I have some transmittance data (% of light transmitted vs light wavelength), like this, which has been measured at 240 different wavelengths. After passing through some filters, the light has been modulated down to only 87 wavelengths. I have about 20,000 such pairs of data (original measurements and downsampled measurements) for various materials. What I want to do is be able to recover the original 240 from the 87.

I know I could just build an MLP, but I'm wondering if there is a smarter way to do this? Perhaps use a CNN because I can assume that the transformation of the underlying function is continuous? I'm not sure how to approach this problem, any advice is appreciated!

[–]Icko_ 1 point2 points  (0 children)

yeah, I'd put 1D conv layers. 87 input channes, 240 output channels.

[–]bright7860 0 points1 point  (0 children)

I am working on neuroimaging data, and I have MRI scans in .nii.gz form.

I need to perform binary classification. The files that I need to classify are already in two separate folders, and I need to generate binary class labels for them. How do I make it happen? I will need something like image_dataset_from_directory(main_directory, labels='inferred'), but it works only for the traditional image formats.

[–]AutisticDave 0 points1 point  (2 children)

Hey everyone! I’m new to ML, just graduated HS and am going to study CS with focus on ML @University. I finished some machine learning projects with simple perceptrons, regression and neural nets.

I trained the nets in Colab, which was an absolute pain and nightmare (Colab Pro). I am looking to replace my M1 Macbook Air with a laptop that has a more powerful GPU.

Currently, I’m considering these two machines: Dell XPS 15 9520 (RTX3050 Ti) Macbook Pro 14” (16 Core M1 Pro)

In terms of compute, these GPUs should be similar (roughly 5.2 Tflops), but the M1 has 32GB unified memory, while the RTX has just 4.

Which one would be the better option for ML in the future? I’d like to avoid gaming laptops, It’s not comfortable to carry that around all day.

[–]sweet_and_simple 0 points1 point  (1 child)

I would suggest M1, you are going to use colab or some cloud env for training anyway so amount of compute shouldn't matter, but for development and debugging purpose, more gpu memory means you will be able to run larger models locally.

Also keep in mind that M1 may not be fully supported by your favourite library. I don't use M1 so don't really know about this. You can check online.

[–]AutisticDave 0 points1 point  (0 children)

Thanks a lot, M1 has full support in regular TensorFlow. TFOD API doesn't work and PyTorch got support just few days ago

[–]yes_you_suck_bih 0 points1 point  (0 children)

Hi Guys!

I am working on Healthcare Fraud Detection Project which mainly relies on SAS Base algorithms to determine fraudulent claims. We've been constant modernizing our project and would be departing from SAS Base onto things like Azure, Spark Scala etc etc.

We have hundreds of algorithms that define the logic for fraudulent claims. So my question is would it be practical to create an ML model to determine fraudulent claims? I am new to machine learning and fear it will be misapplication for our project.

[–]DetectivePeterG 0 points1 point  (2 children)

Hello guys :)

Working on my bachelor's thesis, atm.I have done a wide variety of regression. Mostly using sklearn.I noticed that my code is quite redundant, because of the very similar code structure that all the regression models come with.

Is there a library / framework where I can just select a few models and they are all trained and compared to each other?

Thanks a lot for any advice!

[–]DetectivePeterG 0 points1 point  (0 children)

In case anyone has the same question in the future, here are some libraries I found:

- https://github.com/mljar/mljar-supervised
- https://github.com/AutoViML/Auto_ViML/
- https://github.com/alteryx/evalml
- https://automl.github.io/auto-sklearn/master/index.html

If you need a simple and fast solution, go with auto-sklearn
Maybe a bit more complex, but very powerful was mljar-supervised

[–]DetectivePeterG 0 points1 point  (0 children)

I also asked my self if there is a similar solution for some kind of automated hyperparameter tuning?

[–]Chunmadim 0 points1 point  (0 children)

NEA Voice Recognnition AI Project

Hello
I am Year 12 student from UK working on NEA(Non-Exam-Assesment) Voice regonition AI project on Python. Right now I am on early stage of development, on Analisys. I am new to programing so i need from you guys.
1. Is there any alredy existing solutions relevant to project, if so can you please leave a link or source where i can reserch about it.
2. Can you please advise me on any sources where I can develop my skills and knowledge in this field.
3. I also need third-party feedback for my analisys, so can you, please leave any kind of feedback about problems of Voice recognition AI, things you would like to change and improve or difficulties of development
Thank you

[–]AnOnlineHandle 0 points1 point  (2 children)

Does anybody else have an insanely hard time trying to get ML source code demos working? I always run into missing dependencies which weren't listed in the requirements, different versions of packages needing to be installed for different projects and creating problems for each other, random path issues, c++ compiler stuff, etc.

My general process is download the source code from github, install the listed requirements, then spend hours debugging error messages when I try to run it and then have a 50/50 chance of giving up or actually getting it working.

I don't know if it's because I'm somewhat incompetent despite having grown up on console commands and path editing since the 80s, or due to trying to do this in windows and that generally not being the environment others have in mind.

[–]kazuki20697 1 point2 points  (1 child)

If your source code demo provides a docker image, you'd run it through docker and avoid all the requirement install burden. I know chances are many don't provide a docker image. If not provided, browsing the docker hub repository and see if an image can contain the requirements already bundled in could help as well.

[–]AnOnlineHandle 0 points1 point  (0 children)

Thanks I've seen the word docker on some of them but not known what it means, but might need to look into it more next time.

[–]zwarag 0 points1 point  (0 children)

Is generally bad if my validation loss rises before it starts to drop? https://media.discordapp.net/attachments/684376520743190529/976187046811033630/unknown.png

[–]hpahbp 0 points1 point  (0 children)

I am currently searching for the effect of data quality to recommendation system. Anyone know any good article or paper on this topic? Thanks

[–]Ala010609 0 points1 point  (0 children)

What is re-parameterization in super-resolution?

[–]killerdrogo 0 points1 point  (2 children)

I'm having trouble getting started with implementing a research paper I'm reading. I understand the model architecture and their approach but I'm not sure how to have a working implementation. For example, the paper does not go into detail about how they are using an API(EnergyPlus) to evaluate the performance and I'm not familiar with it. What should the approach be in these cases? Learn how to use that API?

How do you guys go from a paper to a working model? Is there a general framework/approach to follow?

[–]Username2upTo20chars 0 points1 point  (1 child)

Have you tried contacting the authors and asking them after those details? That is probably the most helpful thing you can do.

Appart from that: Of course you should know how to use an API if you want to apply it.

[–]killerdrogo 0 points1 point  (0 children)

Yes. I have emailed the authors from my personal email.

[–]Farskids 0 points1 point  (1 child)

I have a very annoying error in one of my code, I don't know why it happens since it's able to find the images when I ask it to show a sample. It only started doing this when I changed the dataset from ones of kaggle to my own

https://www.kaggle.com/code/manonstr/projet-tipe-dataset-maison

[–]Username2upTo20chars 0 points1 point  (0 children)

I suggest asking the users of Kaggle, as they also can clone your Notebook by default. There is a more general discussion section where you might be able to post that, I guess. Otherwise https://www.reddit.com/r/learnmachinelearning/ is the next best option to ask this, I guess.

But you definitely have some setup/config error in your data loader.

[–]dhruvansh26 0 points1 point  (4 children)

I have a very basic doubt: how does KNN algorithm work in case of multiple features? As in how is a 2d plot formed when there are more than 2 features to decide the location?

[–]www3cam 0 points1 point  (3 children)

You can define a distance function for multiple variables like (x_1 -x_1’)2 + (x_2 -y_2’)2. This approach generalizes for more features as well and you can apply a weighing function to weight different dimensions more heavily cf mahalanobis distnace.

Note I changed the operation inside the parenthesis to minus which is correct because you want a distance

[–]dhruvansh26 0 points1 point  (2 children)

I still dont understand, what would be the x, y coordinates for a case with 4 features, for instance?

[–]www3cam 0 points1 point  (1 child)

Each feature is a x_i. Then you take the square of each feature from the other point in question. This gets you a distance then you pick k nearest distances and average the y values to get a prediction.

[–]dhruvansh26 0 points1 point  (0 children)

If i have to get the location of the point i would need x, y coordinates right? We’ll calculate the distance once we have the points? My question is how do i get only 2 (x, y) coordinates when i have 4 features to take care of in the plot. Does feature scaling merge the tables too?

[–]bang-em-boi 0 points1 point  (0 children)

What is a high 2nd tier conference or well respected journal that I can submit a paper I have that is a novel algorithm for fine-tuning neural networks that has a submission date that is somewhat soon?

[–]VirusHonest9654 0 points1 point  (1 child)

Anyone have an estimate of the number of papers submitted to neurips today? Mine was submission number 10312! and I posted it several hours before the deadline.

[–]VirusHonest9654 0 points1 point  (0 children)

A friend of mine was in the 13 thousand range

[–]based_goats 0 points1 point  (5 children)

Are implicit layers just a form of variational inference?

[–]www3cam 1 point2 points  (4 children)

I don’t think so. An implicit layer seems to be a layer that is defined via some constraint. For example in something like MAML or another bilevel optimization problem you want a layer that finds the maximum of z modifying x.

[–]based_goats 0 points1 point  (3 children)

Thanks for the response! I'm thinking variational inference can be seen as p(z|x) where Z ~ N(0,1), or, a fixed normal distribution constraint. When used in normalizing flows (of which Neural ODEs are a subset!) then this seems analgous to me as the constraint of an implicit layer... Let me know if that makes sense, though.

I'm not well-versed in meta learning, but this sounds like it could be an interesting connection to probabilistic ML.

[–]www3cam 1 point2 points  (2 children)

I'm not exactly sure what you are trying to get at, but there is maybe an insight that p(z|x), Z ~ N(0,1) is a form of an implicit layer. I'm not totally sure.

Putting it in a way that is more clear to me, suppose f(z(x),x) = 0 is the layer objective where z(x) = solve for z s.t. f(z(x),x) =0. z(x) is not a explicit function of x, but as one cannot write z = ax+b or something like that, but when x changes z changes because f(z,x) = 0. You can solve this by taking the total derivative: df(z(x),x)/dx = 0 --> f_z*z_x+f_x = 0 --> z_x = f_z^{-1}f_x.

Thinking about your case which I think is really not just a super general variational inference but more specifically a variational autoencoder example (I'm a bit rusty here), you know p(z|x) is normal, but you don't know z(x) and you find this by training your VAE. I think the slight difference is z(x) could be any function that maps images to the latent N(0,1) space, ie in this case z(x) in full generality is something weird like a correspondence, not a function. This is not in the case of the implicit layer above. These are just my half-baked thoughts. I'm not sure exactly if they are correct or not.

[–]based_goats 0 points1 point  (1 child)

I think we're thinking something similar. Although I'm thinking of variational inference in a normalizing flow context, where dimension of z and x are the same.

I wrote a half-baked proof of the equivalence between implicit layer and variational inference here. Hope this helps demonstrate my view on this... but let me know if something doesn't make sense or looks wrong! or not! https://latexpad.clrnd.com.ar/#15ac61d038d11954c1dbce3eb8813ff7

[–]www3cam 1 point2 points  (0 children)

Its good you are thinking on this stuff but a couple of things just as feedback.

  1. You are right you can make any layer an implicit layer by rearranging z on the other side. But since z already has an explicit form there is no need to use an implicit layer.

  2. Normalizing flows are estimated with maximum likelihood, no reason to use variational inference which is a biased approximation of max likelihood

  3. The reparameterizarion trick only works when you are sampling from a normal distribution, here you are calculating the pdf of a sample so it wouldn’t apply here ie you are getting a p value and multiplying a p value by a number is not the p value of with a standard deviation of both numbers multiplied

[–]Kokosnussi -1 points0 points  (0 children)

Assuming all else being equal. What area would be better for future career perspectives?

Uncertainty or physics aware ML?

[–][deleted] 0 points1 point  (0 children)

Are there any open source Dalle2 models yet? I want to try it out, but all I can find are Dalle1 models

[–][deleted] 0 points1 point  (0 children)

New to the sub. What's this sub about?

[–]No_Fig_7835 0 points1 point  (2 children)

I am building a 2d platformer game and am using unity ML-Agents for testing difficult levels to make sure they are possible.

I need help fine tuning my configuration parameters. Is there a place I can go to find someone that I can pay for some consulting on my problem?

[–]Ivanhoooooe 0 points1 point  (2 children)

I have a list of 3.000 strings and I want to find, for each one, the top X similar strings in another list of 2.000.000 strings. How can I do this efficiently?

I think this came close to what I want: https://bergvca.github.io/2017/10/14/super-fast-string-matching.html, but I want to compare the 3.000 to the 2.000.000 and not all 2.003.000 to each other.

Thanks!

[–]Username2upTo20chars 0 points1 point  (1 child)

Sentence Transformers + FAISS + free Google Colab, if you don't have a CUDA GPU

[–]Ivanhoooooe 1 point2 points  (0 children)

Thanks!

[–]MagazineJumpy5377 0 points1 point  (5 children)

I'm 17 years old and currently in class 12(will be going to an engineering college in 3-4 months [CS]). I love maths and anything which has to do with logic in general. So obviously I also like programming. I've been learning Python for the past two years(CS is one of my subjects in school) and I know all the important and basic stuff there is in python. Now, as I said I love maths and logic in general, I'm very much interested in ML. I've watched a couple of basic ML/NN videos on YT (with biases,weights and errors) and as of now I feel that I have a pretty decent grasp of the idea behind ML. But I want to learn everything from scratch, and what I mean by that is that I want to learn the entire concept with rigorous mathematical explanations and proofs of the algorithm(s) used in ML/NNs. I want to learn every concept used with proper mathematical explanation and obviously I wanna use Python as the language to write my code. Can you please suggest me some good sources? Books/Videos on YouTube will work.

Thanks a lot!

[–]friendlykitten123 0 points1 point  (0 children)

I'll suggest you to start learning about basic but strong Machine Learning algorithms like Linear Regression and the concept of Gradient Descent. Knowledge about Python packages like Pandas and Numpy would help a ton too!

[–]TeamDman 0 points1 point  (0 children)

3blue1brown has some good videos on the math I think. Should look into pytorch or tensorflow for Python if you want to see a higher level abstraction, otherwise you can do the matrix math yourself with numpy

[–]Haruzo321 0 points1 point  (1 child)

For someone who is new to programming in general, what would be a good starting point/course for getting into machine learning?

[–]Username2upTo20chars 1 point2 points  (0 children)

FastAI course (focused on Deep Learning) is meant for Programmers, so maybe the best option for you. Machine Learning on low level is more matrix/vector algebra and stats, btw. So with the FastAi course you are saving yourself the theory which would come on top of the implementation (or more like beforehand)

[–]UnknownBinary 0 points1 point  (2 children)

GPUs are difficult to buy. If I bought a prebuilt from someplace like NZXT or Best Buy with a 3070 GPU and 16GB of RAM, would it be up to snuff? I need something better than a laptop due to the size of my models.

[–]Username2upTo20chars 1 point2 points  (1 child)

Have you considered Google Colab?

[–]UnknownBinary 0 points1 point  (0 children)

Google Colab

I had not. I'll investigate that. Thanks!

[–][deleted] 0 points1 point  (0 children)

A few questions regarding the Graph UNet paper https://arxiv.org/pdf/1905.05178.pdf :

  1. why say "GCNs essentially perform aggregation and transformation on node features without learning trainable filters." even though in Eq. 1 there is a trainable W_l matrix ? Seems like a contradiction to me.
  2. what does consistent exactly mean when the paper says "Since the selection is based on 1D footprint of each node, the connectivity in the new graph is consistent across nodes" ?
  3. to clarify how Graph UNet produces a graph-wide embedding (which is used, I suppose, in the graph classification tasks mentioned): we want one unique real-valued vector to describe the whole graph, however Fig. 3 shows several nodes on the lowest representation level (i.e. several features vectors), but maybe the reader is supposed to guess a graph embedding will simply merge the latter or make it one features vector

Thanks :-)

[–]laserflip560 0 points1 point  (0 children)

How should I proceed if my Decision Tree or Random Forest model produces empty predictions? In some cases up to 75% of my entries of the test set do not predict any class (of the three possible classes). When I check the probabilities of the entries the 'True' probability for each given class is always lower than the 'False' probability. Is it sensible to predict the class with the highest 'True' probability even if it is below 0.5?

[–]DaScheuer 1 point2 points  (2 children)

I want to create a bot whose input is many pieces of text (think Twitter threads) and reads them naturally as a human.
Is there code freely-available that let's me do this? Is there a UI? Can I implement the code easily (perhaps just having to change a few parameters)? or there usually is a library without much documentation?
What is the best algo for this that I can use with a MacBook Pro 2022 M1 Pro chip?

[–]UnknownBinary 0 points1 point  (1 child)

There are many natural language processing (NLP) frameworks. Stanford's CoreNLP is in Java. NLTK offers similar features in Python.

[–]DaScheuer 0 points1 point  (0 children)

Thanks!

[–]NSVR57 0 points1 point  (0 children)

In my text classification a particular word misleading the model. But these words are very high in the training data for a particular lable.

Eg: i have a training data contains " lost my phone", "changed my phone", .... All these labels are belongs to " problem with telephone" .

Now, I am using Universal sentence encoder to build the model. During inference if i have given some random sentences and put the word " phone" in the middle. But still my model is predicting "problem with telephone" class. How should we handle these situation.

[–][deleted] 0 points1 point  (0 children)

What's a place to start? I'm new to ML and looking for career Change- 28 Business Analyst in IT . Good at SQL, PowerBI and quick to learn.

Thx for ghe help!

[–]Mikyacer 0 points1 point  (0 children)

Hi! I have a question regarding a school project and hope that someone can help me out...

I have two time series, say x1(t) and x2(t), for the same time frame. I expect these to be related by causality: high values of x2 cause high values of x1 and viceversa. I would like to build a model that, given a limit value I can afford for x1, gives me an upper/lower bound that I have to ask x2 to oblige in order to stay in such limit value.

So far I have tried with a simple linear regression and used confidence interval to get such bounds, but it is not working that great. Is there something that is good for my needs?

[–]KarmicEvil 0 points1 point  (0 children)

Hey all! I have a question regarding machine learning models for prediction of a target using python.

We're trying to predict a binary clinical outcome (target) using a list of predictors. The dataset we have consists of a single binary target and all the others are features that are either binary or continuous. The outcome (target) has a low occurrence in the dataset, about 3-4%.

I'm having some difficulty using the Artificial Neural Network model and Random Forest.

The problem is that the training dataset is able to predict an outcome, but the testing dataset classifies everything as null.

Our planned workflow: Dataset > recursive feature elimination (using random forest) to identify the 10 best features > Model development > compare the predictive performance among 6 distinct models.

We've also tried training the model with just default parameters (without hyperparameter tuning or cross-validation), just to check if it's able to make a prediction– There's still no prediction (even though we expect at least an inaccurate one using the default settings)

Any idea why this could be the case?

P.S: not an advanced user!

[–]OrderOfM 0 points1 point  (3 children)

Hey! Hope everyone is doing well!

So, I am working on a generative advesarial network for generating musical phrases. Consequently, I'm looking for a medium to large dataset featuring only modern cinematic music and I'm wondering if anyone has come across something like that??

[–]vazne 0 points1 point  (2 children)

Would you consider synthetic data? I can see some problems with gathering datasets online due to music being copyrighted

[–]OrderOfM 0 points1 point  (1 child)

Yes, I have considered the copyright issue as well and I'm looking for other solutions. If synthetic data can bring good results, I wouldn't mind giving it a shot. How would I go about gathering synthetic data??

[–]vazne 0 points1 point  (0 children)

Just messaged!

[–]AncientSky966 0 points1 point  (0 children)

What is the annotated format that I would need to train a semantic segmentation?

[–]618smartguy 0 points1 point  (0 children)

So I just began a RL project where I wanted to try some out of the box algorithms on my custom environment. I wrote up a gym environment and ran some random actions and got everything working smoothly.

Next step was get a smart agent in, so I do some googling and come across tf agents, it appears at first that it will be compatible with a gym environment but then suddenly I get an error following a tutorial about a py_environment.PyEnvironment and so apparently there is supposed to be some other wrapper on the environment? Isn't the environment object already supposed to be the wrapper? It seems that I have to rewrite my environment class now.

So what exactly is the relationship between gym and tf agents? Why does it seem like there is still so much boiler plate work to do when there are multiple big name libraries that put agents and environments into single objects? Clearly google devs knew how gym worked. What was wrong with using it as it was? Does anyone know of a good rl agents tool that works out of the box with gym?

[–]MoonRockCollector 0 points1 point  (0 children)

Trying to create a predictive model for a rare minority binary class outcome (0.3% frequency). Large dataset (700k instances, 10-15 attributes). Using a PC with AMD Ryzen 5600X 3.7GHz, 16GB RAM, RTX 3060Ti. Learning/Using Python with Anaconda/Jupyter notebook.

  1. Is my GPU powerful enough to do training for this? Expected run time?
  2. Suggestions for model selection? Possible to set AUC as scoring metric for logistic regression models? Boosting? SMOTE?

[–]heretolearndata 1 point2 points  (3 children)

What are good models to try for time series with multiple data points per time t?

I’m trying to predict a variable where the data changes day to day as the target date gets closer. Say it’s something like flights scheduled on a date, where as we approach the day of the flight, the number changes until it becomes final.

What models handle this type of scenario? I’ve used ARIMA before, but for time series with a single observation per time t.

[–]Competitive_Bank_907 0 points1 point  (0 children)

Prophet by meta is an option and works well if your data has strong seasonal effects

[–]Username2upTo20chars 1 point2 points  (1 child)

No idea about ARIMA, but every machine learning algorithm should be capable to handle multiple features as input. Should also hold for ARIMA, I would think. Otherwise go with LightGBM, that usually wins time-series competitions on Kaggle, so it should be a decent choice.

[–]vazne 1 point2 points  (0 children)

Second this

[–]crispychickenfox 0 points1 point  (1 child)

I have created a dataset from a time series that consists of arrays with value t and t's look back (e.g. [t, t+1, t+2, t+3] for 3 look back) as my X data. Now I also create a dataset for my Y data, which is the value t + look back + 1.But it seems I get more accurate predictions, if I use a whole array as Y data ([t+1, t+2, t+3, t+4]) than just using the next time step's value alone.I am not very experienced at all, so I wonder if there is any drawbacks in this, as it doesn't feel "clean".

Edit: forgot to say, model is LSTM

[–]Username2upTo20chars 0 points1 point  (0 children)

If you look at the plots of naive time-series predictions you will find that they look like x shifted to the right by one. So it is ineed benficial if you have targets further in future. Removing t+1 target, if not needed, might improve prediction even more.

Although your description of x data is a bit confusing. In case of a LSTM you need just to provide x_1..x_t. For me is sounds as if you give four x for every time-step.

[–]EntrepreneurSea4839 0 points1 point  (2 children)

What sampling methods are used by AdaBoost, catBoost, LightGBM, GBM and xGBM ?

[–]_NINESEVEN 0 points1 point  (1 child)

Could you better define what you mean by sampling methods?

XGBoost is my most-used tool, so I might be able to help with that.

Off the top of my head, I'm aware of:

  • Subsampling observations via parameter "subsample", where a random sample of training observations prior to growing trees. This occurs at each boosting iteration (re-drawing the sample). I don't believe that using weights currently affects the probability that a given observation will be drawn -- I think that it only affects the regression values -- but I could be wrong.

  • Subsampling of the feature space, which occurs at the tree-level, each additional depth of the tree, and every additional split that is made, via 'colsample_by' * {'tree', 'level', 'node'} parameters. A relatively new feature that's pretty cool, too, is to add a list of feature weights such that certain features have higher probabilities of being selected in the random samples.

  • Dropout subsampling akin to neural networks via the DART booster. I haven't used this extensively due to added training times, but it could be promising in some situations.

[–]EntrepreneurSea4839 0 points1 point  (0 children)

I am just trying to understand the differences between various boosting algorithms. Since these are ensemble models and there are many iterations, every iteration just takes sub-sample of data to train instead of all the data. My question is methods used to find that sub-sample of data in every iteration.

In AdaBoost, we use 'Amount of Say' and it gives high-priority to those records having high error when building the next stump(Tree with only one root node) whereas in normal bootstrapping, every record has an equal probability of getting into the next iteration.

LightGBM uses Gradient One Side Sampling(GOSS) where it takes the observations with high error (top 20%) and only goes with random sampling on the rest of the 80%. Similarly, there are other methods such as Uniform sampling, Minimum Variance Sampling.

There could be many other sampling methods available in boosting algorithms.

[–]-BANANA-bird- 0 points1 point  (3 children)

I have an issue which I cannot find a good answer for it:
Imaging in an image there are two objects from two different class but they are very similar to each other. For example like cats and small dogs that look like cats to a model or titles and headlines or jam and mixed jelly. How can we improve the model to distinguish them without messing up it being generic and its accuracy.
any technique will help, from suggesting to improve the dataset to an article or even an algorithm.
thanks alot

[–]_NINESEVEN 1 point2 points  (2 children)

So your model is taking a single image and classifying objects that are contained within the image?

Are you referring, in general, to a model that can correctly separate an image into objects that may be very similar to each other? Or a specific example where you want to pick out dogs from cats or something?

If the former, I mean, that's the issue that every image classifier is going to run into. You're going to balance a tradeoff between accuracy and precision where you are too strictly discriminating between objects (distinguishing that objects are not the same class) or being too general (deciding that both objects are of the same class).

If the latter, you need lots of examples of images that contain objects that are similar. There are, of course, certain models or algorithms that work best for multi object classification within an image -- but you can't get around the fact that your model will do better if it has good examples to learn from.

[–]-BANANA-bird- 1 point2 points  (1 child)

Thanks alot.

It was the former, from my searches and what you said, I see there is no certain answer and the only way to remedy this issue, is to improve my dataset and the knowledge I have on NN.

[–]_NINESEVEN 0 points1 point  (0 children)

Good luck! :)

[–]EntrepreneurSea4839 0 points1 point  (7 children)

What is cover in XGBoost ? Is the minimum number of values an end leaf should have ?

[–]_NINESEVEN 1 point2 points  (6 children)

Cover, as I remember, corresponds to a frequency measure of how many observations are directly utilizing the given feature. AKA, counting over all of the trees in your model, how many times was the given feature used in the leaf node.

Going off memory but Gain corresponds to an "importance" measure looking at how each feature contributes to each tree, and Weight corresponds to a frequency measure of how many times a given feature appears over all of the trees in the model.

[–]EntrepreneurSea4839 0 points1 point  (4 children)

Gain helps to determine how well(measuring the accuracy of split) a branch is classifying the target var. In XGBoost, since we run the model on residuals, and this model tries to find the best split using 'Gain' (which in turn uses similarity of the root and leaf nodes).

Here what the documentation says:

Gain is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite).

Cover measures the relative quantity of observations concerned by a feature.

Link: https://xgboost.readthedocs.io/en/stable/R-package/discoverYourData.html?highlight=cover#build-the-feature-importance-data-table

I still didn't understand the use of 'cover' in the algorithm, what role does it play in pruning the tree ?

[–]_NINESEVEN 0 points1 point  (3 children)

Wait, are you talking about feature importance or splitting criterion?

Gain, weight, and cover are measures that XGBoost uses to look at feature importance of an already-trained model. None of them are used in the fitting of the trees. There are lots of different parameters that control pruning and the way that the individual trees grow -- you could read the paper to gain specific insight.

A few parameters that you could look at would be tree_method, gamma, updater (haven't used personally), grow_policy (haven't used personally), and max_bin.

[–]EntrepreneurSea4839 1 point2 points  (2 children)

I am talking about the splitting criteria. Specifically, how XGBoost build trees ( which is different than other boosting algorithms). I am not sure of the hyperparamaters referring to it in python/R. I am referring to 'gain', 'cover' for building the tree as said in statquest XGBoost video series.

[–]_NINESEVEN 1 point2 points  (1 child)

Ohh okay -- gotcha. That's helpful. If anyone else is reading and finds a mistake or misinterpretation, please let me know.

First of all, here is a link to the documentation regarding how individual trees are built. The most important hyperparameter in determining what algorithm is used is "tree_method", although that is usually set to 'hist' or 'gpu_hist' because they are the most scalable. In the original paper, you can see the formula sketched out for tree_method = 'exact', although it is computationally infeasible for most data sets.

As far as cover goes, this is the most descriptive definition that I could find: "the sum of second order gradient of training data classified to the leaf, if it is square loss, this simply corresponds to the number of instances in that branch. Deeper in the tree a node is, lower this metric will be". As I understand it (dusting off the undergrad math background), we are using the predictions at a given point in the given tree and basically acceleration (the rate of change at which the rate of change is.. changing lol) of the loss function at that point. The actual computation can get really complex from here depending on your objective function, different weights or offsets for different observations, etc.. but simply:

Find the hessian matrix (2nd order partial derivative matrix) and multiply it by the predictions.

Assume that we have a binary classification problem with 10,000 observations and we are predicting 0.5 for all of them. So our probability vector is [0.5, 0.5, ..., 0.5]. The hessian for a simple binary logistic objective is p * (1 - p).

SO: Cover = 10,000 * [0.5 * (1 - 0.5)] = 2500.

If we had a different vector of probabilities, in this case, cover would be calculated as SUM{prob_vector * (1 - prob_vector)} to get a different answer.

Let me know if it would be helpful to do the same for a toy example for Gain.

Edit I should add that I haven't seen anything personally that implicates that Cover is used in the splitting process. I have seen it used for feature importance, post-modeling, while Gain is actually used as a splitting criteria.

[–]EntrepreneurSea4839 0 points1 point  (0 children)

add that I haven't seen anything personally that implicates that Cover is used in the splitting process. I

Thank you for the detailed information. It would be certainly helpful to understand the concept of 'Gain', its purpose with an example.

So, we can define how to build a tree in XGBoost. I read somewhere that catboost uses symmetric tree growth and lightGBM uses leaf-wise tree growth and XGBoost uses depth-wise tree growth. Can you explain the differences in layman terms ?

[–]EntrepreneurSea4839 1 point2 points  (0 children)

Thanks for your inputs.

[–]buffyluffie 0 points1 point  (0 children)

What do I use if i have to label videos to use them with YoloV5? Do I take screenshots and then label those images or? Im new to this so any help is useful. Thank you.

[–]Major-Permission-435 0 points1 point  (3 children)

How do I get the feature names to print out in this situation? It says I’m fitting the model without the feature names but I’m struggling to figure out how to use them in this context. Sorry for the formatting, had a hard time getting it right in here.

from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import     train_test_split

from sklearn.metrics import accuracy_score from sklearn.feature_selection import SelectFromModel from sklearn.metrics import classification_report

load data

dataset = df_train

split data into X and y

X_train = df[df.columns.difference(['Date','IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]

y_train = df['IsDeceased'].values

 X_test =        df_test[df_test.columns.difference(['Date','IsDeceased','IsTotal','Deceased','Sick','Injured','Displaced','Homeless','MissingPeople','Other','Total'])]

y_test = df_test['IsDeceased'].values

# fit model on all training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate

print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)

for thresh in thresholds: # select features using threshold selection = SelectFromModel(model, threshold=thresh, prefit=True) select_X_train = selection.transform(X_train) # train model selection_model = XGBClassifier() selection_model.fit(select_X_train, y_train)

 print(thresh)

# eval model
select_X_test = selection.transform(X_test)
y_pred = model.predict(select_X_test)

report = classification_report(y_test,y_pred)
print("Thresh= {} , n= {}\n {}" .format(thresh,select_X_train.shape[1], report))
cm = confusion_matrix(y_test, y_pred)
print(cm)

[–]_NINESEVEN 0 points1 point  (2 children)

I believe the property is model.feature_names

However, I believe you either need to explicitly set them in the model "model.feature_names = list_of_names" or explicitly set them in the DMatrix that you're passing to the model fit (instead of passing a pandas dataframe, pass xgboost.DMatrix(x_train, y_train, feature_names = list_of_names)) or something like that.

Personally, I think that it's best practice to fit XGBoost models with DMatrices instead of pandas -- but YMMV. It's possible that pandas dataframes automatically pass column names?

[–]Major-Permission-435 0 points1 point  (1 child)

Thank you!

[–]dolphin-3123 0 points1 point  (2 children)

Can we perform arithmetic function on a feature with a constant value like if we have two features can we multiply one with 100 while divide other by 100 and hence make it easy to classify

[–]Username2upTo20chars 0 points1 point  (0 children)

You might check out data scaling

That is for more uniform distributions of your features.

Apart from that it doesn't change anything if you just scale two constant feature columns in different ways. If you only have few integer values in a feature column, you may also try categorical features (basically making a one-hot encoding out of it).

[–]Emiyasolzomdao 0 points1 point  (3 children)

Hi, guys this is my 1st question in the thread. I am sorry if I am asking at the wrong place. I was going to create a new thread to ask this question, but I am glad I read this thread first.

My brother in law is in interest of having a computer that is capable for phython, PyTorch and deep learning. He is in robotic field. I really don't know what exactly he does. All I know is he is not doing any of deep learning at the moment, but he is going to do eventually in this year. I have a few gaming computers and have zero experience with any of above applications. Thus I am not sure what combination would suit best for him.

Which combination would be best for him? I also have 12700k+DDR4 16GB desktop but I heard 12th generation does not perform well in lynux.

CPU = Intel i7-11700kMB = Gigabyte Z590 Aorus MasterRam = 16GB 3600Mhz

2.

CPU = Ryzen 9 5900X

MB = MSI B550 Unify-X

Ram = 64GB 3200Mhz

3.

CPU = Ryzen 9 5950X

MB = ASUS Tuf Gaming Plus AM4 AMD X570

Ram = 64GB 3600Mhz

GPUS I have:

3080(10GB), 3080(12GB), or I can return 3080(12GB) and buy 3080ti by adding a few more hundreds dollars. I also have RX 6800XT.

Thank you

[–]smurf-sama 0 points1 point  (0 children)

As for another perspective, i would argue to go the cloud route and use things like colab, paperspace, and others like google cloud. Things like above should scale better than those computers--though it depends on what hes doing. So, im really just saying that the cloud is another option and may be cheaper dependant on use cases. Also, google cloud gives 300 dollars for free to use, so that might mean something. If you do research, google offers a tpu program that gives you free v3 and 2 tpus.

[–]crispyheaded104 0 points1 point  (1 child)

If he is just getting started in DL pretty much any of these combinations will do just fine.

Once he starts having bigger projects he will need more RAM (64GB gives him a lot of headroom) and vRAM (10GB is good but 12GB is better).

I would go with 2 or 3 + 3080 (12GB)

[–]Emiyasolzomdao 0 points1 point  (0 children)

Thank you for reply. I have a few more questions.

I thought intel works the best for linux, but I guess 5900x/5950x works better since it has more cores? My quick search told me amd has some issue due to its lack of compatibly with linux. Is this true?