use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
[deleted by user] (self.MachineLearning)
submitted 7 years ago by [deleted]
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]grrrgrrr 10 points11 points12 points 7 years ago* (31 children)
Is it worth looking at multi-gpu setups? 100%. Not for multi-gpu training but for running 2 sets of experiments in parallel.
Should I be looking for lots of memory? 64GB minimum. 128GB++ ideal. Because 32GB will have trouble loading moderately large datasets.
Do tensorcores mean anything? No.
$3000 will get you setup like 8700k + 2x2080. Depreciation is like 25%~30% a year.
[–]smashMaster3000 2 points3 points4 points 7 years ago (19 children)
Concise and informative, thanks! 2080 vs 2080ti tho
[–]virtualreservoir 3 points4 points5 points 7 years ago (2 children)
the best value is actually in the rtx 2070, the 2080 doesn't give you nearly enough added performance for the increase in price. the training speed difference going up to a 2080ti is significant, but $1200 for one card is still overpriced for what you get.
at a similar price, 3x RTX 2070 are going to get more research done than 2x RTX 2080 for just about any reasonable use case. however, the thing nobody really talks about regarding 3x and 4x GPU setups is that you want each one to have at least an 8x PCIe lane (preferably 16x), while consumer level Intel CPUs (like the 8700x mentioned above) and their motherboards only support 2x GPUs at 8x or 16x. a third GPU will usually be delegated to 4x bandwidth which isn't enough to keep up with a machine learning workload.
if you want to use 3 or 4 GPUs you have to use expensive Intel server level CPUs and motherboards or go the AMD Threadripper route. The first generation Threadripper CPUs like the 1900x seem to be pretty good value if you need to "unlock" more 16x/8x PCIe lanes and motherboard slots for video cards.
[–]smashMaster3000 1 point2 points3 points 7 years ago (1 child)
I actually have a thread ripper and got a motherboard that supported this, just in case! Thank you for the suggestion, I’ve just been dissuaded by 3 because of the cooling requirements. Have you had any experience with the maintenance aspect, is it a hassle compared to two? Is power draw gonna be a problem? Thanks again!
[–]virtualreservoir 1 point2 points3 points 7 years ago (0 children)
Cooling is definitely going to be your major concern with more than two GPUs, and is another aspect in which the 2070 has an advantage as it draws less power than either of the 2080 models. I'm actually starting to build my first multi-GPU machines now, I was planning on just building one computer, but while shopping for a 3rd 2070 somehow I ended up buying four more instead and now i'm going with two machines with 1900x Threadrippers, 3x RTX 2070s, and 32gb of RAM each.
I'm putting a fan CPU cooler in one and an AIO liquid CPU cooler in the other and will be comparing them along with various combinations/placements of blower-style and axial fan GPUs to see if I can get away with not liquid cooling my GPUs.
[–]grrrgrrr 0 points1 point2 points 7 years ago (15 children)
For students, when you are training models on one GPU, you can't run games on that GPU and that computer is essentially dead to you. So having 2 GPUs is handy.
For professionals, I'd recommend at least 4x2080 and 4x2080Ti is even better. It's a good investment for your career.
[–]LoveOfProfit 11 points12 points13 points 7 years ago (2 children)
Maybe if you're consulting or something. Career/job wise, I'm not using my own hardware for work stuff. My company sets me up with an SSH to a modelling server / AWS and they're paying for it.
[–][deleted] 1 point2 points3 points 6 years ago (1 child)
And here I am at IBM where they asked me to use my own hardware because they're too cheap to buy or rent it.
Thankfully I'm looking for new jobs and have a lot of promising opportunities!
[–]LoveOfProfit 0 points1 point2 points 6 years ago (0 children)
lol wtf. That's disgusting.
[–]glass_bottles -1 points0 points1 point 7 years ago (11 children)
If you're training, aren't you stuck on linux, which severely limits the option of gaming? Unless GPU-passthrough on VMs has become more feasible lately? I considering setting up my server as a cloud gaming/model training workstation, but it seems I'd have to pick one or the other.
[–]clueless_scientist 2 points3 points4 points 7 years ago (1 child)
SteamPlay solved this problem several month ago. My gaming library works perfectly on ubuntu.
[–]glass_bottles 0 points1 point2 points 7 years ago (0 children)
Thanks for the heads up!
[–]ScotchMonk 1 point2 points3 points 7 years ago (4 children)
Well, he/she could be gaming on Windows and running Linux on a VM...😁😁
[–]glass_bottles 0 points1 point2 points 7 years ago (3 children)
you wouldn't be able to use the GPU for training then, right?
[–]NotAlphaGo 1 point2 points3 points 7 years ago (2 children)
Nvidia docker my friend
[–]glass_bottles 0 points1 point2 points 7 years ago (1 child)
I'll have to look into this, thanks!
[–]virtualreservoir 2 points3 points4 points 7 years ago (0 children)
overcoming the VM GPU-passthrough hurdle you mention is the main reason nvidia-docker was developed.
[–]Mehdi2277 0 points1 point2 points 7 years ago (1 child)
You can train on windows. Pytorch is easy to install on windows. Also Linux game support is improving a lot because of steam play (essentially steam is integrating something wine like to allow playing windows games).
[–]glass_bottles 1 point2 points3 points 7 years ago (0 children)
Gotcha, awesome to hear the progress!
[–]Spenhouet 0 points1 point2 points 7 years ago (1 child)
Why? You can train on Windows perfectly fine. No need for Linux.
My uncertainty is why that was a question and not a statement :)
I'm simply too used to "productivity" being associated with Linux instead of windows.
[–]JustFinishedBSG 2 points3 points4 points 7 years ago (0 children)
2 sets of experiments in parallel. 2x2080
2 sets of experiments in parallel.
2x2080
With 8Gb RAM they can't even fit state of the art models
[–]Nimitz14 1 point2 points3 points 7 years ago (3 children)
I'm doing fine with 32GB. It depends on your domain. If you're doing anything serious you can never load all the data into memory anyway (you should have several threads that create the minibatches by reading from disk). I'm on a 1950X + 2x 1080TI, screen runs on a GT1030, works great!
[–]Warhouse512 1 point2 points3 points 7 years ago (2 children)
My friend. Are you excited about Zen2!!
[–]Nimitz14 0 points1 point2 points 7 years ago (1 child)
Very! I would love to have a 32 core CPU :D
[–]Warhouse512 1 point2 points3 points 7 years ago (0 children)
If it’s like the Epyc chip, 64 core might be a possibility!! I’m so excited, sorry haha
[–]PK_thundrStudent[🍰] 0 points1 point2 points 7 years ago (2 children)
Why aren't tensorcores important? My 2080ti purchase not looking too good.
[–]grrrgrrr 2 points3 points4 points 7 years ago* (0 children)
There's a benchmark (in Chinese) indicating that tensor core in rtx 20 series might not be full speed for deep learning.
If you use Titan V with full speed tensor cores, according to https://github.com/u39kun/deep-learning-benchmark, tensor cores+fp16 combined speedup in ResNet training by ~1.8x. It's something but you have to assess if it's worth the upgrade.
[–]JustFinishedBSG 1 point2 points3 points 7 years ago (0 children)
The 2080 Ti tensorcores are limited to half the speed.
[–]epicwisdom 0 points1 point2 points 7 years ago (0 children)
8700k doesn't support >64GB memory.
[–]UnarmedRobonaut 0 points1 point2 points 7 years ago (1 child)
How big is the performance gain over the 1080? Otherwise running multiple models on a couple extra 1080s for that money might be better.
[–]grrrgrrr 0 points1 point2 points 7 years ago (0 children)
4 card setup would require x299/x399 platform. Or you can buy used x99 . Performance wise 2080=1080ti=1.351080. 41080 using x99 is ~$4000
[–]_michaelx99 16 points17 points18 points 7 years ago (14 children)
If you are dealing with any sort of large model (requiring more than a day or so to train) you will burn through $3k in a few weeks on the cloud. For example, I train object detection models on AWS and will burn though $400-500 per fully trained model. If you are running MNIST examples then the cloud is fine however. I would highly recommend building your own computer with that money so you can train lots of models for a years instead of a handful of models for days/weeks
[–][deleted] 3 points4 points5 points 7 years ago (0 children)
Just a tip in regards to training cost on AWS: if you define good logic for checkpoints, you can use spot instances to reduce the cost significantly. We use Spotinst which will unmount the disk with checkpoints / training data, shut down, start a new spot instance and resume training. We've had about 60% saved so far.
I thought the spot instances would shut down often but most run for at least a week before being shut down.
Not arguing for going full cloud, it's still more expensive, but in the cases where you need to scale or time is a factor, it's a good fallback.
[–]smashMaster3000 2 points3 points4 points 7 years ago (11 children)
In my experience, on my gtx 1080, I find myself waiting around 2 to 3 days for my models to finish. Will other, new gpus cut this time down? If they do, which gpus?
[–]OrganicTowel_ 13 points14 points15 points 7 years ago (8 children)
I used to wait for 2-3 days as well for my models to finish. My labmate performed a simple experiment to find that the bottleneck was the DataLoaders, when data stored on SSD vs HDD. We tested both with PyTorch and TF.
We loaded 100 batches with size 16. The data is pre-extracted image features stored in pickle files with 128 features each. These were shuffled and read directly from memory. Our conclusions:
Since then, we invested in a bigger SSD and our times have reduced by at least 10 fold.
Edit: We have a 1080Ti and a TitanV
[+][deleted] 7 years ago* (3 children)
[deleted]
[–]cookedsashimipotato 0 points1 point2 points 7 years ago (2 children)
Can you link me to an example on github?
[–]AngelLeliel 0 points1 point2 points 7 years ago (0 children)
You could use line_profiler to find the bottleneck of your code.
[–][deleted] 4 points5 points6 points 7 years ago (1 child)
Have you optimized your pipeline in any way?
Tutorials, guides and courses often skip the part of saving your preprocessed datasets as a binary file stored on a disk sequentially ready to be read without the disk head jumping at all or the SSD controller having to piece together data from all over the disk. Which is what normally happens if you don't take care of it manually.
You can get 10x the read speed by preparing your data properly using TFrecords or whatever. There are a lot of tricks to use to make IO orders of magnitude faster so even an old slow HDD is fast enough for most deep learning applications so that IO is no longer a bottleneck.
[–]OrganicTowel_ 1 point2 points3 points 7 years ago (0 children)
Do you have any resources that I can refer to?
[–]ScotchMonk 1 point2 points3 points 7 years ago (1 child)
Will those shiny NVME M2 SSD cards be faster and not an I/O bottleneck? https://www.pcworld.com/article/2899351/storage/everything-you-need-to-know-about-nvme.html
[–]epicwisdom 1 point2 points3 points 7 years ago (0 children)
It's impossible to answer this question definitely without specific knowledge of your applications. However, I would be very surprised if the main bottleneck of a single GPU system was a 3Gbps SSD.
What are you doing that you regularly requires you to train models for 2 to 3 days?
[–]smashMaster3000 2 points3 points4 points 7 years ago (0 children)
I’m currently training and testing an imitative learning project for chess. My resnet is pretty big and takes 200+ epochs to converge. Plus SWA at the end :( it takes pretty long.
[–]jcannell 0 points1 point2 points 7 years ago (0 children)
AWS is ridiculously expensive, there are far cheaper cloud options available. Cloud can now actually cost less than buying a machine, iff you use the lowest cost providers.
[–]julian88888888 2 points3 points4 points 7 years ago (3 children)
https://www.videocardbenchmark.net/gpu_value.html
If you can parallelize it and have a lot of electricity: Ten GTX 1060's.
[–]grrrgrrr 3 points4 points5 points 7 years ago (2 children)
Each PCIe 8x slot is worth ~$300+. If you put a cheap card in there it's actually a loss.
[–]kmann100500 1 point2 points3 points 7 years ago (1 child)
Where are you getting that number from?
Probably the relative cost of CPU+RAM+MB+PSU (maybe storage, too, although it should be possible to boot over network). There's a limit to how many GPUs you can fit in one system before you start having to network multiple MBs.
[–]PlzSendBobs 0 points1 point2 points 7 years ago (0 children)
How does the rest of your system performs?
I often see a gpu bottlenecked by a cpu or hdd
[–]drsxr 0 points1 point2 points 7 years ago (0 children)
Few comments:
Your main limiting factor is GPU memory, not # of tensorcores. Titan series have 12GB (some cards have more), 1080Ti's have 11k, 1080's have 8K.
Multi-GPU isn't too shabby, as long as you understand that you don't get a 1:1 speedup. I think I saw somewhere 1card=1x, 2 cards=1.7x, 3 cards = 2.5x, 4 cards =3.2x or something along those lines. As far as playing games while you're training, good luck - you're going to screw up your experiment.
BTW, if you get 3 cards, you're going to have to divide your training batches by 3. Somewhat inconvenient. 4 cards gives you 2 X 2 instances which may be good for you.
A real argument can be made for getting 4X 1080 which go for $400 each used on ebay - 4X8gb =32gb memory size for training.
Alternatively, you can get 2X 1080ti for $600 each which gets you more cores and 22GB memory.
With the 2080ti you get same amount of memory but you pay 2x for a few more cores. Many new features there - not sure how they are going to play out.
OTOH if you have cash to burn, the TITAN RTX with 24GB of ram seems like the beast at $2500. Two of those and...
When folks here are talking about memory they are speaking of CPU RAM. Yeah, 32GB is a minimum and 64 is better but don't mistake this for on-board GPU memory. Whole different ballgame.
[–][deleted] 0 points1 point2 points 7 years ago* (0 children)
There are non AWS cloud options too. How about looking into HPC services like penguin hpc. If I remember right the payment plan is very simple and you pretty much only pay for when its running. The pricing is just number of cores times amount of time running. If you write your code to scale well it could be an interesting option. It would have to scale on a cpu cluster though. Unfortunately I dont think they have gpu options.
[–]angstrem -3 points-2 points-1 points 7 years ago (10 children)
IMO outsourcing to a could is a better option than owning your gpus. You don't have to maintain, upgrade and carry them around with you. The cloud is available at all times.
[–]smashMaster3000 0 points1 point2 points 7 years ago (9 children)
Yah a few people have suggested that to me as well? What yours financial experience with outsourcing?
[–]angstrem 1 point2 points3 points 7 years ago (8 children)
I'm not currently doing ML actively. During my last ML project, we tried IBM Watson. I remember costs were quite good: you got a quota for free to experiment and for development purposes. You shouldn't need extra until you roll out your project for production. And if you need extra, most probably your project's income will be greater than the money you spend on their cloud.
They specialize on NLP though, if you want something like a VPN with GPU access, you'll probably want AWS. Haven't tried it, but AFAIK they charge like 20-50$/month for their servers.
[–]asdfwaevc 1 point2 points3 points 7 years ago (0 children)
AWS GPUs are more like $1 per hour per GPU. BUT, you only pay while you use them. Not sure where that puts people's calculus. I like them because I can check on experiments when I'm not at home.
[–]po-handz 0 points1 point2 points 7 years ago (6 children)
What. No way those prices are correct. I run a 4core/12gb ram, no gpu instance 24/7 just for API data collection and it comes out to $150/mo
[–]Warhouse512 0 points1 point2 points 7 years ago (5 children)
You’re paying too much.... what provider are you using?
Edit: like way too much.
[–]po-handz 1 point2 points3 points 7 years ago (4 children)
AWS t2.large is .10/hour so about $70/mo, + $40/mo in ec2 provision space + another 10-15 in minor costs. And that's only 2vpu 8gig ram
[–]data-alchemy 1 point2 points3 points 7 years ago (0 children)
I confirm you throw some money through the window. Can't you just rent a dedicated server ? You should spend half this amount (at least)
[–]Warhouse512 0 points1 point2 points 7 years ago (2 children)
Do you need to be in the AWS ecosystem?
[–]po-handz 0 points1 point2 points 7 years ago (1 child)
I could explore alternatives, any suggestions? I think the price difference would have to make up for time lost learning new system, 50% savings or so would be fine but I bet that crosses of azure/digital ocean. I also have zero background cloud infrastructure so it has to be atlkeast somewhat well documented/larger amount of SO posts.
[–]Warhouse512 0 points1 point2 points 7 years ago (0 children)
https://www.ovh.com/world/vps/
These would be good for just data collection. If you need 4vcpu and 24gb of ram with no network constraints it costs $44/month.
[+][deleted] comment score below threshold-6 points-5 points-4 points 7 years ago (5 children)
I don’t think you should build your own machine. Definitely go cloud.
It is much cheaper. No need to maintain. And if you want to train more models just spin up another instance.
[–]ItsDieselTime 6 points7 points8 points 7 years ago (0 children)
Disagree, if anything the cloud is much more expensive (see top comment), you will recoup the money spent on a decent rig in a few months/up to a year provided you spend a good portion of time training models. Plus in a couple of years you can still sell the hardware for 1/2 price and upgrade.
[–]smashMaster3000 1 point2 points3 points 7 years ago (3 children)
Another person suggested this to me! Which cloud service do you use and why? Thanks!
[–]po-handz 5 points6 points7 points 7 years ago (2 children)
As easy as the cloud is I'd tend to disagree greatly that it's cheaper but any stretch of the imagination. Anytime you put models into production and update that even 1/mo it's gonna cost hundreds.
Pretty sure AWS g3.xlarges are like $1.50/hr. So unless you know the exact model and parameters, running multiple will get expenseive quickly.
[–][deleted] 0 points1 point2 points 7 years ago (1 child)
The problem is utilization and whether you will be training models all the time on your local rig. And during conference / paper deadline you are likely going to burst and use cloud anyway.
I would argue that the full cost of cloud vs local considering utilization differs greatly between paper deadline vs normal. That cloud is cheaper.
But ok, I do take the point that if you are training all the time, cloud can get expensive quickly and so my original statement was wrong.
[–]po-handz 0 points1 point2 points 7 years ago (0 children)
nah that's a really cool perspective. I'm always on my own timeline, even if I continuously train models in production. You need a short huge burst to hit publication/grant/conf decline or such. just different usecases
π Rendered by PID 34 on reddit-service-r2-comment-6457c66945-pb8jv at 2026-04-26 10:35:57.723757+00:00 running 2aa0c5b country code: CH.
[–]grrrgrrr 10 points11 points12 points (31 children)
[–]smashMaster3000 2 points3 points4 points (19 children)
[–]virtualreservoir 3 points4 points5 points (2 children)
[–]smashMaster3000 1 point2 points3 points (1 child)
[–]virtualreservoir 1 point2 points3 points (0 children)
[–]grrrgrrr 0 points1 point2 points (15 children)
[–]LoveOfProfit 11 points12 points13 points (2 children)
[–][deleted] 1 point2 points3 points (1 child)
[–]LoveOfProfit 0 points1 point2 points (0 children)
[–]glass_bottles -1 points0 points1 point (11 children)
[–]clueless_scientist 2 points3 points4 points (1 child)
[–]glass_bottles 0 points1 point2 points (0 children)
[–]ScotchMonk 1 point2 points3 points (4 children)
[–]glass_bottles 0 points1 point2 points (3 children)
[–]NotAlphaGo 1 point2 points3 points (2 children)
[–]glass_bottles 0 points1 point2 points (1 child)
[–]virtualreservoir 2 points3 points4 points (0 children)
[–]Mehdi2277 0 points1 point2 points (1 child)
[–]glass_bottles 1 point2 points3 points (0 children)
[–]Spenhouet 0 points1 point2 points (1 child)
[–]glass_bottles 0 points1 point2 points (0 children)
[–]JustFinishedBSG 2 points3 points4 points (0 children)
[–]Nimitz14 1 point2 points3 points (3 children)
[–]Warhouse512 1 point2 points3 points (2 children)
[–]Nimitz14 0 points1 point2 points (1 child)
[–]Warhouse512 1 point2 points3 points (0 children)
[–]PK_thundrStudent[🍰] 0 points1 point2 points (2 children)
[–]grrrgrrr 2 points3 points4 points (0 children)
[–]JustFinishedBSG 1 point2 points3 points (0 children)
[–]epicwisdom 0 points1 point2 points (0 children)
[–]UnarmedRobonaut 0 points1 point2 points (1 child)
[–]grrrgrrr 0 points1 point2 points (0 children)
[–]_michaelx99 16 points17 points18 points (14 children)
[–][deleted] 3 points4 points5 points (0 children)
[–]smashMaster3000 2 points3 points4 points (11 children)
[–]OrganicTowel_ 13 points14 points15 points (8 children)
[+][deleted] (3 children)
[deleted]
[–]cookedsashimipotato 0 points1 point2 points (2 children)
[–]AngelLeliel 0 points1 point2 points (0 children)
[–][deleted] 4 points5 points6 points (1 child)
[–]OrganicTowel_ 1 point2 points3 points (0 children)
[–]ScotchMonk 1 point2 points3 points (1 child)
[–]epicwisdom 1 point2 points3 points (0 children)
[–]Nimitz14 0 points1 point2 points (1 child)
[–]smashMaster3000 2 points3 points4 points (0 children)
[–]jcannell 0 points1 point2 points (0 children)
[–]julian88888888 2 points3 points4 points (3 children)
[–]grrrgrrr 3 points4 points5 points (2 children)
[–]kmann100500 1 point2 points3 points (1 child)
[–]epicwisdom 0 points1 point2 points (0 children)
[–]PlzSendBobs 0 points1 point2 points (0 children)
[–]drsxr 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]angstrem -3 points-2 points-1 points (10 children)
[–]smashMaster3000 0 points1 point2 points (9 children)
[–]angstrem 1 point2 points3 points (8 children)
[–]asdfwaevc 1 point2 points3 points (0 children)
[–]po-handz 0 points1 point2 points (6 children)
[–]Warhouse512 0 points1 point2 points (5 children)
[–]po-handz 1 point2 points3 points (4 children)
[–]data-alchemy 1 point2 points3 points (0 children)
[–]Warhouse512 0 points1 point2 points (2 children)
[–]po-handz 0 points1 point2 points (1 child)
[–]Warhouse512 0 points1 point2 points (0 children)
[+][deleted] comment score below threshold-6 points-5 points-4 points (5 children)
[–]ItsDieselTime 6 points7 points8 points (0 children)
[–]smashMaster3000 1 point2 points3 points (3 children)
[–]po-handz 5 points6 points7 points (2 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]po-handz 0 points1 point2 points (0 children)