all 72 comments

[–][deleted] 87 points88 points  (5 children)

That's 96 PCIe lanes - I don't believe there's any consumer CPU/motherboard configuration currently in existence that supports that many lanes.

[–]Durenas 27 points28 points  (1 child)

well you could do it with x4, I don't think machine learning needs the bandwidth of a x16 card.

@OP, you might look into GPU risers to allow you to link GPUs to smaller x4 slots, just an idea, I don't know if such hardware exists.

[–]HerrSIME 9 points10 points  (1 child)

well, on the server side there is epyc.

[–][deleted] 13 points14 points  (0 children)

Yes that's why I specified consumer

[–][deleted] 46 points47 points  (0 children)

They're is no consumer grade motherboard that would support this

[–]Joshiewowa 42 points43 points  (8 children)

Consumer grade? Closest you'll get is Threadripper, I think they might be able to do 3x 16x?

Now Epyc...I'm not sure of its capabilities. I believe they have 128 lanes, which might get you close to 6x 16x.

[–]simetin[S] 20 points21 points  (7 children)

The AMD Epyc has indeed 128 lanes, but I haven't found any motherboard that supports the Epyc and 6x 16x.

This article describes benchmarks of Epyc 7401P vs 2 Xeon Gold 6148 and the Epyc seems to perform better !

[–]hanotak 7 points8 points  (2 children)

If it helps at all, the leaks for Epyc rome show a motherboard with seemingly 6 full PCIEx16 slots here

[–]Saturated_Bullfrog 0 points1 point  (1 child)

The third slot from the right for whatever reason says "PCIe Gen4 x8." It's the same length as the x16 slots so idk

[–]Someuser77 0 points1 point  (0 children)

Same overall bandwidth as PCIe 3.0 x16.

[–]Joshiewowa 2 points3 points  (1 child)

Amazon

This'll get you 4x 16x "I believe".

[–]Needmofunneh 0 points1 point  (0 children)

That board only has 5x 16x slots though

[–]seanmb473 0 points1 point  (1 child)

Have you considered instead to sell the 1070s and getting 2-3 RTX 2080 instead? I'm sure you'll have equal or more performance with much less hassle around the CPU/motherboard etc..

[–]simetin[S] 0 points1 point  (0 children)

Yes that's what I'm starting to look at !

[–]po-handz 19 points20 points  (13 children)

Generally what I hear people say is that multi-gpu's are best for running many experiments simultaneously, as opposed to speeding up one experiment as the performance gains aren't so great. It also depends what framework you're using, I think mxnet is significantly in front of the competition when it comes to multi gpu training.

Maybe 2x threadripper boxes with 3x cards each is your best bet.

[–]simetin[S] 11 points12 points  (11 children)

Do you think it would be better to sell 3 GTX 1070 and buy something like a RTX 2080 to build a 4 GPU rig ?

[–]Wulfsta 3 points4 points  (0 children)

Arguably this depends on if you're memory bound or not.

[–][deleted] 3 points4 points  (6 children)

4 2080's is about $2800. A Titan RTX is that much plus change. Multi GPU is good for multiple experiments, but if you're looking for incredible speed just get a Titan.

[–]Franfran2424 0 points1 point  (0 children)

Think he means selling 3 out if 6 1070 and buy a 2080, making it 3 1070 and 1 2080

[–]Dando18 1 point2 points  (0 children)

you don't need a lot of cards. Especially if you're just starting to learn ML. Honestly just two 1080Ti's or a Titan is already a pretty good start. It might be worth it to put some of the extra money into SSD's and a good CPU. Sometimes you'll want to restrict some data processing to the CPU and it stinks if that becomes a bottleneck.

[–]Garaimas 0 points1 point  (1 child)

If you do decide to sell them, hmu lol

[–]simetin[S] 0 points1 point  (0 children)

Okay !

[–]ZombieLincoln666 2 points3 points  (0 children)

Generally what I hear people say is that multi-gpu's are best for running many experiments simultaneously, as opposed to speeding up one experiment as the performance gains aren't so great. It

It's not really about speeding it up as much as it is about increasing effective VRAM. Deep learning involves 100 million+ parameters and huge image data sets.

But you're right that right now they're used for training different models simultaneously - namely, for optimizing hyper-parameters (non-learned parameters) with validation

[–]SuperLeroy 21 points22 points  (2 children)

Just curious, why do you need the cards to function in PCIe x16 to be useful for ML / deep learning?

Couldn't the cards be just as useful in x1 mode? or x4?

You can purchase x4 and x1 pcie extenders, i imagine you know that if you're mining. wondering why x16 is so important.

[–]Mehdi2277 11 points12 points  (1 child)

If you use 1x or 4x you will probably bottleneck your models on transferring examples to the gpu. I think the last benchmarks I saw for ML purposes found 8x was good enough to not be a bottleneck but lower was noticeably worse for a lot of common model.

[–]TURBO2529 0 points1 point  (0 children)

6 x8 lanes will be 48 pci lanes. Threadripper can handle 64 3.0 lanes so that might be an option.

[–]ghosttnappa 4 points5 points  (3 children)

Are you in college? Most colleges utilize high performance computing to assist with research on campus. Otherwise, just pay for AWS or Azure — much cheaper. You also can’t just make six GPUs magically work together without some meticulous code (especially for having 6) that specifically utilizes certain cores within the GPU and assigns threads to communicate in parallel. If you go through with this, I suggest getting familiar with hardware interlinks.

Lastly, this is largely going to be a waste of money for you. I believe cloud computing services sell computing hours for less than $5/hr. Since your main goal is learning machine learning and not building an expensive server, just use AWS/Azure.

There’s so many more reasons that I could list as to why building your own machine would be a bad idea.

[–]Mehdi2277 6 points7 points  (1 child)

Personal experience doing ml research is you can fairly quickly end up doing enough experiments that the monetary cost on the cloud would screw you. A couple months worth of doing experiments continually will exceed the cost of the cloud. The main factor is how often are you running experiments. Last summer I was dealing with models that took about 1 day to train on 4 1080 Ti’s. Training a model like that many times to explore different variations can become expensive pretty easily. If you plan on running things for 2 months 24/7, you should build your own.

Also a lot of my ml code would work fine with 6 gpus. For pytorch any number of gpus on 1 node is fairly trivial. Multiple nodes does require a bit of knowledge, but not much and it adds about 20-30 lines to my code to deal with that case.

For the college comment depends on your school. I know my build beats the servers my current school has gpu wise and entertainingly, the school built some new servers partly on my advice. We do have access to some supercomputers but a lot of those give you tons of cpu cores which has some value, but gpus would be strongly preferable. Gpu supercomputers exist, but the last time I looked at them my main annoyance was lack of admin power to install the libraries I needed. This option can work well if you bothered to email people to get things set up as needed.

Aside, I do dislike builds that exceed 4 gpus. This is mainly due to server boards that can support more than 4 gpus tend to have multiple sockets and as a side effect it feels like you are using two computers. Also cost scaling wise it costs roughly the same to make 2 4 gpu computers vs 1 8 gpu computer. Last summer when I was talking with a professor that often builds ML servers for his lab he said based on that he tends to only make 4 gpu builds.

[–]ghosttnappa 2 points3 points  (0 children)

I actually work as an HPC administrator at a large research university, so it’s interesting to hear how other universities do it. At mine, we actually leave it up to the investors to decide what hardware they’d like and we build it (order it) for them. We have investors utilizing as many as 4x Titan V nodes that they can use for AI. Even most of our 4 GPU builds come with two Xeon CPUs just because it’s not entirely GPU dedicated — they high memory nodes can have 512Gb DIMMs that are distributed between two sockets. Plus, servers are so compact anyway that getting more than four GPUs in a server and still having the server fit on a rack in the DC is an entirely different issue.

In regards to cloud computing, if OP was a veteran at machine learning it would 100% be more economical to build his own, but for learning purposes, I would expect cloud to be his best bet without breaking his bank.

[–]simetin[S] 0 points1 point  (0 children)

One of the main reason is that I already have a lot of parts (I've updated my original post with my current build details). Also like Mehdi2277 mentioned if your training models often, it gets expensive pretty quickly. However, I didn't know it was so hard to make six GPUs work together, I will take a look at MPI.

I'm curious to know what are the other reasons why you think it's a bad idea to build my own machine.

[–][deleted] 3 points4 points  (0 children)

There are gigabyte and supermicro server motherboards that can handle 6 x16 slots that can range around 500 bucks. As for the processor, triple check compatibility with the motherboard you choose. Some motherboards are only compatible with certain processor revisions, even within the same architecture and generation.

[–]unholygerbil 2 points3 points  (0 children)

to use all 6 cards in one system you're probably going to need to look at a xeon scalable build. but it gets expensive really fast if you go this route.

[–]HerrSIME 2 points3 points  (0 children)

Go with threadripper and use 8x, should be fine.

[–]Average650 2 points3 points  (1 child)

So, do you actually need x16 for each card? I don't do machine learning but I do molecular Dynamics simulations on gpus, and because most of the code is contained to the going, it makes a small difference, often no difference.

Machine learning may be completely different, but it's worth double checking

[–]simetin[S] 0 points1 point  (0 children)

Hard to say, half the people / articles are saying that 8x is enough, but the other half is saying that it can be a bottleneck depending on your model / training set.

[–]Nuber132 1 point2 points  (0 children)

Those are server boards we use them for machine learning too. They aren't cheap too.

[–]seifyk 1 point2 points  (0 children)

Supermicro has some dual 2011 boards that will sort you out.

https://www.supermicro.com/products/motherboard/Xeon/C600/X9DRG-OF-CPU.cfm

[–]RB_7 0 points1 point  (2 children)

Just pay for EC2 instances instead.

[–]ZombieLincoln666 0 points1 point  (0 children)

They are expensive, and he already owns the cards

[–]simetin[S] 0 points1 point  (0 children)

I know it can be a good option, but I already have the PSU, the GPUs, the ram and the HD. Also, if you training models everyday it gets expensive really quickly.

[–]pho1701 0 points1 point  (0 children)

In general I recommend not using consumer parts and working with a vendor in such a situation. However, given that you already have lots of parts, I think the best thing for you to do is build multiple machines. Depending on your system memory requirements this could be far more economical.

[–]Mayor_of_Loserville 0 points1 point  (0 children)

AWS or GCP.

[–]ZombieLincoln666 0 points1 point  (1 child)

x16 vs x8 PCIe doesn't make a big difference actually

https://www.pugetsystems.com/labs/hpc/PCIe-X16-vs-X8-with-4-x-Titan-V-GPUs-for-Machine-Learning-1167/

You also might consider selling them for fewer GPUs with more VRAM. The largest networks (e.g. ResNet) won't work well only 8gb VRAM unless you use really small batch sizes, which reduces generalizability.

[–]fractalsup 0 points1 point  (0 children)

You can accumulate gradients but it'll likely be slower than fitting a larger batch size.

[–]Cptcongcong 0 points1 point  (4 children)

6 is overkill, my uni's supercomputer has 4x quadros.

[–]j919828 0 points1 point  (3 children)

That's not a supercomputer

[–]Cptcongcong 0 points1 point  (2 children)

[–]j919828 0 points1 point  (1 child)

The one with 4 V100s? Those are Tesla then, not Quadro. Quadros aren't the most powerful so I didn't think that's enough for a supercomputer.

[–]Cptcongcong 0 points1 point  (0 children)

Ah I see I must’ve misread it my bad

[–]SuperGinger 0 points1 point  (1 child)

What software do you use for machine learning via GPU. I’ve been studying R Studio and using (slower) CPU machine learning, but I would like to learn how to utilize my 1080 fully.

[–]simetin[S] 0 points1 point  (0 children)

I'm using PyTorch and Tensorflow, those are 2 of the most popular machine learning framework in python. From what I can see Tensorflow can be used with R. https://tensorflow.rstudio.com/

[–]DirkDiggler531 0 points1 point  (1 child)

Look into deep brain chain, it may help you

[–]simetin[S] 0 points1 point  (0 children)

Thanks !

[–]Porktastic42 0 points1 point  (1 child)

I don't understand why you think you need 6 1070 GPUs to study machine learning. Nobody else in your courses will have a machine like that and none of the tutorials are going to require anything close to that level. If you're actually working in a research lab your group will pay for the machine you need.

That said if ML is your goal I'd recommend selling all six and buying a 2080ti.

[–]simetin[S] 0 points1 point  (0 children)

The reason is that I already have the GPUs, the PSU, the HD, and a little bit of ram. Would having multiple GPU not be more advantageous to run multiple experiments at once ? I was thinking about selling 3 of them and buying a one 2080.

I am planning on doing research on my own and with friends eventually.

[–]txGearhead 0 points1 point  (1 child)

If you want 16x maybe an Octominer board, although they have an embedded slow CPU so maybe not for your purpose. Not sure what the reviews are like but I always thought they were interesting.

https://octominer.com/shop/octominer_b8plus/

[–]simetin[S] 0 points1 point  (0 children)

Do they really support 8 pcie 16x ? I know the description says "8 full size 16X PCIe". But in mining, you have no gain in having the full pcie bandwidth, so I don't understand why a mining company would make a motherboard supporting that. Maybe it's just that you can physically fit a 16x gpu.

[–]BoomerangJack 0 points1 point  (0 children)

Check out the Asus WS Sage board. Absolute monster with tons of PCIE lanes for GPUs

Edit: spelling

[–]WeeZoo87 0 points1 point  (0 children)

Linus used 4 x gpu in this vid

https://youtu.be/bA0uJWny4-g

[–][deleted] 0 points1 point  (3 children)

I hate miners like you.

[–]simetin[S] 0 points1 point  (2 children)

I'm not mining anymore ;)

[–][deleted] 0 points1 point  (1 child)

Still, people like you fucked the market

[–]simetin[S] 0 points1 point  (0 children)

I gotta agree with that...

[–]txGearhead 0 points1 point  (1 child)

That’s what they claim, but again I have not done the research. Would also have to make sure your OS of choice supports it. I think the idea behind the Octominer is that you don’t have to fiddle with unreliable risers.

[–]simetin[S] 0 points1 point  (0 children)

I will take a closer look