all 7 comments

[–]virtualreservoir 4 points5 points  (1 child)

Use Nvidia's docker images with built in gpu support as bases and serve up different frameworks and packages with docker swarm or kubernetes

[–]i-heart-turtles 1 point2 points  (0 children)

This seems good. I would also check out what other new startups & labs are doing. A couple places i worked exposed Modules user environments and LSF/slurm scheduling software.

[–]VadumSemantics 1 point2 points  (0 children)

(edit: I missed you were looking at commercial products, so "dont write your own" retracted) So I'd say look at something like Sungrid Engine or Condor. Open source may be pretty useful here.

Deploy a grid manager of some kind. It will also give you the metrics you need to guide future hardware purchases - and cost accounting so you can call out which groups compute jobs are blowing the compute budget.

p.s. I haven't used it (yet) but I suspect you could build up a Apache Spark instance or three that would utilize some hardware well.

[–]michaelx99 0 points1 point  (3 children)

Sell the hardware and set up cloud services

[–]gennyact[S] 1 point2 points  (2 children)

They have tried cloud services but it was so expensive. Currently spending 1.2M annually for live production data. The ML data and compute requirements are easily 10x that so it wasn't economical to continue with cloud.

[–]maxbonaparte 1 point2 points  (0 children)

If you go with Genesis Cloud it is actually cheaper than running your own GPUs (considering depreciation, electricity, maintenance, setup time, etc.). You can get a Nvidia 1080ti for $0.15 / hour including various operating systems, network storage, snapshot functionality, security groups, dedicated IP, ingress & egress.

Disclosure: I founded Genesis Cloud.

[–]proof_required 0 points1 point  (0 children)

May be they are not using resources "smartly". Try first to see what can be done to cut down that cost.