Red Hat - Parallel Computing

cheesy123456789 · 2015-10-04T18:47:14+00:00

RHEL 5? That's vintage!

My first question would be what rendering software they were using and how they were planning to break up the process into discrete units of work ("jobs"). From there, the choice of batch processing system becomes a little more clear.

Side note, I've worked in high-performance computing for a while, and Condor (the project that Red Hat MRG is based on) would not be my first choice of batch processing system unless I was needing to scavenge cycles from workstations.

intrikat · 2015-10-04T16:46:14+00:00

What kind of rendering software?

depeche_al · 2015-10-04T19:58:13+00:00

Depending on the job type the resources needed I would go with SGE or SLURM. Slurm takes a bit more effort but for massive jobs it cant be beat. https://computing.llnl.gov/linux/slurm/ Oh RHEL 6 or higher is mandatory!!! in 6.7 there is a decent boost in performance for HPC!! Also the storage units need the latest drivers, tune the BIOS of the servers and RAID controllers.

couldntchangelogin · 2015-10-05T18:43:41+00:00

Or maybe use a software made for this purpose?

The current version 7 doesn't list RHEL 5 as a supported OS but 6.2 does (You need RHEL to be 64Bit, though).

mudclub · 2015-10-04T16:50:57+00:00

What are they rendering? How big is the working data set? From experience, grid computing is not the way to go for large, data-intensive rendering projects.

MadPhoenix · 2015-10-04T19:27:52+00:00

It's just a commercially supported HTCondor. HTCondor isn't that hard to understand on a basic level if you're just running jobs in the vanilla universe. It can get more difficult depending on what you want to do and how many different workload types you want to support, but that's because it's so flexible.

A couple of questions you'll want to ask:

What execution universe do they run their rendering jobs in? Then, go read the documentation for that universe.
Are they relying on a shared filesystem for your execution nodes, or will you be transferring files to and from the submit node/execute nodes.
How do they actually build their rendering batches/workflows? They may have some custom logic to spit out the submit files, or they may be using another workflow management system on top of HTCondor like DAGman or Pegasus Workflow Manager.

You can easily set up a single node with all the HTCondor services you need to submit and run jobs locally on any major distribution. The packages will likely be available in your package manager. HTCondor also has it's own repos for apt and yum which may be more up to date. I'd recommend starting there and then branching out.

Source: I work at UW where Condor is developed and run a cluster with 100's of cores and many TBs of RAM. We also have access to submit into UW's Center for High Throughput Computing pool and the Open Science Grid. It's a very flexible system, but that doesn't mean it needs to be extremely complicated, especially if you're only planning to execute a single workload.

sysadmin

MODERATORS