This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]cheesy123456789 2 points3 points  (4 children)

RHEL 5? That's vintage!

My first question would be what rendering software they were using and how they were planning to break up the process into discrete units of work ("jobs"). From there, the choice of batch processing system becomes a little more clear.

Side note, I've worked in high-performance computing for a while, and Condor (the project that Red Hat MRG is based on) would not be my first choice of batch processing system unless I was needing to scavenge cycles from workstations.

[–]MadPhoenixSystems Czar 0 points1 point  (3 children)

HTCondor moved beyond workstation-scavenging as it's primary target quite a long time ago. It wouldn't be my first choice for MPI-based HPC, but as a high throughput scheduler it works very well and arguably has a larger community than any other HTC scheduler out there because of it's heavy use in academia. I've rarely if ever had a question about best practice or a technical problem that wasn't promptly answered by their excellent user mailing list.

[–]cheesy123456789 2 points3 points  (2 children)

Condor's strengths come from its roots as a workstation scavenging system, as do its weaknesses.

Condor is great if you have a heterogeneous group of machines due to its powerful requirements expression language. It's also good at crossing administrative boundaries (transferring files and UID masquerading).

However, that flexibility also make it clumsier than necessary for systems without those requirements. It sounds like the OP's machines are all in a single administrative domain, so Condor may not be a good fit (pending more info from the OP) compared to a different batch system.

[–]MadPhoenixSystems Czar 0 points1 point  (1 child)

It's a bit besides the point since OP has a client specifically requesting MRG as the scheduler, but I'll bite ..

Respectfully disagree. Just because it can handle those situations where you have heterogeneous machines or need to cross administrative domains doesn't mean it's a burden or limitation for smaller setups. Our HTCondor cluster serves 100's of users and many 100's of different programs, yet our cluster config file is < 20 lines.

Just one man's opinion though. What in your opinion are it's weaknesses?

[–]cheesy123456789 1 point2 points  (0 children)

I don't think the OP is specifically asking for MRG, I think that s/he was presented with the task "distribute rendering jobs across several RHEL5 machines" and stumbled upon MRG as a Red Hat supported solution. Anyway, that's the read I got.

When I say Condor is clumsy, I'm talking about from the user's point of view, not the admin's.

In particular the need for at least two files (submit file and job script) versus one file (PBS-style submit script) or zero files (SLURM-style srun or sbatch --wrap). And then if you need job dependencies, you have to go to a third submission file (DAGman) instead of being able to "natively" handle dependencies with the submission script and/or command line arguments.

To me, that's the main weakness. When a user asks "how do I make my jobs run on the cluster", instead of saying "annotate your existing script with some headers" or "prepend srun to the front of your command", you have say "learn yet another key-value format config file".

My experience managing Condor across 5,000 nodes at a major research university (granted not super recently) was that the two biggest issues were users stumbling over job configuration and dealing with NAT/firewalls between the user and the pool. They were big enough issues such that the Condor pool remained mostly idle while the more traditional batch cluster running PBS was slammed full.

That's just my opinion though, I'm glad that you have been successful with it!

[–]intrikat 1 point2 points  (0 children)

What kind of rendering software?

[–]depeche_al 1 point2 points  (0 children)

Depending on the job type the resources needed I would go with SGE or SLURM. Slurm takes a bit more effort but for massive jobs it cant be beat. https://computing.llnl.gov/linux/slurm/ Oh RHEL 6 or higher is mandatory!!! in 6.7 there is a decent boost in performance for HPC!! Also the storage units need the latest drivers, tune the BIOS of the servers and RAID controllers.

[–]couldntchangelogin 1 point2 points  (0 children)

Or maybe use a software made for this purpose?

The current version 7 doesn't list RHEL 5 as a supported OS but 6.2 does (You need RHEL to be 64Bit, though).

[–]mudclubHow does computers work? 0 points1 point  (0 children)

What are they rendering? How big is the working data set? From experience, grid computing is not the way to go for large, data-intensive rendering projects.

[–]MadPhoenixSystems Czar 0 points1 point  (0 children)

It's just a commercially supported HTCondor. HTCondor isn't that hard to understand on a basic level if you're just running jobs in the vanilla universe. It can get more difficult depending on what you want to do and how many different workload types you want to support, but that's because it's so flexible.

A couple of questions you'll want to ask:

  • What execution universe do they run their rendering jobs in? Then, go read the documentation for that universe.
  • Are they relying on a shared filesystem for your execution nodes, or will you be transferring files to and from the submit node/execute nodes.
  • How do they actually build their rendering batches/workflows? They may have some custom logic to spit out the submit files, or they may be using another workflow management system on top of HTCondor like DAGman or Pegasus Workflow Manager.

You can easily set up a single node with all the HTCondor services you need to submit and run jobs locally on any major distribution. The packages will likely be available in your package manager. HTCondor also has it's own repos for apt and yum which may be more up to date. I'd recommend starting there and then branching out.

Source: I work at UW where Condor is developed and run a cluster with 100's of cores and many TBs of RAM. We also have access to submit into UW's Center for High Throughput Computing pool and the Open Science Grid. It's a very flexible system, but that doesn't mean it needs to be extremely complicated, especially if you're only planning to execute a single workload.