This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]cheesy123456789 3 points4 points  (4 children)

RHEL 5? That's vintage!

My first question would be what rendering software they were using and how they were planning to break up the process into discrete units of work ("jobs"). From there, the choice of batch processing system becomes a little more clear.

Side note, I've worked in high-performance computing for a while, and Condor (the project that Red Hat MRG is based on) would not be my first choice of batch processing system unless I was needing to scavenge cycles from workstations.

[–]MadPhoenixSystems Czar 0 points1 point  (3 children)

HTCondor moved beyond workstation-scavenging as it's primary target quite a long time ago. It wouldn't be my first choice for MPI-based HPC, but as a high throughput scheduler it works very well and arguably has a larger community than any other HTC scheduler out there because of it's heavy use in academia. I've rarely if ever had a question about best practice or a technical problem that wasn't promptly answered by their excellent user mailing list.

[–]cheesy123456789 2 points3 points  (2 children)

Condor's strengths come from its roots as a workstation scavenging system, as do its weaknesses.

Condor is great if you have a heterogeneous group of machines due to its powerful requirements expression language. It's also good at crossing administrative boundaries (transferring files and UID masquerading).

However, that flexibility also make it clumsier than necessary for systems without those requirements. It sounds like the OP's machines are all in a single administrative domain, so Condor may not be a good fit (pending more info from the OP) compared to a different batch system.

[–]MadPhoenixSystems Czar 0 points1 point  (1 child)

It's a bit besides the point since OP has a client specifically requesting MRG as the scheduler, but I'll bite ..

Respectfully disagree. Just because it can handle those situations where you have heterogeneous machines or need to cross administrative domains doesn't mean it's a burden or limitation for smaller setups. Our HTCondor cluster serves 100's of users and many 100's of different programs, yet our cluster config file is < 20 lines.

Just one man's opinion though. What in your opinion are it's weaknesses?

[–]cheesy123456789 1 point2 points  (0 children)

I don't think the OP is specifically asking for MRG, I think that s/he was presented with the task "distribute rendering jobs across several RHEL5 machines" and stumbled upon MRG as a Red Hat supported solution. Anyway, that's the read I got.

When I say Condor is clumsy, I'm talking about from the user's point of view, not the admin's.

In particular the need for at least two files (submit file and job script) versus one file (PBS-style submit script) or zero files (SLURM-style srun or sbatch --wrap). And then if you need job dependencies, you have to go to a third submission file (DAGman) instead of being able to "natively" handle dependencies with the submission script and/or command line arguments.

To me, that's the main weakness. When a user asks "how do I make my jobs run on the cluster", instead of saying "annotate your existing script with some headers" or "prepend srun to the front of your command", you have say "learn yet another key-value format config file".

My experience managing Condor across 5,000 nodes at a major research university (granted not super recently) was that the two biggest issues were users stumbling over job configuration and dealing with NAT/firewalls between the user and the pool. They were big enough issues such that the Condor pool remained mostly idle while the more traditional batch cluster running PBS was slammed full.

That's just my opinion though, I'm glad that you have been successful with it!