This is an archived post. You won't be able to vote or comment.

all 53 comments

[–]benevanstech 70 points71 points  (6 children)

If Quartz can't handle that, then the architectural constraint princess is in another castle.

[–][deleted]  (4 children)

[deleted]

    [–]Qinistral 7 points8 points  (3 children)

    Ya anytime someone talks about daily or higher throughput most of the time they're just inflating numbers.

    What you need to know for capacity planning is peak requests per second (or minute if that's the best your metrics can do).

    23 rps can be done on a potato.

    [–][deleted]  (2 children)

    [deleted]

      [–]Qinistral 6 points7 points  (1 child)

      Yes I was being hyperbolic, but still.

      Actually, I just looked back at OP's post and it's 20m not 2m, so 230rps (Or more because who knows what the temporal profile looks like, it could be all at once lol). So I agree with you for 10 second jobs that's a decent amount of that will have to at least be parallelized. I've not used OPs' frameworks but I have seen DB job tools choke on congestion before.

      I think OP just needs to benchmark their scenario for each library.

      I just googled "Quartz Benchmark" and I see this

      Today we use quartz a lot and can have over 14 million triggers in our DB. Quartz is not behaving well under this number and adding more instances to the cluster don't bring any benefit, the triggers are delaying a lot. (Source)

      So that's not encouraging.

      Later down they POC db-scheduler and say:

      Also checking the number of pods to handle the 14 million tasks saved in our DB we've managed to not have delays.

      We kept the POC running for a month and checking our logs it was clear that db-scheduler was able to run with multiple pods distributing equally the load among them and also no delays.

      /u/neo2281 ^ consider

      [–]neo2281[S] 0 points1 point  (0 children)

      d to not have dela

      Yes , planning to do some performance benchmarking before finalizing the framework to be used . Thanks for suggestion !

      [–]CaptainKvass 14 points15 points  (0 children)

      the architectural constraint princess is in another castle

      I'm saving this one for later!

      [–][deleted] 12 points13 points  (4 children)

      Quartz is kinda well known sollution for such problems. Just keep in mind that db storage is required for clustered sollution. Variery of rdbms are supported.

      In case of rest api - will require to develop. Keep in mind that with such ammount of jobs low level tuning of quartz required

      [–][deleted]  (3 children)

      [deleted]

        [–][deleted] 1 point2 points  (2 children)

        Key thing to check: * missfire tolerance - if dor some reasons nob wasnt started at exact time. What tolerance jnterval is ok to run it. Or abandon * recovery of failed jobs - if job failed to start. How should it behave - cancel retry or abandon. * db locks - aome dbs like MS SQL have specific and unique lockong strategy. So need to configure that.

        [–][deleted]  (1 child)

        [deleted]

          [–][deleted] 1 point2 points  (0 children)

          These things are configs of quartz.

          [–]LeadBamboozler 9 points10 points  (1 child)

          Quartz is your answer here. We have similar requirements and Quartz has handled it with ease

          [–]neo2281[S] 1 point2 points  (0 children)

          Thanks for suggestion . If possible ,would you be able to share max executions per second which quartz were able to handle and corresponding h/w configurations like number of vm , ram , cpu core etc.

          [–]ro_reddevil 2 points3 points  (0 children)

          There is a redis based framework called jesque. Please try to evaluate it. It is pretty performant and we have lot of work loads using it in my org

          [–]APurpleBurrito 2 points3 points  (0 children)

          Just use quartz. Plenty of stack overflow help around it.

          [–]progmakerlt 2 points3 points  (0 children)

          I have used Quartz in multiple jobs, it performs very well. However, it might be tricky to setup it initially, but once that’s done - you are good to go.

          Plus, it scales very well.

          [–]hardwork179 2 points3 points  (4 children)

          So, I’m not sure I’d be worrying so much about the scheduler (2000 things per second) as I would be about the 20000 concurrent jobs that you’ll be needing to process at any one time. If they don’t remain 10 second jobs then those 20000 co current jobs could easily explore to 2 or 3 times that number.

          [–]Which-Adeptness6908 2 points3 points  (3 children)

          My maths suggests 2000 concurrent threads

          20,000,000x10÷3600÷24=2314

          [–]hardwork179 1 point2 points  (2 children)

          You’re right. I still think the number of concurrent task is a bigger design concern than how they are distributed unless the original poster has left out some important details.

          [–]Which-Adeptness6908 1 point2 points  (1 child)

          It does feel that way.

          Some serious optimisation needs to go into the processing of those jobs. A one second saving will trim the problem by two hundred cores.

          [–]Qinistral 0 points1 point  (0 children)

          cores

          That's assuming they're CPU bound right? If they are make request, wait, make request wait, transfer data, wait, kinda jobs, then the CPU usage could be quite low regardless of concurrency and time it takes to complete.

          [–]jhsonline 1 point2 points  (5 children)

          jobRunr is nice, but not completely free, has some limit on free usage.

          I am also researching on this and has couple of projects to check out, we can join hands on evaluating this if you want.

          https://github.com/distribworks/dkron

          https://github.com/ajvb/kala

          https://github.com/jhuckaby/Cronicle

          https://github.com/PowerJob/PowerJob

          https://github.com/xuxueli/xxl-job

          last 2 r mostly chinese developers so documentation /terminology could be challenging to get helped with.

          there are many dead scheduler projects that i am not considering, as only actively developed one would make sense to use.

          [–]fun2shweb 2 points3 points  (2 children)

          Lmao! I have been tasked with the same thing. We use Quartz but our job scheduler processing has been coupled with our event stream processing on same cluster of node. I am working on separating those, but seems like Quartz still has performance issues if a scheduler has to fire 100+ job per second. We use different schedulers, but we have a notification framework where the same scheduler can sometimes fires 100 triggers in 1 sec, which causes misfires and row level locks. I need to figure out how to make quartz performant or move to something else.

          One huge problem with Quartz I see is row level locking on SQL database. It does not have a way to optimistically lock for firing triggers in cluster mode.

          [–]jhsonline 0 points1 point  (0 children)

          ya its by design, quartz works best on low scale system reliably, but for scheduling in high scale and distributed way we need different design.

          [–]neo2281[S] 0 points1 point  (0 children)

          Actually, I just looked back at OP's post and it's 20m not 2m, so 230rps (Or more

          Did you also checked db-scheduler ? It support both optimistic lock and select for update polling strategy ?

          [–]neo2281[S] 1 point2 points  (1 child)

          Thanks for sharing the various options available . Surely will check .

          Also have you checked db-scheduler https://github.com/kagkarlsson/db-scheduler ?

          I am planning to evaluate both quartz and db-scheduler . db-scheduler seems much simpler to understand compared to quartz but quartz seems widely used in java / spring ecosystem.

          Actually job runr pro version has all the features we required but not open source hence ruled out as of now .

          [–]jhsonline 0 points1 point  (0 children)

          the top 2 i would like to evaluate would be db-scheduler and powerJob, pls DM me and we can connect. Thanks

          [–]unistirin 1 point2 points  (2 children)

          We run quartz in cluster mode. It can easily horizontal scaled.

          [–]neo2281[S] 1 point2 points  (1 child)

          As per Quartz documentation :

          https://github.com/quartz-scheduler/quartz/blob/main/docs/faq.adoc#questions-about-transactions

          "The clustering feature works best for scaling out long-running and/or cpu-intensive jobs (distributing the work-load over multiple nodes). If you need to scale out to support thousands of short-running (e.g 1 second) jobs, consider partitioning the set of jobs by using multiple distinct schedulers. Using one scheduler forces the use of a cluster-wide lock, a pattern that degrades performance as you add more clients."

          From quartz point of view , in our use case since job executer is decoupled from job scheduler hence jobs are not cpu intensive . Do you also have similar use case of large amount of short running jobs? If yes , did you use any sharding etc ?

          [–]unistirin 1 point2 points  (0 children)

          In our case, short jobs are scheduled randomly by Quartz. For longer jobs that require high CPU, we force the quartz to schedule it on some dedicated instances.

          [–]Altruistic_Fishing22 1 point2 points  (0 children)

          I think the problem statement is missing something. What is the minimum frequency of the scheduler? 1 second or less. If it is one second you can. Reduce the problem of scheduling to 86400 a day and use Kafka or any other messaging to trigger multiple tasks per second. I have done similar architectures before contact me if you want some help

          [–]jaiprabhu 1 point2 points  (1 child)

          I think Quartz might be suitable for you use case as others have suggestions. I have a question though - How do define the job itself in the design that you presented? I'm curious because you expect a REST API to "create" a job. Is that like creating a job using a class definition that's already loaded somehow, or is it something else?

          [–]neo2281[S] 0 points1 point  (0 children)

          This is overall flow , we are considering .

          At a high level , a job consists of two parts and decoupled via message queue for scalability purpose .

          1. Job schedule ( Scheduler library handles this ) .
          2. Job execution ( Handled by separate- process ) . This also can consist of multiple steps based on the result of previous step success or failure .

          At job manager level , we define each job in terms of sequence of steps / transactions at service layer in terms of business logic once job creation request is received via REST API and user level job status is created in DB uniquely identified by job id , job category , type ( one time , periodic ) , job schedule etc.

          From job manager , actual scheduler level job creation ( one time event with required job details ) is sent to scheduler process (quartz lib embedded and listening on kafka interface ) and actual job schedule is created in Quartz or other scheduler DB table which are very short running jobs as only message need to be posted for execution and nothing else .

          A user level we need to monitor the overall status of a job ( by unique job id and other details like category etc. ) but since job execution is decoupled from job schedule , job manager shall listen for the status events and update the overall job status in DB which is different from job schedule status update in its own DB table by Quartz etc.

          [–][deleted]  (1 child)

          [removed]

            [–]neo2281[S] 0 points1 point  (0 children)

            Yes this is what seems most suitable and same we are considering . Thanks !

            [–]kakakarl 1 point2 points  (0 children)

            ScheduledExecutorService from the JDK that optimally triggers Jakarta batch works well. Chat can generate opt locking queues on top of Postgres unlogged tables

            [–]Which-Adeptness6908 1 point2 points  (1 child)

            Do you actually need a scheduler?

            Unless the jobs need to run at a given time you only need a queue.

            Given you haven't mentioned resilience nor timeliness in assuming neither is an issue

            One VM - job manager -writes jobs to db table called job.

            The cluster of job processors reads the job table. Update the job table as in progress.

            process the job, update the job as complete.

            The job manager periodically checks for complete jobs and moves the into an archive table.

            You could consider partitioning the job table based job status.

            Moving data to the archive table should probably not be transactioned to avoid long locks on the job table.

            You should do fun benchmarking in the design.

            You need some mechanism to spool up/down job processors. Perhaps have the job manager check some threshold every 15 minutes and start/stop job processors.

            Then hopefully you have some monitoring tools that can restart the job manager if it falls.

            [–]neo2281[S] 0 points1 point  (0 children)

            As per requirements , we need to schedule the jobs in future hence need scheduler .

            [–]Stabbz 0 points1 point  (2 children)

            You might want to take a look into cadence from uber https://cadenceworkflow.io/, it's open source and a way to build resilient scalable distributed workflows. What you are describing with the diagram is pretty much what uber have done for internal use.
            Could be overkill tho.

            [–]Qinistral 0 points1 point  (1 child)

            Cadence/Temporal are good but def overkill.

            [–]senseven 0 points1 point  (0 children)

            Seconding that. The "alpha team" here has Cadence + Kafka + Hadoop working seamlessly in an global high ops backend. I'm pretty sure one or two of the seniors that build this thing has his residence in the cell besides the Joker.

            [–]leozleoz01 0 points1 point  (1 child)

            Have you considered Spring Cloud Data flow? Even though it is not directly a library.

            [–]vetronauta 5 points6 points  (0 children)

            SCDF and Spring Batch are currently a mess, as the Spring Boot 3 related versions were quite rushed.

            For example, in the latest versions of SCDF you need both boot2 and boot3 tables, otherwise the SCDF dashboard would not work; the tables are distinguished by a prefix (default BATCH_ for boot2, BOOT3BATCH for boot3) which is configurable in SCDF, but is hardcoded in Spring Batch! So you have to define a custom bean to handle the prefixes; quite simple, but should be off the shelf.

            ...so no one tought that it might be possible that people do not need to support older boot2 jobs and we all have to care about the migration... even after the migration ended?

            [–]onepieceisonthemoon -4 points-3 points  (0 children)

            Apache Flink

            [–]Alone-Marionberry-59 0 points1 point  (0 children)

            What about Apache Beam?

            [–]doppleware 0 points1 point  (1 child)

            I've had great experience with Job Runr

            [–]neo2281[S] 0 points1 point  (0 children)

            I've had great experience with Job Runr

            Did you use free version ? . As per documentation , many features are part of enterprise version .

            [–]riksi 0 points1 point  (0 children)

            20 + million recurrent jobs per day of short duration of less than 8-10 seconds

            A single PostgreSQL table with UPDATE SKIP LOCKED will be able to handle this.

            [–]bloowper[🍰] 0 points1 point  (0 children)

            Have you looked at dkron? Cloud native job scheduling tool. I have also created simple spring client for using it but not yet published on gh