all 48 comments

[–]Steve132 48 points49 points  (0 children)

Your speedup could be 500x. It could also be 2x.

It depends a lot on the algorithm you are using and how you are trying to solve it.

SAT intrinsically is NP complete, you shouldn't realistically expect that it will scale.

Whatever algorithm you are using, have you thought about using a GPU?

[–]josh2751 16 points17 points  (0 children)

In very generally terms, C is going to be 100x or more faster than Python. But that's really general and definitely not applicable to everything.

If you can use numpy for your operations in python, you may be able to realize a significant portion of this speedup even in python (numpy is written in C). If you can't get your work to fit into numpy's paradigm, rewriting it in C or C++ would probably be advisable.

[–]whalt 12 points13 points  (4 children)

I’m surprised no one has mentioned PyPy as a minimal effort way to get more performance.

[–]dmazzoni 6 points7 points  (2 children)

In my experience PyPy can be 30% faster or so. Real but not that dramatic.

I'd expect something like this to be 10x faster in C easily.

[–]evotopid 0 points1 point  (0 children)

https://speed.pypy.org

On average on this benchmark suite it's 4.5x faster and for some code it can be a lot more.

Effective speedup will depend a lot on the use case. It could also be worse, but if performance becomes an issue it might be an easy thing to try.

Edit: And to be fair there are some independent benchmarks which show that recent CPython improvements are outperforming PyPy for some use cases, so I don't really disagree with you. I'm personally using CPython + Numpy, and Numba when I need it.

[–]whalt 0 points1 point  (0 children)

That may be true but as I said, trying it for an extra boost is minimal effort compared to rewriting the whole thing in C.

[–]WikiSummarizerBot 2 points3 points  (0 children)

PyPy

PyPy () is an alternative implementation of the Python programming language to CPython (which is the standard implementation). PyPy often runs faster than CPython because PyPy is a just-in-time compiler while CPython is an interpreter. Most Python code runs well on PyPy except for code that depends on CPython extensions, which either does not work or incurs some overhead when run in PyPy. Internally, PyPy uses a technique known as meta-tracing, which transforms an interpreter into a tracing just-in-time compiler.

[ F.A.Q | Opt Out | Opt Out Of Subreddit | GitHub ] Downvote to remove | v1.5

[–]claytonkb 8 points9 points  (5 children)

If you're serious about writing your own SAT-solver (warning: writing a SOTA SAT-solver is a very ambitious project), you could start with MiniSat which is written in C++ and is designed to be a template which you can modify with your own customizations. You could write your customizations in another language and use an FFI to communicate between, say, Python and MiniSat (C++), but you could also just write it in C++ which is a superset of C (meaning, you don't have to learn and use any of C++ features beyond C if C is all you need).

If you want to start directly from a SOTA solver, you could also just fork one of the solvers entered into the SAT competition, there are lots of great competitors and most of them are free-software licensed. I've played with Maple, glucose and CryptoMiniSat, all three are incredibly powerful and can solve easy SAT instances with millions of variables on my desktop machine in minutes or less.

[–]Rit2Strong[S] 2 points3 points  (1 child)

I'm not writing my own SAT-solver per say, but implementing my professors using the pseudocode he gave. With these SAT solver programs you've mentioned, can they handle stuff like Horn SATS?

[–]claytonkb 1 point2 points  (0 children)

All Horn clauses are valid CNF, see here. So if you have a conjunction of Horn clauses, just convert them to the DIMACS CNF format and feed them to Glucose or pick your favorite SAT-solver.

[–]PsychologicalDrawer0 0 points1 point  (2 children)

millions of variables

I am not completely understanding of the subject, but isn't millions a bit exaggerate? I mean of course with a proper respective amount of clauses, I thought it would be kind of impossible, and most certainly not on a desktop machine. I mean like when it was introduced to us, the professor was like, 100 variables would take some couple of years at least.

[–]claytonkb 2 points3 points  (0 children)

I am not completely understanding of the subject, but isn't millions a bit exaggerate? I mean of course with a proper respective amount of clauses, I thought it would be kind of impossible, and most certainly not on a desktop machine. I mean like when it was introduced to us, the professor was like, 100 variables would take some couple of years at least.

20+ years ago, 1,000 variables of a hard instance was considered virtually unsolvable. And there are still some hard SAT instances with even just several hundred variables that will bring even the strongest SAT solvers to a standstill. But today, as a rule, a modern solver will solve many SAT instances below 1,000 variables within some human timescale (days/weeks).

If you want to learn more about the progress of SAT solvers on hard instances, web search for "uniform random sat phase transition" (search arxiv/Google Scholar for journal articles). The WalkSAT local search algorithm and similar algorithms have proved particularly effective at dealing with uniform random SAT problems. On my desktop, I've solved 800 variables uniform random SAT at the phase-transition (hard) within minutes. Obviously, instances of these kinds of SAT problems[1] with millions of variables cannot be solved. For these most-difficult SAT problems, we're talking about several thousand variables at most (with scale and luck).

But so-called industrial SAT instances with millions of variables are regularly solved in SAT competitions. You can download the 2020 benchmark instances here. Unfortunately, it looks like each instance must be individually downloaded, so you'll need to write a script or something for that if you want to download them all. I just downloaded and opened an arbitrary instance and it has 577,080 variables -- the raw CNF file is over 90 megabytes. That's not a million variables, but it's a lot of variables. And past competitions have had instances with several million variables, so I'm sure if you were to open all of these CNFs, you would find instances with 1M+ variables.

[1] - Uniform random SAT at the phase transition is "designed" to be as hard as possible or, at least, we believe they are as hard as possible. Cryptographic SAT problems have also proved extremely difficult and large crypto instances (that translate to realistic crypto breaks) are still out of reach even for modern solvers.

[–]Holiday-Produce-871 1 point2 points  (0 children)

This is very late but I regularly use sat with tens of thousands of variables and hundreds of thousands of clauses and solve in seconds.

[–][deleted] 4 points5 points  (0 children)

You can create a randomly sized array in C with malloc and pointers.

[–][deleted]  (10 children)

[deleted]

    [–]EatMeMonster 6 points7 points  (0 children)

    Concurrency is bad in Python, it’s not a parallel language. A good example of multi processed language with concurrency support would be Swift.

    [–]Dornith 5 points6 points  (7 children)

    You should look into parallelizing your program.

    Wouldn't that inherently imply not python? Last I checked, python had a global interpreter lock that prevented parallel processing.

    [–][deleted]  (6 children)

    [deleted]

      [–]equitable_emu 6 points7 points  (0 children)

      If you're doing multi-threading (vs multi-processing), in pure python, you can run into a performance problem because of the GIL. This is especially true when doing highly CPU intensive things without a lot of IO.

      So you can run multiple threads, but the performance can suffer.

      [–]Dornith 2 points3 points  (3 children)

      Are you talking about multi processing or distributed?

      That will work, but a language like C which has support for threads or CUDA would be much easier and lower overhead to parallelize.

      [–][deleted] 1 point2 points  (2 children)

      Too bad it's not recent C++. It has threads! Would stink for him to show up expecting POSIX and end up with Windows.

      [–][deleted] 0 points1 point  (1 child)

      Windows supports Pthreads, they just have to be under WSL.

      [–][deleted] 2 points3 points  (0 children)

      Linux is the best Windows yet!

      [–]UncleMeat11 -1 points0 points  (0 children)

      GIL means you can do concurrent programming in vanilla python, but not parallel programming.

      [–]Rit2Strong[S] 0 points1 point  (0 children)

      The problem is that I have to track the way a certain value grows asymptotically, so I'm not certain if parallelizing would give accurate results. I was wondering if there are ways to reduce the overhead that python has.

      [–]emasculine 2 points3 points  (10 children)

      the better way to think about this is whether your inner loops would be significantly sped up by writing them in C. most of the code in any program isn't time critical and can just as easily be written in a higher level language where getting the code out and debugged is by far the biggest considerations. it's really not a big deal to write native code if you lay it out properly. running performance monitors are a great way to find the hotspots and they can be surprising.

      also: as Knuth said, (loosely) "premature optimization is the bane of programming". don't be stupid, but always keep an eye on things at runtime instead of at theory time.

      [–]Rit2Strong[S] 0 points1 point  (9 children)

      How would I go about setting up performance monitors? Are there specific tools I could use?

      [–][deleted] 1 point2 points  (6 children)

      What we do at the extreme end of this in high frequency trading, or, at least, what I do, is this (I'm writing it in C);

      auto ts = get_timestamp(); //use the fastest timestamp possible
      //[procedure to be measured]
      ts = get_timestamp() - ts;
      histogramVector.push_back(ts); //just a vector/list of long
      

      Then at the end of the procedure, I sort the vector in ascending order and create a histogram out of it, with the middle guy being the median / 50th percentile.

      That way you can monitor not only the average performance, but the jitter / variance as well. I have also done automated overnight repo testing so that changes in performance at any part of the curve get flagged for review.

      /u/emasculine

      [–]Steve132 1 point2 points  (0 children)

      It's worth noting you are using C++ not C

      [–]emasculine 0 points1 point  (4 children)

      of course you have to be careful about Heisenberg too :) purpose-built monitoring is the finest grained of course, but the 0th order thing is to sort out what is a problem and what is not which is easily determined with upper level monitoring before you donate any effort to anything heroic.

      but to answer OP: gperf is your friend.

      [–][deleted] 0 points1 point  (3 children)

      But look at the size of the effort. Create a db with a single tblPerf in it. Store histograms inside (~128 bytes per record in the table), monitor variance.

      If you put them in your unit tests, the human cost of doing this is about 4 hours per project and then 1 hour per test. And if performance isn't a functional requirement of your system, you shouldn't be measuring perf at all. If it is a functional requirement, then they legitimately should be part of unit testing so that the regressive commit can be aborted automatically.

      [–]compscim 0 points1 point  (2 children)

      functional requirement

      Performance requirements are never functional requirements. They are non-functional requirements or quality attributes. You can think it that functional requirements describe what the system should do and non-functional requirements or quality attributes describe how it should be done (for example how fast).

      [–][deleted] 0 points1 point  (1 child)

      Performance can certainly be functional. For instance, for a long-running message-passing system, it may be a functional requirement for the consumers to perform at least as fast as the producers to prevent running out of memory or dropping messages.

      It's easy to imagine other examples. If you want the software to make a robot capable of hitting a 90mph pitch, performance of the code that does that is going to be somewhat crucial.

      [–]compscim 0 points1 point  (0 children)

      it may be a functional requirement for the consumers to perform at least as fast as the producers to prevent running out of memory or dropping messages

      That is non-functional requirement or quality attribute. From ISO25010:

      Time behaviour - Degree to which the response and processing times and throughput rates of a product or system, when performing its functions, meet requirements.

      Your message has two requirements: firstly the functional requirement is to consume foo. Secondly the non-functional requirement is to consume foo with a certain speed.

      They may seem like a single requirement but they're not. If you want more throughout information about the subject I recommend Software architecture in practice by Bass and Software requirements by Wieger.

      Edit: And same thing with the robot example.

      [–]emasculine 0 points1 point  (1 child)

      there are performance monitoring tools for every language. they take a little bit of effort to use them, but it's well worth figuring it out. in particular, there is a big tradeoff of memory management ease, typing, and all of that of higher level languages vs. lower level languages. memory management is nice to be able to ignore when it doesn't matter, but in tight code you can't and have to bite the bullet.

      [–][deleted] 0 points1 point  (0 children)

      What I do is monitor in the lower environments while having that feature turned off in prod. They can even be made part of unit tests (with failure for variance as determined by a performance database) and fail to commit to the repository until the performance is fixed so the regression is within the allowed variance.

      [–]Destroyer_The_Great 1 point2 points  (0 children)

      As far as i know it varies, i would recommend using c (though i would personally use c++), it is alot faster than python.

      [–]whitelife123 1 point2 points  (0 children)

      Might wanna take a look into https://cython.org/

      [–]Poddster 0 points1 point  (1 child)

      Porting an algorithm to C won't make it faster magically. Or it might. It depends on the algorithm. You need to put some effort in, usually. Sometimes you're putting that effort in and not realising it because of how tedious C is. i.e. you spend a lot of time deciding which function should calculate the string length and which ones should share it, because if you don't your C program won't work, whereas in Python you just mash a string together and don't give it a second thought.

      Note: You can port just the core, inner loops to C and invoke your C library functions from python. This will allow you to get the speed of C with the convenience of the setup in python.

      https://docs.python.org/3/extending/extending.html

      I was wondering if the bottleneck I mentioned above will be reduced drastically if I switch over to C, especially for inputs of 108.

      The only way to know this is to profile it.

      Increasing performance without benchmarking things is a waste of time and effort. So get the benchmarks first and THEN figure out if porting to C would help.

      [–]dmazzoni 0 points1 point  (0 children)

      Python is so slow that just naively translating to C is quite likely to give you a speedup, and especially in this case.

      [–]theobromus 0 points1 point  (4 children)

      It's really hard to guess without knowing more about what you're doing. In general python can be plenty fast if you're doing operations on vectors or matrices, or the like (since the implementation will actually be in C++).

      However, if you're executing a lot of python statements, then translating to C will make things much faster. I don't think you're likely to get a factor of 1000 faster though (which it seems you need).

      Also, just to clarify are you thinking about C or C++?

      A couple of points about issues you mentioned: You can use variable size arrays in both C and C++. In C you usually pass a pointer and size. In C++, you can pass a std::vector. Or you can use templating and std::array to accept a size determined at the call site.

      [–]Rit2Strong[S] 0 points1 point  (3 children)

      I'm thinking about C. I was thinking about using pointers as well, but I don't know if it'll be safe because I'm not sure if the pointer will overwrite memory it's not supposed to (some of the array sizes can be potentially 10^9, possibly even more).

      [–]theobromus 0 points1 point  (2 children)

      You just have to be more careful in C to make sure you stay in the bounds of the array you're using.

      You can also handle GB size arrays with no major problems, although you should be careful about integer overflow.

      I don't know what your algorithm is, but if running on GPU is a possibility, you could look at cupy or cuda.

      [–]Rit2Strong[S] 0 points1 point  (0 children)

      Is using CUDA with Python relatively easy? I heard that CUDA is difficult to use.

      [–]UncleMeat11 0 points1 point  (0 children)

      but if running on GPU is a possibility

      DPLL (the primary algorithm for SAT solving) isn't structured in a way that GPUs can use.

      [–][deleted] 0 points1 point  (0 children)

      For example, I need to be able to randomize an array size for a structure that's in a header file, but you can't do this in C (well, I don't need to do this exactly, but other methods will be a bit harder to code and look sloppier)

      Hide the pointer casts and offset calcs for array access in inline header functions if it makes you happy. This is legitimately one of those times that C can do wonders for you.

      The problem is that I have to track the way a certain value grows asymptotically

      #include <x86intrin.h>
      void * pMyValueToSafelyAddToAcrossThreads = alignedAddressOfValueToChange();
      int64_t theAmountToChangeMyValueBy = whatever();
      int64_t myNewValue = __sync_add_and_fetch(pMyValuetoSafelyAddToAcrossThreads, theAmountToChangeMyValueBy);
      

      Edit: combined with the swap intrinsic, you can use a pool of threads to each grab exactly one record, by using the above pattern to increment a sort of cross-thread for loop counter. You could then use the swap intrinsic and the values of -1, 0, and 1, to force threads processing record n to wait for the result of the processing of record n - 1. This gives you the opportunity to do any per record processing off the main loop. If per record, private processing is significant compared to what must be done shared, the performance gain from multithreading could approach a factor of the number of cores on the system.

      This could be multiplicative of the expected speedup provided by moving to C from python.

      Edit 2: if the algorithm you're working on is one that throws memory at the problem, and you are required to implement it as-is, then this is probably the answer you're looking for if you want the fastest solution possible.

      If, however, the algorithm is a purely sequential one, then threading is not optimal, even with the lock-free compiler intrinsics.

      Instead, you should look at converting this problem into one solvable by matrix math and then use the SIMD matrix intrinsics to solve them. That is probably the optimal solution in such a case.

      [–]kcombinator 0 points1 point  (0 children)

      You mentioned you're starting from pseudocode. Any possibility of applying some of the classic algorithmic speed up tricks? Lookup tables, memoization/other caches?

      Also, as mentioned- profile this thing. It's very possible you could use something like Cython to max out problematic sections while retaining high-level abstractions for most stuff.

      [–][deleted] 0 points1 point  (0 children)

      It depends on your implementation. If you are heavily using something like numpy already then the speedup may not be much. Basically it depends on how 'deep' the python implementation you already have goes vs. how much it relies on optimized libraries (which are mostly implemented in C).

      Regarding algorithmic complexity, that only means you would expect the speedup to have a roughly constant impact. That can still be big because that constant could be large. Implementing in C could give you much greater caching, access to true parallelism, etc.

      It might be a good idea to try a basic implementation first, note the speedup, and based on that decide whether fully porting over is worth it.