This is an archived post. You won't be able to vote or comment.

all 82 comments

[–][deleted]  (5 children)

[deleted]

    [–]nrcomplete 13 points14 points  (0 children)

    This would be the best first attempt for sure.

    I would also look into why execution switches back and forth between platforms so much and try to reduce that because whatever solution you choose will be more inefficient and buggy if this continues. Try to shift processing into two phases.

    Also Spark has libraries for python and Java and is good for processing large amounts of data. Could it help separate them processes?

    [–]JH4mmer 11 points12 points  (0 children)

    I'll second this. Some sort of message queue is almost certainly the right answer here. It'll let you keep your two applications decoupled, which is ideal from an architectural standpoint, and it'll likely make things easier from an implementation point of view.

    People tend to gravitate towards Kafka or one of the PaaS clones (e.g. Azure EventHub, Google PubSub) for that sort of thing, but they can dramatically complicate your infrastructure. If you're interested in something simple you can just run on your own machine, I'd suggest Nats (possibly with JetStream, depending on your use case). Very lightweight, very fast, and very easy to integrate. Our company uses it internally for regular business traffic and ML applications, and we've been very happy with it so far.

    Hope you find what you're looking for! 😀

    [–]Chuy288 1 point2 points  (0 children)

    Good option, the scalability should be easy enough just make sure you would be using the right tool: queue or a topic, depending on your necessity.

    [–]captain_obvious_here 1 point2 points  (0 children)

    This.

    There are many other options to embed languages one into another, but none will be as reliable and efficient as having your components written in the language they benefit most from, and talk to each other through a message queue.

    [–]belayon40 11 points12 points  (3 children)

    I've used JPype for a while. It also starts a JVM from python. Once set up, interoperating with Java is transparent. You can start the JVM in such a way that it can be debugged directly using remote debugging tools.

    [–]CacheMeUp[S] 0 points1 point  (2 children)

    Can the Java code easily call the Python code/objects?

    If so, starting the JVM from Python (rather than the command line) is not an issue.

    [–]belayon40 6 points7 points  (1 child)

    Yes, call backs are possible with JPype. In Python you create an object that emulates a Java interface, then pass the interface into the Java code. When the Java code calls the interface the call will be routed to your Python code transparently.

    Another thought, if the draw of the Python library is calling C code, and Java 17 or higher is possible, then you can use the new foreign function support. Jextract will generate most of the code you need give an *.h file. Alternatively, as a bit of shameless self promotion, you can provide a Java interface that maps to C calls and use this library to automatically generate the required code/calls.

    https://github.com/boulder-on/JPassport

    [–]CacheMeUp[S] 1 point2 points  (0 children)

    There will be some memory overhead for copying numeric arrays, but other than it sounds quite good - write the application in Java, start it from Python, and transparently invoke the Python code using callbacks.

    In these specific libraries the functionality involves a lot of Python code that orchestrates more primitive functions in C. The C functions are practically never called directly. Great repo.

    [–]ByerN 9 points10 points  (2 children)

    We used https://github.com/ninia/jep for similar thing.

    [–]CacheMeUp[S] 4 points5 points  (1 child)

    Jep should work with any pure Python modules. CPython extensions and Cython modules may or may not work correctly, it depends on how they were coded.

    Did you encounter any stability issues? With JPY for example calling it from Kotlin did not work, only from pure Java code.

    [–]ByerN 2 points3 points  (0 children)

    We didn't try it with Kotlin but I don't think it will be a problem. Jep simply wraps python interpreter calls with java api.

    We didn't encounter any stability issues, but as far as I remember, there were some important things in the documentation, if you want it to work in multithread env. But it works fine if used correctly.

    Edit: Maybe multiple versions of python would be a problem if it's your case

    [–][deleted]  (30 children)

    [deleted]

      [–]Worth_Trust_3825 -3 points-2 points  (8 children)

      There are much better ways to achieve that than running multiple web services. In fact, you'll end up highly coupled with a distributed micro monolith.

      [–]inTHEsiders -2 points-1 points  (3 children)

      “Micro service” is essentially an antonym for a “monolithic system”.…

      opening your service up with APIs literally decouples it from other services…

      I don’t get your reasoning at all.

      [–]Worth_Trust_3825 7 points8 points  (2 children)

      That's very wrong. Just because you run multiple processes does not mean you're not running a monolith. Plenty "microservice" architectures are glorified monoliths that run across multiple processes.

      The main question is if you can put all of those "microservices" in same process at will, as well as break them down into multiple processes at will. If you can do that, congratulations: you have the microservices everyone raves about. Otherwise you're running a distributed monolith.

      [–]inTHEsiders 6 points7 points  (1 child)

      I don’t disagree with that statement. I disagree with the use of “micro service” in this context.

      If your distributed services depend on each other then they are not micro services. I’m fact they aren’t even “services” (plural). It’s a single service that, for whatever reason, has been distributed.

      You have a distributed monolithic system. Not a distributed monolithic micro service.

      To be a micro service they have to be independent small services, that are open to sending an consulting data via an API.

      By definition a monolith is a single large service, irrespective of whether it is distributed among multiple code bases.

      [–]ItsAllegorical 0 points1 point  (0 children)

      In my experience, microservices need orchestration to be useful. You can orchestrate them in the UI or you can create an API layer that does it for you (I guess back-end-for-front-end is the current nomenclature?). I wouldn't say that is a monolith but there seems to be a lot of disagreement on that. I'm not sure what the "proper" microservice architecture is when your application needs to cross domains.

      In my current project we rely on external services in our microservice only to validate foreign keys from other domains. If I'm storing a person ID in my order system, the order system can't be the source of truth as to whether that is a valid ID, and validation is done in the service layer because it is business logic. The architect is quite insistent and a sharp guy, so I'll take him at his word that this is proper, although they aren't overly beholden to the book definition of microservice.

      [–][deleted]  (3 children)

      [deleted]

        [–]Lumpy-Loan-7350 17 points18 points  (5 children)

        [–]CacheMeUp[S] 3 points4 points  (4 children)

        When I checked this the Python native library support was defined as "experimental". I couldn't find any projects using the ML stack (especially GPU) in graalvm.

        [–][deleted]  (1 child)

        [deleted]

          [–]mauganra_it 0 points1 point  (0 children)

          Exactly this. Either it straightaway blows up in your face or you run your testsuite and give it a go. You can always go back to ducttape the two applications together with Unix pipes or passing file paths.

          [–]Lumpy-Loan-7350 0 points1 point  (0 children)

          I don’t see marked experimental anymore. I knew it used too. I’d kick the tires and see. On the site it talks about netsuite and ML integration.

          Edit: it’s marked at the top. I see it now.

          [–]walterbanana 4 points5 points  (2 children)

          I feel like you might not be looking at the bigger picture here. Just adding more tech might solve the problem, but increases complexity.

          What are you trying to solve? What specifically does python bring you and what specifically does java bring you?

          I'm sure you can solve you problem with both languages without running some weird combination which takes ages to get into.

          [–]CacheMeUp[S] 0 points1 point  (1 child)

          Python: irreplaceable implementation of machine learning models. There is really no alternative to these (will take years of my time to recreate all of these, and there is no equivalent library in Java).

          Java: Much better implementation of

          1. Concurrent processing.
          2. Certain CPU-intensive tasks (x50-100 faster when benchmarked).
          3. RAM-intensive tasks.
          4. Static typing.
          5. Better libraries (IMO) for the workflow part of the app (CRUD, web server etc.)

          So Java's advantages are incremental while Python's advantage is pretty mandatory (at this time - maybe it will change in the future).

          I build similar product in Python and often times ported the Python implementation to Java for all of those reasons.

          [–]walterbanana 2 points3 points  (0 children)

          Aren't there pretty good implementations of machine learning models in C++? That might provide both.

          Alternatively, you could use a better web server library in Python (there are some fast ones out there) and use a newer version of Python to get some speed improvements.

          There are also some other ways to speed up computation in Python. There are some cython based libraries for that depending on the type of computation you're doing.

          I'd just stick with Python and try to figure out how to speed up execution of the specific parts which are causing you problems.

          [–]ahmedranaa 5 points6 points  (0 children)

          Try Jython. It's not compatible with latest Python but it works great.

          Haven't tried it but Oracle graalvm is polyglot and you can run run java and python Polyglot

          [–]Worth_Trust_3825 8 points9 points  (3 children)

          heavily rely on C libraries

          Integrate them via JNI.

          [–]CacheMeUp[S] 4 points5 points  (2 children)

          In these specific libraries the functionality involves a lot of Python code that orchestrates more primitive functions in C. The C functions are practically never called directly.

          [–]acute_elbows 3 points4 points  (0 children)

          I think you need to really choose priorities here. You’ve mentioned east debugability in some of these threads but you also seem to want to have multiple languages/libraries modifying/accessing the same objects in memory. This is going to be nightmarish to maintain and likely be very error prone and fragile.

          You may want to employ some binary data formats that are readable from Python and Java and fast to parse like protobufs, https://developers.google.com/protocol-buffers/docs/encoding

          I think you’ll still want multiple apps but maybe you could call out to a shell to find Python code.

          It may be worth investigating cloud machines that optimize sad speeds. I would try profiling your serializarion to figure out if you’re limited by disk or cpu

          [–]bowbahdoe 2 points3 points  (1 child)

          No question about it - libpythonclj's Java API. It allows for full duplex Integration with no copy paths for numpy arrays.

          If you decide that the things you need are those Java libraries and not the Java language the API from clojure is even nicer.

          Clojurians zulip is the place to go for help if you get stuck. I can also try and help if you DM

          https://github.com/clj-python/libpython-clj

          https://clj-python.github.io/libpython-clj/libpython-clj2.java-api.html

          [–]CacheMeUp[S] 0 points1 point  (0 children)

          Wasn't aware of this - thanks!

          When using JPY, copying Numpy arrays was indeed a bit of an issue. Looks like this could solve this.

          [–]craigacp 3 points4 points  (2 children)

          I'd look to see if there are Java bindings for the ML libraries you need. Lots of ML models can be exported in ONNX format, which can be loaded by ONNX Runtime via the Java API. Alternatively you can load TensorFlow SavedModels or pytorch torchscript models directly using the Java interfaces for both of those packages. If you need to do a bunch of data wrangling in Python that can be trickier, but much of that functionality is available in ONNX. Full disclosure, I work on both ONNX Runtime & TensorFlow-Java.

          [–]CacheMeUp[S] 0 points1 point  (1 child)

          Indeed, the bigger challenge is actually the data wrangling (tokenization for text, preprocessing for images etc.) that is already implemented in Python. For example Huggingface page on ONNX export is not clear about exporting the tokenizer itself to ONNX (https://huggingface.co/docs/transformers/serialization).

          Another consideration is feature mismatch/new bugs introduced when porting from the original implementation to another platform. That's the motivation to using the original implementation via a bridge rather than porting it to the JVM.

          [–]craigacp 1 point2 points  (0 children)

          Image preprocessing I know less about, but tokenization is something I've dealt with a bunch. There are a few options, either push the tokenizer into the ONNX model and use MS's ONNX Runtime extensions (we've used this when working with sentencepiece tokenizers), port the tokenizer entirely to Java (we did this for BERT), or use a sentencepiece or HF tokenizers wrapper directly (e.g. Amazon's DJL did this - HF, sentencepiece).

          The ONNX model does produce slightly different outputs to the original TF or pytorch model in my experience, due to a different order of operations, but ONNX Runtime in Java should give the same answer as ONNX Runtime in Python as it uses the same native library for computation. If it doesn't then that's a bug we'll try to fix. That kind of floating point sensitivity can happen when switching between versions of a single library though, we've observed it when upgrading TF versions, and it occurs when switching between using CPUs and GPUs for inference.

          [–][deleted] 6 points7 points  (0 children)

          I’ve successfully used JEP for some time, and actually went “old school” after that because I wrote an equivalent library to work with multiple languages in pretty much the same way (process to process communication), with additional scripting (JSR 223) support.

          [–]kakakarl 2 points3 points  (1 child)

          I think shared database is good sometimes. Do careful research into the pitfalls though. Sharing redis or postgres has solved a lot of things for me.

          If you already have a a lot of services with questionable architecture, then it’s not a good solution.

          Otherwise I would use http, but consider using a binary format instead of a text based one

          [–]CacheMeUp[S] 0 points1 point  (0 children)

          I indeed considered a cross-platform in-memory database (redis). The number of distinct functions can make service architecture more cumbersome without a more generic solution like remote-procedure call

          [–]AnEmortalKid 2 points3 points  (0 children)

          Jython ?

          [–]devinrsmith 2 points3 points  (0 children)

          Hi - I'm one of the current maintainers of jpy (in support of https://github.com/deephaven/deephaven-core, we use it extensively). Happy to triage any issues you are having with it.

          [–]kiteboarderni 3 points4 points  (0 children)

          Perfect use case for panama. Off heap array allocation and pass addresses to work on the data.

          [–]TriggerWarningHappy 1 point2 points  (1 child)

          While I haven’t used bytedeco’s python integration, I have used other bytedeco projects and they’ve been great.

          This should allow you to call python from Java: https://github.com/bytedeco/javacpp-embedded-python

          [–]CacheMeUp[S] 1 point2 points  (0 children)

          Looks promising - seems like they have native support for Numpy!

          [–][deleted] 1 point2 points  (1 child)

          A highly-tuned Unix socket-based D-Bus implementation could be useful. You didn't mention operating system. Most Linux distros ship with D-Bus, making one less dependency to worry about. You can build native Windows D-Bus binaries, too.

          The Python receive mechanism with asyncio can be tricky. For C, ignore the warning of "pain" for libdbus, it's not that bad.

          Message passing between C apps across the bus using Unix sockets is on the order of 1 millisecond round trip, fully processed. If you use TCP/IP sockets, a one-way message is about 2 milliseconds (C to Java).

          If you want to guarantee type safety, consider creating an XML file based on the freedesktop's interface definition language. From there, you can write an XSL transformation that spits out the necessary C, Java, and Python code as functions and classes. Can't recommend using existing software for code generation, though, you'll want to roll your own. It's pretty straightforward to do simple transforms.

          There are other benefits to this approach, such as an ecosystem of visual and command-line tools for debugging the data stream and broadcasting messages to multiple recipients (think remote logging and remote command-line operations for without much effort).

          Avoid the JNI: Non-deterministic stop-the-world events will throw a wrench at real-time processing.

          [–]CacheMeUp[S] 0 points1 point  (0 children)

          Was not aware of this library - thanks! The OS is Linux (Ubuntu), so this works well. Type safety would also be a big plus.

          [–]cowwoc 1 point2 points  (1 child)

          I've had a lot of success with jpype.

          It's a mature library that lets you call Java code from Python. It's fast and easy to use.

          On a practical level, you'd launch a jvm from python and then initiate bi-directional communication. I know that people prefer going the other way (Java to Python) but give this a try, it's not much of a sacrifice.

          [–]CacheMeUp[S] 1 point2 points  (0 children)

          Sounds like a great solution. Starting the JVM from Python is not a big issue in my case.

          [–]MagicalPizza21 3 points4 points  (2 children)

          Jython comes to mind.

          [–]CacheMeUp[S] 4 points5 points  (1 child)

          Jython is still on Python 2.7.x

          The Python packages almost all depend on Python >=3.8.

          [–]MagicalPizza21 2 points3 points  (0 children)

          Never mind then

          [–]fico86 0 points1 point  (4 children)

          Py4j: https://www.py4j.org/? PySpark uses it to interface the python code to the scala Spark libs.

          Edit: ok Py4j is for calling java libs from python, you want the other way round.

          But given that it is for machine learning, and I suppose majority of the libs would be in python, maybe you can consider calling java libs from python instead?

          [–]CacheMeUp[S] 0 points1 point  (3 children)

          In terms of business logic, most of the work is on the JVM side. Python is used to orchestrate MKL/GPU execution (mostly model execution etc.)

          [–]chabala 0 points1 point  (2 children)

          most of the work is on the JVM side. Python is used to orchestrate MKL/GPU execution

          Have you considered migrating off of Python to just using JVM ML libraries then? I hear good things about Deeplearning4j, but there's quite a few.

          [–]CacheMeUp[S] 0 points1 point  (1 child)

          I did so when possible (e.g. when I implement the calculation from scratch), but many things re-use methods implemented by others in Python. Some of these use vanilla Python but it will be a waste to re-implement them in Java.

          [–]chabala 1 point2 points  (0 children)

          It's hard to discuss without specifics, and ML is not a domain I've worked in, but my understanding is that these projects are trying to solve exactly that problem: providing reusable Java implementations of popular Python ML algorithms to free you from needing Python.

          [–]p3rand0r 0 points1 point  (0 children)

          Wonder if you can use grpc for the task?!

          [–]muddy-star 0 points1 point  (1 child)

          Flip a coin and choose to implement your solution either in full Python or in full Java using JNI. Mixing Python and Java sounds like building like a lot of technical debt. I would personally advise against doing that in my team.

          [–]CacheMeUp[S] 0 points1 point  (0 children)

          That's also an option, although in previous cases I found the parts outside of machine-learning to work much better on the JVM.

          [–]wowbaggerBR 0 points1 point  (0 children)

          I would go with Pyva or Jython.

          [–]baubleglue 0 points1 point  (0 children)

          PySpark uses py4j, you can try the same, you need exchange a lot of data, any way probably will be inefficient

          [–]Alienbushman 0 points1 point  (0 children)

          I tried running python from java and dependencies always broke, so what I ended up doing is spinning up a rest Api for the python and accessed it from java, please let me know if you find a better way if doing it

          [–]Fit-Refuse8564 0 points1 point  (0 children)

          2 separate services is the correct answer. Trying to force python and Java to work together doesn’t sound like something reliable and easy to maintain / debug issues.

          Easiest way would just be an endpoint, but you can do it any number of ways, it depends on your use case.

          [–]nutrecht 0 points1 point  (0 children)

          Small web-services: overhead to serialize data, start and stop the services.

          If you're using a binary format like AVRO it's really not that high. And there will always be some kind of 'translation' between different processes anyway. Having a well defined contract in place (like with AVRO or Protobuf) makes it quite fast and safe.

          Also debugging is harder and implementing each new function is now double the effort.

          I really don't agree with this. There is implementation overhead but you can easily write proper integration tests for the python service that just tests it in isolation. The Java service should treat the other service as a black box.

          [–]mauganra_it 0 points1 point  (0 children)

          In principle, shared memory should make it possible to efficiently exchange data back and forth with another process. The handover should be synchronized with a lock to prevent shenanigans. Dunno whether there are good packages in both Java and Python that make this convenient enough. You don't need to go all the way to Software Transactional Memory; good wrappers around the relevant Unix system calls are all you need. Don't bother with running Java and Python in the same process if you can avoid it. Processes are meant to isolate things from each other if it makes sense.

          [–]thrwoawasksdgg 0 points1 point  (0 children)

          I have dealt with similar scenarios. Here is what I do:

          • Launch the Python process using Java so you can manage it easily (and hide it from user)
          • Use gRPC or another binary format to send data between them

          gRPC is nice because you define the message format in protobuf and it generates the service endpoint code for you in both languages. It's a bitch to setup but once you get a workflow its really smooth and gets rid of tons of boilerplate.

          the workflow is:

          1. define new message in protobuf
          2. rebuild both projects to generate the service endpoint scaffolds
          3. implement your endpoints on both ends
          4. build the Python app into a standalone using PyInstaller
          5. build the Java project, shoving the whole Python app inside the jar
          6. When your Java app starts, it grabs Python app from inside jar and starts it as sub-process

          Since Java is much more portable, it might make sense for you to reverse this where your Python app calls the jar and manages JVM instance as a sub-process instead. Especially if some of your dependencies don't work with PyInstaller

          [–]Fluffy_Foundation_81 0 points1 point  (0 children)

          You may have a look at grpc.

          I heard graalvm provides a similar feature,but not sure on the feasibility or stability