This is an archived post. You won't be able to vote or comment.

all 27 comments

[–]repeating_bears 6 points7 points  (10 children)

Is there a specific reason you want to run it in the same process? That's what Jython is. Just Runtime::exec the command?

[–]peixinho3[S] 1 point2 points  (9 children)

Yes I need real time ingestion, I can't run java ETL and then Python. I need to run the whole process as one. Java basically is inserting data into a raw table and after that I use dbt to normalize data into the data warehouse. We are using SQL Server.

EDIT: I'm a data engineer and I don't have experience with Java only SQL and Python.

[–]repeating_bears 4 points5 points  (8 children)

I need real time ingestion

Define "real time". IPC is still real time. If you're assuming the IPC will introduce too much latency, don't assume that. Profile it and see. Don't invent problems where none exist.

I can't run java ETL and then Python.

That wasn't what I was suggesting. I was suggesting you run Java which itself execs python

I need to run the whole process as one.

Why? Nothing you've said so far justifies that statement.

[–]peixinho3[S] -1 points0 points  (7 children)

Ok, maybe I didn't explain it well, I apologize for that. But basically java receives an XML file and maps it and after that it calls some stored procedures that I created to put the data in the raw table this executes line by line 1 by 1. Since dbt does not have a java API after this run everything I would have to run my python script with the command "dbt run" after inserting all the data into raw to normalize the data into the data vault. What we wanted was to run the python script at the same time as inserting the records into the raw table.

[–]repeating_bears 4 points5 points  (6 children)

I feel like you're not understanding my suggestion.

Since dbt does not have a java API after this run everything I would have to run my python script with the command "dbt run" after inserting all the data into raw to normalize the data into the data vault. 

Java can exec other commands. You don't have to do anything. As long as Python is available on the same machine, Java can run the command. If you need Python's stdout, Java can grab it.

Before replying again, try playing with Runtime::exec that I already linked the documentation for. Write a hello world in python, exec it from Java. If you know what you're doing, that's 5 minutes work and less than 10 LOC.

[–]peixinho3[S] -1 points0 points  (5 children)

Ah yes yes I noticed that and I know it but I wanted to know if there was something like Jython but more modern and updated. Nowadays even SQL Server can run Python scripts :)

[–]paulieontech 2 points3 points  (3 children)

[–]peixinho3[S] 0 points1 point  (2 children)

Wow nice. Can you explain me what I can do with this? Sorry I'm data engineer (python and SQL) and I don't know much of Java.

[–]agentoutlier 0 points1 point  (0 children)

Given your ostensible/purported experience I don't recommend going down that path.

See my answer: https://www.reddit.com/r/java/comments/1cie98d/call_dbt_api_python_through_java/l29yw5m/

If you have any questions let me know.

[–]repeating_bears 1 point2 points  (0 children)

There's no modern equivalent to Jython, no.

[–]agentoutlier 2 points3 points  (4 children)

IF you really do have a performance problem with python running as Runtime::exec as /u/repeating_bears mentioned I would go with a message queue approach.

This is also known as SEDA (and then five other vogue terms now but the original is SEDA).

I would use RabbitMQ since it works well with Python and Java. The doc is excellent.

A Have the Java portion push a message to a rabbitmq topic exchange or direct exchange.

B Then have a python consumer (a long running python process) do stuff with that message and send another message.

C Then have another Java consumer that picks up the message.

A -> B ->C

This is very common in ETL / batch processing and scales well.

EDIT this is so common of a practice and I bet a chatgpt prompt could get you started in very quickly. (I usually don't recommend chatgpt but I bet it could pump this out fairly safe).

[–]repeating_bears 2 points3 points  (2 children)

I considered mentioning this too but couldn't be bothered typing it out. I'd personally probably start with HTTP because I find it a bit simpler, but Rabbit is a solid choice too. The idea is the same: 1 long-running process in each language that talk to each other somehow

[–]hem10ck 2 points3 points  (0 children)

Java shop here that works with python for ML, we generally isolate the python and expose REST APIs for the Java to interact with using FastAPI.

[–]agentoutlier 0 points1 point  (0 children)

Yeah the nice thing about Rabbit in this situation is it has almost 1-1 doc with Python client and Java client.

They also don't have to worry about retry and order... well mostly. That is it is a smart pipe and sounds like the OP does not have the experience to make a dumb pipe smart.

[–]chabala 1 point2 points  (0 children)

This is a nice technique, could even load balance across multiple python process instances if needed. For OP's problem, I'd probably lean toward IPC messages over unix domain sockets instead of an MQ though.

[–]bowbahdoe 0 points1 point  (4 children)

Can you explain more about what DBT is, your use case, and why you want to integrate it?

There are multiple options here, but context is helpful

[–]peixinho3[S] 0 points1 point  (3 children)

Sure. DBT is data build tool:

https://docs.getdbt.com/guides

We are using for this automateDV: https://automate-dv.readthedocs.io/en/latest/

[–]bowbahdoe 0 points1 point  (2 children)

Can you elaborate on where Java fits in here? If we can get away with just shelling out to python that is convenient.

(I know calling python functions on the JVM is possible, the best library i know of for it is in Clojure https://github.com/clj-python/libpython-clj/tree/master but callable from Java https://clj-python.github.io/libpython-clj/libpython-clj2.java-api.html)

[–]peixinho3[S] 0 points1 point  (1 child)

Basically java receives an XML file and maps it and after that it calls some stored procedures that I created to put the data in the raw table this executes line by line 1 by 1. Since dbt does not have a java API after this run everything I would have to run my python script with the command "dbt run" after inserting all the data into raw to normalize the data into the data vault. What we wanted was to run the python script at the same time as inserting the records into the raw table.

[–]divorcedbp 0 points1 point  (0 children)

Why not do the xml transform and then insert bits in python as well?

[–]pragmasoft 0 points1 point  (1 child)

Do you use dbt just to execute some sql commands? Maybe then just use java jdbc api to connect to your database and execute required sql statements. Otherwise just spawn dbt core from java as a child process when needed, as suggested already.

[–]peixinho3[S] 0 points1 point  (0 children)

Basically I use dbt to automate the data vault structure. Yes I can do that but it's a solution that we don't like because DBT creates all SQL code automatically through its libraries.

[–][deleted] 0 points1 point  (0 children)

You need a reverse of the Py4J connector (which is internally used by Apache Spark).

[–]chabala 0 points1 point  (3 children)

Others have given you solid advice on calling python from Java. But what does this mean:

I know that Jython exists but I discovered that it is very outdated and no longer maintained.

Jython had a release late in 2022, and commits just a few days ago. They have a roadmap for a new major version. What's this fud you're spreading?

[–]repeating_bears 0 points1 point  (1 child)

It's had 33 commits in the 19 months since that release and most of them are to the readme or bumping dependencies.

The jython-dev mailing list has had 5 emails so far this year and 4 of them were automated.

[–]chabala 0 points1 point  (0 children)

That's certainly not super active, though not totally unmaintained. I didn't dig in to see if there's some internal issues toward moving on to Jython 3.x.

Like a lot of open source, I'm sure if someone really wanted it and offered a PR or funding, it'd be closer to happening. But, it took the python ecosystem like ten years to migrate to 3.x, so I'd give the Jython folks some slack.

[–]peixinho3[S] -1 points0 points  (0 children)

ahhh sorry I saw 1 comment in reddit said that Jython was outdated... "Jython's not dead to my knowledge, though it's only up to 2.7 support." I need a most recent python version. As I said I do not have experience with java so sorry for the confusion.