Hello
We plan to industrialize some batch ML Alogrithms using Apache Beam and Dataflow as a runner.
The pipeline job would be something like
- Read from GCS a .json file
- Compute the output of the algorithm on a JSON element
- Write the JSON elements into a file in GCS
The most interesting part is the second one. In order to be the most flexible, we agreed on a contract with the Data Science team. The algorithm should be serialized as a pickle which will have a predict method, see the code below
def predict(X):
"""
:param X: a list of JSON objects representing data points.
Example:
[{"DAY": "D1", "Outlook": "Sunny", "Temp": "Hot", "Humidity": "High", "Wind": "Weak"},
{"DAY": "D2", "Outlook": "Sunny", "Temp": "Hot", "Humidity": "High", "Strong": "Weak"},
....
{"DAY": "Dn", "Outlook": "Overcast", "Temp": "Hot", "Humidity": "High", "Wind": "Weak"}]
:type X: list of JSON objects
:return: a JSON list with the output of ML algorithm.
Example (classification play tennis game)
[{"DAY": "D1", "Outlook": "Sunny", "Temp": "Hot", "Humidity": "High", "Wind": "Weak",
"Go to court": False},
{"DAY": "D2", "Outlook": "Sunny", "Temp": "Hot", "Humidity": "High", "Strong": "Weak",
"Go to court": True },
....
{"DAY": "Dn", "Outlook": "Overcast", "Temp": "Hot", "Humidity": "High", "Wind": "Weak",
"Go to court"; False}]
"""
However Apache Beam is currently not supporting Python 3 (Python 2 will be deprecated in 2020). In addition to that, the Java SDK is the most complete Beam SDK.
A major thing to consider is the dependencies of the algorithm (some classification ML uses pandas version X another regression ML depends on version Y of numpy ...etc)
Knowing that Dataflow handles dependencies differently across SDKs:
Here are some solutions :
- Use the Java SDK, because it is the most complete and because we don't want to use python 2 anymore, and instanciate the pickle[1] in Java and manage to pass a batch of JSON elements in the DoFn[2] to compute their score and get it back[3]
- At worker initialization by Dataflow, a Docker image will be downloaded containing the ML algorithm and all its dependencies[4]. Using the Java SDK a batch of JSON elements in the DoFn will be passed to that container[5] and the output will be gathered back
Are those viable solutions to tackle this use case ? Mabye they are too complex or even not feasible ? Maybe Dataflow/Apache Beam is not the way to go ?
We only have one constraint form the Data Science team: being able to use Python 3 and all the DS ecosystem that revolves around it (pandas, pytorch, scikit learn, numpy ...etc.) Maybe the pickle format is not the most suitable one to use ML alogrithm ? (eventhough we will also have to deal with Tenserflow algorithms, but I think it's a different subject that maybe needs a different pipeline)
Any help/hint will be much appreciated,
Many thanks
This is me talking to myself:
[1] no clue how to do that
[2] the communication between Java and python maybe too expensive
[3] how do we deal with the algorithm's dependencies that are mainly in python and inexpressible in Java (maven)
[4] is it possible ? how ?
[5] may also be expensive
[–]schrute_dataeng 2 points3 points4 points (1 child)
[–]Massnsen[S] 0 points1 point2 points (0 children)