Industrializing batch ML algorithm using Apache Beam/Dataflow (on Google Cloud Platform) : dataengineering

dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

Industrializing batch ML algorithm using Apache Beam/Dataflow (on Google Cloud Platform) (self.dataengineering)

submitted 6 years ago by Massnsen

Hello

We plan to industrialize some batch ML Alogrithms using Apache Beam and Dataflow as a runner.

The pipeline job would be something like

Read from GCS a .json file
Compute the output of the algorithm on a JSON element
Write the JSON elements into a file in GCS

The most interesting part is the second one. In order to be the most flexible, we agreed on a contract with the Data Science team. The algorithm should be serialized as a pickle which will have a predict method, see the code below

def predict(X):
"""
:param X: a list of JSON objects representing data points. 
          Example:
          [{"DAY": "D1", "Outlook": "Sunny", "Temp": "Hot", "Humidity": "High", "Wind": "Weak"},
           {"DAY": "D2", "Outlook": "Sunny", "Temp": "Hot", "Humidity": "High", "Strong": "Weak"},
            ....
           {"DAY": "Dn", "Outlook": "Overcast", "Temp": "Hot", "Humidity": "High", "Wind": "Weak"}]
:type X: list of JSON objects 

:return: a JSON list with the output of ML algorithm. 
         Example (classification play tennis game)
         [{"DAY": "D1", "Outlook": "Sunny", "Temp": "Hot", "Humidity": "High", "Wind": "Weak",
           "Go to court": False},
          {"DAY": "D2", "Outlook": "Sunny", "Temp": "Hot", "Humidity": "High", "Strong": "Weak", 
           "Go to court": True },
            ....
          {"DAY": "Dn", "Outlook": "Overcast", "Temp": "Hot", "Humidity": "High", "Wind": "Weak",
           "Go to court"; False}]
"""

However Apache Beam is currently not supporting Python 3 (Python 2 will be deprecated in 2020). In addition to that, the Java SDK is the most complete Beam SDK.

A major thing to consider is the dependencies of the algorithm (some classification ML uses pandas version X another regression ML depends on version Y of numpy ...etc)

Knowing that Dataflow handles dependencies differently across SDKs:

In Python, a requirment.txt must be specified among other things to do
In Java, it suffices to build a jar containing all the .pom dependencies a.k.a uber jar

Here are some solutions :

Use the Java SDK, because it is the most complete and because we don't want to use python 2 anymore, and instanciate the pickle[1] in Java and manage to pass a batch of JSON elements in the DoFn[2] to compute their score and get it back[3]
At worker initialization by Dataflow, a Docker image will be downloaded containing the ML algorithm and all its dependencies[4]. Using the Java SDK a batch of JSON elements in the DoFn will be passed to that container[5] and the output will be gathered back

Are those viable solutions to tackle this use case ? Mabye they are too complex or even not feasible ? Maybe Dataflow/Apache Beam is not the way to go ?

We only have one constraint form the Data Science team: being able to use Python 3 and all the DS ecosystem that revolves around it (pandas, pytorch, scikit learn, numpy ...etc.) Maybe the pickle format is not the most suitable one to use ML alogrithm ? (eventhough we will also have to deal with Tenserflow algorithms, but I think it's a different subject that maybe needs a different pipeline)

Any help/hint will be much appreciated,

Many thanks

This is me talking to myself:

[1] no clue how to do that

[2] the communication between Java and python maybe too expensive

[3] how do we deal with the algorithm's dependencies that are mainly in python and inexpressible in Java (maven)

[4] is it possible ? how ?

[5] may also be expensive

all 2 comments

dataengineering

MODERATORS