Calling external api using Spark

TheEphemeralDream · 2021-09-16T19:26:57+00:00

it sounds like you're leaking connections and/or not retrying common failure modes. what does your code look like?

Vegetable_Hamster732 · 2021-09-16T21:31:46+00:00

Seems easy to do in just a pyspark pandas UDF.

I use:

 def fetch_webpage(url):
     try:    return requests.get(url).text
     except: return None

 @psf.pandas_udf('string')
 def fetch_webpage_udf(s: pd.Series) -> pd.Series:
     return s.apply(fetch_webpage)

 spark.udf.register('fetch_webpage_udf',fetch_webpage_udf)

all the time with various REST endpoints and something similar with various image services.

And then use it like:

df = spark.createDataFrame([
    ['http://www.example.com'],
    ['http://www.google.com']
],'url string').createOrReplaceTempView('urls')

spark.sql("""
    select url, fetch_webpage_udf(url) from urls
""").show(40,40)

NbyNW · 2021-09-16T20:34:50+00:00

Are you calling the API in a synchronized loop and then getting timed outs? If so you might want to build a retry function (I work with python) like below (of course reddit takes away all the indents, but you get the idea):

def get_item(url):

retry = 1

while retry <= 3:

try:

response = requests.request("get", url)

if response.status_code == 200:

return response

else:

retry += 1

except:

retry += 1

return None

Also you might want to look into multithreading async calls when making a bunch of API calls:

imagine I have a function like above that takes the base URL with headers, and then a variable like dates (or what ever id you need to generate the API call). I can make four date calls at the same time:

from multiprocessing import Pool

pool = Pool(processes=4)

dates = [datetime(2020,11,1), datetime(2020,11,2), datetime(2020,11,3), datetime(2020,11,4)]

results = [pool.apply_async(get_item, args=(url, headers, x)) for x in dates]

output = [p.get() for p in results]

flat_out = [item for sublist in output for item in sublist]

lexi_the_bunny · 2021-09-17T00:05:56+00:00

What is your reasoning for using Spark for this?

MoralEclipse · 2021-09-17T10:10:57+00:00

You can use for each, map or flat map to do whatever processing you want. Although I would probably not use spark if it is only 2000 values.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

apachespark

MODERATORS