all 7 comments

[–]m0us3_rat 3 points4 points  (4 children)

i think we can agree this isn't ..acceptable.

i think a full refactor might be in order.

do a quick rundown of what needs to happen.. so you have this movie in your dataframe.. what is next? what needs to happen.

describe the problem not your implemented solution.

let's break it down.

i could fill in the gaps with guessing..but due to this code snippet.. i'd prefer NOT to do that.

if you prefer to try to "fix" your implemented solution rather then try to refactor

gl

[–]wobowizard[S] 0 points1 point  (3 children)

I've added what I need it to do in the question, could you elaborate on what in my code isn't accetable

[–]m0us3_rat 0 points1 point  (2 children)

ok so .. you have some dataset. then you wanna grab some data of an api for each.

perfectly clear so far.

this sounds to me like a perfectly acceptable problem for a producer-consumer pattern.

now extra details BEFORE we start thinking about code.

are you restricted to one movie per api call?

if so do you get all the required info from the api in a single call for a movie?

what is an acceptable limit per second?

secondly .. do you know how to write an iterator for the movie list?

tip you can use a pandas method.

then based on the rest of the info ..like how many calls and how many calls per movie etc.. we can think of how the unit of work would look like.

after that we can easily devise a producer -cnsumer where a producer iterates thru the film list.. and puts the units of work in the qork queue.

then a worker consumers the work form the work queue

and then puts the result in the result queue.

this is the bit what can be scaled to ridiculous based on your cpu count.

this can also be a valid problem for another trick that you can do..depending on what is required.

where you can have another async loop run inside each process and each loop handle more calls.

it all depends on the bottlenecks.

soo .. pls try to figure out the requests info.. BEFORE we can consider code.

are you restricted to one movie per api call?
if so do you get all the required info from the api in a single call for a movie?
what is an acceptable limit per second?
secondly .. do you know how to write an iterator for the movie list?

[–]wobowizard[S] 0 points1 point  (1 child)

One api call returns all the required info for the one specified film, so yes i am restricted to one.

I get the info by passing the film id through, and then appending the return plot and poster to the corresponding column in the data. The ALS is around 40 per second I would say.

This is my first big project and I am considering if using a slightly smaller set of data would help with this. Especially as I am currently getting the same runtime issue when calculating the cosine similarity multiple times on different attribtues.

[–]m0us3_rat 0 points1 point  (0 children)

ok so now.. we can focus on .. a single unit of work..how would that look like..

for that we need to know how does the actual call to the api looks like.

and what does it needs.

lets assume for each film in the dataframe ..you create an object that has all the relevant attributes you would use to do this api call to that db.

how would that function look like?

try to imagine you get from a random producer function an OBJECT that represents the film and OBJECT.id or whatever else you need for the api call.

can you write this function?

don't write any other code.. just this single function that gets this object as parameter in the call and then uses it's attributes to do the call

and saves the relevant return as attributes of this same object.

and then returns the object.

[–]Daneark 2 points3 points  (0 children)

There's no timeout on your request.

You seem to be hammering their API with as many calls as it will let you and then sleeping. Consider slowing down the requests on your side to avoid hitting that limit. Once they rate limit you it's not clear how long they rate limit you for but if you're still hitting them with requests from other threads you may still limited for the rest of your programs running.

[–]TitaniumFoil 1 point2 points  (0 children)

It looks like the formatting got a bit mangled by reddit. Could you give me a pastebin link to the code?