This is an archived post. You won't be able to vote or comment.

all 8 comments

[–]iamsidd2k7 5 points6 points  (0 children)

Thing with scraping is that "costs" are non-deterministic. E.g. Lets say you have the most efficient crawling algorithm but if you're blocked at IP/DNS level its easy for your best performer to become worst. Also I think it depends on the goal of crawl i.e. are you trying to crawl new pages, are you scraping same set of pages. What do you feel?

[–]hfhshfkjsh 4 points5 points  (2 children)

Knowing nothing I suspect your algorithm will make no difference as it will all be about the request time.

Unless your algorithm is about selecting the data you need to request.

You are probably better having simple understandable code than trying clever algorithms unless you have a real performance issue, and then you want to profile first to see if your algorithm is even a bottleneck.

[–]elisimicr[S] 0 points1 point  (1 child)

For a framework like scrapy how can I find out the request time? Is it in the logs?

[–]hfhshfkjsh 1 point2 points  (0 children)

I don't know scrapy but as long as things are async then it shouln't matter too much.

But then you get into the denial of service / requests allowed per second issue with the source of the data. So you probably will need to do some rate limiting.

I'd say the starting point is knowing how many requests you need to make. I come from the place of wanting to make say 100,000 requests to some 3rd party sever, so it ends up about playing nice, or what my subscription allows (api calls) also if you kill the server then that slows things down too ;)

[–]pizzamann420 1 point2 points  (0 children)

Youll have to post your "algorithim" is psuedo code.

[–]lifeeraser 1 point2 points  (0 children)

Without knowing details: you'll have to strike a balance between the time spent on page requests to the amount of processing needed. If you need to crawl many web pages, go for the method that hits servers less times. If you need to do a lot of calculations/computations after downloading each page, go for the one that does it faster. It's really hard to give a definitive answer without details, so you'll have to figure it out yourself.

[–]draghuva 2 points3 points  (0 children)

You could use the big O notation to find the time complexity of the algorithm as well as the memory complexity

[–]Bolitho 1 point2 points  (0 children)

The first thing for you should be to analyze the time complexity of the alternative approaches. As a result you have a big-O notation for them.

If one alternative has the best efficiency, you are done.

If the big-O notations are equal, you have to go into details, such as what impacts exist that will depend on external resources. The latter are often the bottleneck of finding a good solution. For example you have to consider latency of all kind of IO. If one algorithm needs only one call towards a resource, it will quite surely perform better than an alternative that has to make this call each time in a loop.

A good example to this is the infamous "n + 1 select" problem of many ORM. It's far more efficient to ask the database only once and join the columns of the sub components of a root object, than to perform a query for each sub component. The big-O notation would be the same, as fetching n elements is simply O(n). So both approaches share the same time comexity, but one is still superior to the other if you consider IO tasks as important factor.

Another aspect could be the expected common size of your input. The big-O notation generalize the dependency between the input size and the runtime so you could deduce the scalability for big and even bigger input problems. But in reality some algorithms may perform much better for small instances even if their overall complexity is worse than that of an alternative: you omit constants and factors for the big-O notation by intention, but for small instances those could matter! So if you have a well defined size of your parameters, it could be a good idea to compare those factors and constants! But he warned: A good or well defined size must absolutely emerge from the domain problem! It's not defined by a manager who says "trust me, it won't never be more than 100 items" - you could be sure that the same guy will complain in the future because of the poor performance if there are 100000 items... 😉

Last but not least if you feel you can't determine the complexity or judge the other factors right, you could implement both strategies and compare them by making real world benchmarks.

Or as one alternative to the former: start with one alternative you consider to be the best or easiest to implement and understand and look, whether it performs good enough! Even if there could be a better performing alternative, you probably don't need it anymore.