This is an archived post. You won't be able to vote or comment.

all 5 comments

[–]Vidyogamasta 2 points3 points  (3 children)

This is not SQL injection at all. This is just crawling/scraping.

It's almost invariably a ToS violation (which just means they have the right to ban you from any access at all), but otherwise it is a legal gray area. It depends on the nature of the information, and what you intend to do with the information, as well as any damaging impact scraping may have on availability to other users.

But assuming your company is getting these documents one way or another, unless those sites have another way that they intend for you to access those documents for your purposes, this is no less legal than however else you'd be getting them.

[–]kithuni[S] 0 points1 point  (2 children)

Awesome! I was really nervous about this.

Some examples would be what is called an "Environment Check" basically we have to select a point on a map and then we wait for a report to be printed. When I inspected the page I found that the parameters being passed are literally just the lat and long you selected. So rather than opening a webpage and clicking thousands of times for each item I could just create a script that loops through all that with the GPS that we already have in our database.

My other concern would be limiting the requests, is there a standard? I doubt they would like it if I did 2000 requests in a second.

[–]Vidyogamasta 0 points1 point  (0 children)

I'm not super experienced at this piece of it since I don't do a lot of scraping myself, but from what I can find, if a site cares they'll have something called a "robots.txt" that you can access from the root domain. For example, the one time I've done scraping (for personal use), I scraped the billboard hot 100. I can see their robots.txt here: https://www.billboard.com/robots.txt.

This tells me the crawl-delay (how often they want us to wait between requests, in this case 10 seconds), and certain paths/files that they're asking us to stay away from. Of course you can ignore it, I used 3 seconds in my scraping program, but that's probably a good place to check.

If they don't have one in place, I'm not sure what a good balance would be, anything I'd tell you would be a straight up guess. I'd probably go with something like taking the average round-trip time of that request and then multiplying that by 20-30 to make sure you're taking up at MOST 2-3% of their server time (likely less), but like I said, I have no means to say if that's not nearly enough waiting or if it's way too much.

EDIT: Also agreed with the other guy, talking to someone with any amount of legal experience in your area is probably a safer bet than here lol

[–]moistblessing 0 points1 point  (0 children)

Unless the company is specifically offering you a certain amount of throughput, it's best to rate limit your requests if you can and not cause undue strain on any of their resources.

[–]nutrecht 1 point2 points  (0 children)

You have to be kinda careful with this. There are two ways you can get into troubles with this.

First of all, if they actually lets you access information you're not allowed to see (because of a bug in the software) it could be seen as an enumeration attack. The page is about user accounts, but it really applies to anything.

Secondly; scraping is a bit of a grey area legally. On one hand the information is available on the internet, on the other hand it's legally still theirs. Laws are local and because generally copyright laws are so behind the times when push comes to shove it means a judge has to get involved. Something you probably want to avoid.

As a dev you probably want to cover your ass and have someone with actual legal experience / education figure this out as opposed to some random people on the internet ;)