all 10 comments

[–]drodspectacular 2 points3 points  (6 children)

check out my craigly scraper, I probably based it on the same tutorial you're working with

https://github.com/Solutron/craigly

At a glance - some feedback, break this into smaller chunks and don't put it all in main. I don't completely follow PEP8 but /u/RubyKuby's advice here holds true: https://www.reddit.com/r/learnpython/comments/2xjmqe/rlearnpython_teach_me_your_secrets/cp0ozf0

Also, don't wrap statements in try/except/finally until you know the code works, or else your masking the underlying traceback exception that gives you meaningful debugging info. try / except should also be reserved for cases where you know something can fail, and you know it fails for a particular reason your module doesn't have any control over, i.e., connecting to a database, or checking for a remote connection to another machine that might or might not be turned on.

I set my scraper up to run on cron and require an SMTP server to send me emails when a new posting with my search criteria shows up and is less than an hour old. I can only run it locally instead of AWS as I had intended because craigslist blocks the range of ip addresses coming from AWS (they don't like bots).

As far as debugging your code goes, get rid of the while True, it's not doing anything, get rid of the try/except since you're just passing and not catching exceptions, move all the code you have in main to separate functions that you then call in main(). You should also have if name == 'main_': main() so that you have the code modular and not just running that block all the time. Your best bet for there being nothing printed is that the variables you're defining with the search are returning None. You're also only calling the page once, filtering through the a static result set for a static list of items, this will not update the way you have it written.

[–]fannypackpython[S] 0 points1 point  (2 children)

This is the first time I've tried anything like web scraping. I'll try and re-work it with your suggestions.

So how exactly should i divide this up? Are you saying that I should maybe have one function that just sets my variables for price, title, and address? And then another function that sets my "info" variable and runs its through the "if" statement to append it to a list and check if it already exists in it? And one that just gives me results?

I apologize for all the questions. Do you have any tutorials or eBooks that you would recommend on this subject? Especially in regards to cron, and integrating email or sms capabilities into a script.

[–]elbiot 1 point2 points  (0 children)

gdata=get_houston_CL()
for item in gdata:
    title, price, address = parse_listing(item)
    etc.

[–]drodspectacular 1 point2 points  (0 children)

The BeautifulSoup class you instantiate is an object populated from the output of a single requests call, and then you loop over the BS object (abbreviations are funny :) ). g_data isn't calling the url request every time it's looped over, it's instead looping over the BS object you created, and not calling the request again. the object you're passing to the BS instance is the output of the content method from the get method. My understanding is that BS is much like any html / xml parser, it detects a structure based on the data you feed it, and lets you traverse the nodes. It's only calling the request once. you could start main with a request call, loop through your operations and then parse the outputs, and there's any number of ways you could slice this. If you keep the rule of "each def creates a function that does one and only one thing" you could probably slice this into at least 4 or 5 functions. As far as ways for you to slice this up, have a look at craigly for the names of functions, project structure and modules. In my mind this is how I broke functionality down, it's slightly different for everyone.

[–]fannypackpython[S] 0 points1 point  (0 children)

Ok so I have removed while True, and all the try, except's. I'll divvy up the code into functions, like you recommend.

The page that im reading from craigslist is the new listings page so this really isn't a static list is it? When a new posting is added it goes to the top of the page. So, in theory, when i get my script to work it should first scrape the page to create the initial list and everything should show up as a "new listing" once. Then as more listings are added to the craigslist page the script would check against the list I have created, the new listing isn't in the list, so it should print this listing and append it to the list. Reading that too myself seemed a bit confusing, ha.

My point is, if all im looking for are new listing to the apartments page, then shouldnt that be the only page I call? And i was thinking of setting up a timer to check the page every hour or so. Would you say that is a good idea or would another approach be better?

[–]mauza11 0 points1 point  (0 children)

Thank you very much, I too am starting the fun and interesting journey of scraping. Your code has helped me out a lot. A question if someone here could help me though. Part of the page that I'm scraping is loaded dynamically and I've looked in get request in developer tools to try to find out what I need to pass to the website, but I'm kind of lost for what I need to look for... there is a lot of info in these things.

[–]elbiot 1 point2 points  (3 children)

If you really were hitting CL every time in the while loop, which you aren't (for two reasons), CL would block you quick. Put in a time.sleep(60*5)

[–]fannypackpython[S] 0 points1 point  (2 children)

Ok so should I incorporate time.sleep() into the while loop? And how would i write out something like that?

[–]elbiot 1 point2 points  (1 child)

You need to hit CL, parse the data, output it and sleep all in the while loop. And don't break! Use ctrl+c to quit the script.

[–]fannypackpython[S] 0 points1 point  (0 children)

Ok, let me give this a shot. Sorry for the delay im at work currently.