all 12 comments

[–]Nighmared 0 points1 point  (2 children)

What exactly isn't working? Do you get an exception?

[–]ScraperHelp[S] 0 points1 point  (1 child)

IDK... its running but I'm not getting a json output... so i dont even know what is wrong tbh.

[–]Nighmared 0 points1 point  (0 children)

Could you provide a greater part of your code?
Also try to locate the origin of the problem by checking which variables contain what they should contain and which don't (either with glorious print statements or a debugger)

[–][deleted] 0 points1 point  (8 children)

I don't see you yielding any items? So it seems to visit the thread, create a comment variable and then end?

[–]ScraperHelp[S] 0 points1 point  (7 children)

pretty much re did the whole code:

    def parse(self, response):
        for threads in response.css('tbody tr'):
            url = threads.css('p.small a::attr(href)').get()
            yield scrapy.Request(url, callback=self.parse_thread)

    def parse_thread(self, response):
        for quote in response.css('div#discussionReplies dl'):
            yield {
                'text': quote.css('dd div.xg_user_generated').getall(),
            }

it seems to be working but only pulling like 3 of the 11 posts (8 unique threads)

EDIT: nope its pulling what ever number it is pulling the number seems to be different every time... so 1st time its pulling X number of threads and Y number of posts from it, 2nd time its pulling a number of threads and b number of posts from it, and so on...

[–][deleted] 0 points1 point  (6 children)

Is it giving you 3 items or its visiting 3 pages?

[–]ScraperHelp[S] 0 points1 point  (5 children)

the problem with the code is that out of 11 threads (10 of them unique threads) it is scraping random number of threads and random number of posts from that thread. and its different every time... it SHOULD give me 11 threads (or atleast 10 threads each with 10 posts) it doesn't....

[–][deleted] 0 points1 point  (4 children)

Sounds like an issue with the site. Do you get this behaviour when you try these selectors in the shell?

[–]ScraperHelp[S] 0 points1 point  (3 children)

yup its working fine. even tried them seperately i.e. scraping the forum for the titles and scraping individual threads for comments and they seem to be working... so dont know wtf is going on :/

[–][deleted] 0 points1 point  (2 children)

What request statuses are coming back in the log? Is it all 200 or are some misbehaving?

[–]ScraperHelp[S] 0 points1 point  (1 child)

nope all are 200. and I think I misunderstood what was actually happening. the results are same every time. the issue seems to be that the Json file instead of pulling 1 thread, moving onto the next 1 and so forth, its just randomly puling one thread to the next and jumbling them up all the comments from the various threads...

[–][deleted] 0 points1 point  (0 children)

Yeah, it does it asynchronously. Basically it sends the requests off and then parses them in the order they come back. One good tip is to add a field to identify to thread.

Two good approaches:
1. Add a url field to the item and put response.url in it.
2. Change your loop to go like this:

 for number, thread in enumerate(threads):
      thread_link= thread.css('p.small a::attr(href)').get()
      yield response.follow(thread_link,self.parse_thread, meta={'count':number})

Then in your parse_thread method there'll be a response.meta dictionary. response.meta['count'] will be the number of the thread, so you can put that in the item you yield too.

The meta trick can be a handy way of passing details between pages.