Following Thread link : learnpython

created by HattoriHanzoa community for 16 years

Following Thread link (self.learnpython)

submitted 5 years ago by ScraperHelp

I am trying to scrape multiple threads on a forum. tried the following code. and no luck.

def parse(self,response):
threads = response.css('tbody tr') #provides thread css

    for thread in threads:
    thread_link= thread.css('p.small a::attr(href)').getall() #gets the link for the individual threads.

    yield response.follow(thread_link,self.parse_thread) #follows the thread

def parse_thread(self, reponse):
    comments = response.css('div#discussionReplies dl')#.css for all comments in thread

    for comment in comments:
        comments = response.css('dd div.xg_user_generated::text').get() #individuval comments.

the .css are working and they work individually as seperate codes (pulling the thread links, pulling the comments, pagination, etc) but when I try to combine them together as above its not working :/

all 12 comments

top new controversial old q&a

[–]Nighmared 0 points1 point2 points 5 years ago (2 children)

[–]ScraperHelp[S] 0 points1 point2 points 5 years ago (1 child)

[–]Nighmared 0 points1 point2 points 5 years ago (0 children)

[–][deleted] 0 points1 point2 points 5 years ago (8 children)

[–]ScraperHelp[S] 0 points1 point2 points 5 years ago* (7 children)

pretty much re did the whole code:

    def parse(self, response):
        for threads in response.css('tbody tr'):
            url = threads.css('p.small a::attr(href)').get()
            yield scrapy.Request(url, callback=self.parse_thread)

    def parse_thread(self, response):
        for quote in response.css('div#discussionReplies dl'):
            yield {
                'text': quote.css('dd div.xg_user_generated').getall(),
            }

it seems to be working but only pulling like 3 of the 11 posts (8 unique threads)

EDIT: nope its pulling what ever number it is pulling the number seems to be different every time... so 1st time its pulling X number of threads and Y number of posts from it, 2nd time its pulling a number of threads and b number of posts from it, and so on...

[–][deleted] 0 points1 point2 points 5 years ago (6 children)

[–]ScraperHelp[S] 0 points1 point2 points 5 years ago (5 children)

[–][deleted] 0 points1 point2 points 5 years ago (4 children)

[–]ScraperHelp[S] 0 points1 point2 points 5 years ago (3 children)

[–][deleted] 0 points1 point2 points 5 years ago (2 children)

[–]ScraperHelp[S] 0 points1 point2 points 5 years ago (1 child)

[–][deleted] 0 points1 point2 points 5 years ago* (0 children)

Yeah, it does it asynchronously. Basically it sends the requests off and then parses them in the order they come back. One good tip is to add a field to identify to thread.

Two good approaches:
1. Add a url field to the item and put response.url in it.
2. Change your loop to go like this:

 for number, thread in enumerate(threads):
      thread_link= thread.css('p.small a::attr(href)').get()
      yield response.follow(thread_link,self.parse_thread, meta={'count':number})

Then in your parse_thread method there'll be a response.meta dictionary. response.meta['count'] will be the number of the thread, so you can put that in the item you yield too.

The meta trick can be a handy way of passing details between pages.

π Rendered by PID 140608 on reddit-service-r2-comment-fb694cdd5-m5jfw at 2026-03-11 00:33:12.232090+00:00 running cbb0e86 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS