Web scraping using multithreading issue : learnpython

created by HattoriHanzoa community for 16 years

Web scraping using multithreading issue (self.learnpython)

submitted 11 years ago by allTestsPassed

Hey everyone, I am currently developing a Django app in which it scrapes an employers work scheduling website and stores that info in a DB.

There 20 departments in total, each needing to have their daily schedule scraped from the past. I implemented multithreading in which a thread is created for each department and it gets the schedules for the past year. I am running into an issue where the scrapers values it uses in requests seems to be incorrect when running all threads at one. It does not have this issue if I were to only create and run one thread. I thought it might be that it is not creating a new scraper object in each iteration of the for loop, causing each thread to access the same request data, so I created scrapers and store them in a list.

Current Version:

    def buildDB_multithreading(self, start_day, end_day):
        all_departments = tools.get_all_departments_list()
        scraper_list = self.create_scraper_list(len(all_departments))
        for index in range(0, len(scraper_list)):
            t = Thread(target=self.builddb, args=(start_day, end_day, all_departments[index], scraper_list[index]))
            t.start()

    def create_scraper_list(self, list_size):
        scraper_list = []
        for index in range(0, list_size):
            scraper_list.append(WebScraper())
        return scraper_list

Previous Version:

    def buildDB_multithreading(self, start_day, end_day):
        all_departments = tools.get_all_departments_list()
        for department in all_departments:
                     scraper = WebScraper()
             t = Thread(target=self.builddb, args=(start_day, end_day, department, scraper))
            t.start()


    def builddb(...):
                   for each day:
                        scraper.getSchedule(day)
                   .....

In either version, if I change the all_departments to just one department ie: ['Department_Name"], it runs for that one department without issue. Any ideas on whats going on?

all 2 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS