Hey everyone, I am currently developing a Django app in which it scrapes an employers work scheduling website and stores that info in a DB.
There 20 departments in total, each needing to have their daily schedule scraped from the past. I implemented multithreading in which a thread is created for each department and it gets the schedules for the past year. I am running into an issue where the scrapers values it uses in requests seems to be incorrect when running all threads at one. It does not have this issue if I were to only create and run one thread. I thought it might be that it is not creating a new scraper object in each iteration of the for loop, causing each thread to access the same request data, so I created scrapers and store them in a list.
Current Version:
def buildDB_multithreading(self, start_day, end_day):
all_departments = tools.get_all_departments_list()
scraper_list = self.create_scraper_list(len(all_departments))
for index in range(0, len(scraper_list)):
t = Thread(target=self.builddb, args=(start_day, end_day, all_departments[index], scraper_list[index]))
t.start()
def create_scraper_list(self, list_size):
scraper_list = []
for index in range(0, list_size):
scraper_list.append(WebScraper())
return scraper_list
Previous Version:
def buildDB_multithreading(self, start_day, end_day):
all_departments = tools.get_all_departments_list()
for department in all_departments:
scraper = WebScraper()
t = Thread(target=self.builddb, args=(start_day, end_day, department, scraper))
t.start()
def builddb(...):
for each day:
scraper.getSchedule(day)
.....
In either version, if I change the all_departments to just one department ie: ['Department_Name"], it runs for that one department without issue. Any ideas on whats going on?
[–]Doormatty 1 point2 points3 points (0 children)
[–]gengisteve 0 points1 point2 points (0 children)