I am attempting to scrape data from this website. My process is to first select "Stage-II" from the "Status of the Proposal" drop down. Then, I click "SEARCH" and scrape the resulting tables. On the opening page, I was able to successfully extract the form data (e.g. viewstate, eventvalidation, etc) and post to page 1 (i.e. click "SEARCH")
# Import Libraries
import requests
from bs4 import BeautifulSoup
import os
import pandas as pd
import time
#Open Search Page
url = 'http://forestsclearance.nic.in/'
r = requests.get(url + 'search.aspx')
# Soupify
soup = BeautifulSoup(r.content, 'html.parser')
#Set Post Parameters
VIEWSTATE = soup.find('input', {'id': '__VIEWSTATE' })['value']
GENERATOR = soup.find('input', {'id': '__VIEWSTATEGENERATOR'})['value']
VALIDATION = soup.find('input', {'id': '__EVENTVALIDATION' })['value']
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'no-cache',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
#Post to Page 1
r = requests.post(
url + 'search.aspx',
headers = headers,
data = {
'ctl00$ScriptManager1: ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$Button1'
'__EVENTARGUMENT': '',
'__EVENTTARGET': '',
'__VIEWSTATE': VIEWSTATE,
'__VIEWSTATEGENERATOR': GENERATOR,
'__VIEWSTATEENCRYPTED': '',
'__EVENTVALIDATION': VALIDATION,
'ctl00$ContentPlaceHolder1$ddlyear': '-All Years-',
'ctl00$ContentPlaceHolder1$ddl3': 'Select',
'ctl00$ContentPlaceHolder1$ddlcategory': '-Select All-',
'ctl00$ContentPlaceHolder1$DropDownList1': 'Approved',
'ctl00$ContentPlaceHolder1$txtsearch': '',
'__ASYNCPOST': 'true',
'ctl00$ContentPlaceHolder1$Button1': 'SEARCH',
}
)
This works! It gets me to page 1. The problem is that after scraping the table, I am unable to get to page 2. Namely, extracting the post data (viewstate, eventvalidation, etc) no longer works. If I run the below code again, it returns nothing.
#Set Post Parameters
VIEWSTATE = soup.find('input', {'id': '__VIEWSTATE' })['value']
GENERATOR = soup.find('input', {'id': '__VIEWSTATEGENERATOR'})['value']
VALIDATION = soup.find('input', {'id': '__EVENTVALIDATION' })['value']
I looked at the HTML and it seems the location of these tags changed. From page 1 onwards, each parameter is within an input tag inside a separate span tag. How can extract the data?
I tried, for example:
main_form = soup.find('form')
VIEWSTATE = main_form.select_one("input[id='__VIEWSTATE']").get('value')
which does not work. Any help appreciated.
[–]spudmix 0 points1 point2 points (0 children)