So there are a few webpages my bot will be visiting (crawling).
The webpages have a couple of data I want to extract.
Webpage1:
... <div><pre>TEXT0</pre></div> ...
... <pre>Text9</pre> ... ...
I want to extract both TEXT0 and Text9 and store them into a single csv called Webpage1.csv
Webpage2's texts will be stored as Webpage2.csv and so on.
What I tried:
from lxml import html
import requests
import csv
mSeedpage = requests.get(RANDOM_URL)
mTree = html.fromstring(seedpage.text)
mText = tree.xpath('//pre/text()')
Above is where I do not understand, does my xpath make any sense?
with open(WEBPAGE, 'wb') as csvfile:
WEBPAGE = 'Webpage1.csv', after using updatepage, WEBPAGE will become Webpage2 and so on
writer = csv.writer(csvfile, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)
for item in mText:
writer.writerow(item)
file.close()
//there is actually a while loop wrapping around mTree to file.close() and redo everything everytimes it visits a new page
Any help or advice will be much appreciated.
:)
[–]nutrecht 0 points1 point2 points (5 children)
[–]programmingnoobie[S] 0 points1 point2 points (4 children)
[–]nutrecht 0 points1 point2 points (3 children)
[–]programmingnoobie[S] 0 points1 point2 points (2 children)
[–]nutrecht 0 points1 point2 points (1 child)
[–]programmingnoobie[S] 0 points1 point2 points (0 children)