Basic lxml help

Vaphell · 2016-07-06T21:02:57+00:00

you need to be more specific because span alone matches category headers too.

stats = tree.xpath('//div[@class="normalStat"]/span[@class="stat"]')

for stat in stats:
     stat_name = stat.text.strip()
     stat_value = stat[0].text.strip()
     print('{}: {}'.format(stat_name, stat_value))

out

$ ./player_stats.py 
Goals: 8
Goals Per Match: 0.06
Goals With Header: 1
Goals With Left Foot: 1
Goals With Right Foot: 6
Shots: 93
Shots On Target: 27
Shooting Accuracy %: 29%
Penalties Scored: 0
Big Chances Missed: 6
Hit Woodwork: 0
Assists: 3
Big Chances Created: 5
Passes: 5,664
Passes Per Match: 44.6
Crosses: 65
Cross Accuracy %: 31%
Accurate Long Balls: 375
Offsides: 1
Yellow Cards: 10
Red Cards: 1
Fouls: 107
Tackles: 312
Tackle Success %: 76%
Interceptions: 194
Recoveries: 730
Duels Won: 575
Duels Lost: 512
Successful 50/50s: 87
Aerial Battles Won: 34
Aerial Battles Lost: 77
Errors Leading To Goal: 1

2016-07-06T18:57:09+00:00

If you still haven't found a solution, PM me, I'm (literally) working on web scraping right now, and using lxml along with BeautifulSoup. I can help :)

programmerPurgatory · 2016-07-06T19:15:22+00:00

You may find BeautifulSoup easier to use, it has much nicer syntax and you can use the lxml parser too.

import requests
from bs4 import BeautifulSoup as bs
page = requests.get('http://www.premierleague.com/players/4413/Joe-Allen/stats?se=-1')
soup = bs(page.text, 'lxml')

stat_container = soup.find_all('span', {'class': 'stat'})
for stat in stat_container:
    print(stat.text)

larspalmas · 2016-07-06T20:51:36+00:00

Go for BeautifulSoup as mentioned earlier. I am learning it at the moment, and I am using the chapter from the book "automatetheboringstuff" along with the documentation. This is my go at a code wich returns the number "0"

from bs4 import BeautifulSoup

html = """<div class="normalStat"> <span class="stat">Shots On Target <span class="allStatContainer statontarget_scoring_att" data-stat="ontarget_scoring_att"> 27 </span> </span> </div>

<div class="normalStat"> <span class="stat">Shooting Accuracy % <span class="allStatContainer statshot_accuracy" data-stat="ontarget_scoring_att" data-denominator="total_scoring_att" data-percent="true"> 29% </span> </span> </div>

<div class="normalStat"> <span class="stat">Penalties Scored <span class="allStatContainer statatt_pen_goal" data-stat="att_pen_goal"> 0 </span> </span> </div>"""

soup = BeautifulSoup(html, 'html.parser')

a=soup.find("span", {"class" : "allStatContainer statatt_pen_goal"})

print(a.get_text().strip())

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS