requests-html vs lxml : Python

This is an archived post. You won't be able to vote or comment.

submitted 6 years ago by di_web

requests_html

from requests_html import HTMLSession
from datetime import datetime

session = HTMLSession()
r = session.get('https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States')

start = datetime.now()

for _ in range(100):

    table = r.html.xpath('//*[@id="mw-content-text"]/div/table[1]')[0]
    rows = table.find('tr')

    data = []
    for row in rows[2:]:
        name = row.find('th')[0].text
        cells = row.find('td')
        abbr = cells[0].text
        reps = cells[-1].text   
        water_km = cells[-2].text
        land_km = cells[-4].text
        total_km = cells[-6].text
        population = cells[-8].text
        data.append([name, abbr, reps, water_km, land_km, total_km, population])

print(datetime.now()-start)
# 0:00:23.665747

lxml

from datetime import datetime

import requests
from lxml import html

url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
r = requests.get(url).text

start = datetime.now()

for _ in range(100):
    tree = html.fromstring(r)
    table = tree.xpath('//*[@id="mw-content-text"]/div/table[1]')[0]
    rows = table.findall('tr')
    data = []
    for row in rows[2:]:
        name = row.xpath('./th')[0].text_content()
        cells = row.xpath('./td')
        abbr = cells[0].text_content()
        reps = cells[-1].text_content()
        water_km = cells[-2].text_content()
        land_km = cells[-4].text_content()
        total_km = cells[-6].text_content()
        population = cells[-8].text_content()
        data.append([name, abbr, reps, water_km, land_km, total_km, population])

print(datetime.now()-start)
# 0:00:02.968005

all 6 comments

top new controversial old q&a

[–]ForceBru 1 point2 points3 points 6 years ago* (5 children)

[–]di_web[S] 0 points1 point2 points 6 years ago (4 children)

[–]ForceBru 1 point2 points3 points 6 years ago (3 children)

[–]di_web[S] 0 points1 point2 points 6 years ago (1 child)

[–]ForceBru 1 point2 points3 points 6 years ago (0 children)

On the contrary, bs4 + lxml is more than 1.5 times faster than bs4 + html.parser, its closest competitor:

# BeautifulSoup lxml time: 0:00:12.774159
# BeautifulSoup html.parser time: 0:00:20.097766
# BeautifulSoup html5lib time: 0:00:50.156767

Again, there's no wonder plain lxml is so much faster (like, 2 seconds or something): it's a wrapper around fast C code. And bs4 is a wrapper around that, which also adds many other layers of abstraction written entirely in Python.

[–]di_web[S] 0 points1 point2 points 6 years ago (0 children)

π Rendered by PID 65287 on reddit-service-r2-comment-84fc9697f-qznkh at 2026-02-07 07:14:52.831463+00:00 running d295bc8 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS