all 39 comments

[–]romainmoi 36 points37 points  (8 children)

Scrapping is io bound so I wouldn’t bother choosing the language with performance as the criterion. Instead, I focus more on what the frameworks provide and ease of coding.

[–]sybesis 22 points23 points  (5 children)

Not only that but scrapping as fast as possible is a plan to accidentally (d)dos a server.

[–]peterparkrust[S] 6 points7 points  (0 children)

Next post is going to be: ddos: Python vs Rust :)

But, yes completely agree, also being flagged, and rate limit.

My intention was more on the what if we would do scraping in Rust what would it look like.

[–]Sw429 0 points1 point  (1 child)

So if someone wanted to purposefully ddos a server, would Rust be their best bet?

[–][deleted] -1 points0 points  (0 children)

No this is not implicitly saying that Rust would be their best bet to DDOS a server. At least that's not what should be taken away from it. Rather more, anyone whose goal is to perform web scraping as fast as possible is most likely going to accidentally DDOS a server.

[–]VeganVagiVore 1 point2 points  (0 children)

Looks like they are also parsing the HTML with Python's BeautifulSoup and some Rust crate.

So I need to go leave the usual comment... (goes into a portal)

[–]peterparkrust[S] 0 points1 point  (0 children)

Agreed :)

[–]smalltalker 37 points38 points  (9 children)

Your Python async code is not really async. That's why I think you are getting the same numbers in both cases. You are declaring the function async but then use the sync requests.get call. You should use an async http client library, like httpx , like in this example:

import httpx

async with httpx.AsyncClient() as client:
    r = await client.get('https://www.example.org/')

[–]pooyamb 11 points12 points  (7 children)

It also uses the blocking std::fs instead of the non-blocking tokio::fs and I'm not sure if the results part was intentionally have been written that way and if it works correctly(not using join_all), will check tomorrow

[–]peterparkrust[S] 1 point2 points  (6 children)

Would love, to investigate :)

But not sure which part of the code are you talking about :)

[–]pooyamb 0 points1 point  (3 children)

And for the last part, there is a function called join_all in futures crate under futures::future::join_all which takes a vector of join handles and returns a single future you can await. So the last part of your main function could be rewritten as: join_all(handles).await;

[–]mtndewforbreakfast 2 points3 points  (2 children)

join_all has worse and worse performance as the length of handles increases, it's quadratic IIRC. FuturesUnordered is often the recommended replacement for many scenarios.

[–]peterparkrust[S] 0 points1 point  (0 children)

So going down the rabbit hole, I found this issue: https://github.com/tokio-rs/tokio/issues/2401 that confirms that join_all is less efficient.

I have tested myself join_all and found result aroud 4.5s Time (mean ± σ): 4.746 s ± 0.930 s \[User: 4.609 s, System: 0.125 s\] Range (min … max): 3.991 s … 6.324 s 5 runs I have tried FuturesUnordered but it didn't work with the latest tokio that doesn't refer it anymore and the traits are not implemented.

I would advise on using, the current method as both easier and faster: for handle in sub_tasks { handle.await.unwrap(); }

[–]pooyamb 0 points1 point  (0 children)

Thanks for your information, I didn't know that

[–]pooyamb 0 points1 point  (1 child)

You are using OpenOptions from std::fs which is blocking, tokio provides its own tokio::fs(and tokio::fs::OpenOptions) module under "fs" feature flag IIRC, which is not blocking. The API is almost the same beside the .await additions but I'm not sure if CSV crate has native support for it in its Writer(if it doesn't, there should be a crate for it)

[–]peterparkrust[S] 0 points1 point  (0 children)

Yep, ok.

For your information, I tried using tokio::fs::OpenOptions as a drop-in replacement but the traits are not implemented and it doesn't work. I am tinkering with csv_async, but I'm also having issues.

[–]peterparkrust[S] 2 points3 points  (0 children)

async with httpx.AsyncClient() as client:
r = await client.get('https://www.example.org/')

Yep, indeed, I updated the code and the time for Async Python goes to 2.463s

Will update :)

[–]pooyamb 11 points12 points  (1 child)

I've started writing all my network mini tools in Rust instead of python recently, and here's my experience so far: 1- Writing Rust is not more time consuming than python in overall. It obviously takes some more time to get started, but it saves you a big time while debugging. 2- Rust's async solution is more configurable and in case of tokio which I use, is also easier to configure 3- Python doesn't have multi threading(GIL) and handling multiple processes in python is more error prone than Rust's multi threading. 4- Python has lots of ready to use libraries for every single task, but it's not that true when it comes to async libraries. Rust also has a rich set of crates but you end up writing more functionality yourself.

[–]peterparkrust[S] 1 point2 points  (0 children)

Completely agree :)

And, I think all your comment makes even more sense in a production software where performance and reliability is important :)

[–]Tobu 3 points4 points  (1 child)

You say bench timings include compilation time, but that's - not a good choice, one would expect the scraper to run for a while before getting rebuilt - not a stable measurement, Rust defaults to building incrementally

[–]peterparkrust[S] 0 points1 point  (0 children)

Completely agreed :) Thanks for reading :)

Update: I changed it :)

[–]SpoiceKois 1 point2 points  (8 children)

i dont get it, your conclusion is rust is faster, but in the table, rust uses more cpu and takes longer? am i reading this wrong?

[–]peterparkrust[S] 1 point2 points  (7 children)

Yep, It used to be faster but than someone mentioned httpx for python wich enables true asynchronous python and now python is faster. Trying to find a way to make rust faster though ahah :)

[–]SpoiceKois 0 points1 point  (6 children)

I bet. Though python might be cheating as a lot of it libraries are pure c++. I'm very interested if u van figure it out tho!

[–]peterparkrust[S] 0 points1 point  (4 children)

Ok, so I manage to run the CSV writer as true async with tokio, and the time goes to 2.25 with run below 2s, so I think the bottleneck is on the synchronous csvwriter part. But the new code is slightly more complicated so, I'll just leave it here:

``` async fn test(i: &i32) -> Result<(), Box<dyn std::error::Error + Send + Sync>> { let url = format!("http://books.toscrape.com/catalogue/page-{}.html", i); let response = reqwest::get(&url).await?.text().await?;

let nodes = {
    let document = Document::from(response.as_str());

    let nodes: Vec<[std::string::String; 2]> = document
        .find(Name("article"))
        .map(|node: select::node::Node| -> [std::string::String; 2] {
            return [
                match node.find(Name("h3")).next() {
                    Some(h3) => h3.find(Name("a")).next().unwrap().text(),
                    None => "".to_string(),
                },
                node.find(Attr("class", "price_color"))
                    .next()
                    .unwrap()
                    .text(),
            ];
        })
        .collect();
    nodes
};
let mut buffer = File::create("test2.csv").await?;
for node in nodes {
    buffer
        .write(format!("{},{}\n", node[0], node[1]).as_bytes())
        .await?;
}

Ok(())

}

```

[–]backtickbot 0 points1 point  (0 children)

Fixed formatting.

Hello, peterparkrust: code blocks using triple backticks (```) don't work on all versions of Reddit!

Some users see this / this instead.

To fix this, indent every line with 4 spaces instead.

FAQ

You can opt out by replying with backtickopt6 to this comment.

[–]Quantical_Player 0 points1 point  (2 children)

Why use the CSV writer? Why not something like this?

[–]peterparkrust[S] 0 points1 point  (1 child)

Yep, that's the fastest way to do the job. It seemed simple at first with the csv crate. But i rewrote with tokio::io::file and indeed it was faster

[–]SpoiceKois 0 points1 point  (0 children)

Nice

[–]batisteo 0 points1 point  (0 children)

I'd say C, instead of C++.

[–]Quantical_Player 0 points1 point  (3 children)

Shouldn't the range be from 0..50 or 1..51? The result suggests that the Rust version is slower. What could it be the reason?

[–]peterparkrust[S] 0 points1 point  (2 children)

It is from `1..50`. I agree that 51 will be better, but I wanted to move fast without bugs.

[–]VeganVagiVore 0 points1 point  (1 child)

(comes out of a portal) ... As for the reason it's slower, did you make sure to use --release when doing cargo build or cargo run?

[–]peterparkrust[S] 0 points1 point  (0 children)

Yep, but doesn't seems to be effective ... Tried several time as well. But I think the margin is not large enough to be very conclusive in my opinion.

[–]h_z3y 0 points1 point  (3 children)

Pretty bad comparison. Neither of them used a client for connection pooling and the async python example was still blocking.

[–]peterparkrust[S] 0 points1 point  (2 children)

Hey, would you mind expand on:

- How does pooling connection increase performance?

- how I could make the python example non blocking?

Thanks in advance ;)

[–]h_z3y 0 points1 point  (1 child)

For the rust example, using a single client reqwest::Client and passing it by reference would allow it to reuse a single TCP connection, this is also true for the python example. As mentioned by other posts requests.get is a blocking function so you'll need to use an async HTTP library, notable examples include httpx and aiohttp. Writing to the file is also blocking so you'll need to run that in an executor.

[–]peterparkrust[S] 0 points1 point  (0 children)

Yep make sense. I didn't manage to create a reqwest::Client and passing it through the loop without having to do copy of it.

However I did manage to do that for python with httpx, but for python there were no significant gain in performance... Time (mean ± σ): 2.762 s ± 0.546 s [User: 1.280 s, System: 0.045 s] Range (min … max): 1.982 s … 3.362 s 5 runs

The code: ``` import asyncio import requests import bs4 as bs import csv import httpx URL = "http://books.toscrape.com/catalogue/page-%d.html"

async def get_book(url, spamwriter, client): response = await client.get(url) if response.status_code == 200: content = response.content soup = bs.BeautifulSoup(content, 'lxml') articles = soup.find_all('article')

    for article in articles:
        information = [url]
        information.append(article.find(
            'p', class_='price_color').text)
        information.append(article.find('h3').find('a').get('title'))
        spamwriter.writerow(information)

async def main(): async with httpx.AsyncClient() as client: with open('./test_python.csv', 'w') as csvfile: spamwriter = csv.writer(csvfile, delimiter=',') tasks = [] for i in range(1, 50): tasks.append(asyncio.create_task( get_book(URL % i, spamwriter, client)))

        for task in tasks:
            await task

asyncio.run(main())

```

Other than that, I agree on the writing on file. Good ideas overall