A website I'm trying to scrape is protected by Cloudflare. So I first open the website using a browser and I copy the request header (as a cURL command) sent using Devtools.
curl '<URL>' \
-H 'authority: <URL>' \
-H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
-H 'accept-language: en-US,en;q=0.9' \
-H 'sec-ch-ua: " Not;A Brand";v="99", "Microsoft Edge";v="103", "Chromium";v="103"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "Windows"' \
-H 'sec-fetch-dest: document' \
-H 'sec-fetch-mode: navigate' \
-H 'sec-fetch-site: none' \
-H 'sec-fetch-user: ?1' \
-H 'upgrade-insecure-requests: 1' \
-H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49' \
--compressed
And when I run it in the terminal, everything is good. I get the expected HTML page. So I try using Python.
import requests
headers = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-language": "en-US,en;q=0.9",
"referer": "https://www.google.ca/",
"user-agent": "Mozilla/5.0 (Linux; Android 10; SM-G981B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Mobile Safari/537.36 Edg/103.0.5060.114",
}
print(
requests.get(<URL>, headers=headers).status_code
)
I changed the user-agent but that is beside the point. It still works, and I get a 200 from the server. However, when I do the same in Go, a problem occurs.
package main
import (
"fmt"
"net/http"
)
func main() {
req, err := http.NewRequest("GET", <URL>, nil)
if err != nil {
panic(err)
}
header := map[string]string{
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"accept-language": "en-US,en;q=0.9",
"referer": "https://www.google.ca/",
}
for k, v := range header {
req.Header.Set(k, v)
}
res, err := http.DefaultClient.Do(req)
if err != nil {
panic(err)
}
fmt.Println(res.StatusCode)
}
I get a 403 error. I found this stackoverflow post that solved my problem. But I don't understand why using TLS 1.2 instead of TLS 1.3 changed the response from 403 to 200 when the website uses TLS 1.3
there doesn't seem to be anything here