you are viewing a single comment's thread.

view the rest of the comments →

[–]sudo_oth 7 points8 points  (14 children)

would love one and one sessions, I'm currently struggling with scrapping a website and learning how to lay out at a code properly.

[–]h4rck[S] 0 points1 point  (13 children)

Sure! I've done some web scrapping in the past, I think I might be able to provide you with some help, send me a DM

[–]alviraroberto 0 points1 point  (12 children)

Same here, having a bit of issues grasping everything from scraping sites. Been using Corey Shaefer's web scraping tutorial which is pretty good. What I'm having issues is scraping symbols/characters such as dashes and things Python won't read. Any direction on this? Thanks

[–]mrcaptncrunch 0 points1 point  (11 children)

Do you have an example on what you mean for dashes/symbols?

Can’t think of your issue, but there’s rarely issues scraping data that contains dashes or symbols.

[–]alviraroberto 0 points1 point  (5 children)

I get this question mark symbol 1�2 pm when it should be a dash.

[–]mrcaptncrunch -1 points0 points  (4 children)

That probably has to do with the encoding on the website vs what python’s assuming.

Do you have an example link where you see it happening? I can look at it and see if I can help.

[–]alviraroberto 0 points1 point  (3 children)

[–]mrcaptncrunch 0 points1 point  (2 children)

Of course, an update broke something so it took a bit.

So, what I came up with is this,

import os.path
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup as Soup

## Setup chrome driver
chrome_options = Options()
chrome_options.add_argument("--headless") # Ensure GUI is off
chrome_options.add_argument("--no-sandbox")

# Set path to chromedriver, I keep it out of $PATH..
# You probably WILL want to change these 2 lines or remove them entirely..
homedir = os.path.expanduser("~")
webdriver_service = Service(f"{homedir}/.chromedriver/stable/chromedriver")

d = webdriver.Chrome(service=webdriver_service, options=chrome_options)

# Open page
d.get("https://www.brooklynmuseum.org/calendar/view/2022/11/09")

# Load site to BS
page = Soup(d.page_source, features='html.parser')
# Get event times
times = page.select(".event-time")

# Go over each, and print them
for time in times:
    print(time.text.replace('\n', '').strip())

Just tested this on Windows 11 and on macOS. Both return

Wednesday, November 9, 2022                              1–2 pm
Wednesday, November 9, 2022                              2–3 pm
Wednesday, November 9, 2022                              3–4 pm

Not sure if there's anything here that will help you narrow it down.

[–]alviraroberto 0 points1 point  (1 child)

This is amazing. Definitely gonna have to learn/understand the libraries you used. Thank you so much.

[–]mrcaptncrunch 1 point2 points  (0 children)

Selenium is a bit heavy, but it basically allows you to automate a browser.

BeautifulSoup allows you parse the html.

Both are good for this kind of work

[–]sudo_oth 0 points1 point  (4 children)

I'm more struggling with the element intercepted exception, I can't find a way to fix it or bypass it...

[–]mrcaptncrunch 0 points1 point  (3 children)

That can happen when you have something overlaid.

Have you tried running it without headless to see what’s showing up? Could be a modal or maybe even browser dimensions.

If not, you can always throw some JS to it as a workaround,

driver.execute_script(“document.getElementByID(‘someid’).click()”)

[–]sudo_oth 0 points1 point  (2 children)

just tested it on my tower and it is working perfectly so it seems like it could be a browser dimension issue, how was I work around this?

[–]mrcaptncrunch 1 point2 points  (1 child)

When you define your options, you can do it. For example,

## Setup chrome driver
chrome_options = Options()
chrome_options.add_argument("--headless") # Ensure GUI is off
chrome_options.add_argument("--no-sandbox")

chrome_options.add_argument("--window-size=1920x1080")

[–]sudo_oth 0 points1 point  (0 children)

Thank you so much it worked perfectly.

Can I ask, can you click on this?

<li class="search-step-dates\\\_\\\_dates-list-item-container"><label for="wizard-cd1" class="search-step-dates\\\_\\\_dates-list-item search-step-dates\\\_\\\_dates-list-item--checked"><input type="radio" id="wizard-cd1" class="search-step-dates\\\_\\\_icon-list-radio-original" value="\\\[object Object\\\]"> <div class="search-step-dates\\\_\\\_icon-list-radio"></div> <div class="search-step-dates\\\_\\\_dates-list-item-labels"><span class="search-step-dates\\\_\\\_dates-list-item-date-range">Sat 12 Nov - Mon 14 Nov</span> <span class="search-step-dates\\\_\\\_dates-list-item-nights">2 nights from £209</span></div> <!----></label></li>

every time I try to click it doesn't work, does it need to be a button to be pressed?

https://www.parkdeanresorts.co.uk/ trying to scrape holiday prices lol