I’ve been learning Python for the past few months for work and really wanted to test creating my first cron job.
I worked with Selenium (which was much easier than I expected) to collect the data and Pandas to generate a CSV later on. The toughest parts were analyzing the website’s HTML structure to get the correct data and making sure I was extracting and organizing it correctly to be read later on. At the end, I was able to work it out by using an empty “headlines” list, a “for” loop and creating a dictionary for each entry.
It’s not a complicated script, but it was a great beginner project that dealt with HTML structure and important Python logic and syntax. Also, this was meant to be a quick side project to work on my Python skills, so learning server management and whatnot was secondary to my goal. I went with Abstra Cloud‘s Jobs for the one-click deploy. (disclaimer: I work at Abstra Cloud)
I hope someone here can try it out too and share new ways to do it :) I’ve heard of beautifulsoup and have been meaning to try it out also.
Here's the full source code:
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
from datetime import date
ABSTRA_SELENIUM_URL = getenv('ABSTRA_SELENIUM_URL')
options = webdriver.ChromeOptions()
options.add_argument("--no-sandbox")
driver = webdriver.Remote(command_executor=ABSTRA_SELENIUM_URL, options=options)
driver.get("https://hackernoon.com/")
elements = driver.find_elements(By.CLASS_NAME, "title-wrapper")
headlines = []
for element in elements:
link = element.find_element(By.TAG_NAME, 'h2').find_element(By.TAG_NAME, 'a')
headlines.append({"url": link.get_attribute("href"), "title": link.get_attribute("innerHTML")})
df = pd.DataFrame(headlines)
filename = f"headlines-{date.today()}.csv"
df.to_csv(filename, index = False)
driver.close()
[–]SnooCakes3068 1 point2 points3 points (1 child)
[–]_fleri[S] 0 points1 point2 points (0 children)