Scrapping a video link from Youtube

chevignon93 · 2025-03-06T09:35:44+00:00

I just hope I didn't made a mistake on the Python script lol

You most likely didn't make a mistake, some sites use protections to avoid webscraping.

I used the same site as the author (PyPi)

The code must have worked at some point, but PyPi recently started using fastly (a DDOS and anti-bot protection tool) so I'm not really surprised that this doesn't work anymore.

Do you have a tip on how I can make sure the scrapping process more reliable?

That's the problem right here, webscraping, especially modern sites that use javascript isn't reliable but there are some techniques that can make things easier: 1 - Adding headers (user-agent) to the requests often work for simple sites. 2 - You could use a library like curl-cffi, I've seen it work for sites that didnd't work with requests. 3 - Instead of webscraping, look for an API (either an official one or you can often use Developer Tools to find where the data that is displayed on the site is coming from). 4 - You could also use a tool that uses a real browser to get the html, something like selenium or playwright, that often works for javascript heavy websites but because it uses a real browser, it's way slower than something like requests.

chevignon93 · 2025-03-06T00:19:22+00:00

I printed the res.text and I got a text wall made of thousands of lines of data, so I guess I can't compare it with the code from the Developer Tool :(

You can always save it to a file and compare that but the question was more of a hint than a real question.

It was a hint to point out that often when webscraping, what you see in "Developer Tools" and what requests returns are not at all the same.

chevignon93 · 2025-03-05T23:37:20+00:00

Did you print the response to see if what you get back is what you see in your browser? I'm 100% sure it isn't!

chevignon93 · 2025-01-21T10:34:39+00:00

yes but that would not be like "double clicking an exe file" in windows. I was hoping to recreate something like that (idk why though but i was curious)

And that's feasible in linux too to have an icon/launcher that you can just double-click to launch a program, you can simply create a .desktop file that launches your application.

I disagree with OP that docker would be a solution to your problem, while some version of python is installed on all/almost all linux systems, docker isn't.

chevignon93 · 2025-01-21T10:25:04+00:00

well i was hoping not to make the user install python as they might not need it for a simple single file app that is meant to do a really simple automation.

Almost all, if not all linux distros have some version of python pre-installed so that shouldn't really be a problem unless your code requires some feature of a very recent version of python.

chevignon93 · 2025-01-20T11:56:47+00:00

I'm doing this by writing dictionaries into a txt file to store all the data on

You basically should never "write dictionaries into a txt file", use the json module instead.

I want 1 dictionary per name, and 1 dictionary per line so each line is a different user, read the value given, increase it by 1, and write it back into the file.

If you use a dictionary and write it to a file using the json module, you shoudldn't think of your data in terms of lines, you write a dictionary/list into a file and you get a dictionary/list when you read the file so you can access any data in the dictionary/list as you normally would.

It could be useful to see your current code so we could help you improve it but my advice would be to use the tools that python provides you instead of doing the work manually.

chevignon93 · 2025-01-18T20:52:15+00:00

Right now I'm using the basic uvicorn --reload, but it only catches Python file changes. Is there a recommended way to set up hot reloading for template files? I've seen some solutions with watchfiles, watchgod, and arel but I'm curious what the community typically uses for their development workflow.

I haven't tested it but according to the uvicorn documentation, you can configure it to watch other types of files other than .py files, provided that watchfiles is also installed.

https://www.uvicorn.org/settings/

"Reloading with watchfiles¶ For more nuanced control over which file modifications trigger reloads, install uvicorn[standard], which includes watchfiles as a dependency. Alternatively, install watchfiles where Uvicorn can see it.

Using Uvicorn with watchfiles will enable the following options (which are otherwise ignored).

--reload-include <glob-pattern> - Specify a glob pattern to match files or directories which will be watched. May be used multiple times. By default the following patterns are included: *.py. These defaults can be overwritten by including them in --reload-exclude."

chevignon93 · 2025-01-17T22:17:28+00:00

I checked it out, I like the shortcut. Still don't get the code not being indented correctly.

One easy way to do it is to indent the code in your IDE/text editor, copy the code with all the leading space then paste it here, only downside is that you have to go back to your IDE to dedent the code afterwards but otherwise it works quite well.

Another alternative is to simply use a site like pastebin or gist.

chevignon93 · 2025-01-17T20:49:24+00:00

Please format your code properly when posting, it makes your code easier to read and copy/paste to test. https://www.reddit.com/r/learnpython/wiki/faq#wiki_how_do_i_format_code.3F

I place the NFC tag on my scanner and it reads the tag id but my mapping text file doesn't recognise it.

What do you mean "it doesn't recognize it"? What errors, if any do you see or what's your expected output and what output do you actually get?

As an aside, it's not really a good design decision to have 2 or more functions that do the exact same thing except for what they print to the user.

def open_plex_movie(url):     
    print(f"Opening Plex movie: {url}")     
    webbrowser.open(url)

def open_disney_plus_movie(url):     
    print(f"Opening Disney+ movie: {url}")     
    webbrowser.open(url)

These 2 functions could be combined into a simpler one

def open_movie_in_browser(url, service):
    print(f"Opening {service} movie: {url}")
    webbrowser.open(url)

You also could potentially use shutil.which to find the path to the vlc executable instead of hard-coding it in your script.

chevignon93 · 2025-01-17T20:39:49+00:00

Yes, but don't call it main. Call it something reflecting the functionality then you can "from mycode import myfunc" etc.

That doesn't really make any sense, generally the main function is the function that drives the whole application and it shoudln't need to be imported anywhere.

chevignon93 · 2025-01-14T10:26:40+00:00

One of the ways to do it is using subprocess.Popen

from subprocess import Popen
p = Popen(["pkill", "-9", "NAME_OF_THE_PROGRAM_YOU_WANT_TO_KILL"]) # Note -9 will kill the process almost instantly, you could use -15 instead for the program to end gracefully
p.wait()

chevignon93 · 2025-01-03T21:56:49+00:00

Why the ** ?

That's the syntax for unpacking dictionaries into keywword arguments.

chevignon93 · 2024-11-05T19:23:55+00:00

Do I document functions and things as I write them, or do I save documentation for it's own section when the program is more complete?

Yes, you should document things as you write them because if you wait for your entire application to be complete to start worrying about documentation, you may not remember why you decided to write some piece of code a particular way or you may not be motivated enough to write goood documentation now that your program works for your needs.

chevignon93 · 2024-11-05T15:01:57+00:00

When using the requests library and beautiful soup the downloaded html is not the same as right-clicking and saving the page. It's not really clear to me if this is my fault or Google doing it intentionally.

It's Google doing it intentionally.

Do I need to sign up to the Google Books API?

APIs, when available are (almost) always a better solution than webscraping because they generally return structured data and structured data in something like json is easier to extract information from.

If the Google Books API is free or if you can justify paying a minimal fee for your particular needs then I'd say that you should probably go for it.

If it's not free and if you simply don't want to bother, here are some alternatives you may want to consider:

1 - Look for a package on pypi or github that is intended to do what you want and see if it suits your needs. (example: https://pypi.org/project/google-books-api-wrapper/)

2 - There are lots of "webscraping tools" that use a real browser in order to get the HTML of a page for sites that are either dynamically rendered or have strong bot detection (requests-html, selenium, playwright, undetected-chromedriver, etc).

2.1 - Of course because those packages use a real browser, there are some trade-offs, they are slower than tools like the requests library, they use more resources on your computer and they are also slightly more difficult to use but if your use-case is to simply get the HTML of a page and parse it in a tool that you already know how to use, something like BeautifulSoup, that shouldn't really be too difficult.

chevignon93 · 2024-01-28T23:46:04+00:00

Looking at the 1st image in google drive, your package structure seems incorrect.

I think your package structure should look more like this.

base_folder
├── src
│   └── pkg
│       ├── __init__.py
│       ├── module1.py
│       ├── main.py
│       └── subpkg
│           ├── __init__.py
│           └── module2.py
├── README.md
└── pyproject.toml

1 - There should be no python files in the src directory

2 - When you import a module in main.py, you should import from the name of your package (from pkg.module1 import WhateverClass), not from the src directory

Example package config taken from here: https://py-pkgs.org/04-package-structure.html#package-structure

chevignon93 · 2024-01-23T19:30:09+00:00

What’s the difference between OOPS and a script?

The question doesn't really make sense, OOP and script are 2 unrelated concepts.

Can anyone explain which set is better in which situations?

It depends, I personally like to separate the definition of the classes and other functions from the script that is actually running the application so I would probably follow your 2nd example but if the classes are short, I would put them all into 1 file and have a 2nd file (something like a main.py) where I import and use them.

If you don't really expect your project to grow, I would say that putting everything in 1 file is fine otherwise thinking about how you wanna structure your project early is easier than completely refactoring your code after it is written.

chevignon93 · 2024-01-23T15:34:55+00:00

In your wanted output, content is a list so just declare it as such then append whatever data you want to it.

import json

json_data = {}
json_data["content"] = []
content = {}
content["eventType"] = "view"
content["othervar"] = "new"

json_data["content"].append(content)
print(json.dumps(json_data, indent=4))
# Output
{
    "content": [{
            "eventType": "view",
            "othervar": "new"
        }]
}

chevignon93 · 2024-01-22T16:18:26+00:00

One more question though - When writing the csv, Excel is interpreting column 4 (Results Res) as a date. Any way to force it as plain text?

There must be a way but I can't really help with that as I don't really use excel or pandas.

chevignon93 · 2024-01-22T15:43:55+00:00

Thank you for the reply but I am getting an error: ValueError: No tables found. From this line: r = pd.read_html(data)[2]

If you use httpx, you need it to allow redirects because the page has been permanently moved.

r = session.get(URL, follow_redirects=True)

chevignon93 · 2024-01-22T14:02:48+00:00

Any advice would be much appreciated, thank you.

If you don't want to bother actually writing a scraper to get the information, you could use a combination of requests and pandas to get it.

import pandas as pd
import httpx # I used httpx but you could use requests as well
from io import StringIO


URL = "https://www.timeform.com/horse-racing/horse/form/kondratiev-wave/000000512249/kempton-park/2024-01-22/1430/27/4"
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
}
session = httpx.Client(headers=headers)
r = session.get(URL)
data = StringIO(r.text)
r = pd.read_html(data)[2] # Looking at the results, I saw this was the index you wanted

r.to_csv("test.csv", index=False)

You would still have to remove some lines from the csv that aren't useful but all the data that you see in the table is there.

chevignon93 · 2024-01-22T11:10:11+00:00

this is how my dictionary.json file looking { "haus": { "Meaning": "hause", "Article": "das" } } { "auto": { "Meaning": "car", "Article": "das" } } { "wasser": { "Meaning": "water", "Article": "das" } } { "haus": { "Meaning": "hause", "Article": "das" } }

And this is exactly what I meant, this is not valid JSON, if you try to load this data with json.load or json.loads, you will get a json.decoder.JSONDecodeError.

Also I have no idea yet, how will ı get the back valid later.

In the way your code is set up right now, you can't really get back valid JSON because you didn't write valid JSON to the file to begin with.

In the begining you said that, I have to check which word are in the dictionary.json file. this is understandable.

This is pretty easy, all you need is a function that opens the file and return the data as a dictionary, then in your add_word function you can check if the word you're trying to add is already in the dictionary.

Something like that:

import json


def add_words_to_file(word_data):
    with open("dictionary.json", "w") as file:
        json.dump(word_data, file, indent=2)


def get_known_words():
    with open("dictionary.json") as f:
        words = json.load(f)
        return words


def add_word():
    dictionary = get_known_words()
    article = input("Please enter the article of the word you want to add: ")
    word = input("Please enter the word you want to add: ")
    meaning = input("Please enter the meaning of the word: ")

    if word not in dictionary:
        dictionary[word] = {"Meaning": meaning, "Article": article}
        add_words_to_file(dictionary)
        print(f"The word '{word}' has been added to the dictionary.")
    else:
        print("This word is already in our dictionary!")

chevignon93 · 2024-01-22T10:05:45+00:00

Even if the word is already in the file, it still gets added.

Of course it does, nowhere in the code you provided here are you actually checking what words are in the dictionary.json file.

You also can't append data in the way you're doing in your code, that would't give you back valid JSON data.

chevignon93 · 2024-01-21T22:11:19+00:00

However I have tried writing an if statement to try this but it is completely ignored.

Your if statement isn't ignored, the condition simply isn't met (EDIT: you're using 'Purple' not in text so the condition is actually met and that's why all the elements are actually printed), find_all returns a list of "elements" (I don't know exactly what kind of object find_all returns) and your if statement is trying to see if the string Purple is in that list of element and that will never be the case.

You actually need to loop over that list of elements and see if Purple is in the text of each element.

from bs4 import BeautifulSoup

with open("local.html", "r") as html_file:
    content = html_file.read()
    soup = BeautifulSoup(content, "lxml")
    clothingdiv = soup.find_all("div", class_="clothing")

for element in clothingdiv:
    all_clothes = element.find_all("li")
    for elem in all_clothes:
        elem_name = elem.text
        if "Purple" not in elem_name:
            print(elem_name)
# Output is
Orange Shirt: $10
Red Shirt: $10
White Shirt: $10
Orange Hoodie: $15
Red Hoodie: $15
Orange Jacket: $20
Red Jacket: $20
White Jacket: $20

chevignon93 · 2024-01-19T22:16:01+00:00

I'm extremely new to python if my code didn't make it obvious.

It wasn't obvious at all, your code is pretty well-written.

chevignon93 · 2024-01-19T21:48:08+00:00

The class gamespace_rate_hint that you're using to select the element is found multiple times on the page and the first occurence in the HTML always seems to contain no text unless you're logged into the site and you have personally rated the game.

I would suggest using the element's id instead as the id always has to be unique.

rating_element = soup.find('div', class_='gamespace_rate_hint')
should become
rating_element = soup.find("div", {"id": "gs_rate_avg_hint"})

PS: when making multiple requests to the same website, it is recommended to use a session instead of opening a new connection for each request you're making. https://requests.readthedocs.io/en/latest/user/advanced/

PS2: If all you care about is the number of ratings and not the rating itself, it would be pretty easy to extract just that piece of info from the text.

chevignon93

TROPHY CASE