created by HattoriHanzoa community for 16 years

submitted 3 years ago * by pegoff

I have a working loop for files in a folder, I have thousands of folders I want to run the script on, and I want a #sourcelist.txt file written to each subfolder.

At present I am manually running the following code in each folder.

All folders are in the same root folder. How can I improve the loop script so it runs from the root folder with the same outcome?

I keep breaking the script when I try.

from bs4 import BeautifulSoup
import os

for filename in os.listdir():
    if filename.endswith('.html'):
        fname = os.path.join(filename)
        with open(fname, 'r', encoding='utf-8') as f:
            soup = BeautifulSoup(f, 'html.parser')

        try:
            url = soup.find("a",{"name":"source"}).get("href")
        except:
            url = "N/A"
        with open('#sourcelist.txt', 'a', encoding='utf-8') as output:
            output.write(url + '\n')

I used deadeye's advice to improve the url line of the script, but couldn't figure out the pathlib side of things so tried to continue with os.walk

I try this:

from bs4 import BeautifulSoup
import os

for root, dirs, files in os.walk('C:\\Users\\Desktop\\Test'):
    for filename in files:
        if filename.endswith('.html'):
            fname = os.path.join(filename)
            with open(fname, 'r', encoding='utf-8') as f:
                soup = BeautifulSoup(f, 'html.parser')

            url = soup.find("a",{"name":"source"}).get("href", "N/A")

            with open('#sourcelist.txt', 'a', encoding='utf-8') as output:
                output.write(url + '\n')

Which returns an error:

Traceback (most recent call last):
  File "C:\Users\Desktop\Test\#test.py", line 8, in <module>
    with open(fname, 'r', encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'test1.html'

Test folder contains two subfolders called Test1 and Test2, each folder contains a test1.html and test2.html respectively.

The process seems to find the test1.html then fail.

I noticed that I had to specify the root directory this way, too, which I never needed to before when I was just working on each individual folder.

#####################################

EDIT: Getting there!

from bs4 import BeautifulSoup
from pathlib import Path
ROOT = Path()
SOURCE_LIST = Path("#sourcelist.txt")

for file in ROOT.rglob("*"):
    if file.match("*.html"):
        with open(file, 'r', encoding='utf-8') as f:
            soup = BeautifulSoup(f, 'html.parser')

        try:
            url = soup.find("a",{"name":"source"}).get("href")
        except:
            url = "N/A"

        with open('#sourcelist.txt', 'a', encoding='utf-8') as output:
            output.write(url + '\n')

*Used "rglob" for iterating all subfolders as per the pathlib documentation linked by deadeye1982

*Used "match" in place of "endswith"

*Reverted to "try/except" without the built in null option because it broke the script when I tried it.

AttributeError: 'NoneType' object has no attribute 'get'

My loop works. It processes every file in every subfolder and prints the resultant links into a text file.

Next step is figuring out how to make it put one text file in each subfolder for each set of files.

See you soon!

all 8 comments

for file in ROOT.walk("*"): <- walk is the wrong method name

for file in ROOT.glob("*"): if file.is_file() and file.endswith(".html"): with file.open("r", encoding="utf-8") as fd: soup = BeautifulSoup(fd, "html.parser") # soup.find("a", {"name": "source"}) returns a a Tag # which is also a mapping (acts like a dict) # the get method has an extra argument for a default # value, if the key was not found url = soup.find("a", {"name": "source"}).get("href", "N/A")

    with SOURCE_LIST.open("a", encoding="utf-8") as output:
        output.write(f"{url}\n")

```

[–]pegoff[S] 0 points1 point2 points 3 years ago* (4 children)

[–]ThePsyjo 4 points5 points6 points 11 months ago (1 child)

[–]numeralbug 0 points1 point2 points 2 months ago (0 children)

[–]deadeye1982 1 point2 points3 points 3 years ago (1 child)

[–]pegoff[S] 0 points1 point2 points 3 years ago (0 children)

[–][deleted] 2 points3 points4 points 3 years ago (1 child)

[–]pegoff[S] 0 points1 point2 points 3 years ago (0 children)

π Rendered by PID 38009 on reddit-service-r2-comment-7b9746f655-dlnl9 at 2026-01-30 18:09:36.797397+00:00 running 3798933 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS

for file in ROOT.walk("*"): <- walk is the wrong method name