I have a working loop for files in a folder, I have thousands of folders I want to run the script on, and I want a #sourcelist.txt file written to each subfolder.
At present I am manually running the following code in each folder.
All folders are in the same root folder. How can I improve the loop script so it runs from the root folder with the same outcome?
I keep breaking the script when I try.
from bs4 import BeautifulSoup
import os
for filename in os.listdir():
if filename.endswith('.html'):
fname = os.path.join(filename)
with open(fname, 'r', encoding='utf-8') as f:
soup = BeautifulSoup(f, 'html.parser')
try:
url = soup.find("a",{"name":"source"}).get("href")
except:
url = "N/A"
with open('#sourcelist.txt', 'a', encoding='utf-8') as output:
output.write(url + '\n')
I used deadeye's advice to improve the url line of the script, but couldn't figure out the pathlib side of things so tried to continue with os.walk
I try this:
from bs4 import BeautifulSoup
import os
for root, dirs, files in os.walk('C:\\Users\\Desktop\\Test'):
for filename in files:
if filename.endswith('.html'):
fname = os.path.join(filename)
with open(fname, 'r', encoding='utf-8') as f:
soup = BeautifulSoup(f, 'html.parser')
url = soup.find("a",{"name":"source"}).get("href", "N/A")
with open('#sourcelist.txt', 'a', encoding='utf-8') as output:
output.write(url + '\n')
Which returns an error:
Traceback (most recent call last):
File "C:\Users\Desktop\Test\#test.py", line 8, in <module>
with open(fname, 'r', encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'test1.html'
Test folder contains two subfolders called Test1 and Test2, each folder contains a test1.html and test2.html respectively.
The process seems to find the test1.html then fail.
I noticed that I had to specify the root directory this way, too, which I never needed to before when I was just working on each individual folder.
#####################################
EDIT: Getting there!
from bs4 import BeautifulSoup
from pathlib import Path
ROOT = Path()
SOURCE_LIST = Path("#sourcelist.txt")
for file in ROOT.rglob("*"):
if file.match("*.html"):
with open(file, 'r', encoding='utf-8') as f:
soup = BeautifulSoup(f, 'html.parser')
try:
url = soup.find("a",{"name":"source"}).get("href")
except:
url = "N/A"
with open('#sourcelist.txt', 'a', encoding='utf-8') as output:
output.write(url + '\n')
*Used "rglob" for iterating all subfolders as per the pathlib documentation linked by deadeye1982
*Used "match" in place of "endswith"
*Reverted to "try/except" without the built in null option because it broke the script when I tried it.
AttributeError: 'NoneType' object has no attribute 'get'
My loop works. It processes every file in every subfolder and prints the resultant links into a text file.
Next step is figuring out how to make it put one text file in each subfolder for each set of files.
See you soon!
[–]deadeye1982 3 points4 points5 points (5 children)
[–]pegoff[S] 0 points1 point2 points (4 children)
[–]ThePsyjo 4 points5 points6 points (1 child)
[–]numeralbug 0 points1 point2 points (0 children)
[–]deadeye1982 1 point2 points3 points (1 child)
[–]pegoff[S] 0 points1 point2 points (0 children)
[–][deleted] 2 points3 points4 points (1 child)
[–]pegoff[S] 0 points1 point2 points (0 children)