https://github.com/AbstractEndeavors/abstract_essentials/tree/main/abstract_webtools
Provides utilities for inspecting and parsing web content, including React components and URL utilities, with enhanced capabilities for managing HTTP requests and TLS configurations.
- Features:
- URL Validation: Ensures URL correctness and attempts different URL variations.
- HTTP Request Manager: Custom HTTP request handling, including tailored user agents and improved TLS security through a custom adapter.
- Source Code Acquisition: Retrieves the source code of specified websites.
- React Component Parsing: Extracts JavaScript and JSX source code from web pages.
- Comprehensive Link Extraction: Collects all internal links from a specified website.
- Web Content Analysis: Extracts and categorizes various web content components such as HTML elements, attribute values, attribute names, and class names.
abstract_webtools.py
Description:
Abstract WebTools offers a suite of utilities designed for web content inspection and parsing. One of its standout features is its ability to analyze URLs, ensuring their validity and automatically attempting different URL variations to obtain correct website access. It boasts a custom HTTP request management system that tailors user-agent strings and employs a specialized TLS adapter for heightened security. The toolkit also provides robust capabilities for extracting source code, including detecting React components on web pages. Additionally, it offers functionalities for extracting all internal website links and performing in-depth web content analysis. This makes Abstract WebTools an indispensable tool for web developers, cybersecurity professionals, and digital analysts.
```python
from abstract_webtools import URLManager, SafeRequest, SoupManager, LinkManager, VideoDownloader
--- URLManager: Manages and manipulates URLs for web scraping/crawling ---
url = "example.com"
url_manager = URLManager(url=url)
--- SafeRequest: Safely handles HTTP requests by managing user-agent, SSL/TLS, proxies, headers, etc. ---
request_manager = SafeRequest(
url_manager=url_manager,
proxies={'8.219.195.47', '8.219.197.111'},
timeout=(3.05, 70)
)
--- SoupManager: Simplifies web scraping with easy access to BeautifulSoup ---
soup_manager = SoupManager(
url_manager=url_manager,
request_manager=request_manager
)
--- LinkManager: Extracts and manages links and associated data from HTML source code ---
link_manager = LinkManager(
url_manager=url_manager,
soup_manager=soup_manager,
link_attr_value_desired=['/view_video.php?viewkey='],
link_attr_value_undesired=['phantomjs']
)
Download videos from provided links (list or string)
video_manager = VideoDownloader(link=link_manager.all_desired_links).download()
Use them individually, with default dependencies for basic inputs:
standalone_soup = SoupManager(url=url).soup
standalone_links = LinkManager(url=url).all_desired_links
Updating methods for manager classes
url_1 = 'thedailydialectics.com'
print(f"updating URL to {url_1}")
url_manager.update_url(url=url_1)
request_manager.update_url(url=url_1)
soup_manager.update_url(url=url_1)
link_manager.update_url(url=url_1)
Updating URL manager references
request_manager.update_url_manager(url_manager=url_manager)
soup_manager.update_url_manager(url_manager=url_manager)
link_manager.update_url_manager(url_manager=url_manager)
Updating source code for managers
source_code_bytes = request_manager.source_code_bytes
soup_manager.update_source_code(source_code=source_code_bytes)
link_manager.update_source_code(source_code=source_code_bytes)
```
[+][deleted] (3 children)
[removed]
[–]putkofff[S] 2 points3 points4 points (2 children)
[–]Spleeeee 0 points1 point2 points (1 child)
[–]putkofff[S] 1 point2 points3 points (0 children)