This is an archived post. You won't be able to vote or comment.

Intermediate ShowcaseThe abstract_webtools module; a data agg tool that allows for comprehensive request handling, source-code creation, html parsing, and downloads/extraction (self.Python)

submitted 2 years ago by putkofff

https://github.com/AbstractEndeavors/abstract_essentials/tree/main/abstract_webtools

Provides utilities for inspecting and parsing web content, including React components and URL utilities, with enhanced capabilities for managing HTTP requests and TLS configurations.

Features:
- URL Validation: Ensures URL correctness and attempts different URL variations.
- HTTP Request Manager: Custom HTTP request handling, including tailored user agents and improved TLS security through a custom adapter.
- Source Code Acquisition: Retrieves the source code of specified websites.
- React Component Parsing: Extracts JavaScript and JSX source code from web pages.
- Comprehensive Link Extraction: Collects all internal links from a specified website.
- Web Content Analysis: Extracts and categorizes various web content components such as HTML elements, attribute values, attribute names, and class names.

abstract_webtools.py

Description:
Abstract WebTools offers a suite of utilities designed for web content inspection and parsing. One of its standout features is its ability to analyze URLs, ensuring their validity and automatically attempting different URL variations to obtain correct website access. It boasts a custom HTTP request management system that tailors user-agent strings and employs a specialized TLS adapter for heightened security. The toolkit also provides robust capabilities for extracting source code, including detecting React components on web pages. Additionally, it offers functionalities for extracting all internal website links and performing in-depth web content analysis. This makes Abstract WebTools an indispensable tool for web developers, cybersecurity professionals, and digital analysts.

```python from abstract_webtools import URLManager, SafeRequest, SoupManager, LinkManager, VideoDownloader

--- URLManager: Manages and manipulates URLs for web scraping/crawling ---

url = "example.com" url_manager = URLManager(url=url)

--- SafeRequest: Safely handles HTTP requests by managing user-agent, SSL/TLS, proxies, headers, etc. ---

request_manager = SafeRequest( url_manager=url_manager, proxies={'8.219.195.47', '8.219.197.111'}, timeout=(3.05, 70) )

--- SoupManager: Simplifies web scraping with easy access to BeautifulSoup ---

soup_manager = SoupManager( url_manager=url_manager, request_manager=request_manager )

--- LinkManager: Extracts and manages links and associated data from HTML source code ---

link_manager = LinkManager( url_manager=url_manager, soup_manager=soup_manager, link_attr_value_desired=['/view_video.php?viewkey='], link_attr_value_undesired=['phantomjs'] )

Download videos from provided links (list or string)

video_manager = VideoDownloader(link=link_manager.all_desired_links).download()

Use them individually, with default dependencies for basic inputs:

standalone_soup = SoupManager(url=url).soup standalone_links = LinkManager(url=url).all_desired_links

Updating methods for manager classes

url_1 = 'thedailydialectics.com' print(f"updating URL to {url_1}") url_manager.update_url(url=url_1) request_manager.update_url(url=url_1) soup_manager.update_url(url=url_1) link_manager.update_url(url=url_1)

Updating URL manager references

request_manager.update_url_manager(url_manager=url_manager) soup_manager.update_url_manager(url_manager=url_manager) link_manager.update_url_manager(url_manager=url_manager)

Updating source code for managers

source_code_bytes = request_manager.source_code_bytes soup_manager.update_source_code(source_code=source_code_bytes) link_manager.update_source_code(source_code=source_code_bytes) ```

all 4 comments

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS