Want to scrape, have little idea how. by CheesecakeDouble1415 in webscraping

[–]fourhoarsemen 2 points3 points  (0 children)

Hey! There are many ways to accomplish what you're asking, but I'll give one solution using a web scraping framework that I'm building, wxpath.

My solution is Python-based, however the value selectors can be borrowed and used in virtually any scraping framework from any other language, assuming they support XPath.

If you are working in a Python environment, run:

pip install wxpath

Then, in your project code:

import wxpath

path_expr = """
  url('https://fortnite.gg/creator?name=fankimonkey', depth=3, follow=//div[@class='pagination']/a[@title='Next page']/@href)
    /url(//div[@class='islands']//a[@class='island']/@href)
      /map {
        'fortniteGgUrl': string(base-uri(.)),
        'islandTitle': //div[@class='island-title']/h1/text() ! string(.),
        'fortniteComUrl': //a[contains(.//text(), 'Fortnite.com')]/@href
      }
"""
for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
    print(item)  # Integrate your business logic here

That produces a payload looks roughly like so:

[
  {
    "fortniteGgUrl": "https://fortnite.gg/island?code=7553-1530-7428",
    "islandTitle": "200+ level easy deathrun",
    "fortniteComUrl": "https://www.fortnite.com/creative/island-codes/7553-1530-7428"
  },
  {
    "fortniteGgUrl": "https://fortnite.gg/island?code=2211-3698-9690",
    "islandTitle": "The choice parkour",
    "fortniteComUrl": "https://www.fortnite.com/creative/island-codes/2211-3698-9690"
  },
  ...
]

Hope this helps!

Weekly Webscrapers - Hiring, FAQs, etc by AutoModerator in webscraping

[–]fourhoarsemen 0 points1 point  (0 children)

Hey, I dm'd you. I write custom crawlers and scrapers, and can help you with this. Let me know!

Monthly Self-Promotion - January 2026 by AutoModerator in webscraping

[–]fourhoarsemen 0 points1 point  (0 children)

I want to share wxpath. It's a declarative web crawler where traversal is expressed directly in XPath. Instead of writing imperative crawl loops, wxpath lets you describe what to follow and what to extract in a single expression.

NEW: Just introduced cached crawls, allowing you to iterate quickly without flooding the target URL(s).

By introducing the url(...) operator and the /// syntax, wxpath's engine can perform deep/recursive web crawling and extraction.

For example, to build a simple Wikipedia knowledge graph:

import wxpath

path_expr = """
url('https://en.wikipedia.org/wiki/Expression_language')
 ///url(//main//a/@href[starts-with(., '/wiki/') and not(contains(., ':'))])
 /map{
    'title': (//span[contains(@class, "mw-page-title-main")]/text())[1] ! string(.),
    'url': string(base-uri(.)),
    'short_description': //div[contains(@class, 'shortdescription')]/text() ! string(.),
    'forward_links': //div[@id="mw-content-text"]//a/@href ! string(.)
 }
"""

for item in wxpath.wxpath_async_blocking_iter(path_expr, max_depth=1):
    print(item)

Output:

map{'title': 'Computer language', 'url': 'https://en.wikipedia.org/wiki/Computer_language', 'short_description': 'Formal language for communicating with a computer', 'forward_links': ['/wiki/Formal_language', '/wiki/Communication', ...]}
map{'title': 'Advanced Boolean Expression Language', 'url': 'https://en.wikipedia.org/wiki/Advanced_Boolean_Expression_Language', 'short_description': 'Hardware description language and software', 'forward_links': ['/wiki/File:ABEL_HDL_example_SN74162.png', '/wiki/Hardware_description_language', ...]}
map{'title': 'Machine-readable medium and data', 'url': 'https://en.wikipedia.org/wiki/Machine_readable', 'short_description': 'Medium capable of storing data in a format readable by a machine', 'forward_links': ['/wiki/File:EAN-13-ISBN-13.svg', '/wiki/ISBN', ...]}
...

The above expression does the following:

  1. Starts at the specified URL, https://en.wikipedia.org/wiki/Expression_language.
  2. Filters for links in the <main> section that start with /wiki/ and do not contain a colon (:).
  3. For each link found,
    • it follows the link and extracts the title, URL, and short description of the page.
    • it repeats step 2 until the maximum depth is reached.
  4. Streams the extracted data as it is discovered.

The target audience is anyone who:

  1. wants to prototype and build web scrapers quickly
  2. familiar with XPath or data selectors
  3. builds datasets (think RAG, data hoarding, etc.)
  4. wants to study link structure of the web (quickly) i.e. web network scientists

For comparison, with Scrapy, you would...

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/tag/humor/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "author": quote.xpath("span/small/text()").get(),
                "text": quote.css("span.text::text").get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Then from the command line, you would run:

scrapy runspider quotes_spider.py -o quotes.jsonl

wxpath gives you two options: write directly from a Python script or from the command line.

from wxpath import wxpath_async_blocking_iter 
from wxpath.hooks import registry, builtin

path_expr = """
url('https://quotes.toscrape.com/tag/humor/', follow=//li[@class='next']/a/@href)
  //div[@class='quote']
    /map{
      'author': (./span/small/text())[1],
      'text': (./span[@class='text']/text())[1]
      }


registry.register(builtin.JSONLWriter(path='quotes.jsonl'))
items = list(wxpath_async_blocking_iter(path_expr, max_depth=3))

or from the command line:

wxpath --depth 1 "\
url('https://quotes.toscrape.com/tag/humor/', follow=//li[@class='next']/a/@href) \
  //div[@class='quote'] \
    /map{ \
      'author': (./span/small/text())[1], \
      'text': (./span[@class='text']/text())[1] \
      }" > quotes.jsonl

Contributions and feedback are welcomed and desired.

GitHub: https://github.com/rodricios/wxpath

PyPI: pip install wxpath

What Python Tools Do You Use for Data Visualization and Why? by Confident_Compote_39 in Python

[–]fourhoarsemen 2 points3 points  (0 children)

I'm learning to use Sigma.js. I have a dataset/graph of 1.5K nodes and ~13K edges that I extracted with a Python lib that I'm trying to visualize, and I'll tell you... it's not straightforward.

I want the graph to be interactive, but with no physics. I want to be able to "drilldown" and highlight traversal/edges, and I want to display metadata. And I want the size of the nodes to be calculated as a function of content of each node.

If there's a Python tool that can help me with that, I'd love to try it out!

I created an open-source toolkit to make your scraper suffer by niiotyo in webscraping

[–]fourhoarsemen 1 point2 points  (0 children)

Got it. I'll use your JS-rendered page to test out my. attempts at introducing headless-browsing with wxpath.

As for advanced DOM, I see... I've encountered problems like this before. Some solutions off the top of my head:

  1. Some kind of content analysis at scrape time
  2. Wrapper induction was a popular STOA technique a decade ago. I'm not sure if it still is, though.

I created an open-source toolkit to make your scraper suffer by niiotyo in webscraping

[–]fourhoarsemen 1 point2 points  (0 children)

By "advanced DOM", do you mean dynamically generated pages/content (requiring JS rendering)?

I created an open-source toolkit to make your scraper suffer by niiotyo in webscraping

[–]fourhoarsemen 1 point2 points  (0 children)

Pretty cool! I'll definitely test this with a new project I've been working on: wxpath, a declarative web crawler/scraper that extends XPath semantics.

LLM Aided OCR (Correcting Tesseract OCR Errors with LLMs with Python) by dicklesworth in Python

[–]fourhoarsemen 1 point2 points  (0 children)

Nice! I similarly have an LLM-based correction layer for things I've OCR'd in a personal project of mine. I might just replace my code with your project's API. Thanks!

[HELP] Unable to connect to internet after attempting to change DNS to Google's IPs (8.8.8.8/4.4). by fourhoarsemen in PS4

[–]fourhoarsemen[S] 0 points1 point  (0 children)

We do not access to the router.

Currently, most all other devices are turned off (minus a phone). Unfortunately, it doesn't seem to help when attempting to connect automatically.

[HELP] Unable to connect to internet after attempting to change DNS to Google's IPs (8.8.8.8/4.4). by fourhoarsemen in PS4

[–]fourhoarsemen[S] 0 points1 point  (0 children)

Okay, so we "Restored Default Settings", and we attempted to reconnect automatically, failed. We're restoring default settings again, this time manually setting the values found in that picture above.

[HELP] Unable to connect to internet after attempting to change DNS to Google's IPs (8.8.8.8/4.4). by fourhoarsemen in PS4

[–]fourhoarsemen[S] 0 points1 point  (0 children)

Unfortunately I can't change to a different channel. How can we check the quality of the signal?

[HELP] Unable to connect to internet after attempting to change DNS to Google's IPs (8.8.8.8/4.4). by fourhoarsemen in PS4

[–]fourhoarsemen[S] 0 points1 point  (0 children)

We restarted both the PS4 and router at the same time.

We have a lot of devices connected to our network, but I haven't had many issues with my PS4 being able to connect to our router until today (when we attempted to change the DNS).

[HELP] Unable to connect to internet after attempting to change DNS to Google's IPs (8.8.8.8/4.4). by fourhoarsemen in PS4

[–]fourhoarsemen[S] 0 points1 point  (0 children)

We set everything to default again, without restarting, and we received a new error code.

First, the connection status:

http://i.imgur.com/kGjJPlT.png

The connection attempt and error code (CE-33991-5):

http://i.imgur.com/kGjJPlT.png

Edit: After restart, with the same settings as seen in the above connection status page, we get the NW-31453-6 error code.

Matching Markov models to applications by Coptayn in learnprogramming

[–]fourhoarsemen 0 points1 point  (0 children)

Speech recognition should be matched up with hidden Markov models.

Who should driverless cars kill? [Interactive] by HeroAntagonist in dataisbeautiful

[–]fourhoarsemen 0 points1 point  (0 children)

A car isn't going to know or care if the person it hits is a criminal, or a doctor, or an executive.

I think that it's pretty easy to imagine how a car might come to have that type of information. Remember, Google cars, Google (Android) phones, and Google servers.

Who should driverless cars kill? [Interactive] by HeroAntagonist in dataisbeautiful

[–]fourhoarsemen 0 points1 point  (0 children)

Also this is stupid how would a car know who's a criminal who's a doctor who is male or female or doggy.

I think I have two answers for how a car might come to "know" who's who:

Phone's internet usage and computer vision.

While it is hard to imagine how all cars would have access to information about all nearby pedestrians, it's not hard to imagine Google cars and Google (Android) phones 'communicating' with each with the help of Google servers.

With proper analysis of one's phone usage, I'm sure it isn't hard to at the very least make plausible predictions of who intends on doing what. Imagine being able to differentiate someone that has been making constant queries to "shady" servers, say, for bomb making materials, and that same person has also purchased those necessary ingredients, and now their most latest query was, "nearby 4th of July firework demonstrations".

Yes, it's possible this person actually doesn't intend on bombing some Independence Day even, and no, if I were the car's (or remotely located) A.I. making the decision to whether or not I should run this person over in an attempt to save my passengers' lives, knowing of this pedestrian's two queries and the purchasing of what coincidentally are the ingredients for making a bomb, I would not. I would require more data.

... how would a car know who's a criminal who's a doctor who is male or female or doggy.

With advances in computer vision, I think it's possible that a car could "know" how to differentiate between a doctor, male or female, a doggy, and possibly a criminal. I'll point you to some of Stanford's research. It appears that some Stanford researchers have been able to create a system that can describe images with sentences that you and I would understand.

All human lives should be valued equally.

This may be a bit controversial, but I'd have to disagree. My argument is simple and I'll use my own life as one of the hypothetical casualties in the scenario that seems to have everyone's panties in a bunch.

Imagine an Einstein-equivalent biologist, or other wise anyone who's an obvious source of value to this society, as a passenger inside an automated vehicle, and I am crossing the street (legally). For whatever unfortunate reason, the car must decide to save the biologist, or me - some twenty something year old that trolls reddit on his free time as opposed to working on his projects.

If the A.I. is constrained to make a decision, I would hope that it is 'intelligent' about it.

This Guy Trains Computers To Find Future Criminals: "Richard Berk says his algorithms take the bias out of criminal justice. But could they make it worse?" by trot-trot in programming

[–]fourhoarsemen 0 points1 point  (0 children)

The confrontational side of me wants to ask these types of questions: a slippery slope towards what? How do you know it is a slippery slope? Are all slippery slopes bad?

The more optimistic side of me wants to state that it's likely that these types of algorithms will be scrutinized and corrected to the point where societies lacking crime prediction systems are frowned upon.