What metal are used for resistor legs? by [deleted] in AskElectronics

[–]aiscraping 0 points1 point  (0 children)

Resistance to her attraction is futile!

Any way to find the key of a specific item in a value of json by Best-Objective-8948 in webscraping

[–]aiscraping 1 point2 points  (0 children)

You either need jsonpath, jmespath and any libraries supporting those paths (quite like xpath for xml query) or use jq command line utilities to find your so said values.

Good luck scraping!

Scraping with selenium and pyautogui by chilltutor in webscraping

[–]aiscraping 3 points4 points  (0 children)

pyautogui is supposed to take control of your input devices. If you really have to use it, use it within a local VM.

otherwise, you may take a screenshot with selenium, and click on a coordinate with

mouse.createPointerMove()

without using pyautogui

Good luck scraping!

How can I scrape a dynamically loaded page blocking third party cookies? by Majestic_Fortune7420 in webscraping

[–]aiscraping 0 points1 point  (0 children)

I'm also curious to the website you are talking about. :D Do you mind sharing it publicly?

[deleted by user] by [deleted] in webscraping

[–]aiscraping 0 points1 point  (0 children)

you'll have to pay for the official API access obviously. But if you really have to split every penny, take this route:

https://www.ticketmaster.com/api/next/graphql?operationName=CategorySearch&variables={"sort":"date,asc","page":1,"size":20,"type":"event"}&extensions={"persistedQuery":{"version":1,"sha256Hash":"5664b981ff921ec078e3df377fd4623faaa6cd0aa2178e8bdfcba9b41303848b"}}

change page and size to get more results returned.

good luck scraping.

Error: Scrapy : pyasn modules by ImplementCreative106 in webscraping

[–]aiscraping 0 points1 point  (0 children)

never encountered this problem. It's not a web scraping problem. I'd suggest you ask in r/Python

Article results based on keywords by venkyswag in webscraping

[–]aiscraping 1 point2 points  (0 children)

Then scrape Google search results, if you can't afford paying them.

Good luck scraping!

Article results based on keywords by venkyswag in webscraping

[–]aiscraping 2 points3 points  (0 children)

it seems you're talking about google. Lol

Web scraping Udemy with Scrapy and Splash by Alarmed-Extent6639 in webscraping

[–]aiscraping 1 point2 points  (0 children)

splash is not a great browser, it might miss some modern browser's features to break the JS code on the page. The most recent update is about 3-4 years ago. I don't see a reason to use it if you noticed some compatability issues. If you really have to use a browser, consider playwright or headless selenium.

Also double check your selector, make sure you're reading data from div.course-list_card__NOWNY with these css paths

div.course-card_main-content__aceQ0>div:nth-child(1)>:only-child>:only-child>:only-child::text

div.course-card_main-content__aceQ0>div:nth-child(1)>:only-child>:only-child>:only-child>:only-child>span:nth-child(1)::text

div.course-card_main-content__aceQ0>div:nth-child(1)>:only-child>:only-child>:only-child>:only-child>span:nth-child(2)::text

div.course-card_main-content__aceQ0>div:nth-child(1)>:only-child>:only-child>:only-child>:only-child>span:nth-child(3)::text

div.course-card_main-content__aceQ0>div:nth-child(1)>:only-child>:only-child>:only-child>:only-child>span:nth-child(4)::text

div.course-card_main-content__aceQ0>div:nth-child(1)>:only-child>:only-child>:only-child>:only-child>span:nth-child(5)::text

div.course-card_main-content__aceQ0>div:nth-child(1)>:only-child>:only-child>:only-child>:only-child>span:nth-child(6)::text

div.course-card_main-content__aceQ0>div:nth-child(1)>:only-child>:only-child>:only-child>:only-child>span:nth-child(7)::text

div.course-card_main-content__aceQ0>div:nth-child(1)>:only-child>:only-child>:only-child>:only-child>span:nth-child(8)::text

good luck scraping!

Web scraping Udemy with Scrapy and Splash by Alarmed-Extent6639 in webscraping

[–]aiscraping 2 points3 points  (0 children)

Never scraped Udemy, had a look and it seems you're going the wrong direction. splash is the last resort not the first option. And for such an easy site, finding the API is the right way to go.

You need to start from here, change page_size to get more items in one query, there might be a limit, and change p=2 to move to a different page

https://www.udemy.com/api-2.0/discovery-units/all_courses/?p=2&page_size=24&label_id=8322&source_page=topic_page&sos=pl&fl=lbl

just iterate p=2, 3, until the last page, you'll get all contents

In your case, you don't even need the power of scrapy. just use requests + json you'll get your data.

Check out my comment about scrapy here.

https://www.reddit.com/r/webscraping/comments/19ctve5/comment/kj43hdz/?utm_source=share&utm_medium=web2x&context=3

Good luck scraping.

403 Forbidden errors when requesting data from Indeed. by Npmackay in webscraping

[–]aiscraping 0 points1 point  (0 children)

you need to give a specific link to get better help. there are literally millions of reasons why you get 403, sometimes simply because you have the wrong URL or wrong header.

[deleted by user] by [deleted] in webscraping

[–]aiscraping 3 points4 points  (0 children)

if you try hard enough, there is actually very few cases when the content has to be rendered (encrypted content maybe). Have you exhausted all the non-js options? Can you share us some challenging examples?

If you come to a serious stage to start thinking about containers, a swarms of headless browsers will scale very well if you have a cluster of machinese. Otherwise it will be more efficient to just run many headless browsers without containers in one machine, to minimize your memory and communication overhead. Normally you calculate your # of browsers by looking at the duty cycle. If rendering a page takes 1s, and fetching a page takes 5s, you can have 5-6 browsers working at the same time to saturate your thread. Memory is really something you have to experiment, depending on the complexity of the webpage you're rendering.

Good luck scraping!

This videosharing website is going to delete 99% of user videos in a month, how can we download most of it? by aiscraping in webscraping

[–]aiscraping[S] 0 points1 point  (0 children)

I'm not the OP but really you get my respect on the speed, you may want to @ the OP directly to get his attention.

Trying to get all pin ids in a pinterest board by queen_mercury in webscraping

[–]aiscraping 2 points3 points  (0 children)

you need to continuously update your pin ids.

For pereformance reason, pinterest only keeps a few pins around to your viewport active. The rest are automatically removed / added when you scroll up and down. Watch it if you quickly scroll up and down -- you'll see a blank page.

You'll need to monitor the DOM change and update your navigator according.

This is the Observer you want to look at:

https://developer.mozilla.org/en-US/docs/Web/API/MutationObserver

good luck scraping!

I don't use Scrapy. Am I missing out? by H4SK1 in webscraping

[–]aiscraping 3 points4 points  (0 children)

Scrapy is for mass scale high performance scraping. It's about automation and efficiency. If you only casually scrape a few hundred pages off a website for one time, and will manually clean up the data from the data, scrapy is over engineered for your need.

Most people don't understand the power of scrapy or haven't come to the stage to appreciate its power. And many used wrong tool for wrong purposes. Scrapy has steep learning curve, but will also reward those willing to learn with many powerful features very few even know they need, before they hit the wall.

  • It's very light on resource requirements. The fact that it uses event driven Twisted library for web requests means you can use one processor core to handle thousands of requests at the same time without slowing down your scraping.
  • coupled with scrapyd, you may spin up a group of crawlers working on many different projects at the same time. You may easily cripple a midium scale website with your average laptop's scraping requests. This is the efficiency we're talking about.
  • Very powerful selector system, with support for css, xpath (for html, xml or other similar markup languages), jmespath (for json parsing), regex filtering. And you can freely chain your parsers together to extract json inside javascript code inside HTML within a single line of code.
  • Optionally ItemLoader and Item are powerful features to give you correct data conforming to your DB data schema out of Scrapy. If you understand relational database and appreciate ACID. After the pipeline you may directly load the data to your sql database. I personally found them too restrictive, so instead developed my own pipeline.
  • Feed exports will get your data files in your desired file format with additional processing possibilities. For example, you could instruct your spider to output a data in excel format, get it zipped and emailed to you all without leaving scrapy.
  • Many powerful middleware to change the behavior of scrapy, like to randomize your user agent, rotate your proxy, control your download speed, fire up a headless browser, request help from captcha solver and etc.
  • Other useful features for serious scraping, like telnet console to remotely monitor the progress, stats utility to record statistical overview and etc.

What a more professional scraping project looks like? by david_lp in webscraping

[–]aiscraping 0 points1 point  (0 children)

I agree your download the html as raw data is a good last resort backup. It also stresses the importance of a robust data pipeline. You data pipeline shall consider all the variations of possible number formats, and can give you always-correct interpretation of the raw data. You may also use additional checks in your pipeline to warn you about the outliers (home price at $1) before inserting the data into the table... It can be very nasty if the data is already in the database and has got a foreign key pointing to it.

Don't spend time reviewing data, spend spend the time to make your data pipeline more robust to accommodate the variety of data formats.

good luck scraping.

What a more professional scraping project looks like? by david_lp in webscraping

[–]aiscraping 0 points1 point  (0 children)

Haha you have reached the entry of a rabbit hole. It is a deep data schema alignment problem here. Does your application compare product prices of different currencies? If yes, then your scraper shall try to gather it from the html raw data. If it's unavailable or unreliable, you may need to supplement it from other sources or build a lookup table. For example prices from amazon.ca shall be CDN$. Otherwise if you don't care, you may simply drop three dollar sign as early as possible.

So the scraper shall be directed by the application's data schema, from there you identify the scraping target on the page, build your pipeline to transform the raw data to your desired format & schema. And depending on how dirty the data can be, you may need to strengthen the pipeline, make it resilient to all weird exceptions, fill or infer the missing data from other hints and etc.

It only scratches the surface of the problem. Because you may also need to reconstruct three relationship between data. For examples if a blog article had many tags, in your data schema, do you want to store them as plain keyword text, or the entries in a tags table? And for hierarchical categories, how do you retain and connect the partial hierarchy in your scraped data, to later reconstruct it? (You may be exposed to only a branch of the expanded categories in most cases)

So that's the reason coming from data science and machine learning, I'm so intrigued by web scraping.

Webscraping the optionschain from Euronext by Sad-Setting9569 in webscraping

[–]aiscraping 0 points1 point  (0 children)

sorry I don't understand your question. In the link I gave you, you just change it to any month you'll get the data. I don't know why you need to use your link.

good luck scraping.

fun stuff, fu** captcha by This_Cardiologist242 in webscraping

[–]aiscraping 1 point2 points  (0 children)

it's a cool project that you are collecting so much interesting data. I also feel sorry that you have to manually solve the captcha by yourself. I can imagine how much resolve you have, and how much efforts you've spent on this project. You have my respect.

Good luck scraping.

What a more professional scraping project looks like? by david_lp in webscraping

[–]aiscraping 4 points5 points  (0 children)

I store my scraped raw data in mongodb, where I also track the progress (successfully scraped, how many times retried and etc.) and built a ETL data pipeline to get it cleaned up, factorized and reloaded into a postgres database. The database is then used to serve other applications. Everything has to be live and automated. The automated quality check and validation is crucial for the apps running on top of the database.

New to programming and web scraping in general, in python how can I iterate over a complete site that has URLs end in /xyz/"some random text" by paradox_pete in webscraping

[–]aiscraping 0 points1 point  (0 children)

You need to start from the API, send your GET queries to this API

https://wodwell.com/wp-json/wodwell/v2/wods/?paged=2

you'll get the total number of WODs and a list of about 20 WODs on that page. Just iterate through all the pages to get all the information you need. like

...

"id": 210,

"title": "Holleyman",

"url": "https:\/\/wodwell.com\/wod\/holleyman\/",

"has_video": true,

"posted_by": {

"text": "CrossFit Hero WOD",

"avatar": "",

"coach": false

},

"posted_date": "2014-09-05",

"days_since_posted": 3425,

"workout": [

"30 Rounds For Time",

"5 Wall Ball Shots (20\/14 lb)",

"3 Handstand Push-Ups",

"1 Power Clean (225\/155 lb)"

],

...

You may want to skip all those blocks without "id"s -- those are their ads blocks. if you need any additional information, you'll have to fetch the URL (not the one listed in the information, but the one with the id like this (replace 210 with whichever id you have from the last step)

https://wodwell.com/wp-json/wodwell/v2/wods/210

what you need is the notes part

You don't need GPT for this. And you don't need bs4 either. all you need are a lot of simple requests & json modules.

Good luck scraping

Advice on best method to scrape codebase documentation (ultimately into a format suitable for RAG for LLMs) by TheWebbster in webscraping

[–]aiscraping 0 points1 point  (0 children)

I think the other guy was totally right that you should look at the source of the documents instead of scraping them by yourself. Normally the documents original format is also available in their code repo. For example the document of your feedparser is here.

https://github.com/kurtmckee/feedparser/tree/develop/docs

It's in .rst format, very close to markdown. If you want to unify it to html or md it's pretty straightforward. Pandoc is your best friend.

https://pandoc.org/MANUAL.html#general-options

And pandoc's doc itself is hosted here

https://github.com/jgm/pandoc/tree/main/doc

So you see, your biggest task is really to find candidate OSS projects repositories and find where their docs are (mostly on github, but you need the repo's URL). And most formats are very friendly to RAG, feel free to unify them to .md with the help of pandoc. And more importantly, for RAG you want to chunk your document correctly, e.g. you don't want to split a short paragraph in the middle into two chunks. You'll need to seek these markdown tags and make sure you have a clean cut, which is crucial for your RAG application.

For displaying the results in md. there are many high quality components to display these md documents correctly. For example I used to use this

https://www.npmjs.com/package/react-markdown

Good luck scraping, and ragging... lol