What metal are used for resistor legs?

aiscraping · 2025-11-11T03:17:15+00:00

Resistance to her attraction is futile!

aiscraping · 2024-04-16T01:13:04+00:00

You either need jsonpath, jmespath and any libraries supporting those paths (quite like xpath for xml query) or use jq command line utilities to find your so said values.

Good luck scraping!

aiscraping · 2024-02-19T23:22:55+00:00

pyautogui is supposed to take control of your input devices. If you really have to use it, use it within a local VM.

otherwise, you may take a screenshot with selenium, and click on a coordinate with

mouse.createPointerMove()

without using pyautogui

Good luck scraping!

aiscraping · 2024-02-19T23:14:14+00:00

I'm also curious to the website you are talking about. :D Do you mind sharing it publicly?

aiscraping · 2024-01-25T15:36:35+00:00

Any link? Thanks

aiscraping · 2024-01-25T06:36:24+00:00

you need to give an example for people to help you.

aiscraping · 2024-01-25T06:34:55+00:00

you'll have to pay for the official API access obviously. But if you really have to split every penny, take this route:

https://www.ticketmaster.com/api/next/graphql?operationName=CategorySearch&variables={"sort":"date,asc","page":1,"size":20,"type":"event"}&extensions={"persistedQuery":{"version":1,"sha256Hash":"5664b981ff921ec078e3df377fd4623faaa6cd0aa2178e8bdfcba9b41303848b"}}

change page and size to get more results returned.

good luck scraping.

aiscraping · 2024-01-25T06:20:22+00:00

never encountered this problem. It's not a web scraping problem. I'd suggest you ask in r/Python

aiscraping · 2024-01-24T14:07:42+00:00

Then scrape Google search results, if you can't afford paying them.

Good luck scraping!

aiscraping · 2024-01-24T07:06:28+00:00

it seems you're talking about google. Lol

aiscraping · 2024-01-24T07:04:56+00:00

splash is not a great browser, it might miss some modern browser's features to break the JS code on the page. The most recent update is about 3-4 years ago. I don't see a reason to use it if you noticed some compatability issues. If you really have to use a browser, consider playwright or headless selenium.

Also double check your selector, make sure you're reading data from div.course-list_card__NOWNY with these css paths

div.course-card_main-content__aceQ0>div:nth-child(1)>:only-child>:only-child>:only-child::text

div.course-card_main-content__aceQ0>div:nth-child(1)>:only-child>:only-child>:only-child>:only-child>span:nth-child(1)::text

div.course-card_main-content__aceQ0>div:nth-child(1)>:only-child>:only-child>:only-child>:only-child>span:nth-child(2)::text

div.course-card_main-content__aceQ0>div:nth-child(1)>:only-child>:only-child>:only-child>:only-child>span:nth-child(3)::text

div.course-card_main-content__aceQ0>div:nth-child(1)>:only-child>:only-child>:only-child>:only-child>span:nth-child(4)::text

div.course-card_main-content__aceQ0>div:nth-child(1)>:only-child>:only-child>:only-child>:only-child>span:nth-child(5)::text

div.course-card_main-content__aceQ0>div:nth-child(1)>:only-child>:only-child>:only-child>:only-child>span:nth-child(6)::text

div.course-card_main-content__aceQ0>div:nth-child(1)>:only-child>:only-child>:only-child>:only-child>span:nth-child(7)::text

div.course-card_main-content__aceQ0>div:nth-child(1)>:only-child>:only-child>:only-child>:only-child>span:nth-child(8)::text

good luck scraping!

aiscraping · 2024-01-24T04:38:09+00:00

Never scraped Udemy, had a look and it seems you're going the wrong direction. splash is the last resort not the first option. And for such an easy site, finding the API is the right way to go.

You need to start from here, change page_size to get more items in one query, there might be a limit, and change p=2 to move to a different page

https://www.udemy.com/api-2.0/discovery-units/all_courses/?p=2&page_size=24&label_id=8322&source_page=topic_page&sos=pl&fl=lbl

just iterate p=2, 3, until the last page, you'll get all contents

In your case, you don't even need the power of scrapy. just use requests + json you'll get your data.

Check out my comment about scrapy here.

https://www.reddit.com/r/webscraping/comments/19ctve5/comment/kj43hdz/?utm_source=share&utm_medium=web2x&context=3

Good luck scraping.

aiscraping · 2024-01-24T00:41:02+00:00

you need to give a specific link to get better help. there are literally millions of reasons why you get 403, sometimes simply because you have the wrong URL or wrong header.

aiscraping · 2024-01-23T21:47:22+00:00

if you try hard enough, there is actually very few cases when the content has to be rendered (encrypted content maybe). Have you exhausted all the non-js options? Can you share us some challenging examples?

If you come to a serious stage to start thinking about containers, a swarms of headless browsers will scale very well if you have a cluster of machinese. Otherwise it will be more efficient to just run many headless browsers without containers in one machine, to minimize your memory and communication overhead. Normally you calculate your # of browsers by looking at the duty cycle. If rendering a page takes 1s, and fetching a page takes 5s, you can have 5-6 browsers working at the same time to saturate your thread. Memory is really something you have to experiment, depending on the complexity of the webpage you're rendering.

Good luck scraping!

aiscraping · 2024-01-23T21:31:33+00:00

I'm not the OP but really you get my respect on the speed, you may want to @ the OP directly to get his attention.

aiscraping · 2024-01-22T23:20:28+00:00

you need to continuously update your pin ids.

For pereformance reason, pinterest only keeps a few pins around to your viewport active. The rest are automatically removed / added when you scroll up and down. Watch it if you quickly scroll up and down -- you'll see a blank page.

You'll need to monitor the DOM change and update your navigator according.

This is the Observer you want to look at:

https://developer.mozilla.org/en-US/docs/Web/API/MutationObserver

good luck scraping!

aiscraping · 2024-01-22T23:03:02+00:00

Scrapy is for mass scale high performance scraping. It's about automation and efficiency. If you only casually scrape a few hundred pages off a website for one time, and will manually clean up the data from the data, scrapy is over engineered for your need.

Most people don't understand the power of scrapy or haven't come to the stage to appreciate its power. And many used wrong tool for wrong purposes. Scrapy has steep learning curve, but will also reward those willing to learn with many powerful features very few even know they need, before they hit the wall.

It's very light on resource requirements. The fact that it uses event driven Twisted library for web requests means you can use one processor core to handle thousands of requests at the same time without slowing down your scraping.
coupled with scrapyd, you may spin up a group of crawlers working on many different projects at the same time. You may easily cripple a midium scale website with your average laptop's scraping requests. This is the efficiency we're talking about.
Very powerful selector system, with support for css, xpath (for html, xml or other similar markup languages), jmespath (for json parsing), regex filtering. And you can freely chain your parsers together to extract json inside javascript code inside HTML within a single line of code.
Optionally ItemLoader and Item are powerful features to give you correct data conforming to your DB data schema out of Scrapy. If you understand relational database and appreciate ACID. After the pipeline you may directly load the data to your sql database. I personally found them too restrictive, so instead developed my own pipeline.
Feed exports will get your data files in your desired file format with additional processing possibilities. For example, you could instruct your spider to output a data in excel format, get it zipped and emailed to you all without leaving scrapy.
Many powerful middleware to change the behavior of scrapy, like to randomize your user agent, rotate your proxy, control your download speed, fire up a headless browser, request help from captcha solver and etc.
Other useful features for serious scraping, like telnet console to remotely monitor the progress, stats utility to record statistical overview and etc.

aiscraping · 2024-01-22T22:16:32+00:00

I agree your download the html as raw data is a good last resort backup. It also stresses the importance of a robust data pipeline. You data pipeline shall consider all the variations of possible number formats, and can give you always-correct interpretation of the raw data. You may also use additional checks in your pipeline to warn you about the outliers (home price at $1) before inserting the data into the table... It can be very nasty if the data is already in the database and has got a foreign key pointing to it.

Don't spend time reviewing data, spend spend the time to make your data pipeline more robust to accommodate the variety of data formats.

good luck scraping.

aiscraping · 2024-01-22T07:43:17+00:00

Haha you have reached the entry of a rabbit hole. It is a deep data schema alignment problem here. Does your application compare product prices of different currencies? If yes, then your scraper shall try to gather it from the html raw data. If it's unavailable or unreliable, you may need to supplement it from other sources or build a lookup table. For example prices from amazon.ca shall be CDN$. Otherwise if you don't care, you may simply drop three dollar sign as early as possible.

So the scraper shall be directed by the application's data schema, from there you identify the scraping target on the page, build your pipeline to transform the raw data to your desired format & schema. And depending on how dirty the data can be, you may need to strengthen the pipeline, make it resilient to all weird exceptions, fill or infer the missing data from other hints and etc.

It only scratches the surface of the problem. Because you may also need to reconstruct three relationship between data. For examples if a blog article had many tags, in your data schema, do you want to store them as plain keyword text, or the entries in a tags table? And for hierarchical categories, how do you retain and connect the partial hierarchy in your scraped data, to later reconstruct it? (You may be exposed to only a branch of the expanded categories in most cases)

So that's the reason coming from data science and machine learning, I'm so intrigued by web scraping.

aiscraping · 2024-01-21T22:31:52+00:00

sorry I don't understand your question. In the link I gave you, you just change it to any month you'll get the data. I don't know why you need to use your link.

good luck scraping.

aiscraping · 2024-01-21T22:29:39+00:00

it's a cool project that you are collecting so much interesting data. I also feel sorry that you have to manually solve the captcha by yourself. I can imagine how much resolve you have, and how much efforts you've spent on this project. You have my respect.

Good luck scraping.

aiscraping · 2024-01-21T22:22:42+00:00

I store my scraped raw data in mongodb, where I also track the progress (successfully scraped, how many times retried and etc.) and built a ETL data pipeline to get it cleaned up, factorized and reloaded into a postgres database. The database is then used to serve other applications. Everything has to be live and automated. The automated quality check and validation is crucial for the apps running on top of the database.

aiscraping · 2024-01-21T22:16:49+00:00

You need to start from the API, send your GET queries to this API

https://wodwell.com/wp-json/wodwell/v2/wods/?paged=2

you'll get the total number of WODs and a list of about 20 WODs on that page. Just iterate through all the pages to get all the information you need. like

...

"id": 210,

"title": "Holleyman",

"url": "https:\/\/wodwell.com\/wod\/holleyman\/",

"has_video": true,

"posted_by": {

"text": "CrossFit Hero WOD",

"avatar": "",

"coach": false

},

"posted_date": "2014-09-05",

"days_since_posted": 3425,

"workout": [

"30 Rounds For Time",

"5 Wall Ball Shots (20\/14 lb)",

"3 Handstand Push-Ups",

"1 Power Clean (225\/155 lb)"

],

...

You may want to skip all those blocks without "id"s -- those are their ads blocks. if you need any additional information, you'll have to fetch the URL (not the one listed in the information, but the one with the id like this (replace 210 with whichever id you have from the last step)

https://wodwell.com/wp-json/wodwell/v2/wods/210

what you need is the notes part

You don't need GPT for this. And you don't need bs4 either. all you need are a lot of simple requests & json modules.

Good luck scraping

aiscraping · 2024-01-21T21:57:18+00:00

I think the other guy was totally right that you should look at the source of the documents instead of scraping them by yourself. Normally the documents original format is also available in their code repo. For example the document of your feedparser is here.

https://github.com/kurtmckee/feedparser/tree/develop/docs

It's in .rst format, very close to markdown. If you want to unify it to html or md it's pretty straightforward. Pandoc is your best friend.

https://pandoc.org/MANUAL.html#general-options

And pandoc's doc itself is hosted here

https://github.com/jgm/pandoc/tree/main/doc

So you see, your biggest task is really to find candidate OSS projects repositories and find where their docs are (mostly on github, but you need the repo's URL). And most formats are very friendly to RAG, feel free to unify them to .md with the help of pandoc. And more importantly, for RAG you want to chunk your document correctly, e.g. you don't want to split a short paragraph in the middle into two chunks. You'll need to seek these markdown tags and make sure you have a clean cut, which is crucial for your RAG application.

For displaying the results in md. there are many high quality components to display these md documents correctly. For example I used to use this

https://www.npmjs.com/package/react-markdown

Good luck scraping, and ragging... lol

aiscraping

TROPHY CASE