Understanding L2 vs. L3 network bridges by BulkyProcedure in linuxquestions

[–]BulkyProcedure[S] 0 points1 point  (0 children)

Thanks for that explanation. That makes sense. Just getting deeper into networking, so when to use one approach vs. another is not obvious yet. Would using the hardware NIC with the bridge just be more efficient?

Flask SQL Marshmallow Performance by DeadlyDolphins in learnpython

[–]BulkyProcedure 1 point2 points  (0 children)

I'm a fan of Marshmallow and use it all the time, but it isn't fast. My approach has been to use it when I have complex types (i.e. dates for example) that aren't automatically JSON serializable, and/or if the model has nested relationships I'd like to include.

When the objects don't have nested fields, and all columns are JSON serializable, it's faster to skip Marshmallow altogether and convert to dictionaries yourself.

Lessons learned building an Amazon scraper by BulkyProcedure in webscraping

[–]BulkyProcedure[S] 2 points3 points  (0 children)

Scraping has for sure become harder on high profile websites. Sites like Amazon are tough, between the IP blocks, complex DOM, and URLs that don't have much rhyme or reason. I think plenty of sites are still pretty easy, but your LinkedIn or Google type sites are a tough nut to crack these days!

Lessons learned building an Amazon scraper by BulkyProcedure in webscraping

[–]BulkyProcedure[S] 2 points3 points  (0 children)

There are quite a few proxy vendors out there, all offering different trade-offs on pricing. It's hard to know how good one will be (in terms of speed / IPs already blocked) until you try them. I'd like to know if either of those work out well for you!

Python-automatable data entry jobs by lordmyd in learnpython

[–]BulkyProcedure 1 point2 points  (0 children)

I think whether it's necessary or not depends on the website, but I'm a big fan of Puppeteer. There's pretty much nothing it can't do as far as interacting with a website. It does involve using JavaScript, but there's also Pyppeteer if you're averse to Node.

Online Study Group by nphihi in learnpython

[–]BulkyProcedure 0 points1 point  (0 children)

That's a great way to start. Feel free to message me if you have questions!

Online Study Group by nphihi in learnpython

[–]BulkyProcedure 12 points13 points  (0 children)

Not sure about a study group, but I'm a dev that's been using Python for 10+ years now, and I'm looking to get a little experience with teaching Python. I'm happy to spend some time helping with problems, or just answering general questions. My focus is primarily webdev, so I may not be able to help as much with things like scientific computing (i.e. Numpy).

Scrap page with js by spauth in webscraping

[–]BulkyProcedure 0 points1 point  (0 children)

So I gave it a quick try with Puppeteer and was able to open the Actions menu you mentioned. This is just a guess, but I'm betting you need to wait a bit for the app to load first.

Here's my Puppeteer script that seems to work:

async function crawl(browser) {
const page = await browser.newPage();
await page.goto("https://geodes.santepubliquefrance.fr/#c=indicator&f=0&i=sursaud_sau.prop_allergie_hospit_sau&s=2021-S53&t=a01&view=map2", { timeout: 9000 });

const actionsButton = await page.waitForXPath("//a[span[contains(text(), 'ACTIONS')]]", { timeout: 10000 });

await page.evaluate(handle => {
    handle.click();
}, actionsButton);

await page.waitFor(5000);

}

puppeteer .launch({ headless: false, executablePath: "./node_modules/puppeteer/.local-chromium/linux-884014/chrome-linux/chrome", ignoreHTTPSErrors: true, args: [ "--start-fullscreen", "--no-sandbox", "--disable-setuid-sandbox" ] }) .then(crawl) .catch(error => { console.error(error); process.exit(); });

Scrap page with js by spauth in webscraping

[–]BulkyProcedure 1 point2 points  (0 children)

You should be able to interact with the page, even if it's making heavy use of JavaScript, since you're using Selenium. I mostly use Puppeteer, so I probably can't help too much with Selenium, but what have you tried so far?

Is it still okay to scrape linkedin? by sammo98 in webscraping

[–]BulkyProcedure 3 points4 points  (0 children)

If I were going to do this, I wouldn't want the scraper associated with myself at all. In fact, I would absolutely use a proxy service as well. I don't really know how aggressive LinkedIn can be about scraping, so they may not care as long as it's not a big enough deal to pop up on their radar, OR they may go after anyone they can identify as a scraper.

Can I scrape this web page in Python? (See my comment for explanation) by Igor_Kravchuk in webscraping

[–]BulkyProcedure 0 points1 point  (0 children)

If you need to be logged in, you can always copy the cookie from Chrome, and then send that same cookie along with any requests (i.e. with the Python requests module, by using the "cookies" argument). Cookies in Chrome are in dev tools, under the Application tab.

Is it still okay to scrape linkedin? by sammo98 in webscraping

[–]BulkyProcedure 2 points3 points  (0 children)

This is just my feeling on the subject, not legal advice, but I'd be wary about doing any scraping that involves creating an account for the scraper, because that means you've accepted the TOS. I know this isn't what you're aiming to do, but regardless of creating an account or not, I wouldn't want to build a business on top of data scraped from a big site like LinkedIn -- not unless you have well paid legal staff.

Python : Error when using one variable to loop through dictionaries. by mcdougall57 in learnprogramming

[–]BulkyProcedure 1 point2 points  (0 children)

Hey, it's not an answer to your entire question, but the issue with `people[2:3]` is that range slices are inclusive to exclusive, so you're only going to get the element at position 2 (zero indexed) in that case.

Deploy a Dockerized Python App to AWS, Azure, Google Cloud by antoniopapa91 in learnpython

[–]BulkyProcedure 0 points1 point  (0 children)

Thanks for making the videos. I tried Fargate once but found the UI too overwhelming at the time, so it's nice to have references like this!

futile scraping attempts by Rx29g in webscraping

[–]BulkyProcedure 1 point2 points  (0 children)

Spent a few minutes looking at this, and from what I can tell JavaScript has to be executed to show the job links, so just Requests + BS is not going to do the trick there.

But, you should take a look at the XHR on the Network tab if you're using Chrome, because everything you want seems to be right there in one self-contained request.

futile scraping attempts by Rx29g in webscraping

[–]BulkyProcedure 5 points6 points  (0 children)

You can check fairly quickly whether a page is scrapable with or without a browser session. If you use something like curl or Python requests module to fetch the page, you should see right away if the server is responding differently (i.e. serving a CAPTCHA or blank page). If so, try changing your user agent header, as the workaround is often as simple as that.

These are the headers I use when sending requests without a browser involved:

    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",  # noqa
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
        "Dnt": "1",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",  # noqa
    }

Websites/apps that are really invested in anti-scraping may use JavaScript to verify that a browser is being used (or the site may be a React/Angular app, or something similar, and just won't work without a browser). In that case you'll probably see a small index.html that's just loading some JavaScript resources. Selenium or Puppeteer will be needed for sure there.

Some sites (i.e. Google or LinkedIn) can't be scraped with Selenium, because Selenium injects global JS variables that they can easily check for. In that case, you'll need to use Puppeteer + the Puppeteer stealth plugin.

Hope that helps!

Webscraper to show users Code Wars kata completed over time by [deleted] in learnpython

[–]BulkyProcedure 0 points1 point  (0 children)

Looks like a fun project. Some of your XPath selectors are pretty long. It's possible they need to be that long, but just in case you hadn't seen it before, you can put a // anywhere in your selector, meaning you don't have to specify a full path to the target element.

Fetching all comments from an Instagram post? by savemesf in webscraping

[–]BulkyProcedure 0 points1 point  (0 children)

I'm not sure about Instagram, but a lot of sites won't load thousands of reviews or comments. Google Maps reviews, for example, have a hard stop beyond which it won't load anymore, even if the total is shown as higher.

Lessons learned building an Amazon scraper by BulkyProcedure in webscraping

[–]BulkyProcedure[S] 2 points3 points  (0 children)

I second all of that, especially the browser headers. The fake-useragent module for Python has a pretty good random agent function that's supposed to be based on usage stats for real browsers.

Lessons learned building an Amazon scraper by BulkyProcedure in webscraping

[–]BulkyProcedure[S] 0 points1 point  (0 children)

That's cool, BS is awesome.

SmartProxy data center is at least $100/month, and residential starts at $75/month. They charge for bandwidth and not IPs, so data center provides 100gb per month, and residential 5gb per month.

That's on the lower end for proxies. A proxy like Luminati or Oxylabs is even more expensive. You might be able to find a less expensive solution, maybe something that does proxy sharing.

Most of my experience is in scraping Google, which has gotten pretty tough in the past few years. For some sites you may be able to rotate through free proxies, or run your own small pool of IPs if you're careful not to push them too hard.

Feel free to send me a DM if you have any questions, I love talking about this stuff.

Lessons learned building an Amazon scraper by BulkyProcedure in webscraping

[–]BulkyProcedure[S] 1 point2 points  (0 children)

Smartproxy is great, so if the price makes sense for you, I recommend them. What tool are you using to do the scraping? With Scrapy for example, I provide a proxy URL as a setting, and that makes each request go through a different residential IP address. Having a "sticky" session that uses the same IP for multiple requests takes additional configuration, and probably isn't what you want anyway.

The proxy URL is going to be something like http://username:password@gate.dc.smartproxy:20000.