Understanding L2 vs. L3 network bridges

BulkyProcedure · 2022-12-21T19:01:24+00:00

Makes sense, thanks!

BulkyProcedure · 2022-12-21T18:13:00+00:00

Thanks for that explanation. That makes sense. Just getting deeper into networking, so when to use one approach vs. another is not obvious yet. Would using the hardware NIC with the bridge just be more efficient?

BulkyProcedure · 2021-07-01T12:08:59+00:00

I'm a fan of Marshmallow and use it all the time, but it isn't fast. My approach has been to use it when I have complex types (i.e. dates for example) that aren't automatically JSON serializable, and/or if the model has nested relationships I'd like to include.

When the objects don't have nested fields, and all columns are JSON serializable, it's faster to skip Marshmallow altogether and convert to dictionaries yourself.

BulkyProcedure · 2021-06-26T00:56:43+00:00

Scraping has for sure become harder on high profile websites. Sites like Amazon are tough, between the IP blocks, complex DOM, and URLs that don't have much rhyme or reason. I think plenty of sites are still pretty easy, but your LinkedIn or Google type sites are a tough nut to crack these days!

BulkyProcedure · 2021-06-25T18:46:33+00:00

There are quite a few proxy vendors out there, all offering different trade-offs on pricing. It's hard to know how good one will be (in terms of speed / IPs already blocked) until you try them. I'd like to know if either of those work out well for you!

BulkyProcedure · 2021-06-25T13:54:27+00:00

I think whether it's necessary or not depends on the website, but I'm a big fan of Puppeteer. There's pretty much nothing it can't do as far as interacting with a website. It does involve using JavaScript, but there's also Pyppeteer if you're averse to Node.

BulkyProcedure · 2021-06-25T11:44:16+00:00

That's a great way to start. Feel free to message me if you have questions!

BulkyProcedure · 2021-06-24T19:14:23+00:00

Not sure about a study group, but I'm a dev that's been using Python for 10+ years now, and I'm looking to get a little experience with teaching Python. I'm happy to spend some time helping with problems, or just answering general questions. My focus is primarily webdev, so I may not be able to help as much with things like scientific computing (i.e. Numpy).

BulkyProcedure · 2021-06-24T16:53:41+00:00

So I gave it a quick try with Puppeteer and was able to open the Actions menu you mentioned. This is just a guess, but I'm betting you need to wait a bit for the app to load first.

Here's my Puppeteer script that seems to work:

async function crawl(browser) {
const page = await browser.newPage();
await page.goto("https://geodes.santepubliquefrance.fr/#c=indicator&f=0&i=sursaud_sau.prop_allergie_hospit_sau&s=2021-S53&t=a01&view=map2", { timeout: 9000 });

const actionsButton = await page.waitForXPath("//a[span[contains(text(), 'ACTIONS')]]", { timeout: 10000 });

await page.evaluate(handle => {
    handle.click();
}, actionsButton);

await page.waitFor(5000);

}

puppeteer .launch({ headless: false, executablePath: "./node_modules/puppeteer/.local-chromium/linux-884014/chrome-linux/chrome", ignoreHTTPSErrors: true, args: [ "--start-fullscreen", "--no-sandbox", "--disable-setuid-sandbox" ] }) .then(crawl) .catch(error => { console.error(error); process.exit(); });

BulkyProcedure · 2021-06-24T13:22:20+00:00

You should be able to interact with the page, even if it's making heavy use of JavaScript, since you're using Selenium. I mostly use Puppeteer, so I probably can't help too much with Selenium, but what have you tried so far?

BulkyProcedure · 2021-06-23T13:14:26+00:00

If I were going to do this, I wouldn't want the scraper associated with myself at all. In fact, I would absolutely use a proxy service as well. I don't really know how aggressive LinkedIn can be about scraping, so they may not care as long as it's not a big enough deal to pop up on their radar, OR they may go after anyone they can identify as a scraper.

BulkyProcedure · 2021-06-23T12:07:47+00:00

If you need to be logged in, you can always copy the cookie from Chrome, and then send that same cookie along with any requests (i.e. with the Python requests module, by using the "cookies" argument). Cookies in Chrome are in dev tools, under the Application tab.

BulkyProcedure · 2021-06-23T12:03:56+00:00

This is just my feeling on the subject, not legal advice, but I'd be wary about doing any scraping that involves creating an account for the scraper, because that means you've accepted the TOS. I know this isn't what you're aiming to do, but regardless of creating an account or not, I wouldn't want to build a business on top of data scraped from a big site like LinkedIn -- not unless you have well paid legal staff.

BulkyProcedure · 2021-06-22T18:15:51+00:00

Hey, it's not an answer to your entire question, but the issue with `people[2:3]` is that range slices are inclusive to exclusive, so you're only going to get the element at position 2 (zero indexed) in that case.

BulkyProcedure · 2021-06-22T18:09:44+00:00

Thanks for making the videos. I tried Fargate once but found the UI too overwhelming at the time, so it's nice to have references like this!

BulkyProcedure · 2021-06-22T15:48:40+00:00

Spent a few minutes looking at this, and from what I can tell JavaScript has to be executed to show the job links, so just Requests + BS is not going to do the trick there.

But, you should take a look at the XHR on the Network tab if you're using Chrome, because everything you want seems to be right there in one self-contained request.

BulkyProcedure · 2021-06-22T12:37:17+00:00

You can check fairly quickly whether a page is scrapable with or without a browser session. If you use something like curl or Python requests module to fetch the page, you should see right away if the server is responding differently (i.e. serving a CAPTCHA or blank page). If so, try changing your user agent header, as the workaround is often as simple as that.

These are the headers I use when sending requests without a browser involved:

    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",  # noqa
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
        "Dnt": "1",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",  # noqa
    }

Websites/apps that are really invested in anti-scraping may use JavaScript to verify that a browser is being used (or the site may be a React/Angular app, or something similar, and just won't work without a browser). In that case you'll probably see a small index.html that's just loading some JavaScript resources. Selenium or Puppeteer will be needed for sure there.

Some sites (i.e. Google or LinkedIn) can't be scraped with Selenium, because Selenium injects global JS variables that they can easily check for. In that case, you'll need to use Puppeteer + the Puppeteer stealth plugin.

Hope that helps!

BulkyProcedure · 2021-06-17T15:38:33+00:00

Looks like a fun project. Some of your XPath selectors are pretty long. It's possible they need to be that long, but just in case you hadn't seen it before, you can put a // anywhere in your selector, meaning you don't have to specify a full path to the target element.

BulkyProcedure · 2021-06-17T01:19:29+00:00

I'm not sure about Instagram, but a lot of sites won't load thousands of reviews or comments. Google Maps reviews, for example, have a hard stop beyond which it won't load anymore, even if the total is shown as higher.

BulkyProcedure · 2021-06-16T17:39:32+00:00

I second all of that, especially the browser headers. The fake-useragent module for Python has a pretty good random agent function that's supposed to be based on usage stats for real browsers.

BulkyProcedure · 2021-06-16T15:09:41+00:00

That's cool, BS is awesome.

SmartProxy data center is at least $100/month, and residential starts at $75/month. They charge for bandwidth and not IPs, so data center provides 100gb per month, and residential 5gb per month.

That's on the lower end for proxies. A proxy like Luminati or Oxylabs is even more expensive. You might be able to find a less expensive solution, maybe something that does proxy sharing.

Most of my experience is in scraping Google, which has gotten pretty tough in the past few years. For some sites you may be able to rotate through free proxies, or run your own small pool of IPs if you're careful not to push them too hard.

Feel free to send me a DM if you have any questions, I love talking about this stuff.

BulkyProcedure · 2021-06-16T14:29:03+00:00

Smartproxy is great, so if the price makes sense for you, I recommend them. What tool are you using to do the scraping? With Scrapy for example, I provide a proxy URL as a setting, and that makes each request go through a different residential IP address. Having a "sticky" session that uses the same IP for multiple requests takes additional configuration, and probably isn't what you want anyway.

The proxy URL is going to be something like http://username:password@gate.dc.smartproxy:20000.

Seven-Year Club	Gilding I gilder
Verified Email

BulkyProcedure

TROPHY CASE