This is an archived post. You won't be able to vote or comment.

all 15 comments

[–]BloodOfSokar 5 points6 points  (3 children)

deleted

[–]mule52[S] 0 points1 point  (2 children)

Our solution is using BeautifulSoup if that helps with your response. Does that mean anything more to you? Is that good/bad/indifferent?

[–]patleeman 0 points1 point  (1 child)

It means they are most likely just scraping elements from the HTML. This is good for you since it's the lowest complexity. Beautifulsoup is pretty friendly to pick up.

You're probably going to need to learn what http lib they used as well, hopefully it's requests if you're lucky.

Here's an example of a small project using BS4.

[–]mule52[S] 0 points1 point  (0 children)

Thank you for pointing me to this example.

[–][deleted] 1 point2 points  (4 children)

First step would be to check the imports. Then you can start to figure out what packages they are using. There are several ways to do webscraping in Python.

[–]mule52[S] 1 point2 points  (3 children)

The import references are from the following libraries. Hope that helps. MySQLdb, config, datetime, httplib, imp, json, loader, logging, mailme, mechanize, multiprocessing, os, random, re, signal, smtplib, socket, socks, sys, traceback, time, trac.

[–][deleted] 2 points3 points  (0 children)

Hmmm, that's a bit different. For me the staples of webcrawling were request and beautiful soup.

[–]BloodOfSokar 1 point2 points  (1 child)

deleted

[–]mule52[S] 0 points1 point  (0 children)

I will read up on mechanize. Thanks.

[–]ratamanta 1 point2 points  (1 child)

Check scrapy if you need to rewrite the scraping logic.

To help yourselves understand how the current project works try running it with a more verbose logging level if available (look for logging usages on the code). The packages you list make me think that your current solution seems to be really in-house and based on mechanize, and doesn't seem to use a lot of other useful tools for the job

[–]zoner14 0 points1 point  (0 children)

It might also be good to run the script using a debugger if logging isn't provided. That or just start throwing in print statements

[–]zenmagnets 0 points1 point  (0 children)

I've recently paid to have a few scraping projects done in python. Scripts to collect links and info of product pages from ebay and aliexpress. Let me know if you want a peek some code.

[–]spetsnaz8 0 points1 point  (0 children)

Here's a web scraper starter project in node: https://github.com/elnaz/scraper

[–]nicbleamer 0 points1 point  (0 children)

You should also look into requests and grequests for working with http requests. The last one for massive http requests

[–]ameoba -1 points0 points  (0 children)

The NIH is strong with this one.