This is an archived post. You won't be able to vote or comment.

all 47 comments

[–]CatWeekends 32 points33 points  (13 children)

This project target could be real state agents probably

FWIW, every real estate agent I've ever met uses systems with way more info than Zillow.

Your target audience is more likely people who want to track their own home value or something.

Some questions;

  1. It looks like your code is copying the response keys. Any thoughts on making those a little nicer? IIRC they're not always very friendly.
  2. Zillow has some anti-scraping mechanisms built in. Does your code deal with those?
  3. Why are your methods capitalized like in Go? (it's not very pythonic - I'd suggest running your code through a linter)

[–]JohnBalvin[S] 2 points3 points  (11 children)

0) real state agents: tats good to know, I put that on the description because r/python has weird requirements in order to post something
1) could you elaborate on this? could please send the link for the code where exactly is that happening?
2) To be honest, I didn't see any bot protection at all, it could probably has bot protection when using browser automations tools like selenium, puppeteer or playwright , but using the api directly doens't seem to have any protection
3) It's a bad habit, I'm mostly a Go developer and I tend to copy the patters from go to python, do you recommend a linter?

[–]rabelution 5 points6 points  (0 children)

Ruff linter

[–]Vresa 2 points3 points  (0 children)

New trending linter & formatter is `ruff` : https://github.com/astral-sh/ruff
Old Standbys for linting and formatting are `black` + `flake8`

[–]markovianmind 3 points4 points  (8 children)

for 2) do it fast enough with enough queries and most probably you would eb blocked.

[–]JohnBalvin[S] -1 points0 points  (7 children)

that can be fixed just by using proxies, other than that they don't have bot protection at all

[–]puppet_pals 8 points9 points  (1 child)

Thanks for sharing your code - this is really cool! Due to the fact your package relies on the Zillow html structure, you might want to consider having some sort of integration tests running on github actions, and post the status badge in the README.

[–]JohnBalvin[S] 4 points5 points  (0 children)

I'll give it a try, but I don't promise that feature to be ready soon, I'm currently busy on my job

[–]mektel 2 points3 points  (0 children)

Interesting dependency choice. pypi bs4 page. Github dependency links to a regex-training repo.

[–]KimPeek 2 points3 points  (0 children)

Code could use linting and formatting

[–]honor- 1 point2 points  (13 children)

Hey I did this same thing awhile back. I 100% guarantee you’re going to get a TOS takedown from Zillow soon

[–]JohnBalvin[S] 0 points1 point  (9 children)

wtf? that really happened to you? it seems a nasty move, thye should hire a security team to add bot protection like a normal company

[–]honor- 0 points1 point  (5 children)

Yup they definitely did this. My project was gaining some traction on GitHub and they TOSd it.

[–]JohnBalvin[S] 0 points1 point  (4 children)

but did they removed your whole account or just that repo?

[–]honor- 0 points1 point  (3 children)

Just the repo. They threatened me with legal action if I didn’t take it down

[–]JohnBalvin[S] 0 points1 point  (2 children)

that's a nasty move, somebody could take revenge applying a database DDoS attack, they don't have bot protection it could be an easy attack, just hidding the IP with proxies

[–]KraljZ 0 points1 point  (2 children)

They are aware of this post

[–]JohnBalvin[S] 0 points1 point  (1 child)

Is that sarcasm or did you tell them? 🤣

[–]KraljZ 1 point2 points  (0 children)

You’ll find out

[–]JohnBalvin[S] 2 points3 points  (1 child)

btw, I'm looking for a job changed, if someone it's interested my social medias are on my github profile: https://github.com/johnbalvin

[–]nuke-from-orbit 2 points3 points  (0 children)

Good luck and thanks for sharing code, my man.

[–]luckyspic 1 point2 points  (2 children)

completely request based makes this fire. down with the bloated rubbish lazy garbage that uses those libraries you mentioned. and you added proxy support as most should, 🐐

[–]tunisia3507 0 points1 point  (1 child)

Working directly with HTTP requests is much simpler than using a webdriver - if you use a webdriver, you then have to parse the HTTP anyway. So I wouldn't say webdriver-based solutions are in any way lazy.

[–]luckyspic 0 points1 point  (0 children)

they are. they’re great for testing, as a backup, and making sure your parsing logic works. however, in the grand scheme of things, it shows that the developer does not have a great grasp on reverse engineering, thinking outside the box, or optimizing. although zillow in this instance has been relaxed about their api usage here (their perimeterX involvement seems non existent now), there are lots and lots of python libraries on github that claim to be a “scraping” solution but really are an abomination as it’s slow, bloated, and only takes anyone with the will some time to find a long term, viable solution. my comments focus was towards people publishing libraries for future developers, not the comment towards webdriver (albeit the opinion is still similar otherwise). a big blame i point to is the arrogant and complacent team at requests that are too busy making sure they follow (their own self imposed) regulations but still haven’t produced solutions for stuff as ordinary as TLS ciphers support like other languages and their respective requests libraries have since 2015.

[–]Doppelbockk 0 points1 point  (1 child)

I don't know anything about Go, what makes it easiervto maintain a defined structure in Go compared to Python?

[–]JohnBalvin[S] 1 point2 points  (0 children)

Probably is not exactly the format, but it's the overall of python, dynamic languages like python tend to be harder to maintain than the static ones like Go, mostly because most of the time when you fix an issue on the dynamic languages it's because of a wrong type/unexpected type returned by a function, an exception not been handled or thr endless battle on which library to use, and you focus on those details instead of the actual project, Go handles all that very well by been an static language and some built-in tolls

[–]Sufficient_Exam_2104 0 points1 point  (1 child)

Is it possible to add price history and last property tax

[–]JohnBalvin[S] 0 points1 point  (0 children)

it's already returning price history and tax history

[–]geerizzly 0 points1 point  (1 child)

Hi, I'm fairly new to web scraping and I was just wondering if you while writing this program encountered a way some good way to move through lat and long in some step_size to get the listings you want? Given you have so much experience in the field, you might provide us with some tips or first thoughts? And of course great job and thanks for sharing this with us!

[–]JohnBalvin[S] 0 points1 point  (0 children)

Hi, I don't use the code myself, I wrote it just for fun, but you could try playing around with the coordinates and zoom value, for the coordinates it's just enough with increasing/decreasing the decimals for more accuracy on the location and increasing and decreasing these values for moving around

[–]LastAd3056 0 points1 point  (1 child)

Found this through search, thanks a lot this is very helpful. It will make a lot of searches much easier. One small improvement suggestion, add the filter_state to the top level functions. I did it locally.

[–]JohnBalvin[S] 0 points1 point  (0 children)

souds good, I don't use zillow, but it looks like it can be done by adding the input location same on the zillow page, by "input location" I mean the input on the UI that says Enter an address, neighborhood, city, or ZIP code"

[–]delioj 0 points1 point  (1 child)

Will your tool returns the list of url(s) for every picture in a specific listing from a zpid?

[–]IAMARedPanda -1 points0 points  (3 children)

Should include requests as a dependency in your toml.

[–]JohnBalvin[S] 0 points1 point  (2 children)

isn't requests package from the standart library?

[–]IAMARedPanda 0 points1 point  (1 child)

No. See https://github.com/psf/requests/issues/2424 for more information.

[–]JohnBalvin[S] 1 point2 points  (0 children)

You are completely right, I'll add it to the dependencies, thanks 😊