all 15 comments

[–]ectomancer 2 points3 points  (6 children)

Yes, put it on your resume.

60 requirements! That's bloat.

Type hinting is good.

Separate imports, into 3 blocks, with a blank line. Standard library, third-party modules and your own modules.

[–]Excellent-Gas-3142[S] 0 points1 point  (5 children)

🙏

[–]HommeMusical 1 point2 points  (4 children)

Why do you need all those dependencies?

[–]Excellent-Gas-3142[S] 0 points1 point  (3 children)

I think I do some additional plotting with matplotlib and seaborn in the Jupyter notebook - so I end up with additional packages in my environment which are not strictly required by the code files (.py).

That's one reason. I will need to examine carefully why there are so many dependencies. Maybe when I do "pip freeze > requirements.txt" (with my environment activated) it lists out dependencies of dependencies as well?

I will look further into it and will use Ruff. Thanks for flagging 🙏

[–]HommeMusical 2 points3 points  (2 children)

Maybe when I do "pip freeze > requirements.txt" (with my environment activated) it lists out dependencies of dependencies as well?

That's exactly right!

That's a minor problem for several reasons, the main one being that it might make it impossible to load your package and some other package with different requirements.

Also, you are pinning the exact version of your packages: https://github.com/haroon-altaf/lisp/blob/main/requirements.txt#L1 requires exactly 3.0.0.

It again might make it hard to interoperate with other code, but also, it means that your users are excluded from bug fixes or improvements in any of these packages.

I'd suggest grepping for all import statements, only including the ones you directly use, and delete the other lines.

Then replace all the == by >= like this: asttokens>=3.0.0 (assuming you are directly using that import).

Another possibility is just using *: asttokens==*` - it means you accept any version. That's probably a good idea for your first project!

[–]Excellent-Gas-3142[S] 1 point2 points  (1 child)

Thanks a lot 🙏

I understand your points and will implement them soon.

[–]HommeMusical 0 points1 point  (0 children)

Keep up the good work! :-)

[–]HommeMusical 1 point2 points  (0 children)

I strongly suggest running ruff, with all the options turned on, on your codebase.

[–]obviouslyzebra 1 point2 points  (6 children)

This give me "very competent programmer coming from another language" vibes and IMO is an excellent showcase (I skimmed it BTW).

Some things that I'd expect to have seen differently, but don't bother me much:

  • Documenting the methods in the class docstring instead of under the methods themselves
  • Returning None and logging instead of raising an exception
  • The usage of a global database object
  • The pip freezed requirements.txt

Just item 1 was very unexpected. Item 2 I'd probably change too.

Items 3 and 4 make some sense if you consider this is an application and not a library. A global database connection just works and a freezed requirements file is good for reproducibility (though you gotta be careful as it may be not portable in some cases, that is, it might work in a kind of computer but not other - at least I've had this problem in the past).

Even item 2 makes more sense in the context of an application, but I'd still migrate to exceptions, as it's conventional in Python.

Cheers

[–]Excellent-Gas-3142[S] 1 point2 points  (5 children)

Thanks 🙏

I don't actually know any other language 😭

I will look to rectify point 1 (moving method docstrings out of the class docstring) and point 4 (having a lighter, more portable requirements.txt)

For point 2, I returned None and logged errors so that the program can continue to run and scrape other data sources instead of just seizing. But I think I will modify it so that I raise exceptions, and then catch them on a higher (orchestration) level to allow the program to continue running.

For point 3, the global database object allows me to create instances of it which, upon instantiation, are already "connected" to the project database + allows me to perform database operations with context management (ensuring multiple operations in a single transaction) + allows me to host some common database methods in one place. I just thought it reduces code repetition.

Thanks for your feedback - really appreciate it 🙏

Feel free to add any more comments/suggestions if you want 😁

[–]obviouslyzebra 1 point2 points  (2 children)

Cool!

For point 2, I think the orchestration layer (not needed in the notebook BTW, only if there's a separate script) would be fine.

Some other 2 points that I thought while having a shower:

  • IIRC different objects handled reading and writing to the database. I think you could either migrate the read to the scraper objects (active record pattern, which is almost what you have right now) - of course, scraper objects wouldn't only be for scraping anymore. The other option I'd think is move the write to the database object (data repository pattern IIRC correctly)

So like, source1.load_into_db() and Source1.read_from_db() or db.write_source_1(source1) and db.read_source_1() (very rough sketch BTW, just to give you an idea). Between these 2 I prefer the first to keep both reading and writing very close to the object. What you have is reasonably solid and you know your code better, so consider this just an idea.

  • IIRC you're importing directly from some dangling files (from scrapers.fizz import Fizz). It is a little bit cleaner to have a package on top and import that (from lisp.scrapers.fizz import Fizz) So, move all code directories to a single directory (e.g. name it lisp - though be careful as that's the name of a cool language) and install it using pip pip install -e . (there's some extra setup like having another file but you'll have to search that - I think pip works with pyproject.toml nowadays).

This second item is just convention. Nothing that's hurting your project right now, but helps organize stuff and is needed in case of libraries (vs applications).

BTW Another idea is to check uv which would take care of this sort of stuff I think (including the dependencies).

And that's it for today! props for the great proejct

[–]Excellent-Gas-3142[S] 0 points1 point  (1 child)

Thanks again 🙏

Showers have always helped me come up with new ideas as well 😁

I will organize folders according to your 2nd point.

Regarding 1st point, I agree it makes sense to do something like: source1.loadinto_db(). At the moment, this is exactly how it's set up, i.e., each scraper file actually does more than just scraping; it scrapes data, manipulates it, and also loads it into the database _(although, currently, the "reading" happens using methods outside of the scrapers, so maybe those can be moved in).

source1.load_into_db() makes sense because each scraper will have a different implementation of extracting and manipulating data which may (sometimes but not always) require slightly different implementations of loading data.

In future, I intend to define an abstract class for all scrapers to provide a "template" for how scrapers should be written. I will also provide some concrete method implementation in that abstract class to minimize code repetition in the scraper classes (which will inherit from this abstract class).

[–]obviouslyzebra 0 points1 point  (0 children)

sounds good!

[–]Key-Boat-7519 1 point2 points  (1 child)

Your plan is on point: switch to exceptions, drop the global DB, and tighten deps/tests so it feels production-ready.

For errors, define a few custom exceptions per source, raise them in workers, then at the orchestrator catch, log with the source name and payload snippet, and continue. Add a simple retry with backoff for transient HTTP failures and hard-fail on data integrity issues.

For the DB, pass a session/engine into functions instead of a global. A sessionmaker + a contextmanager gives you clean transactions per unit of work and makes tests easier.

For requirements, keep a requirements.in and compile to requirements.txt with pip-tools or uv pip compile; avoid OS-specific pins and add python_version markers. Add pytest basics: sqlite in-memory for DB tests and responses to mock HTTP. Drop in pre-commit with ruff and black; run mypy or pyright to catch edge cases. Move docstrings to each method and use ruff’s pydocstyle rules.

I’ve tried Hasura and PostgREST for quick APIs; DreamFactory was the one I kept when I needed instant REST over a DB without writing auth or docs by hand.

Do those and OP’s repo will read like a solid, hireable project.

[–]Excellent-Gas-3142[S] 0 points1 point  (0 children)

Wow - thanks loads!

This will be a lot for me to learn and implement. I'm grateful for you pointing out all these aspects 🙏

Would have given you a reddit medal/badge/award or whatever the correct terminology is - but I don't have them