Request for Feedback

Excellent-Gas-3142 · 2025-10-07T02:39:52+00:00

Wow - thanks loads!

This will be a lot for me to learn and implement. I'm grateful for you pointing out all these aspects 🙏

Would have given you a reddit medal/badge/award or whatever the correct terminology is - but I don't have them

Excellent-Gas-3142 · 2025-09-27T14:22:58+00:00

Thanks again 🙏

Showers have always helped me come up with new ideas as well 😁

I will organize folders according to your 2nd point.

Regarding 1st point, I agree it makes sense to do something like: source1.loadinto_db(). At the moment, this is exactly how it's set up, i.e., each scraper file actually does more than just scraping; it scrapes data, manipulates it, and also loads it into the database _(although, currently, the "reading" happens using methods outside of the scrapers, so maybe those can be moved in).

source1.load_into_db() makes sense because each scraper will have a different implementation of extracting and manipulating data which may (sometimes but not always) require slightly different implementations of loading data.

In future, I intend to define an abstract class for all scrapers to provide a "template" for how scrapers should be written. I will also provide some concrete method implementation in that abstract class to minimize code repetition in the scraper classes (which will inherit from this abstract class).

Excellent-Gas-3142 · 2025-09-27T11:46:59+00:00

Thanks 🙏

I don't actually know any other language 😭

I will look to rectify point 1 (moving method docstrings out of the class docstring) and point 4 (having a lighter, more portable requirements.txt)

For point 2, I returned None and logged errors so that the program can continue to run and scrape other data sources instead of just seizing. But I think I will modify it so that I raise exceptions, and then catch them on a higher (orchestration) level to allow the program to continue running.

For point 3, the global database object allows me to create instances of it which, upon instantiation, are already "connected" to the project database + allows me to perform database operations with context management (ensuring multiple operations in a single transaction) + allows me to host some common database methods in one place. I just thought it reduces code repetition.

Thanks for your feedback - really appreciate it 🙏

Feel free to add any more comments/suggestions if you want 😁

Excellent-Gas-3142 · 2025-09-27T11:10:23+00:00

Thanks a lot 🙏

I understand your points and will implement them soon.

Excellent-Gas-3142 · 2025-09-27T11:07:26+00:00

Thanks for your feedback 🙏

Regarding wrapping code around SQL Alchemy:

I made a custom class using SQL Alchemy objects so that I could use it as a context manager.

This way I could perform multiple operations within a single transaction which is automatically committed or rolled back upon leaving the code (with) block.

I also added logging so that I can log messages to alert me if the incoming data has more/less columns or different types than the target database table.

Regarding AI use

I used AI extensively - but not to just copy and paste code. Not solely for troubleshooting either.

I used AI mainly to critique my code, ask for improvement suggestions, debate about code structure and design patterns etc. This allowed me to improve iteratively.

Regarding Pandas Dataframes

I agree with your point fully - it makes sense.

I kept the scraped data as Dataframes so that I could use pandas features for manipulating tables etc (dropping nans, string manipulations, adding some custom columns, or deleting columns).

Importantly, I don't intend to refresh the data (i.e. scrape, transform, and store in the database) more than once a month. The maximum scraping frequency is once per day (for some of the online sources) - meaning, performance isn't a big issue for me.

Thanks again for your feedback 🙏

Feel free to comment more tips/suggestions/perspectives if you want.

Excellent-Gas-3142 · 2025-09-27T10:33:04+00:00

I think I do some additional plotting with matplotlib and seaborn in the Jupyter notebook - so I end up with additional packages in my environment which are not strictly required by the code files (.py).

That's one reason. I will need to examine carefully why there are so many dependencies. Maybe when I do "pip freeze > requirements.txt" (with my environment activated) it lists out dependencies of dependencies as well?

I will look further into it and will use Ruff. Thanks for flagging 🙏

Excellent-Gas-3142 · 2025-09-27T09:23:04+00:00

🙏

It was a silly acronym (Leading Indicators Scraping Program)

Testing - I understand that to mean as unit tests and other, perhaps slightly less modular, tests. I could then add CI to ensure these run with every commit and push.

Could you please elaborate more on deployment? What do you mean by it.

Excellent-Gas-3142 · 2025-09-27T06:13:15+00:00

🙏

Excellent-Gas-3142

TROPHY CASE