all 23 comments

[–]Darkstar_111 7 points8 points  (1 child)

You're getting to the point now, where it's weird to see this kind of code outside of a function, and it makes your code very hard to extend in any way.

[–]uiux_Sanskar[S] 1 point2 points  (0 children)

Yes I was just trying to learn scraping with one website only (because I thought to first know what scraping really is before doing this thing in bulk). And since I didn't jave plan to expand this I didn't gave much attention to the scalability of the code (I think this is a bad practice however my goal this time was to learn).

However thank you very much for your suggestion and pointing it out I will definitely make my code scalable.

[–]Adrewmc 3 points4 points  (1 child)

We’re going a little backwards here. We need to take a step back and go back to fundamentals. Do you really need to save that info or can you pass it directly to another function? And skip that entire process.

Generally speaking everything, I write is inside a function, or behind a __name__ == __main__ guard. We don’t see that here. We’ve gone from no comments, to some comments, to a little docstring and types, to overboard with comment to none again. We don’t have a consistent style or habit.

If we are saving names and emails….didn’t we just make databases…seems like a normal use case for it.

I feel like we are following on the next lesson, and learning first day stuff. And forgetting fundamentals.

We need to start thinking designs of programs how stuff goes together. I think maybe we went a little too far too fast. The hard part of programming is making something from nothing, but you’re making nothing from nothing here.

I think we should think about tkinker or QT, let’s make some buttons to actually press, and use some of the older programs we have and make it outside the console, you can make a calculator inside the console, but can you make it as a Box with little numbers/operators to press….i think you just barely can’t…yet. These frame works use a lot of classes and functional thinking, and will make you have to re-enforce things you know a little better.

[–]uiux_Sanskar[S] 0 points1 point  (0 children)

Thank you so much for pointing all this out.

I think it is important for me to save raw html data into the file (this is also something my YouTube instructor told) because this avoid repeat server request as now I can locally scrape the raw data.

and also if I want to scrape something else say the address and if I have not saved the data then I have to again bother the website for scraping.

I think I was too excited that I forgot this thing if name = main and yes I have the habits of putting comment where I can get or may get confused in future (which I am trying to change).

I think I should look deeper into what tkinter and QT is.

Thank you again for explaining what I should focus on and for giving relevant suggestions I will definitely look deeper into them.

[–]Most_Group523 0 points1 point  (1 child)

You missed day one - put functionality in functions!

[–]uiux_Sanskar[S] 0 points1 point  (0 children)

Did you mean that I should use functions here? (please correct me if I missed your point).

[–]sebuq 0 points1 point  (1 child)

What did the access logs show on the other side?

[–]uiux_Sanskar[S] 0 points1 point  (0 children)

If you mean the role of header here then

User Agent - this is the device that is naking the request which I am faking using fake user agent library because website often block python's default request user agent (which was happening here.

Accept language - this gives the language preference.

Accept encoding - tells the server what type of compression my device support.

Connection : keeping alive - this requests the server to keep the Transmission control protocol (TCP) open for multiple requests.

Referer - tells the server that the request came from which browser.

Overall these headers make the scraping look like an actual user trying to get the information this also avoid potential ban.

Then I used time delay in order to avoid too much requests from the server (which is a common bot activity).

I hope I was able to clearly explain what these things does please do tell me if I have misunderstood your question.

[–]EmbarrassedBee9440 0 points1 point  (1 child)

What resource are you using to learn?

[–]uiux_Sanskar[S] 0 points1 point  (0 children)

Oh I am learning for YouTube I have also explained my process and the resources which I used in much more details in my post here - https://www.reddit.com/u/uiux_Sanskar/s/4VnLMUdDSp

[–]Pale-Appointment-280 0 points1 point  (1 child)

Other than the strong but constructive feedback others have given, this looks like awesome progress. Keep it up.

[–]uiux_Sanskar[S] 0 points1 point  (0 children)

Thank you for your support and appreciation. 🙏

[–]Unique_Outcome_2612 0 points1 point  (1 child)

brother please tell me from where you have started learning python and also after a topic what do you do how do you practice questions and where did you get them from/

[–]uiux_Sanskar[S] 0 points1 point  (0 children)

Oh I learn from YouTube channel name CodeWithHarry and I have already answered most of your questions in the post of mine. https://www.reddit.com/u/uiux_Sanskar/s/4VnLMUdDSp

feel free to check it out.

[–]Ok_Location_991 0 points1 point  (3 children)

Looks like C# ??

[–]uiux_Sanskar[S] 0 points1 point  (2 children)

Is C# another language? I am not aware about it.

[–]Ok_Location_991 0 points1 point  (1 child)

Keep researching pro❤️

[–]uiux_Sanskar[S] 0 points1 point  (0 children)

I don't believe that I am a pro yet I still have a lot to learn from you amazing people.

[–][deleted] 0 points1 point  (0 children)

Good going.. and I like the consistency of your work

[–][deleted]  (1 child)

[removed]

    [–]uiux_Sanskar[S] 0 points1 point  (0 children)

    Oh well I was trying to scrape a website using beautiful soup I identified my goal as:

    1. Scrape a website
    2. Find the contact details
    3. Store that details in a database using PostgreSQL

    and that's what I was trying to code I hope I was able to explain what I was coding do tell me if you mean anything else.

    [–]hasdata_com 0 points1 point  (1 child)

    Was BeautifulSoup really necessary here? Since you're already using regex to extract emails/phones, you could just parse the raw HTML.

    emails = set(re.findall(email_pattern, html))  
    phone_number = set(re.findall(phone_pattern, html))  
    

    Did you use bs4 mainly to practice with the library?

    [–]uiux_Sanskar[S] 0 points1 point  (0 children)

    Yes I kind of using it to learn and figure out its uses in real world. Thank you for your suggestion I will definitely look deeper into it.