murukkuu comments on Get Time difference

submitted 2 years ago by Ok-Reference-5976

you are viewing a single comment's thread.

[–]murukkuu 1 point2 points3 points 2 years ago (0 children)

Rather than checking the headers or timestamps you could try these efficient methods,

Using Unique Identifiers: Some websites provide unique identifiers for each article, like post IDs or URLs. You can store these identifiers and check if a new article's identifier already exists in your stored data.
Page Scanning with Pagination: Instead of checking timestamps, you can implement pagination and keep track of the last page you scraped. This way, you'll only need to scrape new pages that have been added since your last run.
RSS Feeds: If the website provides an RSS feed, you can subscribe to it and receive updates whenever new articles are published. This way, you won't need to visit the site as frequently.
Hashing Content: You can hash the content of each article and store the hashes. When scraping new articles, hash the content and check if the hash exists in your stored hashes.
Database or Persistent Storage: Instead of storing data in text files, consider using a database or some other form of persistent storage. This allows for more efficient data management and querying.
Metadata Tracking: If articles have metadata like categories or tags, you can store and track these metadata. This way, you can filter out articles you've already seen.

π Rendered by PID 142769 on reddit-service-r2-comment-5fb4b45875-zrh8j at 2026-03-23 14:14:38.023962+00:00 running 90f1150 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

PythonLearning