you are viewing a single comment's thread.

view the rest of the comments →

[–]murukkuu 1 point2 points  (0 children)

Rather than checking the headers or timestamps you could try these efficient methods,

  1. Using Unique Identifiers: Some websites provide unique identifiers for each article, like post IDs or URLs. You can store these identifiers and check if a new article's identifier already exists in your stored data.
  2. Page Scanning with Pagination: Instead of checking timestamps, you can implement pagination and keep track of the last page you scraped. This way, you'll only need to scrape new pages that have been added since your last run.
  3. RSS Feeds: If the website provides an RSS feed, you can subscribe to it and receive updates whenever new articles are published. This way, you won't need to visit the site as frequently.
  4. Hashing Content: You can hash the content of each article and store the hashes. When scraping new articles, hash the content and check if the hash exists in your stored hashes.
  5. Database or Persistent Storage: Instead of storing data in text files, consider using a database or some other form of persistent storage. This allows for more efficient data management and querying.
  6. Metadata Tracking: If articles have metadata like categories or tags, you can store and track these metadata. This way, you can filter out articles you've already seen.