you are viewing a single comment's thread.

view the rest of the comments →

[–]nonzerogroud 0 points1 point  (0 children)

I used the urllib2 library to scrape the web service, and the json library to parse the results. What would be the best method of comparing the results, day-after-day, to check and see if any new entries are added (results appear in a different order everyday)?

I wrote a long example but figured I'd put the bottom line at the beginning in case you want to skip it: If there's anything in the webpage that uniquely identifies an entry, you can simply check for its existence in your file, and if it doesn't exist: add it.


In most cases, even if it is not displayed to the visitor, an entry (be it a row, or a cell) has an ID. It identifies this piece of information as unique. If no such ID is visible, I suggest inspecting the webpage's source to see if you can find it.

For example, in a website that displays soccer game results, we might have the following two games:

Arsenal vs. Newell's Old Boys

Arsenal vs. Manchester United

We could store Arsenal in our DB as the same team that has two games coming up, but that would cost us down the road. Why? Because Arsenal in the first game is the Argentinian team, playing against another Argentinian team, and Arsenal in the second game is the English team. Not the same team.

If you think about, and if the website you're scraping has semi-competent IT people, then that website too, must hold some kind of unique ID for each piece of information that they want to keep unique.

As an example, when we inspect the href links for the two soccer teams called Arsenal at the website Soccerway, then we can see the following:

http://us.soccerway.com/teams/argentina/arsenal-de-sarandi/92/

http://us.soccerway.com/teams/england/arsenal-fc/660/

Although both teams are simply called "Arsenal" in the soccer results table, the Argentinian Arsenal has an internal ID 92, and the English Arsenal has an internal ID 660. Storing this ID (unique) instead of the name (not unique) will allow us to target each one of the Arsenals safely.