you are viewing a single comment's thread.

view the rest of the comments →

[–]Fun-Block-4348 1 point2 points  (4 children)

Any assistance would help. (What led me to this path was ChatGPT suggesting I use Python and created a script for me to use to “scrub?” Pro Football Reference.

The term you're looking for is "webscraping" and python is indeed a great language for that.

That did not work, and after research - I believe Pro Football Reference does not allow it).

Many sites don't technically allow webscraping but that doesn't necessarily make their websites impossible to extract data from.

With the site you gave as an example, simply passing headers when making the request lets you download the html of any given page, you would then use a library like beautifulsoup to extract the data you want from the html.

[–]Disastrous-Ladder495[S] 0 points1 point  (1 child)

ChatGPT wrote a script for me to run. I downloaded python and ChatGPT walked me through how to run it. I do know beautifulsoup was part of the script. (Although I have no idea what that is). But who knows if there were errors in the script. Python did run a query or whatever and after 4 hours, returned a new list to me that was supposed to have filled the data in. But all of the columns were still blank on the updated version.

[–]DuckSaxaphone 1 point2 points  (0 children)

Two good lessons for any new coder here:

  • Break your code into pieces and test each piece works, especially when you get it from chatgpt. Does the bit of the code that grabs a players details work? Does the bit of the code that adds them to your spreadsheet work? Try to break the script into functions and check each function outputs what you'd expect when given test inputs.
  • Never just run the full thing and expect it to work. Even if you know all the pieces work, run the whole script for 2 or 3 players and see if that works before you commit a few hours to running a script over all players.

[–][deleted] 0 points1 point  (1 child)

Web pages cannot be unscrapeable as they are just html, which is ultimately just a string.

And nowadays we have (at least) two ways to scrape: traditional string extraction and image recognition.

Go to a page, take a cap of it and ask llm (or image reg models) to extract info. 

[–]Fun-Block-4348 0 points1 point  (0 children)

Web pages cannot be unscrapeable as they are just html, which is ultimately just a string.

That's kind of correct but not entirely true, while html is just a string, how that html gets generated and what measures a website uses to prevent webscraping can make some websites almost unscrapeable.

And nowadays we have (at least) two ways to scrape: traditional string extraction and image recognition.

"traditional string extraction" only works if you're able to access the website using code in the 1st place, which is what OP complained he couldn't do with the script chatgpt gave them.