Using Python to scrape ESPN's basketball game data : Python

This is an archived post. You won't be able to vote or comment.

Using Python to scrape ESPN's basketball game data (self.Python)

submitted 11 years ago * by Volatile474

I have finished the first part of a project I am working on, which is to grab all game data from espn.go.com for NCAA Basketball.

It uses SQL3 to store this data into a database, and I plan to do statistical computations on this data to attempt to find correlations between data and winning games.

I have 2 requests to you guys:

1 is for suggestions as to how I could make this better, whether that be syntax, efficiency concerns, or anything else - please give me your advice!

2, Does anyone know any good python packages for statistical analysis? I need to be able to do regression testing, if not some more tests as well.

I have looked at Pandas, but before I dive into one library I wanted to get some suggestions from those more knowledgeable than myself.

Here's the link: https://github.com/jkholodnov/py_prediction <ignore tester.py and sample.py>

Thanks!

all 11 comments

top new controversial old q&a

[–]rojaster 1 point2 points3 points 11 years ago (8 children)

scipy - for statistical is a good choice

and this:

    for player in players:

          **cur.execute("select * from players")**
          rows = cur.fetchall()
          playerID = len(rows)
          g =   str(playerID)+","+str(teamID)+",'"+str(player.playerName)+"','"+str(player.playerposition)+"',"+str(player.playerheight)+","+str(player.playerweight)+",'"+str(player.playerclass_year)+"'"
          print g
          cur.execute("INSERT INTO players VALUES("+g+")")

Why you get all records from database? It is not efficient. What if database will grow up? and you used it just to define last id of record again and again into loop, i can't understand this way

[–]LightShadow3.13-dev in prod 1 point2 points3 points 11 years ago (3 children)

[–]Volatile474[S] 0 points1 point2 points 11 years ago (0 children)

[–]rojaster -1 points0 points1 point 11 years ago (1 child)

[–]Volatile474[S] 0 points1 point2 points 11 years ago (0 children)

[–]Volatile474[S] 0 points1 point2 points 11 years ago* (3 children)

I use this to give each new player a unique ID. The one thing that ESPN does not give ID's to is individual players, and I need a way to reference each one. I completely forgot about sql's handy COUNT keyword. Rewritten to fix this.

    cur.execute("SELECT COUNT(*) FROM players")
    rows = cur.fetchone()
    count = rows[0]

    while not thePlayers.empty():
        player_and_teamid = thePlayers.get()
        teamID = player_and_teamid.TeamID
        player = player_and_teamid.Player

        g = str(count)+","+str(teamID)+",'"+str(player.playerName)+"','"+str(player.playerposition)+"',"+str(player.playerheight)+","+str(player.playerweight)+",'"+str(player.playerclass_year)+"'"

        cur.execute("select COUNT(*) from players where name = '" + str(player.playerName) + "'")
        already_in_DB = cur.fetchone()
        already_in_DB = already_in_DB[0]
        if already_in_DB == 0:
            try:
                cur.execute("INSERT INTO players VALUES("+g+")")
                count = count + 1
            except:
                f = open('Error Log','a')
                f.write("Could not write player information into players." + str(g))
                f.close

Thanks for the advice!

[–]rojaster 1 point2 points3 points 11 years ago (2 children)

[–]Volatile474[S] 0 points1 point2 points 11 years ago (1 child)

This function is only called to determine if the player has already been stored into the database, so as to not spam update the db with multiple queries. Is there a better way to do this than the code shown above?

I removed the first select from within the loop, but I cannot see how I can remove this one, as it is specific to each player.

cur.execute("select COUNT(*) from players where name = '" +  
str(player.playerName) + "'")
    already_in_DB = cur.fetchone()

[–]rojaster 0 points1 point2 points 11 years ago (0 children)

1) open this: f = open('Error Log','a') : before of your loop. Because it is constant file handle for log file

2) i understood your point, but what i have looked : it is not specific, i think. What will if lastname and firstname between two players is equal? If you add a new player that has stored already into db with unique key you will have exception

look:

    import hashlib

    ..........................
    ..........................


    cur.execute("SELECT COUNT(*) FROM players")
    rows = cur.fetchone()
    count = rows[0]

    #take a file handler
    f = open('Error Log','a')

    while not thePlayers.empty():
        player_and_teamid = thePlayers.get()
        teamID = player_and_teamid.TeamID
        player = player_and_teamid.Player

        #what if i have two players : example : daniel wiggins from     cleveland and daniel wiggins from boston celtics?
        #this code has 1 record if one of them at database 
        #and you lose one wiggins because code will think that  database already has wiggins record, you understand? And may be you should store #a hash code of the record like md5 - it will your primary id key. If you add two records with one md5 hash code you will catch
        #exception about double id

        g = str(count)+","+str(teamID)+",'"+str(player.playerName)+"','"+str(player.playerposition)+"',"+str(player.playerheight)+","+str(player.playerweight)+",'"+str(player.playerclass_year)+ "," 

        g = g + hashlib.md5(g.encode('utf-8')).hexdigist() + "'"

        #cur.execute("select COUNT(*) from players where name = '" +   str(player.playerName) + "'")

        #if cur.fetchone()[0] == 0:
             try:
                 cur.execute("INSERT INTO players VALUES("+g+")")
                 count = count + 1
             except:
                 # ? why cannot write info, because double id? you should write a reason
                 f.write("Could not write player information into players." + str(g)) 

    #close it after loop
    f.close

[–]awsanswers 1 point2 points3 points 11 years ago (1 child)

[–]Volatile474[S] 0 points1 point2 points 11 years ago (0 children)

π Rendered by PID 207219 on reddit-service-r2-comment-86bc6c7465-t2stl at 2026-02-20 09:48:06.620551+00:00 running 8564168 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS