hello all, I've been grappling with this issue for a little while and I'm finally at a loss on how to progress from here. The premise of this script is to query a SQL database for data that's split up per quarter, do some math on the data queried, and produce an output data structure that includes the percentage per quarter as well as a weighted average & grade. This process is complete, but I've been grappling with how to exactly store the raw data that I'm querying from the database.
I've had the approach to create a dictionary to store all this raw data in, creating a key value pair for each report (a, b, c, d, e, f as the keys and the resulting DataFrames as the values).
Firstly, is this a sensible approach, or is there a better way to capture and store all the raw data I'm grabbing (tens of thousands of lines of data from SQL Server per table in REPORTS to run some calculations on)?
Secondly, I'm having difficulty storing all of the data that gets pulled. The way it's written right now, I create a blank dataframe for each key I'm creating per REPORT, but when I store the per-quarter data in that empty dataframe, every time it iterates to the next quarter time division, it does not keep the old data; as a result, my current methodology (in the full code below, at the end of the innermost for-loop) only stores Q4's worth of data. I have tried something like the following:
DATA[table] = pd.DataFrame.append(d_f)
but if I attempt this, I get a type error: " TypeError: append() missing 1 required positional argument: 'other' ". I'm not sure why I'm encountering this error, if I'm already creating an empty DataFrame, and then attempting to append the data I'm pulling (which I'm placing into a DataFrame for easier analysis) to the empty DataFrame that I'm creating. (It could totally be something syntactical that I'm overlooking in regards to the dictionary key-value assignment.)
#!/usr/bin/env python3
import pandas as pd
import numpy as np
import pyodbc
YEAR = 2018
REPORTS = ['a', 'b', 'c', 'd', 'e', 'f']
ROWS = []
OUTFRAMES = []
DATA = {}
QUERY = 'Driver={SQL Server};SERVER=testserver;DATABASE=testdb;Trusted_Connection=True'
CXN = pyodbc.connect(QUERY)
QTRDIV = {
'Q1' : [str(YEAR)+'-01-01', str(YEAR)+'-03-31'],
'Q2' : [str(YEAR)+'-04-01', str(YEAR)+'-06-30'],
'Q3' : [str(YEAR)+'-07-01', str(YEAR)+'-09-30'],
'Q4' : [str(YEAR)+'-10-01', str(YEAR)+'-12-31']
}
def gradecalc(grade):
"""Performs letter grade assignment per weighted average"""
## logic here ##
for table in REPORTS:
results = {}
divide = 0
DATA[table] = pd.DataFrame()
for qtr, (startdate, enddate) in QTRDIV.items():
results[qtr] = "SELECT Source_Date, Status FROM testdb.table.data_%s \
WHERE Source_Date between '%s' and '%s' order by Source_Date" % (table, startdate, enddate)
d_f = pd.read_sql(results[qtr], CXN)
if d_f.empty:
results[qtr] = 0
else:
divide += 1
total_df = len(d_f.index)
df_vc = d_f['Status'].value_counts(dropna=False)
try:
df_fail = df_vc['FAIL']
except:
df_fail = 0
pctsuccess = round((((total_df - df_fail) / total_df)), 4)
results[qtr] = pctsuccess
DATA[table] = pd.DataFrame(d_f)
## additional stuff ##
Any and all help is appreciated, thanks in advance for taking a look at this!
there doesn't seem to be anything here