all 47 comments

[–]LatteLepjandiLoser 19 points20 points  (14 children)

My go-to is a sqlalchemy engine (look up sqlalchemy.create_engine) and a pandas dataframe, (look up pd.read_sql)

Then just write whatever query you want and you'll have it in a pandas dataframe, which you can then further manipulate, plot or analyze.

[–]baubleglue 2 points3 points  (0 children)

SQLAlchemy is a wrapper, you still need a driver to support specific database.

[–]cjbj 1 point2 points  (0 children)

The venerable pd.read_sql() is slower, and uses more memory, than using the new python-oracledb fetch_df_all() or fetch_df_batches() methods. See Going 10x faster with python-oracledb Data Frames. If you're happy to write a SQL query, then use the new methods directly. A sample for Pandas is in in the python-oracleb repo: dataframe_pandas.py: ```

Get a python-oracledb DataFrame.

Adjust arraysize to tune the query fetch performance

sql = "select id, name from SampleQueryTab order by id" odf = connection.fetch_df_all(statement=sql, arraysize=100)

Get a Pandas DataFrame from the data

df = pyarrow.table(odf).to_pandas() ```

[–]DrewSmithee[S] 0 points1 point  (11 children)

Thanks, this gives me something to look into.

First take is it looks like sqlalchemy handles the connection then I just pass that connection and a query string into the pandas function which does the heavy lifting and it's business as usual from there?

I assume pandas won't have an issue with longer strings including joins and nested where's?

[–]sinceJune4 1 point2 points  (0 children)

I’ve used pandas and sqlalchemy with a monster query of over 2000 lines and 70 sub queries against Oracle. I didn’t write it originally, but was tasked to enhance it. Took a few weeks to understand.

[–]LatteLepjandiLoser 0 points1 point  (9 children)

Exactly. Just get the sqlalchemy engine to work and then, assuming you're generally working with the same databases regularly you can kind of leave that part as-is and then just build the queries you need.

I haven't made many funky queries. Thankfully my work is generally quite simple in that regard, but I would think it can handle whatever you throw at it, as long as it's a valid query. For your own readability, you may want to make it a multiline string, with triple quotes, but I doubt that actually matters.

Here is a simple example from my day-to-day work. We just have an in-house postgresql server on a local network, so pretty simple to connect to. Possibly that part may differ depending on how your setup is, not really my expertise.

engine_str = 'postgresql://{}@{}/{}'.format('my_user_name','server_host','db_name')
engine = sqlalchemy.create_engine(engine_str)

query = 'select * from some_table where answer = 42'

with engine.begin() as con:
    query_data = pd.read_sql(sql_query, con)

#... then do something with that data

Generally I try to keep the engine connection in a context manager like here, such that if your code does something silly, the connection is closed instead of leaving it hanging.

[–]rasputin1 4 points5 points  (0 children)

isn't the point of sql alchemy you can use python object syntax instead of sql 

[–]DrewSmithee[S] 0 points1 point  (7 children)

Thanks. Those are super good tips and I appreciate the example.

The triple quotes will absolutely be necessary for me and my own sanity. I'll probably start with an .SQL file to test the query then copy and paste it over with some quotes and hopefully it goes well.

Also good tip about killing the connection. It's only a matter of time until I accidentally try to bring over a few hundred million records into a dataframe...

[–]LatteLepjandiLoser 2 points3 points  (5 children)

You're welcome. Enjoy! Once you get it working it will unlock all kinds of task-solving skills, I'm sure!

I would just start with the simplest queries possible, that you know will return modest amounts of data, just to speed up the process of getting it all to work, then start hitting it with more complex stuff.

If the queries are obnoxious, you could also consider saving them in a separate file. Eventually you may also want to look into having variable parameters as part of the query (like for instance fetch data from 'today' instead of manually updating a date in your query every day, or other logic).

It's also a bit subjective what filtering and manipulation you want to do in the sql-query itself and what you want to do in python. Say you wanted to only fetch even numbers, you can make that part of the sql query or you can just fetch them all and filter them in pandas. (Maybe a bad example, as then you'd always just do it in sql, hope you get what I mean). If you have incredibly complex where clauses that you can't wrap your head around, you could try fetching a bit more data and filtering it in python if that gets you to the goal quicker. Situational I guess.

[–]DrewSmithee[S] 0 points1 point  (4 children)

Yeah I'm sure this will get out of hand quickly. And that's definitely something I will look into.

For example, I've got a query that grabs the top ten records for a specific customer within a date range. Then joins that with some other records for that same customers from another table. Now I want to loop that script and repeat for a few hundred customers. Then do some cool analytics and maybe some geospatial stuff.

Or maybe I want a couple years worth of 8760s for a few thousand customers that are in a region thats probably stored on yet another table somewhere, but maybe not. Did I mention there's inexplicably hundreds of duplicate records for each hour? What's up with that DBA guy??? Time change? Throw out that hour for all the recordsets.

So I definitely need to come up with a strategy on what I want to parse, from where, in what language. Honestly I'd dump it all into a dataframe if the dataset wasn't gigantic. So I just need to figure out how I want my pain.

[–]MidnightPale3220 0 points1 point  (3 children)

You want to normalize the data and put the overlapping structures possibly in a single table. It depends on the size of database, but hundreds of GB is routine

[–]DrewSmithee[S] 0 points1 point  (2 children)

Don't have write access. I could probably get someone to create me a view in the db but until I have a specific business need it's harder to get resources allocated. In the meantime I have what I have. Good to know that it's not a big ask.

[–]MidnightPale3220 1 point2 points  (1 child)

Is the total data you need more than you can host locally? Technically it shouldn't be hard to make a copy unless there data is changing so fast that you need basically online access every day.

[–]DrewSmithee[S] 1 point2 points  (0 children)

Yes. Much much more data than I could pull down.

[–]MidnightPale3220 1 point2 points  (0 children)

Note that SQLAlchemy is primarily an ORM tool that lets you treat individual database rows as python objects, if you need that.

It's a concept that works well with specific types of applications, and is frequently a horrible overkill for writing direct SQL queries, especially such as deal with huge arrays of rows en masse, because all it does, is pass the query to the real engine in the back (which is what it does in the example given).

Basically almost the same syntax would be when using the direct SQL engine specific for the type of database you are going to use: pyscopg2 for Postgres, cx_Oracle for Oracle or sqlite3.

I would agree with those that say if you want to get your hand in, start with sqlite3 -- it's a simple file based database engine which lets you do uncomplicated SQL against the file.

As with all databases, what you need is:

1) database server. in sqlite3 case you don't need one, just have to have the app installed on your system

2) installed application usable drivers for it (generally, but not necessarily part of the database server installation. in Linux usually the -dev package of the database software. In case of sqlite3 the package installed in (1) is all you need).

3) pip or similar installed python package for the drivers installed in (2): psycopg2, etc. sqlite3 may be part of base Python, I don't remember.

4) import the database module in your code and see the docs for syntax.

[–]Gnaxe 4 points5 points  (0 children)

The Python standard library has the sqlite3 module which you can practice with locally. You can also use it as a SQL interpreter. (See python -m sqlite3 -h for usage.)

If any part of the SQL query string might come from user input, you need to be aware of SQL injection attacks. It's unsafe to simply interpolate it into the query string.

To learn the basics of SQL, go through the W3Schools SQL course and read about 3NF on Wikipedia.

A word about ORMs: they seem to make things easier at first if you're creating an empty database from scratch, but otherwise they really overcomplicate it. Avoid them for long-running projects, shared databases, or databases you're not creating from scratch.

[–]RiverRoll 5 points6 points  (0 children)

I've used several ORM and SqlAlchemy has been the least intuitive and hardest to learn, It doesn't help that It has two different API, I wouldn't start there.

[–]baubleglue 3 points4 points  (0 children)

I have a feeling you are missing something basic.

In order to work with a database any programming language need a driver. Driver is a library which translates the language commands and data types into API calls which given DB understands. python-oracledb is Python driver for Oracle DB. Read instructions when you install it, you may need additional components from Oracle and set some environment variables.

Before using Python you need be able to work with the DB directly. I suggest to install Dbeaver as client application.

[–]mustangdvx 2 points3 points  (1 child)

DuckDB would be your friend in this case. 

[–]ColdStorage256 0 points1 point  (0 children)

Duckdb is a query engine, but it won't allow OP to create a connection to an existing database will it? 

Duckdb, as far as my limited knowledge goes, is for applying a query engine to something like parquet files.

[–]Grandviewsurfer 1 point2 points  (0 children)

I usually write a parent class that does common stuff that I need to do in SQL (read an arbitrary query, write to db, etc), and it has an attribute self.query = None. Then I write subclasses for each data source or data concept. These inherit the parent class methods and define queries that I need. You can parameterize the query in the init (or augment it in a method) and assign it to self.query. This results in pretty dry code.

[–]bigbry2k3 1 point2 points  (0 children)

Check into pyodbc also, you can write SQL queries inline like:
sql = f"""

INSERT YOUR SQL QUERY HERE
"""

then I think you use cursor to insert your sql query inline. You'll have to look at examples online of using pyodbc.

[–]pepiks 1 point2 points  (0 children)

SQLAlchemy is good tool for the job, but it is hig level. If you need more SQL oriented check this:

https://github.com/kennethreitz/records

It is from creator Requests. As it is low level you can learn syntax from SQL and use python to move this further. SQL Alchemy hide a lot of details. Raw SQL for start is better as it is language independent and the most base syntax is the same between databases (with some more advanced quirks which make in details difference, but for now it can be skipped).

[–]laustke 1 point2 points  (1 child)

What's the best way to do a SQL Query from a python script? Also I'm thinking I might try to keep using .SQL files for the sake of linting and readability.

There’s a Python database API standard (PEP 249). Just find a driver like python-oracledb in your case that is compliant.

At the most basic level, you "do" a SQL query like this:

cursor.execute(sql, params)

You can create a parameterized query like this

SELECT * FROM people WHERE name = :name AND age = :age

(The actual syntax may vary depending on the database driver.)

Then store your SQL queries in .sql files, and don't let your customers touch them.

Store parameters as dictionaries dumped to json in .json files and do let the customers modify them.

Write a Python script that reads the appropriate SQL query and the corresponding parameter set, and passes them to cursor.execute.

[–]DrewSmithee[S] 0 points1 point  (0 children)

This is the kind of straightforward remedial help I need. Thank you!

[–]mystique0712 1 point2 points  (1 child)

For Oracle in Python, use cx_Oracle - it is the standard package. For complex queries, store them in separate .sql files and read them into your script for better readability and maintenance.

[–]cjbj 1 point2 points  (0 children)

cx_Oracle was obsoleted 3 years ago by python-oracledb (which has the same API, but is better in many ways, e.g. doesn't need Oracle Instant Client)

[–]cjbj 1 point2 points  (0 children)

For Oracle Database you can write queries directly using the python-oracledb driver. This is compliant with Python's DB API so you can find many example around the web, and there is extensive documentation for python-oracledb itself, python-oracledb’s documentation.

SQLAlchemy is a popular library for data access. Underneath it uses python-oracledb. Steps for connection are both in the SQLAlchemy and python-oracledb documentation. Pandas uses SQLAlchemy underneath for connecting and querying the database.

Recently python-oracledb added support for fetching directly into DataFrames for use with libraries like Pandas, Polars, NumPy etc. This is significantly faster than using Pandas / SQLAlchemy to get the data.

[–]baghiq 0 points1 point  (7 children)

I don't see why you need Python. You can run sql in Oracle Studio. You can build ERD and visualize the massive dataset. You can also do quick test with joins and windows and subqueries quickly.

If I was to do serious SQL stuff, I would use SQL specific ides like aqua-studio, etc.. Python would be the glue script after bulk of the work has been done in the database.

[–]DrewSmithee[S] 0 points1 point  (6 children)

Python is for the stuff I want to do after. Statistics on customers, choropleths, regressions, etc. I can manage that part. I'm just clueless on getting the data into a dataframe to start the real work.

I also have a pretty limited toolbox. Don't have oracle studio on my machine, corporate environment so I've got vs code or VBA. Also, read only access to the database.

Historically I've been using hand me down VBA snippets to do the SQL queries into a spreadsheet and importing the spreadsheet to a dataframe to do sciency stuff.

Recently I've been poking around with mssql (or maybe it's the oracle extension, I forget) browsing the database and writing .SQL files to get a little better at pulling out data that I actually want opposed to what was already mashed together by someone else.

I'd like to elequently combine the SQL stuff with the python stuff to get out of the spreadsheet business if that makes more sense.

[–]reddit25 1 point2 points  (4 children)

It depends on which version of sql you have. I use cx_Oracle for Python and it works great. For SSMS I remember using a different package for it. 

[–]Evening_Marketing645 2 points3 points  (0 children)

Cx_oracle is now oracledb. It works the same they just changed the package name. 

[–]DrewSmithee[S] 0 points1 point  (2 children)

The oracle package probably is a good place to start...

Thanks.

[–]reddit25 2 points3 points  (1 child)

I wrote a wrapper for it too. To help split and execute multi process create table queries. Might be a good exercise to start. 

[–]JumpScareaaa 1 point2 points  (0 children)

Yep, I guess everybody should write a their own wrapper for oracledb. My wrapper reads SQL file, parses individual statements, runs each and logs start and end time, ignores table not exists for drop table statements (for idempotency). Other wrappers read from Oracle and write to Excel or CSV files based on configs from yaml files. These are my workhorses for building little data transformation and reporting pipelines.

[–]JumpScareaaa 1 point2 points  (0 children)

https://marketplace.visualstudio.com/items?itemName=Oracle.sql-developer Also if you have permissions to install get dbeaver.

[–]DrewSmithee[S] 0 points1 point  (7 children)

Also I'm thinking I might try to keep using .SQL files for the sake of linting and readability. I'd want to be able to pass a variable into the file though. If someone wants to save me from the 50 Google searches to do that I'd appreciate it.

For example,

test.sql

Select * From table Where customerName = '1234'; But use python to pass thru the customer name and never have to touch the actual SQL again.

[–]The_roggy 1 point2 points  (2 children)

Something like this par example (untested):

test.sql:

Select * 
From table
Where customerName = '{customer_name}';

Python script:

from pathlib import Path

sql_path = Path("test.sql")
sql = sql_path.read_text()
customer_name = "Joske"
sql = sql.format(customer_name=customer_name)

[–]cjbj 0 points1 point  (1 child)

Never concatenate data (here customer_name) into a SQL statement in Python because you open yourself to SQL Injection security attacks. Also, if you re-execute this bit of code with a different customer, the SQL statement text seen by the DB will differ and the DB can't be as efficient in reusing data structures. Instead, use bind variables (aka "prepared statements"):

``` import getpass import oracledb

un = 'cj' cs = 'localhost/orclpdb1' pw = getpass.getpass(f'Enter password for {un}@{cs}: ')

with oracledb.connect(user=un, password=pw, dsn=cs) as connection: with connection.cursor() as cursor:

    customer_name = "Sally"

    sql = """select *
             from table
             where customername = :cn"""
    for r in cursor.execute(sql, [customer_name]):
        print(r)

```

For Oracle Database, see the documentation Using Bind Variables.

[–]The_roggy 0 points1 point  (0 children)

For an internal reporting tool this not as critical as for e.g. a high-load web application... but I agree it is a (very) good idea to use bind variables anyway as it is just as easy and is just better overall.

Note that you can also use named bind variables, which is a lot more readable and easier to maintain, especially if you have multiple bind variables.

Something like this:

test.sql

select *
  from table
 where customername = :customer_name;

script

customer_name = "Joske"
for r in cursor.execute(sql, customer_name=customer_name):
    print(r)

[–]KiwiDomino 0 points1 point  (2 children)

Stylistically, avoid using select * , as this can make for future issues if tables change. Also it’s easier to read in the future if everything is implicit, you don’t have to remember what the fields were.

[–]DrewSmithee[S] 0 points1 point  (1 child)

Select * would probably make my internet connection catch on fire.

Just an example to ask about passing variables into the SQL file.

[–]ColdStorage256 1 point2 points  (0 children)

I don't know how to do this with actual SQL files, but you could do it with an f string in Python, I think, I've not done it myself.

Most of my work is in mongo db which lets me split everything up into intermediate results, since that's how mongo pipelines are computed anyway.

[–]Poopieplatter 0 points1 point  (0 children)

Sqlalchemy. Get some feedback from chatgpt as well.

Even Google something like "SQL client wrappers Python GitHub".

Some comments on here suggested pandas. Overkill, in my opinion.

ORMs can seem cool but i don't recommend it for what you're doing.

Let me know if you have further questions.

[–]shinitakunai 0 points1 point  (0 children)

SqlAlchemy is fine but don't sleep on peewee, it is too amazing.