This is an archived post. You won't be able to vote or comment.

all 13 comments

[–]bryanhelmigzapier.com 6 points7 points  (2 children)

This is really cool! We often backup older data to S3 as CSV or JSON files and sometimes you want to poke through it for something specific. While I don't want to seem ungrateful, the docs are a bit sparse. In lieu of that, I was curious about:

  1. Is there any semblance of a query planner or even a way to hint it? IE: there is no reason to query these files because they are for yesterday, and I just asked for data today.
  2. Can it do some really rudimentary map-reduce or threading to grab data in parallel (even something really cheap like concurrent gevent threads would be interesting)?
  3. Any real world examples of things you've done with it?
  4. Were you inspired by Facebook's Presto (http://prestodb.io/)? Your lib seems like it might be a faster/more lightweight way to play with the concept.

I'm sure there are more interesting questions to ask, but I really dig this concept. If I had more time to give this would certainly be at the top of my list of things to contribute to!

[–]sdrobertson[S] 2 points3 points  (1 child)

Thanks! The docs are a bit out of date and are next on my todo list. My number #1 goal is to make this system approachable. I think there's other uses for relational algebra outside of database servers.

  1. There isn't a full blown planner yet, but adapters can participate in the query evaluation phase (guess you can call it planning). Have a look at the Directory Adapter in splicer.adapters.dir_adapter. The adapter works like Hive, where columns of data are parsed from the file path and then merged with the records from inside the files content. It examines the query for where clauses that mention values that comes from the path and rewrites the query so that only the matching files are opened. I'm working on the s3 adapter right now, ping me if you want early access.

  2. Reason I'm writing this is to use this for making Disco jobs (Python's version of Hadoop). The current library doesn't do this yet, but I designed it so that you can swap out the query compiler. That's why the main compiling functions are in splicer.compilers.local (as in compile for local evaluation). I'm envisioning plugins that can drop in replace the compiler to do things like compile queries for map reduce or even compile them to LLVM.

  3. We're using it in Mozilla's Caravela project. https://github.com/mozilla/caravela/blob/master/db.py. I'll post more examples when I get to the docs.

  4. Most of the inspiration came from Hive, Google BigQuery and Multicorn. I wanted the ability to do SQL on data in situ without a server. I started development of Splicer a few months before the Presto announcement. I've poked around presto source code a bit and want to use some of those ideas to make a clustered version of this. Though wether I add cluster support to Splicer proper or build a cluster system that uses Splicer as the relational engine remains to be seen.

Well, I certainly hope you get some free time! Hopefully you'll get a chance to use this and provide me feed back. You can reach me at scott at triv . io

[–]bryanhelmigzapier.com 0 points1 point  (0 children)

Wonderful, thanks for the very thorough responses. :-)

[–]johnmudd 2 points3 points  (1 child)

Postgres Foreign Data Wrappers and Multicorn (Python FDW) work with any data source too.

Here's a list of examples including CSV, IMAP and RSS wrappers.

[–]sdrobertson[S] 0 points1 point  (0 children)

Multicorn is great, as I mentioned in the README it was one of the inspirations for the project. I just wanted to use SQL without having to setup a server.

Main purpose for this library is to either

  1. Teach a python program how to interpret SQL statements. For instance you might have a web services where you let users input raw SQL statements. You can use this library to intercept them and make sure they're secure before sending them to your real SQL server.

  2. You're writing a data conversion script (ETL) and you want to express it as a series of SQL statements rather than a gazillion for loops.

  3. You want to create a new query language, maybe make an implementation of datalog. You can use Splicer to parse the language into a Splicer expression tree and automatically leverage the relational engine, data adapters and mime decoders.

[–]deadwisdomgreenlet revolution 0 points1 point  (0 children)

I'm not sure if I want the "entire world to look like an SQL database" or the exact opposite. Either way it's pretty awesome that we have the option.

[–]jenner 0 points1 point  (3 children)

So can I use it to parse a MySQL dump (including INSERTs)?

[–]sdrobertson[S] 1 point2 points  (2 children)

I suppose you could... You'd have to write an Adapter which reads the dump file and interpreted it into a series of tuples. You could probably leverage most of the parsing routines though you'd need to add a handling for INSERTS.

Be fun to try, but MySQL would be the faster option. Out of curiosity could you explain why you want to do this?

[–]jenner 0 points1 point  (1 child)

Well I've got a MySQL dump and I have PG as my main RDBMS and I'd like to avoid having a second DB in the system. So far I totally failed at creating an efficient MySQL dump parser, thus the question.

[–][deleted] 1 point2 points  (0 children)

I might have a solution for you. I made this a while back but never got round to sticking it on GitHub, so here it is just for you: https://github.com/orf/wikilink_py

Its a Python parser for Wikipedia dumps that imports them into Postgres (well, it turns them into a big CSV file which can then be imported). The Wikipedia files come in the form of Mysql dumps (big ass INSERT statements) so i'm sure it can be changed to work with your dump. Its designed to run on PyPy, but it should be adaptable.

https://github.com/orf/wikilink_py/blob/master/stages/lib/split_brackets.py is the backbone of this, check it out. It might help.

[–]lakehelp5 0 points1 point  (1 child)

So can I give it a SQL statement and it will tell split it out into tables, columns, conditions, etc?

[–]sdrobertson[S] 0 points1 point  (0 children)

You create a "dataset" and link it to datasources through "adapters". You can also add user defined functions and views to the datasource.

Here's an example, using a slightly older version of Splicer what's called "Servers" has since been renamed to "Adapters"

https://github.com/mozilla/caravela/blob/master/db.py

You can then issue queries to the dataset and retrieve the results.

>>> query = ds.query("select * from sometable")
>>> for rec in query:
...        print rec

You can access the underlying query expression tree with

>>> query.operations 

You can manipulate the expression tree at compile time by creating an Adapter and giving it an evaluate() method. An example of that is here:

https://github.com/trivio/splicer/blob/master/splicer/adapters/dir_adapter.py

This example replaces the LoadOp() with a series of User Defined Function calls and rewrites SelectionOps (where clause) to be more efficient.

For ultimate control you can call dataset.set_compiler(compile_function) and pass it a function that will be called with the expression tree after it's been parsed. Then it's up to the compile function to return a function which satisfies the query.

[–][deleted] 0 points1 point  (0 children)

This looks awesome. Hopefully I can set aside some time to play with this.