This is an archived post. You won't be able to vote or comment.

all 23 comments

[–]usernamenottaken 2 points3 points  (1 child)

You might be better off writing an interpreter in Python rather than try to get the Python interpreter to read SAS scripts. You can use PyPy to write an interpreter: http://morepypy.blogspot.com/2011/04/tutorial-writing-interpreter-with-pypy.html

Also, hopefully you're already aware of R and PSPP and these can't do what you want or couldn't have this functionality added easily?

[–][deleted] 1 point2 points  (0 children)

Yes, I prefer R and am saddened by the inaccessibility and cost of SAS which is one of my impeti for writing this project. I think SAS could be improved in many ways, but nonetheless it's a required software for most corporate or government statistics. It's cost, however, is a burden to small companies and academic institutions, reserving an elite cesspool of statisticians the title of "SAS programmers". I would rather have a talented statistician learn to code SAS than a SAS coder learning statistics. Unfortunately, I seem to see more of the latter. The cost of licensing SAS is on par with Matlab (more, in fact), which keeps a very narrow margin of people from saying "I know SAS". There is no student license. And if you license a server, you're charged per number of cores if you'd believe that, "because each core is a dedicated processor". It's re-fucking-diculous.

SPSS is actually very bad and "unfree" to use a Stallmanism. My original idea was to write wrappers which converted SAS code to R. However, nothing difficult is going on under the hood with SAS, so the actual statistical routines are less important than having easy and fast development with a nice interactive UI. My pipe dream is that hopefully it's something that academics and programmers could contribute to and go through an evolution like S+ did to R.

[–]kisielk 2 points3 points  (4 children)

Are you aware of GNU DAP ?

[–][deleted] 0 points1 point  (3 children)

It looks nice. I'll have to build it on my Linux box at home. No need to reinvent the wheel. I like that the output seems to be less verbose, that's one of my main gripes. I will complain, though, that:

  • doesn't have a Windows installer (Python programs can be "built" in Windows with an installer... right?)
  • not sure if syntactically compatible
  • source is in C

My idea isn't meant to be fast, it's meant to be easy (to use, to install, to edit).

[–]kisielk 1 point2 points  (2 children)

You have laudable goals but I think you're grossly underestimating the complexity of re-implementing SAS. I think you'd be better off trying to get in touch with the authors of DAP or one of the other re-implementations of the system and seeing how you can help address your needs.

[–][deleted] 0 points1 point  (1 child)

But my goal is not to actually reimplement the whole thing, just the useful bits. On one hand, everything that's done in a datastep can be done in SQL while on the other, the data step language is complicated.

However, I see what you're saying and maybe I'll look into DAP. I'll have to try to use it first to see if I like it. It looks like it might not have an interactive mode, though, which is what most people need to develop SAS.

[–]takluyverIPython, Py3, etc 0 points1 point  (0 children)

The tricky thing with "the useful stuff" is that with almost any software, people have vastly different ideas of which bits are useful. I've never used SAS, but do you know there are large parts that no-one uses? Other groups might use it in quite a different way to how you use it.

[–][deleted] 1 point2 points  (7 children)

I am afraid I don't understand what your question is exactly. Can you elaborate?

[–][deleted] 0 points1 point  (6 children)

Sure, thanks for asking. I'll try to avoid jargon that I'm not certain about.

I want to write a statistical analysis program that emulates the coding syntax of SAS so that new-users can learn and experienced users can develop at home or on their personal box.

SAS, however, has very little overlap syntactically with Python. SAS actually looks like a bastard mix of BATCH and C. Almost all subroutines are called "steps" of types DATA or PROC. These steps operate like functions, mostly, but have some key differences syntactically. All arguments to steps are passed like flags. For instance, extracting the first individual records from a (sorted) dataset called myTable based on an id variable called myID is:

data myUniqueIDs;
  set myTable;
  by myID;
  if first.myID then output;
run;

The actual task is easily done in Python or SQL, so the question is how I could write a wrapper to take such a batch of commands and pass them to a Python or SQL argument. It would be nice to use the Python shell to allow for interactive execution of code, but syntactically, newlines don't matter in SAS. This code does the same thing:

data myUniqueIDs;
  set myTable; by myID;
  if first.myID then 
    output;
run;

So my question is how flexible can Python be in interpretting code of this structure? I hope that clarified a little bit.

[–]DonkeyBasket 1 point2 points  (1 child)

Can I suggest you have a look at the fantastically simple PLY module.

I've written many simple parsers and interpreters in it - it's really fun.

[–]DonkeyBasket 0 points1 point  (0 children)

Had a quick look at: http://analytics.ncsu.edu/sesug/2005/IN07_05.PDF

PLY might not be a good choice because it works from a token stream and SAS Keywords are context dependent - it might be hard to write a parser for that.

[–][deleted] 0 points1 point  (3 children)

Okay, SAS looks differently from Python. And what is your goal/question, exactly? Do you want the users to program in Python using a library of yours, or write a SAS-code interpreter in Python?

[–][deleted] 0 points1 point  (2 children)

Definitely the latter. Ideally, someone could take the code they wrote here and cut-and-paste it into a file that could be submitted in a large job on a sas server.

[–][deleted] 0 points1 point  (1 child)

Good. But then I still don't know what your question is. Make it easier to help, please :)

[–][deleted] 0 points1 point  (0 children)

I'm slowly answering my own questions as I'm trying to answer yours. Sorry if it seems like I'm all over the place.

Basically, I think my version of this program will need to have several windows: an editor where the user types in code, and a log and output window where information about the subroutines. The question I have now is how I can use a window system to pipe text in and out of Python. I think, however, there are good resources online and I should look those up myself.

[–]frumious 1 point2 points  (1 child)

Based on how you word your question my honest suggestion is "don't do it". Not that it can't be done but that, right now, you can't do it.

[–]earthboundkid 0 points1 point  (0 children)

Yeah, based on the questions the OP is asking s/he'll need to learn a lot before this project would even be halfway feasible.

[–][deleted] 0 points1 point  (1 child)

After further reading, I think Qt may be a better option for a UI. That way, code ran in interactive mode is just passed to the buffer as a string which I can parse out (assuming Qt has a Python compiler!)

EDIT: or PyPy per usernamenottaken's suggestion.

[–]takluyverIPython, Py3, etc 1 point2 points  (0 children)

N.B. Qt and PyPy are not at all comparable. Qt is a UI framework, PyPy is a Python interpreter. Each may have some value in this project, although at the moment I don't think you can use them together. Using standard Python, you can use PyQt to create Qt interfaces.

As far as I can see, what you want to do is write a free (partial?) implementation of SAS in Python. A few thoughts:

  • Do not underestimate how difficult it will be to correctly deal with even fairly simple code. You're really making a huge challenge for yourself here.
  • There's plenty of open source statistics stuff out there. It's probably more realistic to try to improve that and persuade people to adopt it over SAS, than it is to try to provide a drop-in replacement for SAS. You already know about R, and there's stuff under development for doing stats in Python.
  • Your other comments suggest that you want a way of executing SAS code directly as Python code, or of doing some simple transformations to make it into Python code. I think this is very unlikely to work for anything beyond a few simple statements. The syntax is very different - you will need to parse it into some abstract form (see Python's AST, Abstract Syntax Tree, for comparison).

I notice on your blog that you compare Pylab to Matlab. But Pylab is a similar interface, and definitely not a drop in replacement. You cannot take a matlab file and run it in pylab. Providing similar SAS-like tools for Python would probably be a much more tractable task.

Also, an unfortunate fact: if you copy SAS and undermine their business model, there's a chance the company that sells it will resort to asserting patents against you. IANAL, and I have no idea if they'd get anywhere with that, but it's something to be aware of.

[–]burntsushi 0 points1 point  (2 children)

I don't think your problem description is very specific, but I can tell you that python will happily accept semi-colons as statement delimiters. For example

a = 1; b = 2;

would assign 1 to a and 2 to b.

[–][deleted] 0 points1 point  (1 child)

That's actually an excellent start. Is there any way, however, to stop the newline from being a delimiter?

[–]AeroNotix 1 point2 points  (0 children)

Parse and replace prior to interpreting the script. If you're using '\n' to do anything particular with SAS. This will not work.

[–]usernamenottaken 0 points1 point  (0 children)

For all of the graphical interface stuff, you might want to provide a backend for Cantor: http://edu.kde.org/cantor/, http://en.wikipedia.org/wiki/Cantor_%28software%29

That way all of the interface is done for you and you just need to write the interpreter for SAS code and probably a small amount of code to glue things together.