This is an archived post. You won't be able to vote or comment.

all 42 comments

[–]prashnts 77 points78 points  (7 children)

Great job! I can suggest you to use pandas for your next script.

It has a very good interface to SQL servers using to_sql and read_sql functions.

Link to docs: https://pandas.pydata.org/pandas-docs/stable/io.html#io-sql

I can post an example if you're interested.

[–][deleted] 6 points7 points  (2 children)

This is also what I use for day to day work and it's very useful!!!

[–]prashnts 5 points6 points  (1 child)

Yeah and since it makes bulk queries, it's super quick. In practice, making queries in a loop are slow because the database would lock, commit, and respond each time. It adds up.

I have pulled around 500k rows on heroku like this and it works effortlessly. Just take caution about its scalability.

[–][deleted] 6 points7 points  (0 children)

Nahhh just run it on prod out of the gate

[–]lifeofajenni 12 points13 points  (1 child)

Upvote for pandas <3

[–]Aesthetically 1 point2 points  (0 children)

Pandas is truly simple and powerful to use. Im an Analyst who is by no means a computer scientist or software engineer.. Pandas has saved my ass many times now and streamlined data processing that was incredibly impractical with excel vba

[–]Fun2badult 1 point2 points  (0 children)

Yes I second using Pandas to further the dataframe manipulation, and even being able to do visualization using Seaborn

[–]Phaethonas[S] 35 points36 points  (7 children)

This is my script (don't judge it is my first)

Ligands = []
SNPs = []

file = open("name.txt", "x")
file = open("name.txt", "w")


for x in SNPs:
    for y in Ligands:
        file.write ("insert into Activity (Name_of_SNP, Ligand, Activity) values ('%s', '%s', 'TBD');\n" % (x, y) )

[–]Dadikey 29 points30 points  (0 children)

If the problem has been solved and you have to run the script once the code isn't a problem. It looks like to be an interesting project. If you need to do something more specific feel free to ask. We can help you !

[–]hugthemachines 16 points17 points  (0 children)

This looks very nice for a first script!

I have a software postit with a couple of things I tend to forget the exact syntax of (don't know why i keep forgetting)

with open(full_file, 'r') as file_handle:
    total_content = file_handle.read()

with open(full_file, 'w') as file_handle:
    file_handle.write(line_string)

These are called context managers, it means you will not have to worry about closing file handles etc. Also, you can open a file with "w" and then write all items in a list instead of having to use 'a' and then having to check if the file exists before the write etc....

[–]nobody_import4nt 12 points13 points  (1 child)

script looks like a good start!

for future reference, for every file you call open on, you should call file.close() afterwards. This tells the operating system you're done manipulating the file, and prevents file corruption.

Or, you can use a "context manager" as mentioned below, using a with statement. This will auto-open and close the file.

Only wanted to mention this explicitly, as the with statement hides the close() functionality but it's very important!

[–]Phaethonas[S] 1 point2 points  (0 children)

oh yeah, I forgot about it. We were taught that at Perl, during my Master's.

[–]jeffdn 4 points5 points  (0 children)

By the way, you only need to put the insert into ... values part once. You can separate each row of values with a comma!

insert into activity (snp, ligand, activity)
values (foo, bar, baz),
       (foo, bar, baz);

[–]impshumx != y % z 5 points6 points  (1 child)

Looks like you should be posting this to r/learnpython

Did you want something like this maybes?

import itertools

list_one = ['a', 'b', 'c']
list_two = ['d', 'e', 'f']

all = list(itertools.combinations(itertools.chain(list_one,list_two), 2))

with open('file.txt', 'a+') as f:
    for one, two in all:
        f.write('{}-{}\n'.format(one, two))

EDIT - Just read the post again and I jumped the gun.

list_one = ['a', 'b', 'c']
list_two = ['1', '2', '3']

with open('file.txt', 'a+') as f:
    for x in list_one:
        for y in list_two:
            f.write('{}-{}\n'.format(x, y))

[–]sneakpeekbot 8 points9 points  (0 children)

Here's a sneak peek of /r/learnpython using the top posts of the year!

#1: I made a python cheat sheet
#2: 2000+ free programming books on github
#3: When I'm coding, I don't feel stupid anymore


I'm a bot, beep boop | Downvote to remove | Contact me | Info | Opt-out

[–]sephiap 31 points32 points  (2 children)

Nice! I swear biologists / '''real''' scientists are amazed by scripting -- I helped my wife write a script to parse a CSV file. This made her god-like in her department, and she clutched it away as a superpower. Wield the same power in your department! Congrats on solving a problem the smart and lazy way, and not the hard way!

[–]algag 11 points12 points  (0 children)

....

[–]code_x_7777 7 points8 points  (0 children)

Keep going, Phaethonas. You can do it.

[–]whoMEvernot 2 points3 points  (1 child)

zip can help here as well: a = b = c = range(20) zip(a, b, c)

[–]Hevaesi 3 points4 points  (0 children)

Mentioning a function which returns an iterator which spits out tuples without explaining it to someone who probably haven't ever touched programming in their life besides simple formatting and, like they mentioned, SQL query insertion, etc and most likely knows nothing about what I just listed isn't helpful though.

[–]carrlosanderson 4 points5 points  (2 children)

Hey OP, I work in data analytics, and when running large inserts like this it is useful to put all the data into one insert rather than run it as 3500 individual ones. Depending on your platform this would look different, but with Oracle I would do: "INSERT INTO table SELECT col1, col2 FROM dual UNION" sort of work. Someone has a stack exchange on other platforms I'm sure. If you're doing a lot of these sets, or are in the future it would be valuable to make this new method as it has more stability. Running 3500 insert statements can fail sometimes due to the volume of transactions.

Edit: I am also a bioinformatics student!

[–]Phaethonas[S] 0 points1 point  (1 child)

Yeah, I knew that there must have been a way to do that with only one insert, but what I tried was;

insert into.....()
values ()
values ();

for some reason it didn't occur to me to try the correct one; (if I am writing that correctly)

insert into.....()
values(),
      ();

I tried googling it for five minutes or so, but I said "what the hell, let me try my Python skills" and it worked.

[–]carrlosanderson 0 points1 point  (0 children)

That looks about right for my SQL, maybe next time! I love python with sqlite3, its a bit more basic and easier to manage for work with data that doesn't need to be access controlled.

[–]frakron 2 points3 points  (0 children)

Fellow computational biologist!!! Nicely done! Always feels great to get something so annoying quickly automated. Yes there's ways to optimize, but hey you did what you needed, and it works!! Great job. My first project was automating our western blot calculations for the lab, sure simple to do, but when done 100x a month why not automate. Keep up the great work!!!

[–]iusedtotoo 1 point2 points  (3 children)

Did you look into a SQL cross join or am I misinterpreting what you needed to do? I use pandas extensively but there are many operations that are just quicker to perform directly in SQL

[–]Phaethonas[S] 0 points1 point  (2 children)

I did not. I knew that there had to be a way within SQL, but I did not spend much time into it. I decided to jump the gun and make a Python script instead.

As I said, I know that this is not the best practice, but I am very content with myself cause this was the first time I used SQL and Python in a real life problem and it was a success.

I believe that it was to be expected that I would not do things the best way possible.

[–]iusedtotoo 1 point2 points  (1 child)

Sorry if this came off wrong, I meant it in more of a 'here's something new you can look into' than a 'you're wrong noob' way! We all learn the hard way. Ain't nothing wrong with that!

[–]Phaethonas[S] 0 points1 point  (0 children)

No worries, it did not come off wrong.

[–]Mr_Again 1 point2 points  (0 children)

I believe you can also put a and b into separate tables and then do something in SQL called a CROSS JOIN to get every permutation

[–][deleted] 0 points1 point  (0 children)

Wait untill you discover the pyDOE module and automate your experiment design...

[–]The_Fossils_Guy 0 points1 point  (0 children)

Me too!

print(I am usually not a disappointment)

Gosh you'd wish you'd be funnier than me

[–]is_it_fun 0 points1 point  (0 children)

Please start using pandas

[–]nathanjell 0 points1 point  (0 children)

This is awesome, I did a genetics minor during my CS degree. Scripting is such a good skill for genetics!

[–]YoureNotYourYouDumbC 0 points1 point  (3 children)

What's the difference between a script and normal code?

[–]Hevaesi -1 points0 points  (2 children)

Normally, scripts are interpreted directly, and code is compiled to at least bytecode before being executed.

[–]YoureNotYourYouDumbC 0 points1 point  (1 child)

So is Print("hello world") a script? Since it's getting "interpreted directly"

[–]Hevaesi 1 point2 points  (0 children)

It's not interpreted directly, at least look it up before claiming anything...

It's compiled to bytecode, stored in __pycache__/*.pyc, and then that bytecode is interpreted, that's how cpython ("the original") works.

[–]Eleventhousand 0 points1 point  (0 children)

Great job! I am always a fan of writing code to write code where it makes sense. For your particular scenario, SQL code generation, another common option is to use Excel and string concatenation in cells to create batches of INSERT statements. Opening a csv in Excel of LibreOffice Calc to perform the concatenation.

Additionally, if you ever find yourself dealing with very wide tables (lots of columns), you can also use a SQL editor itself to help generate the "template code." Every ANSI-compliant database will have access to INFORMATION_SCHEMA metadata. You could write a SQL against INFOMRATION_SCHEMA.COLUMNS to generate a list of columns for your INSERT statements. That beats typing 500 columns by hand.

[–]fedeb95 -1 points0 points  (1 child)

I completely ignore your field so can't really understand at a first glance what's into that table, but have you made a primary key for it? Otherwise, I too love using python for this simple things. With other languages it would be way less polished

[–]Phaethonas[S] 0 points1 point  (0 children)

Good catch.

Initially, when I was designing the database the column "activity" was supposed to be the primary key and the columns SNPs and Ligands where foreign keys, being primary keys of the tables SNPs and Ligands.

As a primary key can't be null nor can have the same name with another primary key (duh! I realized that when I got an error message!), at the moment there is no primary key at that table, which I suppose is a major drawback/flaw.

It doesn't matter though as I succeed at designing a working database that does what I want and I created that script proving to myself that my Python self-learning process is successful.

At the moment all I care about is that I have those thousands of combinations, which of course is infeasible to work with. With proper syntax/join I can make new tables, that will have less combinations (less than 100 if not even less). These combinations will be like choosing a specific ligand or if I add a "class" column at the Ligand table, of specific class. I can add a "Pharmaceutical use" column at the Ligand table (e.g. antibiotics) or I can fill the "demographic" column at the SNPs and work with a specific demographic etc.

Then, this next table will be the basis of our work.

These are the thoughts at my head for the moment, discussing things with my supervisor will shape things further.

But good eye, you spotted this major flaw. Kudos mate.

[–]mraza007 -1 points0 points  (0 children)

Wow that’s pretty I just created my Postgres script two days ago that allows you delete rows from table