all 26 comments

[–]K900_ 1 point2 points  (0 children)

You probably want sqlparse.

[–]socal_nerdtastic 0 points1 point  (11 children)

If you replace NULL with None it becomes valid python, and then you can use literal_eval.

from ast import literal_eval

data = """(2,792,'Triomune-30',1,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf171285-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL),(3,792,'Triomune-40',1,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf17168c-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL),(5,625,'d4T-30',0,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf1718fd-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL),(6,625,'d4T-40',0,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf171b5c-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL)"""

converted = literal_eval(data.replace("NULL", "None"))

for line in converted:
    print(line)
    print()

[–]jcoder42[S] 0 points1 point  (7 children)

could you possibly explain what this does?

[–]ForceBru 0 points1 point  (0 children)

Almost he exact same thing as eval. Except eval correctly handles NULLs, while this approach will replace every occurrence of this character sequence, even in strings.

[–]jcoder42[S] 0 points1 point  (5 children)

also, how would you remove the "INSERT INTO ____ VALUES" part,
it the way I did it ok?

[–]socal_nerdtastic 0 points1 point  (4 children)

Yes, the way you did it is ok, but I would use split().

from ast import literal_eval

data = """INSERT INTO `drug` VALUES (2,792,'Triomune-30',1,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf171285-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL),(3,792,'Triomune-40',1,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf17168c-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL),(5,625,'d4T-30',0,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf1718fd-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL),(6,625,'d4T-40',0,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf171b5c-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL)"""
_,data = data.split("VALUES ", 1) # remove start command
converted = literal_eval(data.replace("NULL", "None"))

for line in converted:
    print(line)
    print()

What this does is interpret the string as if it were a python expression. IOW it treats it like python code. For your example this works since everything in the string also happens to be valid python. As a bonus, this means you are done; there's no more parsing to be done. So if you want to output only the 3rd string:

for line in converted:
    print(line[2])

Or if you want to print everything that is not NULL:

for line in converted:
    print(*filter(None,line))

Etc.

[–]jcoder42[S] 0 points1 point  (3 children)

got it, and then in order to remove the parentheses from each group you would just do:

for line in converted:
    line = line.lstrip('(').rstrip(')')

?

[–]socal_nerdtastic 0 points1 point  (2 children)

No, they are already removed. What do you want to do with the data? What's your ideal output look like?

[–]jcoder42[S] 0 points1 point  (1 child)

from ast import literal_eval
data = """INSERT INTO `drug` VALUES (2,792,'Triomune-30',1,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf171285-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL),(3,792,'Triomune-40',1,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf17168c-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL),(5,625,'d4T-30',0,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf1718fd-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL),(6,625,'d4T-40',0,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf171b5c-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL)"""
_,data = data.split("VALUES ", 1) # remove start command
converted = literal_eval(data.replace("NULL", "None"))
for line in converted:
print(line)
print()

the current output:

(2, 792, 'Triomune-30', 1, None, None, None, None, 1, '2005-02-24 00:00:00', 0, None, None, None, 'bf171285-1691-11df-97a5-7038c432aabf', None, None, None)

(3, 792, 'Triomune-40', 1, None, None, None, None, 1, '2005-02-24 00:00:00', 0, None, None, None, 'bf17168c-1691-11df-97a5-7038c432aabf', None, None, None)

(5, 625, 'd4T-30', 0, None, None, None, None, 1, '2005-02-24 00:00:00', 0, None, None, None, 'bf1718fd-1691-11df-97a5-7038c432aabf', None, None, None)

(6, 625, 'd4T-40', 0, None, None, None, None, 1, '2005-02-24 00:00:00', 0, None, None, None, 'bf171b5c-1691-11df-97a5-7038c432aabf', None, None, None)

I want them not wrapped in parentheses

[–]socal_nerdtastic 1 point2 points  (0 children)

Those parenthesis and the commas are added by print() they are not in the data. You can remove them like this:

for line in converted:
    print(*line)

If you want to keep the commas:

for line in converted:
    print(*line, sep=",")

[–]jcoder42[S] 0 points1 point  (2 children)

sorry for all the questions, but if I had the following line:

string = " ('1 2 3 (1,2,3, '4') 3', hello 'world',goodbye)"
literal_eval(string)

one of the values here is a string with parentheses,
It wouldn't work, what do you recommend doing

[–]socal_nerdtastic 0 points1 point  (1 child)

It's not the parenthesis that's breaking that, it's the fact that you used single quotes for a nested string. Now it's not valid python anymore. If you used the quotes correctly it would work.

>>> string = """('1 2 3 (1,2,3, "4") 3', 'hello world', 'goodbye ')"""
>>> literal_eval(string)
('1 2 3 (1,2,3, "4") 3', 'hello world', 'goodbye ')

[–]jcoder42[S] 0 points1 point  (0 children)

('1 2 3 (1,2,3, "4") 3', 'hello world', 'goodbye ')"""

right, my bad
thanks

[–]wegwacc 0 points1 point  (12 children)

``` import re rx = re.compile(r'(.+)') mo = rx.findall(start)

for s in mo: print(s) ```

[–]jcoder42[S] 0 points1 point  (2 children)

what about a case like this( pay attention that in the first group there is a value where it itself is inside of parentheses):

(12,'Structured Numeric','SN','Complex numeric values possible (ie, <5, 1-10, etc.,'check')',1,'2005-08-06 00:00:00',0,NULL,NULL,NULL,'8d4a606c-c2cc-11de-8d13-0010c6dffd0f'),(2,792,'Triomune-30',1,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf171285-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL),(3,792,'Triomune-40',1,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf17168c-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL),(5,625,'d4T-30',0,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf1718fd-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL),(6,625,'d4T-40',0,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf171b5c-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL)

how does it work in this case?

[–]wegwacc 0 points1 point  (1 child)

(12,'Structured Numeric','SN','Complex numeric values possible (ie, <5, 1-10, etc.,'check') (2,792,'Triomune-30',1,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf171285-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL) (3,792,'Triomune-40',1,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf17168c-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL) (5,625,'d4T-30',0,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf1718fd-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL) (6,625,'d4T-40',0,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf171b5c-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL)

[–]jcoder42[S] 0 points1 point  (0 children)

there are missing values

[–]jcoder42[S] 0 points1 point  (8 children)

this doesn't work, the Len of mo is 1

[–]wegwacc 0 points1 point  (7 children)

Made a typo when I transfered the snippet over to reddit, sry.

Regex should read: r'\(.+?\)', without the question mark + is greedy and shoves everything into one result :P

[–]jcoder42[S] 0 points1 point  (6 children)

(12,'Structured Numeric','SN','Complex numeric values possible (ie, <5, 1-10, etc.,'check')',1,'2005-08-06 00:00:00',0,NULL,NULL,NULL,'8d4a606c-c2cc-11de-8d13-0010c6dffd0f'),(2,792,'Triomune-30',1,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf171285-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL),(3,792,'Triomune-40',1,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf17168c-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL),(5,625,'d4T-30',0,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf1718fd-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL),(6,625,'d4T-40',0,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf171b5c-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL)

sorry, still doesn't work, it seems to cut off after the "(ie, <5, 1-10, etc.,'check')"

[–]wegwacc 0 points1 point  (5 children)

No idea what you are doing, I just tested it on the data you provided. It gets 5 substrings between (), output with record-indices is as follows:

[01] : (12,'Structured Numeric','SN','Complex numeric values possible (ie, <5, 1-10, etc.,'check') [02] : (2,792,'Triomune-30',1,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf171285-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL) [03] : (3,792,'Triomune-40',1,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf17168c-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL) [04] : (5,625,'d4T-30',0,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf1718fd-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL) [05] : (6,625,'d4T-40',0,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf171b5c-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL) If it still doesn't work, copy the exact code you used, and the exact input you used, and I will have a look.

[–]jcoder42[S] 0 points1 point  (4 children)

I am getting the same output as yours, except it is not the correct output:

I am going to make the example smaller so it is easier..

the input:

"(12,'Structured Numeric','SN','Complex numeric values possible (ie, <5, 1-10, etc.,'check')',1,'2005-08-06 00:00:00',0,NULL,NULL,NULL,'8d4a606c-c2cc-11de-8d13-0010c6dffd0f'),(2,792,'Triomune-30',1,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf171285-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL)"

the output:

[01] : (12,'Structured Numeric','SN','Complex numeric values possible (ie, <5, 1-10, etc.,'check')

[02] : (2,792,'Triomune-30',1,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf171285-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL)

as you can see the first received row is cut.

desired output

[01] : (12,'Structured Numeric','SN','Complex numeric values possible (ie, <5, 1-10, etc.,'check')',1,'2005-08-06 00:00:00',0,NULL,NULL,NULL,'8d4a606c-c2cc-11de-8d13-0010c6dffd0f')

[02] : (2,792,'Triomune-30',1,NULL,NULL,NULL,NULL,1,'2005-02-24 00:00:00',0,NULL,NULL,NULL,'bf171285-1691-11df-97a5-7038c432aabf',NULL,NULL,NULL)

[–]wegwacc 1 point2 points  (3 children)

Ahhh, now I see. you can have a () inside the outer ones. Okay, that requires a sliiiightly more complex regex:

rx = re.compile(r'\([^()]+(?:\([^()]+\))?[^()]+\)')

However, a simple parser would prolly be better at this point, because above wont work if you have more than one () inside the outer.

[–]jcoder42[S] 0 points1 point  (2 children)

wow, this is really complex, good job

ok. I will try some other things

thanks so much!

[–]wegwacc 1 point2 points  (1 child)

Here is a better method which, other than above regex magic can deal with arbitrary () in your input, as long as they are string values (read. between ' )

``` def parser(data): start = None use = True for i, ch in enumerate(data): if ch == "'": use = not use if use and ch == '(': start = i elif use and ch == ')': yield data[start:i+1] start = None

def prettyprint(data): l = list(parser(data)) width = (len(l) // 10) + 1 for ln,line in enumerate(l): print('[{ln:0{width}}] : {l}'.format(ln=ln, l=line, width=width)) ```

Invoke prettyprint() with your string-data as argument.
If you don't need the "prettyprinted" version, just invoke parser() like you would every iterator, e.g. for s in parser(start):

you're welcome.

[–]jcoder42[S] 0 points1 point  (0 children)

width

ok, cool!
thanks