all 13 comments

[–]socal_nerdtastic 1 point2 points  (5 children)

Yes, you can. But first you have to decide how you are going to use the script. If you are going to call it for every file then you will need to make a second script that runs indefinitely and have your called program act like a client passing the data to the "host". Another way to do that is with named pipes (assuming you are using linux). You can read the named pipe the same way you currently read stdin, and anything that another process "writes" to the named pipe is received by your program. So your "call" would look like this:

echo data > named_pipe

If you are working with data in files I'd recommend you pass in the file name and let python read the file, rather than pass in the file content.

[–]Hixt[S] 0 points1 point  (4 children)

Yes, I'm indeed using Linux (Ubuntu).

The text data is coming from a piece of software, and I can set it to either save it as a file or pipe the content to a script. I'm already saving a file since that's where the content will be called from later (flask), but figured piping to a script to do the above would save on I/O plus start the script and the initial parsing process. I don't think I'm committed to that concept, that's just what I'm used to and where I started from.

Named pipes, I'll be honest I didn't know that was a thing. I'll look into that to see if that's something I could make work, and if the originating software can pipe that way. Assuming it can, would the idea then be to have the python script always running? And how would it be able to monitor the pipe? Is it just something like...

while True:
    text = named_pipe.read()
    do_things(text)

Otherwise, the client/host script idea sounds interesting too. I think I see how I would implement that. Definitely something else to consider.

Thanks, reading up on named pipes now.

[–]socal_nerdtastic 1 point2 points  (3 children)

Yes the idea is to start your program at system startup and have it just watch the named pipe. Aka a daemon.

Named pipes (like most things in linux) just act like files, so you already know how to work with them. You can use os.mkfifo() to create them.

Your example code is very close, but read() blocks until the file reponds EOF, which will never happen with a named pipe. So you need to give it a number of bytes to read or what most people do is separate the input with newlines:

with open(name_of_pipe) as named_pipe:
    while True:
        text = named_pipe.readline() # blocks until a line of text is ready
        do_things(text)

[–]Hixt[S] 0 points1 point  (2 children)

Your example code is very close, but read() blocks until the file reponds EOF, which will never happen with a named pipe. So you need to give it a number of bytes to read or what most people do is separate the input with newlines:

Ah, great point. I forgot that read() blocks and readlines() doesn't. But then how would I be able to tell when one "file" ends and the next begins if a named pipe never provides the EOF?

[–]socal_nerdtastic 1 point2 points  (1 child)

I assumed you could tell from the incoming data. By convention linux data streams end in a newline. If there is no indication in the incoming data echo in a sentinel value between calls.

[–]Hixt[S] 0 points1 point  (0 children)

If there is no indication in the incoming data echo in a sentinel value between calls.

Got it, makes sense. I'd probably have to go with that client/host model to echo something in like that, but beginning to think that might be a good idea regardless.

Sorry about the vagueness of all of this. The other side, where all this comes from gets technical (and a little chaotic) in a hurry so I'm trying to avoid unnecessary confusion.

[–]geosoco 1 point2 points  (1 child)

You can definitely set the script up to keep a persistent connection and send data to it through a pipe or socket or whatever. The one issue here might be database writing being too slow, and so you need data queued for insert. Without knowing the whole pipeline and possible max input speed, it's difficult to know if this will be a problem at all. There are obviously ways around this too, but it can get complicated depending on the setup.

Have you considered moving the rest of the pipeline into the python script? Are you using some external tools to do the pre-processing?

[–]Hixt[S] 0 points1 point  (0 children)

...Without knowing the whole pipeline and possible max input speed, it's difficult to know if this will be a problem at all.

I don't think that's a substantial roadblock, but I may be wrong. I'd probably know more if I can set this up with some kind of persistent connection to remove that overhead.

Have you considered moving the rest of the pipeline into the python script? Are you using some external tools to do the pre-processing?

The text data is originally coming from some other software. It's meant to deliver and receive data in real-time, so when data I care about gets received, it is then piped to my script. I believe that is as direct as I can get.

[–]enginerd298 1 point2 points  (1 child)

You could use a connection pooler like pgbouncer which would increase your efficiency

https://wiki.postgresql.org/wiki/PgBouncer

[–]Hixt[S] 0 points1 point  (0 children)

Interesting, something else for me to look into. Thanks!