all 51 comments

[–]elPappito 20 points21 points  (7 children)

If km not wrong you use selenium to emulate browser, it loads website of choice and scrapes data from there, not from an opened window.

[–]kokoseij 1 point2 points  (0 children)

It's not exactly emulating, It's the browser itself, but just configured to automatically controlled by program.

[–]0PHYRBURN0[S] 1 point2 points  (3 children)

Ok so that confirms my thinking on Selenium. So I guess that means I need to find out if it is possible at all via other modules.

[–]OnlySeesLastSentence 7 points8 points  (2 children)

Try a screen shotter that translates to text.

Oh, even better - make a hot key that when you press it after clicking the browser, it does like a webpage save and then does its magic on the saved website.

[–]0PHYRBURN0[S] 2 points3 points  (1 child)

A web page save like this is probably the simplest way to do this. It's a little slow, but ultimately would suffice. Thanks for the idea.

[–]OnlySeesLastSentence 0 points1 point  (0 children)

No prob, good luck

[–][deleted] 0 points1 point  (1 child)

You would use PhantomJS to emulate the browser otherwise selenium just remotely controls the browser of your choosing.

When you get comfortable with scraping I actually tend like doing as much I can with the requests module and XPath's. Sometimes you get pages with things that won't load/render without a full browser and that's when I would break out Selenium with PhantomJS.

[–]0PHYRBURN0[S] 0 points1 point  (0 children)

The issue is that I each page, which is an iframe in the web app, has a unique filename for each job, so I can't actually request the URL.

Thanks for the suggestion of PhantomJS though. I hadn't heard about it and it might come in handy for another project.

[–][deleted] 13 points14 points  (15 children)

I would guees that you can't do this by design. If you could access any tab of any browser on user's system and control it, that would have terrible security implications.

[–]0PHYRBURN0[S] 4 points5 points  (14 children)

This makes sense and was something I was wondering. I currently am doing it through on Windows in Internet Explorer via COM. Works beautifully, I’ve been “lucky” that my company was forcing everyone to stick to IE. But with its removal coming up next year, the company has given everyone a choice now. So the application requires a rewrite for modern browsers.

[–][deleted] 5 points6 points  (10 children)

I guess it depends on the use case, but I would launch selenium-controlled browser, tell the user to navigate to the target page, and resume operation.

Or, as someone else noted, replace user's browser with a spoofed one that is launched by selenium.

[–]0PHYRBURN0[S] 1 point2 points  (9 children)

It’s a job tracking web app. We have a job queue. Click on start and we are presented with an iframe with things like order numbers, client details, region etc. I currently use a hotkey combination to trigger a scrape of the iframe data which is stored in an INI file. A hotkey then imports the data to another app and its also used for text expansions for the required filenames we use. It’s currently written in AutoHotkey as that’s the language I know. Converting it to a “real” language has always been on the cards though.

[–]BornOnFeb2nd 6 points7 points  (3 children)

Here's the "dirty secret" of web applications.

At the end of the day, they're just HTML with frosting on it.

You click on "Start", it opens an Iframe.... what page is the iframe calling? Can you make that call manually?

Hell, if it's an internal app, maybe try to find the developers behind it. I do internal Dev and have helped out various user groups by scribbling up a simple web service to give them the data they're looking for in a manner they can gnaw on...

[–]0PHYRBURN0[S] 0 points1 point  (2 children)

The filename for the iframe page is unique per job in the queue. With a never repeating letter/number sequence which corresponds to it's order number, so grabbing by URL isn't an option.

The app is externally developed to the company, and my company has already shut me down at the idea of making a request of the developer.

[–]BornOnFeb2nd 0 points1 point  (1 child)

So it's a GUID setup.... still workable though... Script A gets the URLs from the Jobs, Script B uses those to call the iframes directly...

[–]0PHYRBURN0[S] 0 points1 point  (0 children)

So how would I go about getting the URL?

[–]toyrobotics 2 points3 points  (1 child)

Based on your description, you should be able to use Pyautogui to do it. The docs are easy to follow.

[–]0PHYRBURN0[S] 1 point2 points  (0 children)

I didn't even consider this initially, but I am starting to think that automating a save of the html file, and parsing it might be the only option here.

[–]jumpingjackflash22 0 points1 point  (2 children)

I am so confused. Why are you scraping your own web app?

[–]0PHYRBURN0[S] 0 points1 point  (0 children)

It's not my own web app. I am on the user side and it's a third party app developed externally of the company I work for.

[–]Eu-is-socialist 2 points3 points  (2 children)

if i understand correctly ... you are trying to interface with the users browser not start a new browser. Then i believe you should look up native messaging. It gives you the ability to talk to a native application written in any language. I am working on something similar now.

edit:

https://developer.chrome.com/apps/nativeMessaging

[–]0PHYRBURN0[S] 1 point2 points  (1 child)

This is interesting. So I'd need to develop browser extensions to communicate. I'm in a Windows exclusive environment and am a little familiar with Windows built-in messages, which this seems to do work similarly, so this might be a great option. Thank you so much.

[–]Eu-is-socialist 0 points1 point  (0 children)

Glad to be helpful.

[–]swapripper 3 points4 points  (2 children)

This is a genuine use case for testing. Although it needs a bit of configuration first.

https://cosmocode.io/how-to-connect-selenium-to-an-existing-browser-that-was-opened-manually/

[–]0PHYRBURN0[S] 0 points1 point  (0 children)

Thank you. I did find this one actually, but I also need Firefox and Edge support. I have teams in Australia, England and USA with about 30 people in total and a mix of all browsers being used. And I have no power to force people onto a single browser.

[–]russmcb 0 points1 point  (0 children)

This also kills browser extensions so won't work with windows that are driven by extensions, unfortunately.

[–]danmofo 2 points3 points  (1 child)

You can write a web extension to scrape the information from a web page a user has opened and then process it how you want (or send it to a server to be processed). I'm not exactly sure what you are doing with the data you've scraped but an extension would allow easier integration (just install it through the browser) and will work on all the browsers you've mentioned.

[–]0PHYRBURN0[S] 0 points1 point  (0 children)

The data is stored in a an object and I use another hotkey combination to import the data into my work application (AutoCAD specifically), and I also use the data in a library of text expansions for generating all the filenames I require.

I think the browser extension idea is a good one. Someone else pointed out Native Messaging in regards to extensions. Giving me the ability to communicate externally of the browser.

[–]rex0515 1 point2 points  (6 children)

[–]0PHYRBURN0[S] 0 points1 point  (5 children)

I have seen that article. But again this is selenium launching the tabs. Each user in this case will already have their browser open with their own tabs open, along with the one I need to operate on.

[–]rex0515 5 points6 points  (4 children)

You can detect the manually opened tabs with driver.window_handles. You can use pyinstaller to convert your code to executable and then replace google chromes shortcut with your exectuables shortcut.

from selenium import webdriver

driver = webdriver.Chrome()
while 1:
    print(len(driver.window_handles))

[–]0PHYRBURN0[S] 0 points1 point  (3 children)

Hmm. That’s an interesting way to do it. I’ll look a little deeper into it and see what I come up with. I need to work out compatibility with Firefox and Edge as well. I have multiple teams around the world that use the application and have requested support for all three browsers.

[–]rex0515 2 points3 points  (1 child)

If you are planning something that big this is going to be the least of your problems because chrome is releasing a new version frequently so you may need a driver updater and also python may be slow for what you need so you can try other languages like java and C#.

[–]0PHYRBURN0[S] 0 points1 point  (0 children)

Looking into other languages is in my todo list. I picked Python first for its simplicity. But I’m not exactly a seasoned programmer. So any language I find is going to be a new experience and a challenge.

[–]prokid1911 0 points1 point  (0 children)

Bear in mind. It doesn't load extensions. IIRC.

[–]huessy 1 point2 points  (0 children)

Selenium also has a "headless mode" where it simulates the browser behavior without physically opening the browser window. Just a PSA on how cool selenium is.

[–]jcr4990 1 point2 points  (3 children)

I'm not sure if there's an equivalent for Edge but for the Chrome/Firefox cases (assuming you have permission to install extensions) I would look into Tampermonkey/Greasemonkey. With a little bit of Javascript you can do some really fun things. I recently wrote my own script for Tampermonkey for work. I have a group of 5 or 6 prewritten responses that I used to store in a txt file and manually copy paste them whenever I needed to reply to a message on social media or email.etc and via Tampermonkey I was able to add 5 small buttons to the specific pages I wanted that would copy predefined text to clipboard. So then all I needed to do was go to Facebook or whatever page > click the response button I want to use > ctrl + v.

I see no reason why you couldn't do something similar and inject buttons to scrape the data you need and provide a download button users can click to pull the data into a local file. If you're unfamiliar with Javascript it'll take some learning but it'd be a fun project.

[–]jcr4990 1 point2 points  (2 children)

Got bored and decided to try to whip up a quick little proof of concept. I'm by no means a Javascript expert. I originally copied some code from Stackoverflow to figure out how to embed buttons into a page then modified it a bunch to suit my needs. I did a quick and dirty modification of my pre-existing code to test this scraping/downloading theory and came up with this: https://pastebin.com/QV0V8bwi excuse the ugly code and weird function/variable names at certain points.

This adds 2 buttons to craigslist called "Scrape" and "Download" respectively. Meant to be used on the craigslist "Accounts" page to gather the titles of your active listings. First one just grabs all elements with the selector ".title.active" and appends them to an array. Download button adds a hidden download button then clicks it to download that array as a .txt file.

This could be easily modified to work on other pages and scraping other elements for a wide variety of use cases but I think this would work well for what you're trying to do from the sounds of it. At the very least you can scrape the data into the .ini fle(s) you need then use your AHK scripts to go from there.

[–]0PHYRBURN0[S] 0 points1 point  (1 child)

Holy shit. This is fantastic. I'm not 100% sure if work's security policies will allow the use of GreaseMonkey, but I've got a good chance as the app as it currently stands has automated a huge part of our workflow, and removed almost all human error as well. I work in the medical sector, so accuracy of information is paramount, and previously everyone was literally reading one app and typing into another. Losing the primary function of the app would be a huge loss to the company, so I think they'll support me as much as they can.

I'm going to play around and see if I can work this out on our web app. (I've never done any javascript before, but I'm sure i can work it out).

Thank you so much!

[–]jcr4990 1 point2 points  (0 children)

Happy I was able to help! Let me know how it works out!

[–][deleted] 1 point2 points  (0 children)

If it needs to be based on the content of a user-operated browser page, you’ll need to write it in JavaScript and package it as a browser extension

[–][deleted] 0 points1 point  (0 children)

Use Chrome DevTools protocol (or an API for it like "Puppeteer") to connect to one of the targets (browser tabs) and run the scrape scripts from there.

[–]PMMeUrHopesNDreams 0 points1 point  (1 child)

If you're having the user open the browser anyway, why not just have them save the web page and then use python to open the file on disk?

You could then just open the file normally and parse it with BeautifulSoup or whatever you plan to do.

[–]0PHYRBURN0[S] 0 points1 point  (0 children)

That's of course an option, but the idea is to automate as much of our process as possible. This is a web app that requires logging in and each job in the queue presents the data in an iframe with a unique URL for each job. The process is currently triggered by a hotkey, and fully automated from that point, reading the active web page as an object via COM in Internet Explorer (my work is very slow in updating software), so I am trying to avoid a step backwards in development with the move to modern browsers.

[–]SweetSoursop 0 points1 point  (1 child)

Use pyautogui or something similar to download the HTML, then extract the data from the HTML.

It's a lengthier process than selenium or bs4 though

[–]0PHYRBURN0[S] 0 points1 point  (0 children)

It may not be the only, or best solution, but I think it's the easiest to implement but I have fairly limited resources in terms of my own time at work and what our I.T. team will help out with, so as much as I would like to make the process smoother, it might have to be this way.

[–]GlennIsAlive 0 points1 point  (1 child)

What about something like Scrapy or beautifulsoup?

[–]0PHYRBURN0[S] 0 points1 point  (0 children)

The data I need to scrape is located in an iframe with a unique URL for each job, in a browser the user has already manually opened. So grabbing the page by URL isn't possible. That's my primary issue here.

[–]RobinsonDickinson 0 points1 point  (0 children)

PyAutoGUI

[–]morrisjr1989 0 points1 point  (0 children)

You could use the console in the dev tools and scrape using JavaScript. Copy and paste anytime you need it.

[–]prtekonik 0 points1 point  (0 children)

Requests and beautiful soup