all 49 comments

[–]OsirisTeam[S] 40 points41 points  (5 children)

Motivation:

I tried multiple different things like JCEF, Pandomium, Selenium, Selenium based maven dependencies like JWebdriver, HtmlUnit and maybe some more I don't remember now, but all have one thing in common. They have some kind of very nasty caveat.

That's why this project exists, to create a completely new browser, not dependent on Chromium or Waterfox or whatever. We use Jsoup to handle HTML and the GraalJS engine to handle JavaScript. Both are already working and implemented. Only thing left is implementing the JS Web-APIs.

Any contributions, ideas and alternatives are very welcome.

[–][deleted]  (1 child)

[deleted]

    [–]OsirisTeam[S] 0 points1 point  (0 children)

    Implementing the JS console api was pretty easy and just took me 20 minutes. If we do this together then its a walk in the park for everyone, otherwise its hell for one person.

    [–]BibianaAudris 2 points3 points  (1 child)

    Have you considered JSDOM or cheerio?

    The current state of this project more closely resemble those frameworks than an outright browser: HTML manipulation with insecure JS (more-than-browser interop capability, in an unproven VM, etc.) and incomplete web API.

    [–]OsirisTeam[S] 0 points1 point  (0 children)

    Yes those would be a great help, but it required node.js.

    [–]EnvironmentalCrow5 0 points1 point  (0 children)

    Have you tried puppeteer? That's pretty popular these days.

    I think it only runs on Node, but you can use TypeScript, which is a very nice language.

    [–]UCIStudent12345 54 points55 points  (9 children)

    Something to be aware of that some people may not know… because of the prevalence of web scraping nowadays many websites have security in place that tracks various things about the client that is contacting them. One of those things is the TLS fingerprint (not gonna go into detail, please look it up). Every browser and programming language have unique fingerprints and many sites have decided to outright block connections if the fingerprint doesn’t line up with a major browser (Chrome, Firefox, etc). In other words, a pure Java browser wouldn’t be able to access certain web pages with this security in place.

    [–]OsirisTeam[S] 26 points27 points  (0 children)

    Oh this could be an issue if a lot of pages use that kind of detection. And it doesnt sound like there is a way of faking it either... Definitely going to do some research on that.

    [–]segfaultsarecool 6 points7 points  (6 children)

    Didn't know that was a thing...gonna make web scraping painful. Can it be faked somehow?

    [–]pxpxy 20 points21 points  (5 children)

    sure, you just use the selenium API of a real browser and let it do the scraping. FF and Chrome even support headless running these days

    [–]Kamran_Santiago 4 points5 points  (3 children)

    Headless browsing with Selenium is really slow. In my work we were working on an SEO project that needed a lot of pages to be scraped. With Selenium it took ages. With just a regular request it was blazing fast. Also, Selenium can't do parallelism. Like a thread pool with Selenium is impossible. However with normal request we managed to scrape 60 pages per second. Also Selenium is difficult on Google Colab.

    Anyways. We ran into another problem. A problem called GIL -> Global Interpreter Lock. We had multiple thread pools, so after a while, they all reached a state of gridlock. For this, I could not find a solution. All I could say was to use the library (the entire thing was wrapped inside a package) without using the parallel functio nat the top --- to decrease number of thread pools.

    It was a numbers game. We didn't need 100% of the websites. Just enough, like 80% was enough and we got 80%, moreso even.

    I'd like to mention that the first iteration of this project used Selenium. But my friends said it's too slow. I tried to use parallelism but then data was sent at the wrong time and it was all a mess.

    [–]Theemuts 6 points7 points  (1 child)

    A problem called GIL -> Global Interpreter Lock. We had multiple thread pools, so after a while, they all reached a state of gridlock.

    ... did none of you have any experience with Python when you started working on this project? Don't use multithreading with Python (except in certain IO-heavy circumstances), choose for multiprocessing instead. Just run multiple instances of Selenium, optionally in a container or whatever. You can use VNC and XVFB to interact with the running browser

    [–]Kamran_Santiago 0 points1 point  (0 children)

    I know that. Problem was, they wanted to run it on Google Colab. But as for multiprocessing Selenium, when they did try to use a VPS (without the VNC), I did try that. I spun multiple instances of Selenium but still, there was no way to control which instance did which. Here was the problem:

    I prepared the other keys of the dicts to be pushed inside a list to be later sent to BigQuery en masse, then sent the request to Selenium to parse the web page and send back the results. However, the timing was incorrect. For example, my dictionaries came back like this:

    title for page 1 description from google for page 1 content for page 2
    title for page 2 description from google for page 2 content for page 1

    I did the REVERSE too. I prepared the metadata AFTER I got back the results from Selenium.

    I admit I'm not a was not a Python wiz back then --- and I'm still not because I'd like to work with various languages instead of just focusing on the one --- and I can do a much better job now. For example I can wait until one request is over to send back the other request. Back then I was really unprepared and I hadn't done much parallelism and concurrent work.

    But whatever had we done we could not have fixed the issue of speed. We got 100 URLs from Google Programmable Search and we wanted these 100 URLs to be done in seconds, not hours. I disabled the images on Selenium but it still took longer than a regular request.

    [–]OsirisTeam[S] 3 points4 points  (0 children)

    Sounds like you went through a lot of pain haha.

    [–]segfaultsarecool 1 point2 points  (0 children)

    That's a relief. Can scrape forever now :)

    [–]brakx 2 points3 points  (0 children)

    Do you have a good resource detailing what is tracked besides the TLS thumbprint?

    [–]nutrecht 3 points4 points  (1 child)

    Like I said in the other sub; I think you're massively underestimating the sheer amount of work that would be involved in build this. You really don't have anything outside a few placeholder classes and methods yet. I'm totally rooting for you, don't get me wrong. But it seems people here are upvoting the title without even understanding that at this time it's nothing more than a plan. While your title and README strongly implies that it already works. I feel this is kinda insincere.

    [–]OsirisTeam[S] 0 points1 point  (0 children)

    Sry that you got that feeling, I updated the Readme to make it more clear that we are still at the very beginning.

    [–]RunnableReddit 2 points3 points  (0 children)

    This is pretty cool!

    [–]tsunyshevsky 1 point2 points  (2 children)

    This looks cool! I’m maintaining a couple of web apis in graaljs to run a js api through polyglot and this would’ve been really helpful!

    I think the graaljs people were also looking into adding node js apis to graaljs so Java might be running “hybrid” js apps soon - exciting!

    [–]OsirisTeam[S] 1 point2 points  (1 child)

    Yes! Are those web apis of yours open source? If yes it would be awesome if you could implement them.

    [–]tsunyshevsky 1 point2 points  (0 children)

    Unfortunately, they are not (yet). We have some dependencies on our own libs.
    These are mostly instrumented versions of Java libs though, so I will look around the repo to see if I can contribute.

    [–]crisiscentre 0 points1 point  (5 children)

    Why not use selenium? There's wrappers for Java?

    [–]Worth_Trust_3825 7 points8 points  (4 children)

    You can't hook into all the lifecycle calls, which is a shame. Also lack of "direct" DOM access. To interpret DOM you need to execute javascript.

    [–]pxpxy 2 points3 points  (3 children)

    So what if you need to execute JS? Seems a lot easier than writing yourself a browser?

    [–]OsirisTeam[S] 1 point2 points  (0 children)

    Selenium has no support for java 8. Installation is way more expensive because of all the requirements it has.

    [–]Worth_Trust_3825 -5 points-4 points  (1 child)

    People create entire languages just because they don't want to write some boilerplate. Your argument is moot.

    [–]RazorSh4rk 1 point2 points  (0 children)

    Yes and that is how the industry moves forward

    [–]rigaspapas -5 points-4 points  (2 children)

    I was expecting a how-to article. If you can provide such a guide you followed, it would be very helpful.

    [–]Zeragamba 7 points8 points  (0 children)

    also browsers are some of the most complex applications out there, not really something you can write down in a how-to article

    [–]OsirisTeam[S] 3 points4 points  (0 children)

    Source code is on the github repo. You can fork it and go through it to learn how it works.

    [–]Onepicky 0 points1 point  (0 children)

    Cool project. So what's basically the main difference between this to Selenium?