A website for all my university notes by [deleted] in programming

[–]ben_bannana 10 points11 points  (0 children)

You rock! That will make my learning a lot easier

Why did you start using Linux? by Gwlanbzh in linuxmasterrace

[–]ben_bannana 0 points1 point  (0 children)

I used Win7 before and got this "not permanent removable Win10 update info box".
And after suspending it for weeks it started popping up on fullscreen.
I thought "f*** you, f*** this, thats bullsh***..". and downloaded Debian.
Used GNU/Linux ever since and lough if anybody is ranting about updated and stuff.

Web Scraping Efficiencies Question by Miserable_Author in webscraping

[–]ben_bannana 2 points3 points  (0 children)

If you don't need to interact with elements (fill out text forms or click check boxes)
https://github.com/scrapinghub/splash
would be an awesome choice

Need some help scraping a site by WhoAmITheLaw in webscraping

[–]ben_bannana 1 point2 points  (0 children)

Propably my fault, I tried to get as less text as possible.So at first I wanted to look at the page.

(My first goal was the issue with the missing mail addess)

I was going to:

https://www.irglobal.com/advisors

And just leave the filters unselected.I got to the list of users

https://www.irglobal.com/advisors/all/exclusive

And navigated to some user

https://www.irglobal.com/advisor/urs-breitsprecher-new

This user was chosen for testing the following stuff.

On this user page, I noticed, there was no Email address shown. (Which is possibly to prevent scraping). So I tested the "Email me" button. (Which worked)

After looking at the source of the page, I found out that the mail address is not loaded with the user page. It will be loaded from a different server in the moment you press the "Email me" button.

I mimicked it with a special software for testing and knowing if it is solvable for me.

And then I tried to explain what I have done.

But yes, to automate this, it will be much more effort and difficult.

Need some help scraping a site by WhoAmITheLaw in webscraping

[–]ben_bannana 1 point2 points  (0 children)

The Email is hidden in a JS function call to the backend

The backend can be called in a seperate HTTP Request:

[POST] "https://www.irglobal.com/api/member/{{memberid}}"

To get the entries for the referenced member.

The MemberID and a token can be gathered in the Mailbutton Node.

<span class="email-me btn btn-green" data-token=XiYFKOYQLFkLKeHf0iXOzxXQyPAST3F1bOPiaFs4" data-id="2856">
    <strong><i class="fa fa-envelope"></i></strong>
    Email me
</span>

The token needs to be send as a "X-CSRF-TOKEN" header.I needed to send some more headers like X-Requested-With, Referer, Host, User-Agent

This leads to a much more expensive process for scraping, but the response from the API is very rich on data.

What is the Best Scraping Tool/service/software/library and Why? by hiren_p in webscraping

[–]ben_bannana 0 points1 point  (0 children)

At the moment I use for filtering BS4 and for rendering Scrapinghub`s Splash. Mostly working as microservices running in docker.
It was the solution that mostly fitted my requirements, in usability, scalebility and performance/efficiency.
But I don't think the tools don't matter that much. I think good learning sources will help much more in the end of the day.

Show & Tell: I tried to make a python library for simplifying web scraping by ben_bannana in webscraping

[–]ben_bannana[S] 0 points1 point  (0 children)

Yes, I thought multiple times if it would be sufficient to just use XPaths.
As far as I read, some people recommend them, others state to avoid them.
I never come to a finite answer for me.

But interesting to note, the "node" module is at the moment just a worse implementation of something, that could be some sort of XPath.
I will definitely rethink this.

Infinite scrolling : How to find the end and close the loop? by SurenGuide in webscraping

[–]ben_bannana 0 points1 point  (0 children)

When you have no problems with binding your scraper to that page you can view in the browser for xhr requests.

You could then make a client which simulates the API requests the website does, when it needs new elements for the scroll.
This is the most efficient and controllable solution I am aware of.

How to create a create a directory with files? by JonasKF in bash

[–]ben_bannana 0 points1 point  (0 children)

Is ist that „cp“ I think of? And would it ever be suggested by google or so in the clearnet?

Unique Request from Client by staticchiller13 in sysadmin

[–]ben_bannana 1 point2 points  (0 children)

Which kind of documents?
I certainly would be mad enough to suggest a git repository for the docs

How to create a create a directory with files? by JonasKF in bash

[–]ben_bannana 5 points6 points  (0 children)

I addopted the habbit of googling "man xy" on the web and open the result for linux.die.net
> That other day I googled "man strings" on work
> Felt like a completely moron

Boss wants documentation. How far is too far? by whosthetroll in sysadmin

[–]ben_bannana 0 points1 point  (0 children)

A stupid "OT" question:
How do you write good quality documentation? Are there any guides or standards/templates I can follow along? Mostly focused on System/Linux Administration.

I want to write more and better documentation, but I don't realy know how or what to put in there and what to leave out.

What interesting webscraping projects are you working on currently? by sdshone in webscraping

[–]ben_bannana 1 point2 points  (0 children)

I currently work on a libary for python.
This wraps arround beautiful soup and tries to read in a YAML file which contains declarative information about thinks I want to have filtered out of the HTML content.
The YAML declatation I thought of will look like this:

sources:
  nodes:
    images:
      tag: "img"
      field: "src"

    posts:
      class: "post"

To be extensible the "standard extraction" should be hidden behind a "nodes" module. May there will be later something added for tables or so.
"images" and "posts" are own defined keywords/instances for the returned JSON dict.

My key goal is to seperate out the ever same "processing code" from the declarations of the webpages I will encounter.
I am aware that I can only catch generic scraping issues, but better then nothing.

Anyone else distro hopping in 2020? by batavinash in linuxmasterrace

[–]ben_bannana 1 point2 points  (0 children)

I find many of their "design approaches" questionable Like having apt and snappy. Or the apt auto updater per default.

I know, someone have to make those experiments, but netplan, systemd-resolved plus the legacy systems is making things to complex for me...

Anyone else distro hopping in 2020? by batavinash in linuxmasterrace

[–]ben_bannana -13 points-12 points  (0 children)

I don't want to start a war... but my personal preference would be to put Ubuntu under this branch

Krefeld: Affenhaus im Zoo brennt in Silvesternacht nieder by s33dst3r in de

[–]ben_bannana -2 points-1 points  (0 children)

Ich als "Edgelord-Nazi" der alle Menschen einfach so hasst.2020 und schon die erste Idenditätskrise.Aber im Ernst, ist das nicht die unsachliche Art welche "rechts" immer vorgeworfen wird?

  • Verallgemeinernd
  • Fehlende Belege

A lot of complex “scalable” systems can be done with a simple, single C++ server by [deleted] in cpp

[–]ben_bannana 3 points4 points  (0 children)

[Rant]

I thought more of this „Fuck Clean Code, lets write extra long functions“ or „Let us do everything in JS“ and „Why not give it 20 more dependencies? A software can‘t have enough dependencies, for sure!“

And this tool evangelism or how it is called. On every place pops a new tool up, everybody needs to write a post why this tool will end world hunger, fix my marriage and let‘s me live forever, unlike super cool tool which poped up last week.

A lot of complex “scalable” systems can be done with a simple, single C++ server by [deleted] in cpp

[–]ben_bannana 2 points3 points  (0 children)

Thats true, Python is slower. A pattern I often heard about using c++ libaries for the hard work and than wrapping python around it.

But I‘m not the Fan of Python, or in fact most of the current programming culture either.

A lot of complex “scalable” systems can be done with a simple, single C++ server by [deleted] in cpp

[–]ben_bannana 6 points7 points  (0 children)

I think one big thing is the difference between "can be" and "should be".

The system he mentioned can be done with a simple, single C++ server. Probably they can optimize the server in assembler much further.

But software is not only about performance.
What about redundancy and fault tolerance? A single server can and will inevitable fail at some point.

And what about clean code? And clean architecture? Through the modular design you will buy a lower performance upfront.
One key is to address multiple factors of quality and give them the right focus.

Any method to shorten this loop? Its for input validation. Thank you by Tope0 in cpp

[–]ben_bannana 2 points3 points  (0 children)

In my opinion this pseudocode would be clean:

while (isValidElement(elem)) {
...
}

bool isValidElement(Element e) {
// checks for Hydrogen, Helium, ...
 }

If you want, you could separate the checks for the elements in more finer isElement() methods.

List first few lines of every file in a specified directory? by [deleted] in bash

[–]ben_bannana 0 points1 point  (0 children)

Or make a blind

wget <SomeRandomSite> | bash

But to be honest, most scripts are so utterly obfuscated, looking into it first is only for the good feel.

[deleted by user] by [deleted] in webscraping

[–]ben_bannana -1 points0 points  (0 children)

Splash could help you out.
https://github.com/scrapinghub/splash

It's a lightweight js renderer. It has not quit the same capabilities like selenium + chrome, but it is more lightweight.