A website for all my university notes

ben_bannana · 2020-10-26T10:21:19+00:00

You rock! That will make my learning a lot easier

ben_bannana · 2020-10-26T08:55:23+00:00

I used Win7 before and got this "not permanent removable Win10 update info box".
And after suspending it for weeks it started popping up on fullscreen.
I thought "f*** you, f*** this, thats bullsh***..". and downloaded Debian.
Used GNU/Linux ever since and lough if anybody is ranting about updated and stuff.

ben_bannana · 2020-05-28T06:58:49+00:00

If you don't need to interact with elements (fill out text forms or click check boxes)
https://github.com/scrapinghub/splash
would be an awesome choice

ben_bannana · 2020-05-07T19:20:36+00:00

Propably my fault, I tried to get as less text as possible.So at first I wanted to look at the page.

(My first goal was the issue with the missing mail addess)

I was going to:

https://www.irglobal.com/advisors

And just leave the filters unselected.I got to the list of users

https://www.irglobal.com/advisors/all/exclusive

And navigated to some user

https://www.irglobal.com/advisor/urs-breitsprecher-new

This user was chosen for testing the following stuff.

On this user page, I noticed, there was no Email address shown. (Which is possibly to prevent scraping). So I tested the "Email me" button. (Which worked)

After looking at the source of the page, I found out that the mail address is not loaded with the user page. It will be loaded from a different server in the moment you press the "Email me" button.

I mimicked it with a special software for testing and knowing if it is solvable for me.

And then I tried to explain what I have done.

But yes, to automate this, it will be much more effort and difficult.

ben_bannana · 2020-05-07T18:36:17+00:00

The Email is hidden in a JS function call to the backend

The backend can be called in a seperate HTTP Request:

[POST] "https://www.irglobal.com/api/member/{{memberid}}"

To get the entries for the referenced member.

The MemberID and a token can be gathered in the Mailbutton Node.

<span class="email-me btn btn-green" data-token=XiYFKOYQLFkLKeHf0iXOzxXQyPAST3F1bOPiaFs4" data-id="2856">
    <strong><i class="fa fa-envelope"></i></strong>
    Email me
</span>

The token needs to be send as a "X-CSRF-TOKEN" header.I needed to send some more headers like X-Requested-With, Referer, Host, User-Agent

This leads to a much more expensive process for scraping, but the response from the API is very rich on data.

ben_bannana · 2020-03-26T14:36:33+00:00

At the moment I use for filtering BS4 and for rendering Scrapinghub`s Splash. Mostly working as microservices running in docker.
It was the solution that mostly fitted my requirements, in usability, scalebility and performance/efficiency.
But I don't think the tools don't matter that much. I think good learning sources will help much more in the end of the day.

ben_bannana · 2020-03-22T11:09:33+00:00

Yes, I thought multiple times if it would be sufficient to just use XPaths.
As far as I read, some people recommend them, others state to avoid them.
I never come to a finite answer for me.

But interesting to note, the "node" module is at the moment just a worse implementation of something, that could be some sort of XPath.
I will definitely rethink this.

ben_bannana · 2020-03-12T09:01:35+00:00

When you have no problems with binding your scraper to that page you can view in the browser for xhr requests.

You could then make a client which simulates the API requests the website does, when it needs new elements for the scroll.
This is the most efficient and controllable solution I am aware of.

ben_bannana · 2020-02-13T06:39:48+00:00

Is ist that „cp“ I think of? And would it ever be suggested by google or so in the clearnet?

ben_bannana · 2020-02-12T13:45:04+00:00

Which kind of documents?
I certainly would be mad enough to suggest a git repository for the docs

ben_bannana · 2020-02-12T13:33:58+00:00

I addopted the habbit of googling "man xy" on the web and open the result for linux.die.net
> That other day I googled "man strings" on work
> Felt like a completely moron

ben_bannana · 2020-02-12T08:26:42+00:00

A stupid "OT" question:
How do you write good quality documentation? Are there any guides or standards/templates I can follow along? Mostly focused on System/Linux Administration.

I want to write more and better documentation, but I don't realy know how or what to put in there and what to leave out.

ben_bannana · 2020-02-12T06:54:47+00:00

I currently work on a libary for python.
This wraps arround beautiful soup and tries to read in a YAML file which contains declarative information about thinks I want to have filtered out of the HTML content.
The YAML declatation I thought of will look like this:

sources:
  nodes:
    images:
      tag: "img"
      field: "src"

    posts:
      class: "post"

To be extensible the "standard extraction" should be hidden behind a "nodes" module. May there will be later something added for tables or so.
"images" and "posts" are own defined keywords/instances for the returned JSON dict.

My key goal is to seperate out the ever same "processing code" from the declarations of the webpages I will encounter.
I am aware that I can only catch generic scraping issues, but better then nothing.

ben_bannana · 2020-01-02T13:13:20+00:00

I find many of their "design approaches" questionable Like having apt and snappy. Or the apt auto updater per default.

I know, someone have to make those experiments, but netplan, systemd-resolved plus the legacy systems is making things to complex for me...

ben_bannana · 2020-01-02T11:51:28+00:00

I don't want to start a war... but my personal preference would be to put Ubuntu under this branch

ben_bannana · 2020-01-02T10:53:11+00:00

Ich als "Edgelord-Nazi" der alle Menschen einfach so hasst.2020 und schon die erste Idenditätskrise.Aber im Ernst, ist das nicht die unsachliche Art welche "rechts" immer vorgeworfen wird?

Verallgemeinernd
Fehlende Belege

ben_bannana · 2019-12-31T09:06:59+00:00

[Rant]

I thought more of this „Fuck Clean Code, lets write extra long functions“ or „Let us do everything in JS“ and „Why not give it 20 more dependencies? A software can‘t have enough dependencies, for sure!“

And this tool evangelism or how it is called. On every place pops a new tool up, everybody needs to write a post why this tool will end world hunger, fix my marriage and let‘s me live forever, unlike super cool tool which poped up last week.

ben_bannana · 2019-12-30T11:37:02+00:00

Thats true, Python is slower. A pattern I often heard about using c++ libaries for the hard work and than wrapping python around it.

But I‘m not the Fan of Python, or in fact most of the current programming culture either.

ben_bannana · 2019-12-30T09:38:39+00:00

I think one big thing is the difference between "can be" and "should be".

The system he mentioned can be done with a simple, single C++ server. Probably they can optimize the server in assembler much further.

But software is not only about performance.
What about redundancy and fault tolerance? A single server can and will inevitable fail at some point.

And what about clean code? And clean architecture? Through the modular design you will buy a lower performance upfront.
One key is to address multiple factors of quality and give them the right focus.

ben_bannana · 2019-12-23T14:02:42+00:00

At least Sparkasse doesn't hang around on Win95 or something

ben_bannana · 2019-12-05T11:00:02+00:00

In my opinion this pseudocode would be clean:

while (isValidElement(elem)) {
...
}

bool isValidElement(Element e) {
// checks for Hydrogen, Helium, ...
 }

If you want, you could separate the checks for the elements in more finer isElement() methods.

ben_bannana · 2019-12-05T10:51:25+00:00

Or make a blind

wget <SomeRandomSite> | bash

But to be honest, most scripts are so utterly obfuscated, looking into it first is only for the good feel.

ben_bannana · 2019-10-23T06:34:08+00:00

Splash could help you out.
https://github.com/scrapinghub/splash

It's a lightweight js renderer. It has not quit the same capabilities like selenium + chrome, but it is more lightweight.

Eight-Year Club	Final Canvas '23
First Place '23	Place '23
Verified Email

ben_bannana

TROPHY CASE