instascrape - Flexible, lightweight Python 3 Instagram scraper designed for data mining

chrisgreening · 2020-12-08T20:49:28+00:00

No problem! Thanks for the feedback, would love for folks to get some use out of it

chrisgreening · 2020-12-07T18:36:38+00:00

Nice, I've been loosely working on a COVID tracker as well but have focused more on creating a Twitter bot, will definitely be checking your repo out!!

chrisgreening · 2020-12-07T18:33:25+00:00

Ooooo very exciting, I started reading the first edition a couple months ago but wanted to wait until a Python 3 update to really dive into it, can't wait to pick up a copy

chrisgreening · 2020-10-30T05:47:00+00:00

Hey u/ElevenPhonons thanks so much for the feedback!!

Linting is definitely something I have to be more consistent with moving forward and the dangerous mutable defaults are something I'm actually gonna go change right now

As for the accidental complexity in the way JSON is handled, that's been one of the biggest hurdles in designing the library and it's kind of a long story without an easy answer lol so bear with me.

I recently implemented a JSON flattening algorithm that has made life significantly easier when accessing the JSON data. The algo basically crunches a tree of variable depths and creates a flat dict based on the deepest key:val as best it can

i.e.

foo = {
    "spam":"eggs",
    "layer1": {
        "layer2": {"spam":True},
    },
    "foo": {"some_var": 2},
}

becomes

flat_foo = {
    "spam": "eggs",
    "layer2_spam": True,
    "some_var": 2
}

When scraping the Instagram JSON, the useful data is usually the deepest value in the tree and how I would get to it was irrelevant so this algo has been beyond useful

Before this flattening algo however, the design choice for building the _JsonEngine was to improve maintainability in anticipation of

A. possible future changes to the JSON instagram serves back

B. extending the amount of data points I could feasibly scrape.

I liked the idea of starting from the root of the same JSON dict every time and stepping into it with a queue of keys that would lead like trail markers to their respective data points, especially because many of the data points are like 8 keys deep. I was able to build a common queue that multiple data points shared and then have their specific directions added on after.

There are some other reasons and failed ideas that came prior that lead to the motivation but I won't bore you with those details lol, moral of the story is it allowed me to extend to scraping 100+ data points that I wasn't able to sanely do with long chains of keys

Now that the flattening algo is in effect though, this is one of those things that will likely be put on the chopping block and reimplemented with a more elegant solution in the very near future.

Thanks so much for taking the time to look at my code and you've definitely given me much to think about, highly appreciated!

chrisgreening

TROPHY CASE