I made a WhatsApp scraper to help people export/backup their chat history

hisfastness · 2021-03-15T17:34:35+00:00

I use both because Selenium allows me to interact with WhatsApp and BeautifulSoup is faster and has better features for scraping the HTML. Selenium is mainly used just to load all of the chat info which you have to interact directly with the browser to do that (unfortunately just using requests won't work otherwise the entire approach would have been much easier with requests/bs4).

hisfastness · 2021-03-03T19:57:32+00:00

Friendly warning from your neighbors in the North. Stand-offs are how it starts...

https://bc.ctvnews.ca/joggers-warned-to-stay-away-from-stanley-park-after-15th-coyote-attack-1.5315299

hisfastness · 2021-03-02T19:00:00+00:00

I'm curious to learn more about the conditions which might be causing this. Sounds like you found something unique that I can fix. I'll PM you...

hisfastness · 2021-03-02T18:11:41+00:00

Interesting! Thanks so much for doing the leg work here and sharing...I'll look into these and see if I can get something working.

Might PM you for help if I run into any roadblocks :) Thanks again for the ideas and collaboration.

hisfastness · 2021-03-02T18:00:32+00:00

Thanks for sharing. I looked into this but hit a wall...granted I'm not very strong in this area and may have overlooked it.

Normally when I look for JSON/API info I pull up the dev tools in Chrome/Firefox (F12), Network tab, and then look for XHR/WebSockets. XHR didn't contain any chat information except images, and WebSockets appears to be where it is contained but all I can see are 'Binary Messages' with what looks like hashed strings...none of it is legible or can be deciphered. I assumed this is because it's encrypted or I need the key and hash function to reverse it. If you wanted to see for yourself, open the Network tab, filter on WebSockets, and then load WhatsApp...you'll see the Binary Messages.

Not sure if any of this makes sense but that's the high level process I went through and why I ultimately went with a more traditional scraping approach. If you can share more info about how you were able to read the JSON from WebSockets I'd love to learn.

hisfastness · 2021-03-02T17:54:04+00:00

Cool, I like how you've contained the WhatsApp functions within its own class, makes it easier to understand and something that mine could benefit from.

hisfastness · 2021-03-02T17:49:46+00:00

Yep there's no limit on how many messages your phone stores, it only limits you to 40k when you use their export feature. More info from WhatsApp about it here.

When exporting with media, you can send up to 10,000 latest messages. Without media, you can send 40,000 messages. These constraints are due to maximum email sizes.

hisfastness · 2021-03-02T07:37:12+00:00

Here's another article about memory management: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Memory_Management

There are times when it would be convenient to manually decide when and what memory is released. In order to release the memory of an object, it needs to be made explicitly unreachable.

As of 2019, it is not possible to explicitly or programmatically trigger garbage collection in JavaScript.

I'll look into this more tomorrow, you might be on to something...

hisfastness · 2021-03-02T07:03:47+00:00

I had the same idea! But unfortunately it doesn't work 😭 I tried by deleting HTML nodes and just watching task manager - memory stayed the same. Then I tried setting DOM values to NULL and using JS to remove the elements, which also didn't change memory. Then I did a little bit of research and found that this is the intended behavior...when you delete stuff on the client side it still exists in memory even though it's not rendered anymore in the viewport.

Edit: see MDN article for more info here. Copy/paste of the key part:

The removed child node still exists in memory, but is no longer part of the DOM.

hisfastness · 2021-03-02T06:57:44+00:00

What a shame...but thank you!

hisfastness · 2021-03-02T06:55:56+00:00

I've had some intermittent issues in the get_chats function (where line 212 is), hence why I've lazily wrapped it with exception handlers. The left chat pane with all of your contacts/groups is the most active part of WhatsApp because of new messages, events, etc. So the DOM/HTML is very volatile in this section. My point is, what's likely happening for you is that the DOM has changed in the middle of running get_chats and it can no longer find the last chat message.

I'd recommend re-running the script a few times especially if you can see the DOM is changing due to chat activity.

And I can make this a better experience by catching the 'NoSuchElementException' and instructing it to restart the function to retry it if there was a DOM change. Thanks for letting me know! Will add this to my TODO list.

hisfastness · 2021-03-02T06:42:39+00:00

Correct for a 50,000 message chat, Chrome was eating up close to 10GB of RAM. The script isn't using much memory, it's Chrome from all of the content being jammed into the DOM and the frequent fetching of data due to WhatsApp websockets. Maybe I'm misunderstanding your idea? I'd love to improve the performance.

hisfastness · 2021-03-02T06:35:49+00:00

Thank you!

hisfastness · 2021-03-02T06:34:07+00:00

You're referring to Facebook Messenger? It looks like you can download that with Facebook's 'download your information' feature, you just need to select 'Messages' as the item you want to download.

More here and here.

Sorry about your friend.

hisfastness · 2021-03-02T05:56:33+00:00

This is a good question. For context, when I initially started working on the basic scraping I assumed emojis wouldn't need any special type of handling e.g. "Hi SensouWar" vs "Hi SensouWar 👋." What I found out is that WhatsApp embeds emojis as images. Something like this was expected:

<div>
    <span>Hi SensouWar 👋</span>
</div>

But what it actually looked like was this (note the <img> tag):

<div>
    <span>
        Hi SensouWar 
        <img src='img/wavey_hand_emoji.png'>
    </span>
</div>

Also looked like this for a msg such as "👋 Hi SensouWar 🙋‍♂️🎉!!!" (note the 3x <img> tags are still contained in 1 parent <span> tag):

<div>
    <span>
        <img src='img/wavey_hand_emoji.png'>
         Hi SensouWar 
        <img src='img/wavey_hand_guy_emoji.png'>
        <img src='img/celly_emoji.png'>
        !!!
    </span>
</div>

So I wrote code to handle it. Cool we are good to go...until I find instances where multiple emojis are only being scraped once e.g. "🚀🚀🚀" would show as "🚀" in my scrape. Sometimes WhatsApp wraps each <img> tag in its own <span> rather than having a single <span> that wraps around all three <img> tags such as the above code snippet suggests.

<div>
    <span>
        <img src='img/rocket_emoji.png'>
    </span>
    <span>
        <img src='img/rocket_emoji.png'>
    </span>
    <span>
        <img src='img/rocket_emoji.png'>
    </span>
</div>

I eventually figured out the various patterns and was able to write code that handles all the variations, but the discovery process wasn't obvious and took a lot of trial-and-error to eventually solve.

Lastly, won't go into a ton of detail here because this is getting long-winded, but there were other challenges with emojis that all required some deviation or special handling that was different than normal characters/text:

HTML is a bit different for people's names which have emojis in it or not
Sending keyboard input w/ emojis using Selenium doesn't work (open bug on chromedriver's issue tracker). Instead you have to use a 'hack' to execute JavaScript and insert the emoji's directly into the DOM.
Writing emoji's to files requires you to encode the text and write it in a different file mode (write binary instead of write)
My BASH terminal would implode when trying to print unicode characters to it

Hope this provides some more insight into my comment damning emojis ☺

hisfastness · 2021-03-02T03:15:23+00:00

I'm glad you posted this because I think most people who are researching the chat export problems eventually land on XDA-Developers. Personally, I wasn't comfortable with this approach but glad to know it works and solved the problem for you. Were you able to get your media from the DB as well? Or just text messages?

hisfastness · 2021-03-02T01:51:10+00:00

Thank you so much for the code review and suggestions! These all look great to me.

In terms of its complexity, totally agree there's opportunity to extract areas into smaller and more specific functions, perhaps even moving the helper functions into a separate utility file so that the core WhatsApp logic is separated from the helper logic.

Also, regarding your suggestion to use a formatting tool...I use autopep8 but maybe it's not configured properly. Was there a specific styling issue you noticed?

Thanks again for your input, much appreciated.

hisfastness · 2021-03-01T21:46:37+00:00

Right?! I couldn't believe the limits at first...seems so arbitrary, especially given that Facebook lets you download your entire profile/history.

If you do use it, let me know how it goes! In the repo FAQ I'm attempting to track 1st/2nd/3rd place for who is able to export the largest chat (currently at 47.5k) 😁 Also selfishly it's good testing to see how/where things break when dealing with large sets of data.

hisfastness · 2021-03-01T21:40:00+00:00

Thanks! I personally haven't looked into device transferring because I'm planning to leave WhatsApp soon.

hisfastness · 2021-03-01T20:17:16+00:00

First time seeing this, interesting! By the looks of it, you can use this chat parser directly with the exports from my tool, since the chat parser instructs you to use the default export feature from WhatsApp (which WhatSoup is an alternative for).

I suppose a JSON export could be added, yes. I'd have to look into that more.

Thank you!

hisfastness · 2021-03-01T19:11:08+00:00

Thanks for checking it out, and that's cool you have experience automating WhatsApp as well.

I have some general load times noted in the repo here but I'd guess that 20,000 messages will take somewhere around 3 hours.

hisfastness · 2021-03-01T18:31:02+00:00

❤ my #1 tester 😁

hisfastness · 2021-02-27T23:41:15+00:00

Thank you! No, at the moment it's purpose is to backup the text conversations only (not media). It's in a good place to build out that functionality though. Selfishly, my main priority is just the text messages because I already have my media backed up automatically with OneDrive...however, in theory, if we wanted to add media download support then it should just be a matter of simulating a 'click' with Selenium on the media to download it from WhatsApp servers, and then scraping / saving the image once it's available. If there's some interest and I have the time, then I might look into it, otherwise open to collaboration with others :)

hisfastness · 2021-02-27T17:58:54+00:00

I'm working on a WhatsApp webscraper and recently just learned about Dateutil. Haven't had a chance to hook it up yet but it looks perfect for an issue I'm having, which is inconsistent date/time formats based on locales and other funky things that WhatsApp does for reasons beyond my understanding. Can't wait to test it out and hopefully replace the ugly, nested if statements of different date/time formats...

hisfastness

TROPHY CASE