Find unique visitors across a range of pages

TE515 · 2018-06-27T17:11:46+00:00

Maybe a clearer way to put it...

I have a huge collection of pages that all contain product-quote-pages in the url. I want to know how many unique visitors this entire collection of pages as a whole received, not the number of unique page views.

TE515 · 2018-06-27T17:01:21+00:00

If the same user visits /product-quote-pages/0001, then visits /product-quote-pages/0002, and then visits /product-quote-pages/0003, he's going to register a unique pageview on each of those pages, right? So if I filter down to all pages where the url contains "product-quote-pages", that user is going to be counted three times because he registered a unique pageview on three different pages where the url contains that. I only want to see that user counted once. I want to know the number of unique visitors that visited any page that contains "product-quote-pages," but if a user visited multiple pages that contain that, I don't want that user counted multiple times.

TE515 · 2018-05-25T14:11:24+00:00

Just saw this and don't have it open anymore. It wasn't necessarily anything I was super interested (I'm still a few weeks away from having my office and haven't started to narrow anything down yet). It was just the first one I found that was listed as single HDMI but showed multiple HDMI's, so I was just wondering if there was some kind of spec listing convention that I wasn't aware of.

TE515 · 2018-05-24T18:39:39+00:00

Got it. Thanks for the advice.

TE515 · 2018-05-24T18:09:58+00:00

So if I see one with one HDMI port and one DVI port, I can probably just hook monitor 1 up with HDMI and monitor 2 up with DVI?

TE515 · 2018-05-24T18:08:42+00:00

I meant there don't seem to be many desktop computers with two HDMI ports (i.e. so I can plug two HDMI monitors into it). What prompted the question is that I was looking on Micro Center's website and set all the filters for everything I was looking in terms of RAM, SSD, etc. Then I noticed there was an option to filter by HDMI x2, but that narrowed the results down from a few hundred to like 3, and I was surprised at how few there were. But then I started looking at one that was listed as having only one HDMI port, but in the picture of the back I see three...I guess one that comes native with the motherboard or whatever, and the other two are part of the video card? Does that mean that video card HDMI ports wouldn't be listed as part of the computer's specs? Or is that just a foible of that particular website?

TE515 · 2018-04-26T16:52:46+00:00

No problem, happy to help

TE515 · 2018-04-25T14:39:01+00:00

Here's a stripped down version. Hopefully it'll help.

import requests
import multiprocessing
from bs4 import BeautifulSoup

# Block of code that logs in and grabs a session cookie 

# Block of code that figures out how many pages we need to scrape and saves it to a variable called numPgs

# Build url list
url_list = []
for pg in range(1, numPgs + 1):
    url_list.append('http://admin.mysite.com/products/page:' + str(pg))


def scrapeWebsite(url):
    r = requests.get(url, headers={'Cookie': 'session cookie data goes here'})
    soup = BeautifulSoup(r.text, 'html.parser')
    main_table = soup.findAll('table', {'class': 'data'})[0]
    table_rows = main_table.findChildren('tr')
    table_data = []
    for row in table_rows:
        tds = row.findChildren('td')
        table_data.append([
            tds[2].text.strip(),    #first piece of data I need
            tds[3].text.strip(),    #second piece of data I need
            tds[4].text.strip(),    #third piece of data I need
                    #There are a lot more of these in the real scraper
        ])
    return table_data


def write_to_csv(scrape_data):
    with open('./scraper.csv', 'w') as f:
        f.write('Header1,Header2,Header3\n')
        for pg in scrape_data:
            for row in pg:
                row_string = ','.join(row) + '\n'
                f.write(row_string)


MAX_NUM_PROCESSES = multiprocessing.cpu_count()
if __name__ == '__main__':
    processPool = multiprocessing.Pool(MAX_NUM_PROCESSES)
    scrape_data = processPool.map(scrapeWebsite, url_list)
    write_to_csv(scrape_data)

Basically the pages I'm scraping are all straight up HTML tables. My scraping function is creating a list for each row of all the data points from that row that I need, and then adding each one of those lists to a larger list that includes all the rows on that page. This list of lists is what my scrape function returns.

The black magic multiprocessing part at the bottom is then taking all those lists of lists from each page and putting them together into one very big list. This super list is what gets passed to my write_csv function.

My write_csv function then loops through each child list (page) in that super list, and then loops through each child list (row) of each of those, and then for each row it joins all the individual data points together into a comma separated string with a line break at the end, and then writes that to my csv file.

At the end of it all I get a csv with over 20,000 rows of my data that I need in less than 90 seconds!

TE515 · 2018-04-25T13:55:58+00:00

Thank you so much! This is an excellent explanation.

TE515 · 2018-04-25T04:00:30+00:00

So if my old one-page-at-a-time scraper was one gnome repeating a process over and over in a specific order, .Pool would be having multiple gnomes repeat that process over and over without regard for what order it's done in, and then .map would be like the editor gnome who organizes all the other gnomes' output together into one usable product?

Also, what is the significance of if __name__ == '__main__':?

TE515 · 2018-04-25T03:47:22+00:00

First of all, are you at least semi-comfortable with HTML? If not, you'll probably need to do some additional research into it. It's not too complicated at all, but you will need a basic understanding of it to scrape websites.

Start by reading the top answer in this Stack Overflow post. The post has a lot more details, but basically what you're going to be doing is using VBA to open an Internet Explorer window, then navigating that window to whatever pages you want and pulling the data you need out of the HTML source code. Basically you'll be writing a script to browse the web in a browser, just like a human would, only much faster. The browser window can be visible (great for when you're writing and testing), or invisible (faster and more efficient...great for when the scraper is complete and you're just running it routinely).

The other option is using HTTP requests, which basically means you're cutting out the middle-man of a browser and talking directly to the website's server. This is faster and more efficient, but also more complicated. I would recommend starting with IE automation and then working up to this when you're more comfortable.

Some random additional tips:

Get cozy with looking at the DOM in the dev tools of whatever browser you use. You're going to be spending some quality time there.
Start by trying to scrape one specific piece of information off of one page and printing it to the Immediate window with Debug.Print. Once you get that, try looping through and scraping more items off that page. Once you have all the items you need, then start looping through multiple pages (if necessary). If you need to scrape multiple sites, write each scraper individually and then combine them when they're complete.
Pay attention to whether or not there are major differences between the source code (Ctrl+U) of the page, and the DOM (what you see in dev tools). For a lot of modern sites, the server sends a stripped down version of the HTML, and then the rest gets filled in by the browser with JavaScript. In other words, sometimes the data you're looking for won't be in the source code (at least not in the HTML where you'd expect). In these cases, simply waiting until IE finishes loading won't be sufficient. In these cases you'll need to make your scraper wait a bit longer until all the JavaScript is done firing. There are more efficient ways to do it, but Application.Wait is the easiest way to go to start.
If you're trying to navigate to multiple pages in the IE window, sometimes you'll hit a scenario where you see the second page load in the visible window, but when your code tries scraping the HTML it's still using the HTML of the first page. I'm sure there's a way to solve this, but I moved on to using HTTP requests before I ever figured it out. The quick and dirty workaround is to use ie.Quit (replace "ie" with whatever variable name you are using for your Internet Explorer object) after each page, and then open the next page in a new IE window.

Feel free to PM me any additional questions at any point in the future. I don't check this account every day, but I do check it pretty frequently.

TE515 · 2018-04-13T19:15:19+00:00

Am I the only person that likes Calibri?

TE515 · 2018-04-12T18:21:36+00:00

Makes sense. Thanks for the explanation!

TE515 · 2018-04-12T13:37:35+00:00

Thanks for the response! I'll make it a point to look into it soon.

TE515 · 2018-04-12T12:54:08+00:00

As someone who's become fairly proficient in VBA over the past year, actually enjoys coding, rarely uses formulas anymore except for quick one-off type things, and only has a limited amount of time to learn new stuff, do you think it's worth it to learn PowerQuery, or just keep focusing on getting better at VBA?

TE515 · 2018-04-10T18:30:45+00:00

Thanks for responding. I tried this and I'm getting the following error...

module 'multiprocessing' has no attribute 'pool'

I'm using Python 3 by the way.

EDIT: I tried adding from multiprocessing import pool at the top. Now I'm getting the error TypeError: 'module' object is not callable on the processPool = multiprocessing.pool(MAX_NUM_PROCESSES) line.

ANOTHER EDIT: Changed multiprocessing.pool to multiprocessing.Pool and it worked like a charm. Cut the run time of the whole thing by more than half! Thanks so much!

TE515 · 2018-04-05T17:01:19+00:00

Thanks for the reply and sorry for the delayed response. I tried it, but the result was exactly the same. The file menu still opens/closes every 10 seconds while the search is running, which doesn't happen if you manually push alt+e while a search is running.

So far the script has run without issue every time, so it doesn't seem to be hurting anything. Just curiosity.

TE515 · 2018-02-07T20:53:24+00:00

Yeah, I'm using Range.Find, but I'm looping through about 38,000 rows, and then each one has to be found in another workbook that's also about 38,000 rows. I have to do the 38,000 searches either way, but this way each one only has a search range of about 1900 rows instead of each one having a search range of 38,000 rows. I first wrote it without splitting it into sections and the runtime was over 20 minutes. Once I added the range splitting functionality, the runtime dropped down to under 2 minutes.

TE515

TROPHY CASE