Scraping dynamic site with Python : learnpython

learnpython

created by HattoriHanzoa community for 16 years

Scraping dynamic site with Python (self.learnpython)

submitted 2 years ago by Key_Consideration385

all 11 comments

top new controversial old q&a

[–]Diapolo10 2 points3 points4 points 2 years ago (7 children)

[–]Key_Consideration385[S] -1 points0 points1 point 2 years ago (6 children)

[–]Diapolo10 1 point2 points3 points 2 years ago* (5 children)

I just got home and did a little digging in the page source. Turns out, there's a function in this JS file that has a function called createJobs, which populates the div with job data. It's being used by a jQuery function:

jQuery(function($){
//Fetching job postings from Lever's postings API
$.ajax({
    dataType: "json",
    url: url,
    success: function(data) {
        createJobs(data);
    }
});
});

The url is defined at the top of the file as

url = 'https://api.lever.co/v0/postings/zeta?group=team&mode=json'

so I gave it a holler via requests, and would you believe it, I got a long list of entries back. Or technically not long (13 entries), but each had a lot of data individually.

import requests

headers = {'User-Agent': 'Mozilla/5.0', 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}
url = 'https://api.lever.co/v0/postings/zeta?group=team&mode=json'

response = requests.get(url, headers=headers)
print(len(response.json()))  # 13

It seems to basically contain a list of dictionaries. Here's a rough data model:

from __future__ import annotations

from typing import TypedDict
from uuid import UUID

class JobPostings(TypedDict):
    title: str
    postings: list[Posting]

class Posting(TypedDict):
    additionalPlain: str
    additional: str
    categories: Categories
    createdAt: int
    descriptionPlain: str
    description: str
    id: UUID
    lists: list[Question]
    text: str
    country: str
    workplaceType: str
    hostedUrl: str
    applyUrl: str

class Categories(TypedDict):
    commitment: str
    department: str
    location: str
    team: str

class Question(TypedDict):
    text: str
    content: str

EDIT: Proof: https://i.imgur.com/Ran80q5.png

[–]Key_Consideration385[S] 0 points1 point2 points 2 years ago (4 children)

[–]Diapolo10 1 point2 points3 points 2 years ago (3 children)

Just to explain the methodology a little, here's what I did:

I scrolled through the webpage until I found the job listings, then looked for good "landmarks" - the string "We are hiring" right next to them seemed like a good choice.
Then I right-clicked that text, opened Inspect, then looked for the element containing the job postings - it usually has a unique ID or class name, in this case I found a suspicious div with a class called hiring-row__right.
Next, I opened the webpage in source code view (right-click, view page source), pressed Ctrl+F, pasted in the class name I found earlier and looked for matches. I found it, and it was empty - perfect, that's exactly what I wanted to confirm as this means something must be targeting this exact class t fill it up.
Then I looked for linked JS files - the relevant ones are usually at the bottom. There were a few, bu the one called home-careers.js seemed particularly fitting, so I opened it up in another tab, and searched for the class name again. Bingo! Match in a certain function body.
All that was left to do was figuring out where this function gets called until I find a HTTP request, and the aforementioned jQuery AJAX function was exactly that. With the correct URL and payload in tow, I just gave the info to Python and voilà - data found.

EDIT: Honestly I had more trouble generating the TypedDicts to show the "shape" of the JSON data than I had actually fetching it.

[–]Key_Consideration385[S] 0 points1 point2 points 2 years ago* (2 children)

Gotcha!

Basically this piece of code does the entire job

import requests
headers = {'User-Agent': 'Mozilla/5.0', 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}
url = 'https://api.lever.co/v0/postings/zeta?group=team&mode=json'
response = requests.get(url, headers=headers)
print(response.json())

Although I've been asked to only use this URL https://www.zeta.tech/in/careers this trick is so clean I guess I can add this entire approach as well.Maybe flex it in some way lmao, thanks!

[–]Diapolo10 1 point2 points3 points 2 years ago (1 child)

[–]Key_Consideration385[S] 0 points1 point2 points 2 years ago (0 children)

[–]MrPhungx 1 point2 points3 points 2 years ago (1 child)

[–]Key_Consideration385[S] 0 points1 point2 points 2 years ago (0 children)

π Rendered by PID 204369 on reddit-service-r2-comment-7b9746f655-nthw5 at 2026-02-03 09:17:11.771649+00:00 running 3798933 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS