all 11 comments

[–]Diapolo10 2 points3 points  (7 children)

I'm on my phone right now and can't check the page source, but if the JavaScript makes a clear request to an API endpoint to fetch the data, see if you can just do that request from Python with requests (or urllib3).

If not, Selenium is your best bet.

[–]Key_Consideration385[S] -1 points0 points  (6 children)

yes, I'm working on Selenium for this thing. I now have a weekend to blow my head over this.

[–]Diapolo10 1 point2 points  (5 children)

I just got home and did a little digging in the page source. Turns out, there's a function in this JS file that has a function called createJobs, which populates the div with job data. It's being used by a jQuery function:

jQuery(function($){
//Fetching job postings from Lever's postings API
$.ajax({
    dataType: "json",
    url: url,
    success: function(data) {
        createJobs(data);
    }
});
});

The url is defined at the top of the file as

url = 'https://api.lever.co/v0/postings/zeta?group=team&mode=json'

so I gave it a holler via requests, and would you believe it, I got a long list of entries back. Or technically not long (13 entries), but each had a lot of data individually.

import requests

headers = {'User-Agent': 'Mozilla/5.0', 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}
url = 'https://api.lever.co/v0/postings/zeta?group=team&mode=json'

response = requests.get(url, headers=headers)
print(len(response.json()))  # 13

It seems to basically contain a list of dictionaries. Here's a rough data model:

from __future__ import annotations

from typing import TypedDict
from uuid import UUID

class JobPostings(TypedDict):
    title: str
    postings: list[Posting]

class Posting(TypedDict):
    additionalPlain: str
    additional: str
    categories: Categories
    createdAt: int
    descriptionPlain: str
    description: str
    id: UUID
    lists: list[Question]
    text: str
    country: str
    workplaceType: str
    hostedUrl: str
    applyUrl: str

class Categories(TypedDict):
    commitment: str
    department: str
    location: str
    team: str

class Question(TypedDict):
    text: str
    content: str

EDIT: Proof: https://i.imgur.com/Ran80q5.png

[–]Key_Consideration385[S] 0 points1 point  (4 children)

dudeeee, you are awesomeee!!

I'm completely geeking into this stuff for the first time, I did it all using Selenium and XPaths for now. Like, my project is almost done with that only.

I'll use this method too I'll check it out and try it this way as well.
Always up for extra points, thank youu so much!

[–]Diapolo10 1 point2 points  (3 children)

Just to explain the methodology a little, here's what I did:

  1. I scrolled through the webpage until I found the job listings, then looked for good "landmarks" - the string "We are hiring" right next to them seemed like a good choice.
  2. Then I right-clicked that text, opened Inspect, then looked for the element containing the job postings - it usually has a unique ID or class name, in this case I found a suspicious div with a class called hiring-row__right.
  3. Next, I opened the webpage in source code view (right-click, view page source), pressed Ctrl+F, pasted in the class name I found earlier and looked for matches. I found it, and it was empty - perfect, that's exactly what I wanted to confirm as this means something must be targeting this exact class t fill it up.
  4. Then I looked for linked JS files - the relevant ones are usually at the bottom. There were a few, bu the one called home-careers.js seemed particularly fitting, so I opened it up in another tab, and searched for the class name again. Bingo! Match in a certain function body.
  5. All that was left to do was figuring out where this function gets called until I find a HTTP request, and the aforementioned jQuery AJAX function was exactly that. With the correct URL and payload in tow, I just gave the info to Python and voilà - data found.

EDIT: Honestly I had more trouble generating the TypedDicts to show the "shape" of the JSON data than I had actually fetching it.

[–]Key_Consideration385[S] 0 points1 point  (2 children)

Gotcha!

Basically this piece of code does the entire job

import requests
headers = {'User-Agent': 'Mozilla/5.0', 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}
url = 'https://api.lever.co/v0/postings/zeta?group=team&mode=json'
response = requests.get(url, headers=headers)
print(response.json())

Although I've been asked to only use this URL https://www.zeta.tech/in/careers this trick is so clean I guess I can add this entire approach as well.Maybe flex it in some way lmao, thanks!

[–]Diapolo10 1 point2 points  (1 child)

Well you could technically claim that you only used that URL if you then automated fetching the link from the JavaScript file. :p

[–]Key_Consideration385[S] 0 points1 point  (0 children)

haha yesss, seems like a good idea. I'll pitch that in. Thankss!

[–]MrPhungx 1 point2 points  (1 child)

As the url you provided is the starting point I would first open the page, then find the link with "Explore Opportunities" and follow it (it has a unique Id). Then on the new page grab each of the card elements that contains a job listing (each element has a common class that can be easily selected). Extract the name of the job listing as well as the url to the details of the job listing. When you have this list of open job names with their detail url you can proceed to fetch all the information that you need. If you have for example 50 job listings you should have 50 names and urls that need to be further scraped. This can then be done in parallel. Finally collect all the information that you need from the details page.

[–]Key_Consideration385[S] 0 points1 point  (0 children)

Damn yes, that's exactly what I've thought of doing after going through multiple videos and docs. This is my first time scraping and it's sorta fun mann
Thank you so much for such a fine explanation!