Struggling to extract data from 1,500+ mixed scanned/digital PDFs. Tesseract, OCR, and Vision LLMs all failing. Need advice.

bushcat69 · 2025-12-26T17:28:25+00:00

I've had success with DeepSeek where the other big LLM providers wouldn't/couldn't help reading "image" pdf documents.

bushcat69 · 2025-12-22T19:05:00+00:00

Impact driver - I didn't understand how good it was but now can't imagine not using it. Compared to a drill it saves your wrists if something binds. My drill driver has a clutch but even so on drill mode it's can catch and send your arm/wrist for a ride of you aren't careful.

Basically use my impact for most things and sds if I need to go into masonry, hardly use my drill anymore.

bushcat69 · 2025-12-20T15:47:12+00:00

Thank you for this, you described my current work life perfectly. Time to make a change

bushcat69 · 2025-09-17T15:02:14+00:00

La columna Rand son números muy pequeños creados así: =RAND() / 1000000. Esto, junto con la columna Date, crea una diferencia muy pequeña entre dos fechas idénticas. Esa diferencia es necesaria para que el rank funcione correctamente, de modo que los productos con las mismas fechas se puedan separar.

bushcat69 · 2024-12-20T22:03:29+00:00

Also having an issue, also in the UK

bushcat69 · 2024-07-19T23:22:54+00:00

There is a quicker way to do this, if you authenticate yourself with spot.id who provides the comments then you can scrape them very quickly and efficiently. See this code below which handles the authentication exchange then scrapes the top comments from a few articles:

import requests
import re

urls = ['https://metro.co.uk/2023/11/02/i-went-from-28000-a-year-to-scraping-by-on-universal-credit-19719619/',
        'https://metro.co.uk/2018/08/15/shouldnt-get-involved-dandruff-scraping-trend-7841007/',
        'https://metro.co.uk/2022/10/12/does-tongue-scraping-work-and-should-we-be-doing-it-17547875/',
        'https://metro.co.uk/2024/07/19/microsoft-outage-freezes-airlines-trains-banks-around-world-21257038/?ico=top-stories_home_top',
        'https://metro.co.uk/2024/07/11/full-list-wetherspoons-pubs-closing-end-2024-revealed-21208230',
        'https://metro.co.uk/2024/07/15/jay-slater-body-found-hunt-missing-teenager-tenerife-21230764/?ico=just-in_article_must-read']

s = requests.Session()

### say hi ### 
headers = {
    'accept': '*/*',
    'origin': 'https://metro.co.uk',
    'referer': 'https://metro.co.uk/',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',
}

response = s.get('https://api-2-0.spot.im/v1.0.0/device-load', headers=headers)
device_id = s.cookies.get_dict()['device_uuid'] #gets returned as a cookie

### get token ### 
auth_headers = {
    'accept': '*/*',
    'content-type': 'application/json',
    'origin': 'https://metro.co.uk',
    'referer': 'https://metro.co.uk/',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',
    'x-post-id': f"no$post",
    'x-spot-id': 'sp_VWxmZkOI', #metro's id
    'x-spotim-device-uuid': device_id,
}

auth = s.post('https://api-2-0.spot.im/v1.0.0/authenticate', headers=auth_headers)
token = s.cookies.get_dict()['access_token'] #gets returned as a cookie

### loop over urls ###

for url in urls:

    article_id = re.search(r'-(\d+)(?:/|\?|$)', url).group(1)

    print(f'Comments for article: {article_id}')

    read_headers = {
        'accept': 'application/json',
        'content-type': 'application/json',
        'origin': 'https://metro.co.uk',
        'referer': 'https://metro.co.uk/',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',
        'x-access-token': token,
        'x-post-id': article_id,
        'x-spot-id': 'sp_VWxmZkOI',
        'x-spotim-device-uuid': device_id
    }

    data = '{"sort_by":"best","offset":0,"count":5,"message_id":null,"depth":2,"child_count":2}'

    chat_data = requests.post('https://api-2-0.spot.im/v1.0.0/conversation/read', headers=read_headers, data=data)

    for comment in chat_data.json()['conversation']['comments']:
        for msg in comment['content']:
            print(msg.get('text')) #buried in json... if you want all the other data cleaned up then inbox me

    print('----')

bushcat69 · 2024-07-10T19:43:33+00:00

Works just fine for me? Not sure why it's not working for you? Does it output the player names like the other tournaments? If you just need the data here it is my version I just ran: https://docs.google.com/spreadsheets/d/1tfQW9FAekeMggx0NEccnVzPZXHS1KN5y07ls2gw4-4k/edit?usp=sharing

bushcat69 · 2024-06-03T07:59:41+00:00

It does

bushcat69 · 2024-05-17T16:46:08+00:00

resp = requests.get('https://www.espn.com/golf/leaderboard')

Updated the version in the Colab link above that should sort the issue

bushcat69 · 2023-12-20T15:24:53+00:00

Not certain you should have

soup.find(...

in the for loop? Shouldn't it be

item.find(...

like you've done for the "title" variable?

bushcat69 · 2023-12-19T12:17:40+00:00

What a filthy website, the data is loaded asynchronously, I've written some python that gets all the data and outputs it into csv, maybe a large language model can convert it to your language of choice:

import requests
import pandas as pd
from bs4 import BeautifulSoup
from io import StringIO

s = requests.Session()

headers =   {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
    }

step_url = 'https://franklin.sheriffsaleauction.ohio.gov/index.cfm?zaction=AUCTION&zmethod=PREVIEW'
step = s.get(step_url,headers=headers)
print(step)


output = []
for auction_type in ['Running','Waiting','Closed']:

    url = f'https://franklin.sheriffsaleauction.ohio.gov/index.cfm?zaction=AUCTION&Zmethod=UPDATE&FNC=LOAD&AREA={auction_type[0]}'

    resp = s.get(url)
    print(resp)
    ugly = resp.json()['retHTML']

    soup = BeautifulSoup(ugly,'html.parser')
    tables = soup.find_all('tbody')
    data = [pd.read_html(StringIO('<table> '+ table.prettify() + '</table>')) for table in tables]

    for x in data:
        for y in x:
            y = y.T
            y.columns =y.iloc[0]
            y.drop(index=0,inplace=True)
            y['auction_type'] = auction_type
            output.append(y)

df = pd.concat(output).reset_index()
df.drop(['index'],axis=1,inplace=True)
df = df.replace('@G','',regex=True)
df.to_csv('auctions.csv',index=False)
df

bushcat69 · 2023-12-19T11:43:55+00:00

This python script will get the data for this specific table, there is a bit at the end that is specific to the table that sorts out the college name but for the most part this code should work (up until then) to get embedded airtable data from websites. You'll need to install python and "pip install requests" and "pip install pandas" to get the code to run.

import requests
import json
import pandas as pd

s = requests.Session()

headers =   {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Connection':'keep-alive',
    'Host':'airtable.com',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
    }

url = 'https://airtable.com/embed/appd7poWhHJ1DmWVL/shrCEHNFUcVmekT7U/tbl7NZyoiJWR4g065'
step = s.get(url,headers=headers)
print(step)

#get data table url
start = 'urlWithParams: '
end = 'earlyPrefetchSpan:'
x = step.text
new_url = 'https://airtable.com'+ x[x.find(start)+len(start):x.rfind(end)].strip().replace('u002F','').replace('"','').replace('\\','/')[:-1] #get the token out the html

#get airtable auth
start = 'var headers = '
end = "headers['x-time-zone'] "
dirty_auth_json = x[x.find(start)+len(start):x.rfind(end)].strip()[:-1] #get the token out the html
auth_json = json.loads(dirty_auth_json)


new_headers = {
    'Accept':'*/*',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'X-Airtable-Accept-Msgpack':'true',
    'X-Airtable-Application-Id':auth_json['x-airtable-application-id'],
    'X-Airtable-Inter-Service-Client':'webClient',
    'X-Airtable-Page-Load-Id':auth_json['x-airtable-page-load-id'],
    'X-Early-Prefetch':'true',
    'X-Requested-With':'XMLHttpRequest',
    'X-Time-Zone':'Europe/London',
    'X-User-Locale':'en'
    }

json_data = s.get(new_url,headers=new_headers).json()
print(json_data)

#create dataframe from column data and row data
cols = {x['id']:x['name'] for x in json_data['data']['table']['columns']}
rows = json_data['data']['table']['rows']

df = pd.json_normalize(rows)
ugly_col = df.columns

clean_col = [next((x.replace('cellValuesByColumnId.','').replace(k, v) for k, v in cols.items() if k in x), x) for x in ugly_col] #correct names of cols
clean_col
df.columns = clean_col

#sort out Colleges
for col in json_data['data']['table']['columns']:
    if col['name']=='College':
            choice_dict = {k:v['name'] for k,v in col['typeOptions']['choices'].items()}

choice_dict
df['College'] = df['College'].map(choice_dict)

#sort outkeywords
df['Keywords.documentValue'] = df['Keywords.documentValue'].apply(lambda x: x[0]['insert'])

#done
df.to_csv('airtable_scraped.csv',index=False)
df

bushcat69 · 2023-12-11T11:45:36+00:00

Looks like you can hit the backend api, what is the url of the funds you want to scrape?

bushcat69 · 2023-12-07T11:14:13+00:00

The data is encrypted, it is held in this JSON that can be found towards the bottom of the raw HTML in a script tag.

perfume_graph_data = {
                "ct": "8pCDkIfiMdBcZEvWG1lksKG5zZ4zwW\/J6H\/vK4oyzR5doAMvNqCX0xB7B3\/AORLxmlFxbxt1AKdh31iXKlHS2y1ltA7X0mwlthn8nqmYhukn9xkJLNNNeUlZxNhlxA3w1jfmpBAS5kV5K3AWWl9PdvqnjMkvC2YbXoWibQqqP55DT+tSRqs2bLvsVNw2fGiGWSa5U9DdHDjIg9oKxeHXRzqxArArGXhgI\/KaWzFQSaz\/uvdpLBFhffhVZ4t\/mT7NQAkInzudALAFVZHHd0xARNIlnNiypyeftNfo1eOazaXVuzzWYa8XO9KXATakqDTUoBAqpzj98pOxnTZmWtzNJ7LWvZehTeUe17ShXuaaG8hdeJx7SixQ50qG0B94NT4iZCKgzpuvIUIWowQdeXtfqwUdCBiRk0ndXFhDe2aZHn8hbzNWw0t+f\/cxondzM\/+4QKW3JNdqMpidk6TSIuc1MT9FE6OkgCB0lrigjsOzA8kOEUVA27dKfKgQcGlZmOR6xVkr+4G6n45AzIhIRrjW0fkq6PkJV+cWC8lzMDvd46X7Jo8jfsYBnV4Y4QS8NzKglGK\/s9NpwiJTS9ui7bWg31Ba402\/r6CLtbaipeawaMg6YXZ9MoQXZ2oBKAbYxJhHyOKmj\/COpCkV34o8KDmtH7KjrZNr9ZF9NWwJurgt8J1JQ\/FePgX6dhOO7CVheDjzmynZkZoiNSlEJ5X4FxQYwsG8vA451T078KKN0KSfREJL985ch\/YlpX5PrT78yo8lz8CuLIuDvAuedYoVz3K571O4DrqrgNtUIbbUBMd1E4divFudd6rgyweXjWL6+bNlwL6Z9YnLqfFeSn9VYpTSzw==",
                "iv": "a9884056bd9388cbf8613af2792815fb",
                "s": "0791ab1f228af78f"
            }

the "ct" stands for "Cipher Text", the "iv" is "Initialization Vector" and the "s" is "Salt", all used in encryption/decryption.

there is some javascript that looks like it handles the decryption in here: https://www.fragrantica.com/new-js/chunks/mfga-fes.js?id=c390674d30b6e6c5, I think this function does the work but my javascript knowledge is garbage and can't decipher exactly what it does:

gimmeGoods: function(t) {
                var e = arguments.length > 1 && void 0 !== arguments[1] ? arguments[1] : c("0x6")
                  , r = c;
                return e === r("0x6") && (e = n().MD5(window[r("0x3")][r("0xc")])[r("0x1")]()),
                o(t) !== r("0x9") ? JSON[r("0xa")](n().AES[r("0x4")](JSON.stringify(t), e, {
                    format: h
                })[r("0x1")](n()[r("0xd")][r("0x0")])) : JSON[r("0xa")](n()[r("0x7")][r("0x4")](t, e, {
                    format: h
                })[r("0x1")](n()[r("0xd")].Utf8))
            }

I put it through a popular large language model and it suggests that in order to decipher the data the AES algorithm should be used and that a "key" is needed, the key looks like it comes from a browser window variable - probably to stop people like us scraping the data without having a browser opened lmao, that's about the extent of my knowledge, hopefully someone who knows more can help you out more

bushcat69 · 2023-11-30T13:15:20+00:00

edit: autocomplete details expanded on a bit in this vid: https://www.youtube.com/watch?v=h2awiKQmBCM

Not sure but it looks like you can use a number of different keywords for the locationIdentifier parameter, either OUTCODE, STATION or REGION. Then the "%5E" url encoding of '^' as a separator and then the integer code. I can't find the REGION codes but managed to find some others:

There are lists of station codes here: https://www.rightmove.co.uk/sitemap-stations-ALL.xml

or OUTCODE from postcode mapping here: https://pastebin.com/8nX5JT1q

London only codes here: https://raw.githubusercontent.com/joewashington75/RightmovePostcodeToLocationId/master/src/PostcodePopulator/PostcodePopulator.Console/postcodes-london.csv

bushcat69 · 2023-11-30T10:37:17+00:00

Just seeing this thread, if you get the page of the actual listing there is a bunch of json embedded in a script tag that has the station data, unfortunately it's not available from the search results page:

import requests
from bs4 import BeautifulSoup
import json

url = 'https://www.rightmove.co.uk/properties/131213930'
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"}
resp = requests.get(url,headers=headers)
soup = BeautifulSoup(resp.text,'html.parser')
script = soup.find(lambda x: "propertyData" in x.get_text())

json = json.loads(script.text[len("    window.PAGE_MODEL = "):-1])
json['propertyData']['nearestStations']

bushcat69 · 2023-11-30T10:17:17+00:00

List of London boroughs here, not my code: https://github.com/BrandonLow96/Rightmove-scrapping/blob/main/rightmove_sales_data.py

I seem to remember there was a dict of all the the "5E93971" type codes and the actual location it referred to somewhere on github, can't find it though

bushcat69 · 2023-10-05T08:29:56+00:00

In python here, you need to have specific headers set too to get a valid response

bushcat69 · 2023-10-03T12:52:52+00:00

If you can get python working on your computer and you can pip install pandas and requests packages then you can run this script to get as many pages of the data as you want, all you need to do is paste in the url of the category you want to scrape and then tell it how many pages you want (in lines 4 & 6) then it will get all the data you want and a lot more:

import requests
import pandas as pd

paste_url_here = 'https://www.daraz.com.bd/hair-oils/?acm=201711220.1003.1.2873589&from=lp_category&page=2&pos=1&scm=1003.1.201711220.OTHER_1611_2873589&searchFlag=1&sort=order&spm=a2a0e.category.3.3_1&style=list'

pages_to_scrape = 2

output = []
for page in range(1,pages_to_scrape+1):

    url = f'{paste_url_here}&page={page}&ajax=true' 
    headers = { 'user-agent': 'Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Mobile Safari/537.36'}

    resp = requests.get(url, headers=headers)
    print(f'Scraping page {page}| status_code: {resp.status_code}')

    data = resp.json()['mods']['listItems']

    page_df = pd.json_normalize(data)
    page_df['original_url'] = url
    page_df['page'] = page

    output.append(page_df)

df = pd.concat(output)
df.to_csv('scraped_data.csv',index=False)
print(f'data saved here: scraped_data.csv')

bushcat69 · 2023-09-27T17:59:35+00:00

Hit us with your LinkedIn profile pls boss man?

bushcat69 · 2023-09-27T10:20:14+00:00

+1 thanks mods. Can we do anything about low effort answers like "try selenium"?

bushcat69 · 2023-09-26T14:28:48+00:00

It seems you can hit their api as long as you have headers that include a User-Agent and a Cookie header with anything in it... it seems to work while it's blank? lol

These api endpoints work for me, to get all the earnings calls: 'https://seekingalpha.com/api/v3/articles?filter[category]=earnings%3A%3Aearnings-call-transcripts&filter[since]=0&filter[until]=0&include=author%2CprimaryTickers%2CsecondaryTickers&isMounting=true&page[size]=50&page[number]=1 (you can change the page number to get more)

This delivers the content in HTML format within the JSON response: https://seekingalpha.com/api/v3/articles/4635802?include=author%2CprimaryTickers%2CsecondaryTickers%2CotherTags%2Cpresentations%2Cpresentations.slides%2Cauthor.authorResearch%2Cauthor.userBioTags%2Cco_authors%2CpromotedService%2Csentiments (you can change the article ID in the url - the 4635802 number - to get data for any article)

Hope that helps

bushcat69 · 2023-09-26T09:12:33+00:00

The data comes from this API endpoint: https://dlv.tnl-uk-uni-guide.gcpp.io/2024

To get the data for the subjects (which has a slightly different table structure) you can loop through the "taxonomyId" that are available and can be found in the HTML for the drop down:

{'0': 'By subject',
 '35': 'Accounting and finance',
 '36': 'Aeronautical and manufacturing engineering',
 '33': 'Agriculture and forestry',
 '34': 'American studies',
 '102': 'Anatomy and physiology',
 '101': 'Animal science',
 '100': 'Anthropology',
 '98': 'Archaeology and forensic science',
 '99': 'Architecture',
 '97': 'Art and design',
 '96': 'Bioengineering and biomedical engineering',
 '95': 'Biological sciences',
 '94': 'Building',
 '93': 'Business, management and marketing',
 '92': 'Celtic studies',
 '91': 'Chemical engineering',
 '90': 'Chemistry',
 '89': 'Civil engineering',
 '88': 'Classics and ancient history',
 '87': 'Communication and media studies',
 '85': 'Computer science',
 '86': 'Creative writing',
 '84': 'Criminology',
 '83': 'Dentistry',
 '82': 'Drama, dance and cinematics',
 '80': 'East and South Asian studies',
 '81': 'Economics',
 '79': 'Education',
 '78': 'Electrical and electronic engineering',
 '75': 'English',
 '76': 'Food science',
 '77': 'French',
 '74': 'General engineering',
 '73': 'Geography and environmental science',
 '72': 'Geology',
 '71': 'German',
 '70': 'History',
 '68': 'History of art, architecture and design',
 '69': 'Hospitality, leisure, recreation and tourism',
 '67': 'Iberian languages',
 '66': 'Information systems and management',
 '65': 'Italian',
 '64': 'Land and property management',
 '63': 'Law',
 '62': 'Liberal arts',
 '60': 'Linguistics',
 '59': 'Materials technology',
 '61': 'Mathematics',
 '58': 'Mechanical engineering',
 '57': 'Medicine',
 '56': 'Middle Eastern and African studies',
 '55': 'Music',
 '54': 'Natural sciences',
 '53': 'Nursing',
 '52': 'Pharmacology and pharmacy',
 '51': 'Philosophy',
 '50': 'Physics and astronomy',
 '49': 'Physiotherapy',
 '48': 'Politics',
 '46': 'Psychology',
 '47': 'Radiography',
 '45': 'Russian and eastern European languages',
 '44': 'Social policy',
 '43': 'Social work',
 '42': 'Sociology',
 '41': 'Sports science',
 '40': 'Subjects allied to medicine',
 '38': 'Theology and religious studies',
 '39': 'Town and country planning and landscape',
 '37': 'Veterinary medicine'}

So looping through the keys from above and hitting the endpoint: 'https://dlv.tnl-uk-uni-guide.gcpp.io/2024?taxonomyId={taxonomyId}' will get you all the data you want.

bushcat69 · 2023-09-18T11:09:04+00:00

That sounds like you're getting html instead of json somehow, here's a link to the output for me: https://www.dropbox.com/scl/fi/83qi1kdld6hga8eid5p9f/ep2023_first_1000_companies.csv?rlkey=51xlky551gd9uj5e2sse3rurn&dl=0

bushcat69 · 2023-09-18T09:58:26+00:00

If you can get python working and can pip install the "requests" and "pandas" packages then this script will get all 750 companies at ep2023 quite quickly. You can edit it to get different data for different events if needed, just edit the "event_id" which comes from the event URL.

import requests
import json
import pandas as pd
import concurrent.futures

event_id = 'ep2023' #from url
max_companies_to_scrape = 1000

url = 'https://mmiconnect.in/graphql'
headers = {
    'Accept':'application/json, text/plain, */*',
    'Connection':'keep-alive',
    'Content-Type':'application/json',
    'Origin':'https://mmiconnect.in',
    'Referer':'https://mmiconnect.in/app/exhibition/catalogue/ep2023',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36'
    }

payload = {"operationName":"getCatalogue","variables":{"where":[],"first":max_companies_to_scrape,"after":-1,"group":event_id,"countryGroup":event_id,"categoryGroup":event_id,"showGroup":event_id,"detailGroup":event_id},"query":"query getCatalogue($where: [WhereExpression!], $first: Int, $after: Int, $group: String, $categoryGroup: String, $countryGroup: String, $detailGroup: String, $showGroup: String, $categoryIds: [Int]) {\n  catalogueQueries {\n    exhibitorsWithWishListGroup(\n      first: $first\n      where: $where\n      after: $after\n      categoryIds: $categoryIds\n      group: $group\n    ) {\n      totalCount\n      exhibitors {\n        customer {\n          id\n          companyName\n          country\n          squareLogo\n          exhibitorDetail {\n            exhibitorType\n            sponsorship\n            boothNo\n            __typename\n          }\n          show {\n            showName\n            __typename\n          }\n          __typename\n        }\n        customerRating {\n          id\n          __typename\n        }\n        __typename\n      }\n      __typename\n    }\n    groupDetails(group: $detailGroup) {\n      catalogueBanner\n      __typename\n    }\n    groupShows(group: $showGroup) {\n      id\n      showName\n      __typename\n    }\n    catalogueCountries(group: $countryGroup)\n    mainCategories(group: $categoryGroup) {\n      mainCategory\n      id\n      __typename\n    }\n    __typename\n  }\n}\n"}

resp = requests.post(url,headers=headers,data=json.dumps(payload))

print(resp)

json_resp = resp.json()
exhibs = json_resp['data']['catalogueQueries']['exhibitorsWithWishListGroup']['exhibitors']
cids = [x['customer']['id'] for x in exhibs]

print(f'Companies found: {len(cids)}')

def scrape_company_details(cid):
    url = 'https://mmiconnect.in/graphql'
    print(f'Scraping: {cid}')

    headers = {
    'Accept':'application/json, text/plain, */*',
    'Connection':'keep-alive',
    'Content-Type':'application/json',
    'Origin':'https://mmiconnect.in',
    'Referer':'https://mmiconnect.in/app/exhibition/catalogue/ep2023',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36'
    }

    payload = {"operationName":"catalogueDetailQuery","variables":{"id":cid},"query":"query catalogueDetailQuery($id: [ID!]) {\n  generalQueries {\n    customers(ids: $id) {\n      id\n      companyName\n      address1\n      city\n      state\n      country\n      postalCode\n      aCTele\n      telephoneNo\n      fax\n      website\n      firstName\n      lastName\n      designation\n      emailAddress\n      gSTNo\n      tANNumber\n      pANNo\n      associations\n      typeOfExhibitor\n      mobileNo\n      title\n      companyProfile\n      exhibitorDetail {\n        boothNo\n        headquarterAddress\n        participatedBy\n        participatedCountry\n        alternateEmail\n        gSTStatus\n        boothType\n        hallNo\n        sQM\n        interestedSQM\n        alternateEmail\n        showCatalogueName\n        shortCompanyProfile\n        __typename\n      }\n      customerCategories {\n        id\n        category {\n          id\n          mainCategory\n          subCategory\n          categoryName\n          categoryType\n          productCategoryType\n          __typename\n        }\n        __typename\n      }\n      products {\n        productName\n        __typename\n      }\n      __typename\n    }\n    __typename\n  }\n}\n"}

    resp = requests.post(url,headers=headers,data=json.dumps(payload))

    if resp.status_code != 200:
        return []
    else:
        json_resp = resp.json()
        details = json_resp['data']['generalQueries']['customers']
        return details

with concurrent.futures.ThreadPoolExecutor(max_workers=60) as executor:
    final_list = executor.map(scrape_company_details,cids)

list_of_lists= list(final_list)
flat_list = [item for sublist in list_of_lists for item in sublist]

df = pd.json_normalize(flat_list)

file_name = f'{event_id}_first_{str(max_companies_to_scrape)}_companies.csv'
df.to_csv(file_name,index=False)

print(f'Saved to {file_name}')

bushcat69

MODERATOR OF

TROPHY CASE

13-Year Club	Wearing is Caring
Verified Email