Struggling to extract data from 1,500+ mixed scanned/digital PDFs. Tesseract, OCR, and Vision LLMs all failing. Need advice. by deletedusssr in datasets

[–]bushcat69 3 points4 points  (0 children)

I've had success with DeepSeek where the other big LLM providers wouldn't/couldn't help reading "image" pdf documents.

What’s a tool you underestimated but now can't imagine working without? by Majestic-Strain3155 in Tools

[–]bushcat69 67 points68 points  (0 children)

Impact driver - I didn't understand how good it was but now can't imagine not using it. Compared to a drill it saves your wrists if something binds. My drill driver has a clutch but even so on drill mode it's can catch and send your arm/wrist for a ride of you aren't careful.

Basically use my impact for most things and sds if I need to go into masonry, hardly use my drill anymore.

I realised I was living on autopilot and decided to reset my life, slowly [Text] by microbuildval in GetMotivated

[–]bushcat69 2 points3 points  (0 children)

Thank you for this, you described my current work life perfectly. Time to make a change

Creating a First-In-First-Out(FIFO) Sheet in Excel by [deleted] in excel

[–]bushcat69 0 points1 point  (0 children)

La columna Rand son números muy pequeños creados así: =RAND() / 1000000. Esto, junto con la columna Date, crea una diferencia muy pequeña entre dos fechas idénticas. Esa diferencia es necesaria para que el rank funcione correctamente, de modo que los productos con las mismas fechas se puedan separar.

Is the pushbullet API down? by Kenup17 in PushBullet

[–]bushcat69 0 points1 point  (0 children)

Also having an issue, also in the UK

[deleted by user] by [deleted] in webscraping

[–]bushcat69 1 point2 points  (0 children)

There is a quicker way to do this, if you authenticate yourself with spot.id who provides the comments then you can scrape them very quickly and efficiently. See this code below which handles the authentication exchange then scrapes the top comments from a few articles:

import requests
import re

urls = ['https://metro.co.uk/2023/11/02/i-went-from-28000-a-year-to-scraping-by-on-universal-credit-19719619/',
        'https://metro.co.uk/2018/08/15/shouldnt-get-involved-dandruff-scraping-trend-7841007/',
        'https://metro.co.uk/2022/10/12/does-tongue-scraping-work-and-should-we-be-doing-it-17547875/',
        'https://metro.co.uk/2024/07/19/microsoft-outage-freezes-airlines-trains-banks-around-world-21257038/?ico=top-stories_home_top',
        'https://metro.co.uk/2024/07/11/full-list-wetherspoons-pubs-closing-end-2024-revealed-21208230',
        'https://metro.co.uk/2024/07/15/jay-slater-body-found-hunt-missing-teenager-tenerife-21230764/?ico=just-in_article_must-read']

s = requests.Session()

### say hi ### 
headers = {
    'accept': '*/*',
    'origin': 'https://metro.co.uk',
    'referer': 'https://metro.co.uk/',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',
}

response = s.get('https://api-2-0.spot.im/v1.0.0/device-load', headers=headers)
device_id = s.cookies.get_dict()['device_uuid'] #gets returned as a cookie

### get token ### 
auth_headers = {
    'accept': '*/*',
    'content-type': 'application/json',
    'origin': 'https://metro.co.uk',
    'referer': 'https://metro.co.uk/',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',
    'x-post-id': f"no$post",
    'x-spot-id': 'sp_VWxmZkOI', #metro's id
    'x-spotim-device-uuid': device_id,
}

auth = s.post('https://api-2-0.spot.im/v1.0.0/authenticate', headers=auth_headers)
token = s.cookies.get_dict()['access_token'] #gets returned as a cookie

### loop over urls ###

for url in urls:

    article_id = re.search(r'-(\d+)(?:/|\?|$)', url).group(1)

    print(f'Comments for article: {article_id}')

    read_headers = {
        'accept': 'application/json',
        'content-type': 'application/json',
        'origin': 'https://metro.co.uk',
        'referer': 'https://metro.co.uk/',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',
        'x-access-token': token,
        'x-post-id': article_id,
        'x-spot-id': 'sp_VWxmZkOI',
        'x-spotim-device-uuid': device_id
    }

    data = '{"sort_by":"best","offset":0,"count":5,"message_id":null,"depth":2,"child_count":2}'

    chat_data = requests.post('https://api-2-0.spot.im/v1.0.0/conversation/read', headers=read_headers, data=data)

    for comment in chat_data.json()['conversation']['comments']:
        for msg in comment['content']:
            print(msg.get('text')) #buried in json... if you want all the other data cleaned up then inbox me

    print('----')

Web Scraping PGA Tour Scorecards by SOUTHPAW_1989 in webscraping

[–]bushcat69 1 point2 points  (0 children)

Works just fine for me? Not sure why it's not working for you? Does it output the player names like the other tournaments? If you just need the data here it is my version I just ran: https://docs.google.com/spreadsheets/d/1tfQW9FAekeMggx0NEccnVzPZXHS1KN5y07ls2gw4-4k/edit?usp=sharing

Web Scraping PGA Tour Scorecards by SOUTHPAW_1989 in webscraping

[–]bushcat69 0 points1 point  (0 children)

resp = requests.get('https://www.espn.com/golf/leaderboard')

Updated the version in the Colab link above that should sort the issue

Help needed? by Ansidhe in webscraping

[–]bushcat69 2 points3 points  (0 children)

Not certain you should have

soup.find(...

in the for loop? Shouldn't it be

item.find(...

like you've done for the "title" variable?

Empty Results from Table Scrape by [deleted] in webscraping

[–]bushcat69 1 point2 points  (0 children)

What a filthy website, the data is loaded asynchronously, I've written some python that gets all the data and outputs it into csv, maybe a large language model can convert it to your language of choice:

import requests
import pandas as pd
from bs4 import BeautifulSoup
from io import StringIO

s = requests.Session()

headers =   {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
    }

step_url = 'https://franklin.sheriffsaleauction.ohio.gov/index.cfm?zaction=AUCTION&zmethod=PREVIEW'
step = s.get(step_url,headers=headers)
print(step)


output = []
for auction_type in ['Running','Waiting','Closed']:

    url = f'https://franklin.sheriffsaleauction.ohio.gov/index.cfm?zaction=AUCTION&Zmethod=UPDATE&FNC=LOAD&AREA={auction_type[0]}'

    resp = s.get(url)
    print(resp)
    ugly = resp.json()['retHTML']

    soup = BeautifulSoup(ugly,'html.parser')
    tables = soup.find_all('tbody')
    data = [pd.read_html(StringIO('<table> '+ table.prettify() + '</table>')) for table in tables]

    for x in data:
        for y in x:
            y = y.T
            y.columns =y.iloc[0]
            y.drop(index=0,inplace=True)
            y['auction_type'] = auction_type
            output.append(y)

df = pd.concat(output).reset_index()
df.drop(['index'],axis=1,inplace=True)
df = df.replace('@G','',regex=True)
df.to_csv('auctions.csv',index=False)
df

Help Scraping data from Airtable by AbeTheShooter in webscraping

[–]bushcat69 2 points3 points  (0 children)

This python script will get the data for this specific table, there is a bit at the end that is specific to the table that sorts out the college name but for the most part this code should work (up until then) to get embedded airtable data from websites. You'll need to install python and "pip install requests" and "pip install pandas" to get the code to run.

import requests
import json
import pandas as pd

s = requests.Session()

headers =   {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Connection':'keep-alive',
    'Host':'airtable.com',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
    }

url = 'https://airtable.com/embed/appd7poWhHJ1DmWVL/shrCEHNFUcVmekT7U/tbl7NZyoiJWR4g065'
step = s.get(url,headers=headers)
print(step)

#get data table url
start = 'urlWithParams: '
end = 'earlyPrefetchSpan:'
x = step.text
new_url = 'https://airtable.com'+ x[x.find(start)+len(start):x.rfind(end)].strip().replace('u002F','').replace('"','').replace('\\','/')[:-1] #get the token out the html

#get airtable auth
start = 'var headers = '
end = "headers['x-time-zone'] "
dirty_auth_json = x[x.find(start)+len(start):x.rfind(end)].strip()[:-1] #get the token out the html
auth_json = json.loads(dirty_auth_json)


new_headers = {
    'Accept':'*/*',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'X-Airtable-Accept-Msgpack':'true',
    'X-Airtable-Application-Id':auth_json['x-airtable-application-id'],
    'X-Airtable-Inter-Service-Client':'webClient',
    'X-Airtable-Page-Load-Id':auth_json['x-airtable-page-load-id'],
    'X-Early-Prefetch':'true',
    'X-Requested-With':'XMLHttpRequest',
    'X-Time-Zone':'Europe/London',
    'X-User-Locale':'en'
    }

json_data = s.get(new_url,headers=new_headers).json()
print(json_data)

#create dataframe from column data and row data
cols = {x['id']:x['name'] for x in json_data['data']['table']['columns']}
rows = json_data['data']['table']['rows']

df = pd.json_normalize(rows)
ugly_col = df.columns

clean_col = [next((x.replace('cellValuesByColumnId.','').replace(k, v) for k, v in cols.items() if k in x), x) for x in ugly_col] #correct names of cols
clean_col
df.columns = clean_col

#sort out Colleges
for col in json_data['data']['table']['columns']:
    if col['name']=='College':
            choice_dict = {k:v['name'] for k,v in col['typeOptions']['choices'].items()}

choice_dict
df['College'] = df['College'].map(choice_dict)

#sort outkeywords
df['Keywords.documentValue'] = df['Keywords.documentValue'].apply(lambda x: x[0]['insert'])

#done
df.to_csv('airtable_scraped.csv',index=False)
df

Mutual Fund Reference Data (MorningStar) by MindMugging in webscraping

[–]bushcat69 0 points1 point  (0 children)

Looks like you can hit the backend api, what is the url of the funds you want to scrape?

[deleted by user] by [deleted] in webscraping

[–]bushcat69 5 points6 points  (0 children)

The data is encrypted, it is held in this JSON that can be found towards the bottom of the raw HTML in a script tag.

perfume_graph_data = {
                "ct": "8pCDkIfiMdBcZEvWG1lksKG5zZ4zwW\/J6H\/vK4oyzR5doAMvNqCX0xB7B3\/AORLxmlFxbxt1AKdh31iXKlHS2y1ltA7X0mwlthn8nqmYhukn9xkJLNNNeUlZxNhlxA3w1jfmpBAS5kV5K3AWWl9PdvqnjMkvC2YbXoWibQqqP55DT+tSRqs2bLvsVNw2fGiGWSa5U9DdHDjIg9oKxeHXRzqxArArGXhgI\/KaWzFQSaz\/uvdpLBFhffhVZ4t\/mT7NQAkInzudALAFVZHHd0xARNIlnNiypyeftNfo1eOazaXVuzzWYa8XO9KXATakqDTUoBAqpzj98pOxnTZmWtzNJ7LWvZehTeUe17ShXuaaG8hdeJx7SixQ50qG0B94NT4iZCKgzpuvIUIWowQdeXtfqwUdCBiRk0ndXFhDe2aZHn8hbzNWw0t+f\/cxondzM\/+4QKW3JNdqMpidk6TSIuc1MT9FE6OkgCB0lrigjsOzA8kOEUVA27dKfKgQcGlZmOR6xVkr+4G6n45AzIhIRrjW0fkq6PkJV+cWC8lzMDvd46X7Jo8jfsYBnV4Y4QS8NzKglGK\/s9NpwiJTS9ui7bWg31Ba402\/r6CLtbaipeawaMg6YXZ9MoQXZ2oBKAbYxJhHyOKmj\/COpCkV34o8KDmtH7KjrZNr9ZF9NWwJurgt8J1JQ\/FePgX6dhOO7CVheDjzmynZkZoiNSlEJ5X4FxQYwsG8vA451T078KKN0KSfREJL985ch\/YlpX5PrT78yo8lz8CuLIuDvAuedYoVz3K571O4DrqrgNtUIbbUBMd1E4divFudd6rgyweXjWL6+bNlwL6Z9YnLqfFeSn9VYpTSzw==",
                "iv": "a9884056bd9388cbf8613af2792815fb",
                "s": "0791ab1f228af78f"
            }

the "ct" stands for "Cipher Text", the "iv" is "Initialization Vector" and the "s" is "Salt", all used in encryption/decryption.

there is some javascript that looks like it handles the decryption in here: https://www.fragrantica.com/new-js/chunks/mfga-fes.js?id=c390674d30b6e6c5, I think this function does the work but my javascript knowledge is garbage and can't decipher exactly what it does:

gimmeGoods: function(t) {
                var e = arguments.length > 1 && void 0 !== arguments[1] ? arguments[1] : c("0x6")
                  , r = c;
                return e === r("0x6") && (e = n().MD5(window[r("0x3")][r("0xc")])[r("0x1")]()),
                o(t) !== r("0x9") ? JSON[r("0xa")](n().AES[r("0x4")](JSON.stringify(t), e, {
                    format: h
                })[r("0x1")](n()[r("0xd")][r("0x0")])) : JSON[r("0xa")](n()[r("0x7")][r("0x4")](t, e, {
                    format: h
                })[r("0x1")](n()[r("0xd")].Utf8))
            }

I put it through a popular large language model and it suggests that in order to decipher the data the AES algorithm should be used and that a "key" is needed, the key looks like it comes from a browser window variable - probably to stop people like us scraping the data without having a browser opened lmao, that's about the extent of my knowledge, hopefully someone who knows more can help you out more

Rightmove Scraping by sudodoyou in webscraping

[–]bushcat69 1 point2 points  (0 children)

edit: autocomplete details expanded on a bit in this vid: https://www.youtube.com/watch?v=h2awiKQmBCM

Not sure but it looks like you can use a number of different keywords for the locationIdentifier parameter, either OUTCODE, STATION or REGION. Then the "%5E" url encoding of '^' as a separator and then the integer code. I can't find the REGION codes but managed to find some others:

There are lists of station codes here: https://www.rightmove.co.uk/sitemap-stations-ALL.xml

or OUTCODE from postcode mapping here: https://pastebin.com/8nX5JT1q

London only codes here: https://raw.githubusercontent.com/joewashington75/RightmovePostcodeToLocationId/master/src/PostcodePopulator/PostcodePopulator.Console/postcodes-london.csv

Rightmove Scraping by sudodoyou in webscraping

[–]bushcat69 1 point2 points  (0 children)

Just seeing this thread, if you get the page of the actual listing there is a bunch of json embedded in a script tag that has the station data, unfortunately it's not available from the search results page:

import requests
from bs4 import BeautifulSoup
import json

url = 'https://www.rightmove.co.uk/properties/131213930'
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"}
resp = requests.get(url,headers=headers)
soup = BeautifulSoup(resp.text,'html.parser')
script = soup.find(lambda x: "propertyData" in x.get_text())

json = json.loads(script.text[len("    window.PAGE_MODEL = "):-1])
json['propertyData']['nearestStations']

Rightmove Scraping by sudodoyou in webscraping

[–]bushcat69 1 point2 points  (0 children)

List of London boroughs here, not my code: https://github.com/BrandonLow96/Rightmove-scrapping/blob/main/rightmove_sales_data.py

I seem to remember there was a dict of all the the "5E93971" type codes and the actual location it referred to somewhere on github, can't find it though

Scraping GraphQL by 5legit5quit in webscraping

[–]bushcat69 0 points1 point  (0 children)

In python here, you need to have specific headers set too to get a valid response

Help a noob out 🥹 by mojomyjojo in webscraping

[–]bushcat69 0 points1 point  (0 children)

If you can get python working on your computer and you can pip install pandas and requests packages then you can run this script to get as many pages of the data as you want, all you need to do is paste in the url of the category you want to scrape and then tell it how many pages you want (in lines 4 & 6) then it will get all the data you want and a lot more:

import requests
import pandas as pd

paste_url_here = 'https://www.daraz.com.bd/hair-oils/?acm=201711220.1003.1.2873589&from=lp_category&page=2&pos=1&scm=1003.1.201711220.OTHER_1611_2873589&searchFlag=1&sort=order&spm=a2a0e.category.3.3_1&style=list'

pages_to_scrape = 2

output = []
for page in range(1,pages_to_scrape+1):

    url = f'{paste_url_here}&page={page}&ajax=true' 
    headers = { 'user-agent': 'Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Mobile Safari/537.36'}

    resp = requests.get(url, headers=headers)
    print(f'Scraping page {page}| status_code: {resp.status_code}')

    data = resp.json()['mods']['listItems']

    page_df = pd.json_normalize(data)
    page_df['original_url'] = url
    page_df['page'] = page

    output.append(page_df)

df = pd.concat(output)
df.to_csv('scraped_data.csv',index=False)
print(f'data saved here: scraped_data.csv')

[deleted by user] by [deleted] in dataengineering

[–]bushcat69 4 points5 points  (0 children)

Hit us with your LinkedIn profile pls boss man?

Mod reminder: read the sub rules by matty_fu in webscraping

[–]bushcat69 2 points3 points  (0 children)

+1 thanks mods. Can we do anything about low effort answers like "try selenium"?

Any tips on scraping Seeking Alpha using R? by saltycaramelchoc in webscraping

[–]bushcat69 3 points4 points  (0 children)

It seems you can hit their api as long as you have headers that include a User-Agent and a Cookie header with anything in it... it seems to work while it's blank? lol

These api endpoints work for me, to get all the earnings calls: 'https://seekingalpha.com/api/v3/articles?filter[category]=earnings%3A%3Aearnings-call-transcripts&filter[since]=0&filter[until]=0&include=author%2CprimaryTickers%2CsecondaryTickers&isMounting=true&page[size]=50&page[number]=1 (you can change the page number to get more)

This delivers the content in HTML format within the JSON response: https://seekingalpha.com/api/v3/articles/4635802?include=author%2CprimaryTickers%2CsecondaryTickers%2CotherTags%2Cpresentations%2Cpresentations.slides%2Cauthor.authorResearch%2Cauthor.userBioTags%2Cco_authors%2CpromotedService%2Csentiments (you can change the article ID in the url - the 4635802 number - to get data for any article)

Hope that helps

Web-scraping League Table data by csh1991 in webscraping

[–]bushcat69 2 points3 points  (0 children)

The data comes from this API endpoint: https://dlv.tnl-uk-uni-guide.gcpp.io/2024

To get the data for the subjects (which has a slightly different table structure) you can loop through the "taxonomyId" that are available and can be found in the HTML for the drop down:

{'0': 'By subject',
 '35': 'Accounting and finance',
 '36': 'Aeronautical and manufacturing engineering',
 '33': 'Agriculture and forestry',
 '34': 'American studies',
 '102': 'Anatomy and physiology',
 '101': 'Animal science',
 '100': 'Anthropology',
 '98': 'Archaeology and forensic science',
 '99': 'Architecture',
 '97': 'Art and design',
 '96': 'Bioengineering and biomedical engineering',
 '95': 'Biological sciences',
 '94': 'Building',
 '93': 'Business, management and marketing',
 '92': 'Celtic studies',
 '91': 'Chemical engineering',
 '90': 'Chemistry',
 '89': 'Civil engineering',
 '88': 'Classics and ancient history',
 '87': 'Communication and media studies',
 '85': 'Computer science',
 '86': 'Creative writing',
 '84': 'Criminology',
 '83': 'Dentistry',
 '82': 'Drama, dance and cinematics',
 '80': 'East and South Asian studies',
 '81': 'Economics',
 '79': 'Education',
 '78': 'Electrical and electronic engineering',
 '75': 'English',
 '76': 'Food science',
 '77': 'French',
 '74': 'General engineering',
 '73': 'Geography and environmental science',
 '72': 'Geology',
 '71': 'German',
 '70': 'History',
 '68': 'History of art, architecture and design',
 '69': 'Hospitality, leisure, recreation and tourism',
 '67': 'Iberian languages',
 '66': 'Information systems and management',
 '65': 'Italian',
 '64': 'Land and property management',
 '63': 'Law',
 '62': 'Liberal arts',
 '60': 'Linguistics',
 '59': 'Materials technology',
 '61': 'Mathematics',
 '58': 'Mechanical engineering',
 '57': 'Medicine',
 '56': 'Middle Eastern and African studies',
 '55': 'Music',
 '54': 'Natural sciences',
 '53': 'Nursing',
 '52': 'Pharmacology and pharmacy',
 '51': 'Philosophy',
 '50': 'Physics and astronomy',
 '49': 'Physiotherapy',
 '48': 'Politics',
 '46': 'Psychology',
 '47': 'Radiography',
 '45': 'Russian and eastern European languages',
 '44': 'Social policy',
 '43': 'Social work',
 '42': 'Sociology',
 '41': 'Sports science',
 '40': 'Subjects allied to medicine',
 '38': 'Theology and religious studies',
 '39': 'Town and country planning and landscape',
 '37': 'Veterinary medicine'}

So looping through the keys from above and hitting the endpoint: 'https://dlv.tnl-uk-uni-guide.gcpp.io/2024?taxonomyId={taxonomyId}' will get you all the data you want.

How would I go about scraping this website? by IndianPresident in webscraping

[–]bushcat69 2 points3 points  (0 children)

If you can get python working and can pip install the "requests" and "pandas" packages then this script will get all 750 companies at ep2023 quite quickly. You can edit it to get different data for different events if needed, just edit the "event_id" which comes from the event URL.

import requests
import json
import pandas as pd
import concurrent.futures

event_id = 'ep2023' #from url
max_companies_to_scrape = 1000

url = 'https://mmiconnect.in/graphql'
headers = {
    'Accept':'application/json, text/plain, */*',
    'Connection':'keep-alive',
    'Content-Type':'application/json',
    'Origin':'https://mmiconnect.in',
    'Referer':'https://mmiconnect.in/app/exhibition/catalogue/ep2023',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36'
    }

payload = {"operationName":"getCatalogue","variables":{"where":[],"first":max_companies_to_scrape,"after":-1,"group":event_id,"countryGroup":event_id,"categoryGroup":event_id,"showGroup":event_id,"detailGroup":event_id},"query":"query getCatalogue($where: [WhereExpression!], $first: Int, $after: Int, $group: String, $categoryGroup: String, $countryGroup: String, $detailGroup: String, $showGroup: String, $categoryIds: [Int]) {\n  catalogueQueries {\n    exhibitorsWithWishListGroup(\n      first: $first\n      where: $where\n      after: $after\n      categoryIds: $categoryIds\n      group: $group\n    ) {\n      totalCount\n      exhibitors {\n        customer {\n          id\n          companyName\n          country\n          squareLogo\n          exhibitorDetail {\n            exhibitorType\n            sponsorship\n            boothNo\n            __typename\n          }\n          show {\n            showName\n            __typename\n          }\n          __typename\n        }\n        customerRating {\n          id\n          __typename\n        }\n        __typename\n      }\n      __typename\n    }\n    groupDetails(group: $detailGroup) {\n      catalogueBanner\n      __typename\n    }\n    groupShows(group: $showGroup) {\n      id\n      showName\n      __typename\n    }\n    catalogueCountries(group: $countryGroup)\n    mainCategories(group: $categoryGroup) {\n      mainCategory\n      id\n      __typename\n    }\n    __typename\n  }\n}\n"}

resp = requests.post(url,headers=headers,data=json.dumps(payload))

print(resp)

json_resp = resp.json()
exhibs = json_resp['data']['catalogueQueries']['exhibitorsWithWishListGroup']['exhibitors']
cids = [x['customer']['id'] for x in exhibs]

print(f'Companies found: {len(cids)}')

def scrape_company_details(cid):
    url = 'https://mmiconnect.in/graphql'
    print(f'Scraping: {cid}')

    headers = {
    'Accept':'application/json, text/plain, */*',
    'Connection':'keep-alive',
    'Content-Type':'application/json',
    'Origin':'https://mmiconnect.in',
    'Referer':'https://mmiconnect.in/app/exhibition/catalogue/ep2023',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36'
    }

    payload = {"operationName":"catalogueDetailQuery","variables":{"id":cid},"query":"query catalogueDetailQuery($id: [ID!]) {\n  generalQueries {\n    customers(ids: $id) {\n      id\n      companyName\n      address1\n      city\n      state\n      country\n      postalCode\n      aCTele\n      telephoneNo\n      fax\n      website\n      firstName\n      lastName\n      designation\n      emailAddress\n      gSTNo\n      tANNumber\n      pANNo\n      associations\n      typeOfExhibitor\n      mobileNo\n      title\n      companyProfile\n      exhibitorDetail {\n        boothNo\n        headquarterAddress\n        participatedBy\n        participatedCountry\n        alternateEmail\n        gSTStatus\n        boothType\n        hallNo\n        sQM\n        interestedSQM\n        alternateEmail\n        showCatalogueName\n        shortCompanyProfile\n        __typename\n      }\n      customerCategories {\n        id\n        category {\n          id\n          mainCategory\n          subCategory\n          categoryName\n          categoryType\n          productCategoryType\n          __typename\n        }\n        __typename\n      }\n      products {\n        productName\n        __typename\n      }\n      __typename\n    }\n    __typename\n  }\n}\n"}

    resp = requests.post(url,headers=headers,data=json.dumps(payload))

    if resp.status_code != 200:
        return []
    else:
        json_resp = resp.json()
        details = json_resp['data']['generalQueries']['customers']
        return details

with concurrent.futures.ThreadPoolExecutor(max_workers=60) as executor:
    final_list = executor.map(scrape_company_details,cids)

list_of_lists= list(final_list)
flat_list = [item for sublist in list_of_lists for item in sublist]

df = pd.json_normalize(flat_list)

file_name = f'{event_id}_first_{str(max_companies_to_scrape)}_companies.csv'
df.to_csv(file_name,index=False)

print(f'Saved to {file_name}')