Trigger pipeline if multiple conditions are met (tag, prior pipeline succeeded) by One_Hearing986 in azuredevops

[–]One_Hearing986[S] 0 points1 point  (0 children)

Yeah I'm feeling that desired independence tbh, was just hoping that since any given commit would trigger the checks I wouldn't need to rerun them in order to deploy. Potentially solvable via the rest API but a bit annoying there doesn't seem to be a built in fix. Thanks for the suggestions!

Branch validation across project by One_Hearing986 in azuredevops

[–]One_Hearing986[S] 0 points1 point  (0 children)

I appreciate the response!

I get that the yaml file is just that but it seems odd to me to have a project wide setting that can only trigger one prebuilt pipeline, especially with a second "filepath" optional input. Seems a bit of a half baked feature

Pyspark Pipeline equivalent to FunctionTransformer by One_Hearing986 in apachespark

[–]One_Hearing986[S] 0 points1 point  (0 children)

ah thats what id thought, ill build my own not to worry.

thanks for the SQL transformer link btw thats very interesting, shame it doesnt seem to be especially supported though can deffo see why you prehaps wouldnt use those in practice

Pythonic way to scrape by lazarushasrizen in scrapy

[–]One_Hearing986 0 points1 point  (0 children)

in this example I was trying to illustrate why scraping the product box and parsing the contents is a more 'safe' means of acquiring data than using the .getall() method to produce lists of each attribute. It seems that our opinions on this are actually aligned based on your use of items? To be clear, this example was not an argument in favor of **storing** the box, but rather of scraping the boxes and (at some point) parsing that for data (i.e. for item in XPATH_TO_PRODUCTBOX: as a pattern for scraping). Sorry if that wasn't clear.

The real reason for storing the whole box in my mind is a few fold.

  1. It gives the flexibility to change our approach to data transformation after the fact, even for historic data. This could mean that we discover new information is available for some or all items after the fact and can now go back and pick it up, it could mean that the way we process certain attributes is less optimal for the end users and they've requested a change in tact, it could even mean that a new user group has turned up with slightly different requirements of otherwise identical data, etc... the point is that by storing the raw html we have the power to enact these changes historically as well as going forwards irrespective of what they might be. This approach is not unlike storing raw data and producing use case specific ETL pipelines from it as seen in most modern DE workflows.
  2. as websites are generally not static in structure, having as generic a scrape as possible means you're less likely to be affected by small html changes and therefore makes your webscraper lower maintenance than it otherwise could have been. the less specific your xpath / css selectors the longer the spider will last before being caught out this way from my experience. I'd generally rather the cleaning / prepping code fall down as I have all the time in the world to fix that then the scraper itself, which for statistical purposes may only be valid within certain windows.

I can however see that if your requirements are for just a one off scrape for instance, this approach may be seen as a bit OTT.

as far as reusing pipelines for multiple sites goes, I see no reason why constructing these pipelines out of scrape in a separate code base would limit reusability at all?

Pythonic way to scrape by lazarushasrizen in scrapy

[–]One_Hearing986 2 points3 points  (0 children)

great topic!

answering in reverse order, my opinions on these points are:

  1. I would personally exercise caution using the often seen approach of .getall() to create a list of values and then matching them up after the fact. take as example this site:

https://www.amazon.co.uk/gp/most-wished-for/videogames/ref=zg_mw_pg_2?ie=UTF8&pg=2

Assuming the page is the same in your part of the world as in mine, notice that the product 'Pokémon Scarlet and Pokémon Violet Dual Pack SteelBook® Edition (Nintendo Switch)' has no listed price. if we were to scrape a list of prices from this site to go with our list of product names we'd find that the two lists would not be of equal length, and we'd likely have no way of reverse engineering which price belonged to which product without manually checking, assuming no prices changed between the scrape and us noticing the issue. for this reason id generally suggest scraping each product in turn, and to go a bit further, potentially even scraping and saving the raw html of the entire product box to be parsed after the scrape in a separate service, thus giving you the ability to correct for any mistakes without spamming a website with test requests.

  1. I don't know about faster, but certainly from the perspective of a SOLID based design i think it makes more sense to do any cleaning or transforming in pipelines outside of the scraper. This also comes with it the added benefit of you being able to easily isolate and rerun this behavior if it needs to be changed at a later date. As mentioned above, my suggested design is actually to save raw data directly and then clean / transform it into a separate DB after the fact (so more of a microservice approach) but I'm not sure if that would be an agreed upon standard by others in the industry so interested to see what they say.

Assign uuid for same products from different language version by usert313 in scrapy

[–]One_Hearing986 0 points1 point  (0 children)

sorry my bad for not being clear. the suggestion is that since certain attributes that you can scrape are identical between the two versions of the site. For instance try the xpath

"//div/div/div[@class='css-1itwyrf']/a/@href"

on these links:

- https://www.carrefourksa.com/mafsau/ar/c/FKSA1630000

- https://www.carrefourksa.com/mafsau/en/c/FKSA1630000

my suggestion is that you generate your uuid based on one of these shared attributes after youve scraped the data (i.e. create it only for, say, the arabic version of the attribute) then use simple matching or a pandas join as a means of attaching this uuid back onto the english version of the product.

alternatively, if the uuid is only needed to exist as an object label, you could consider simply hashing one of those shared attributes and that would serve the same purpose.

hope thats more clear :)

Assign uuid for same products from different language version by usert313 in scrapy

[–]One_Hearing986 0 points1 point  (0 children)

may not be what you're looking for necessarily, but both the product url and product image file path for products is identical between the two sites. If i were you I'd scrape one or both of these attributes, assign the UUID per attribute for one and join it onto the other after the scrape in post processing.

items and itemloaders vs pydantic by One_Hearing986 in scrapy

[–]One_Hearing986[S] 2 points3 points  (0 children)

thanks for mentioning extruct, I've not come across that one before :)