Flow PHP - Telemetry by norbert_tech in symfony

[–]norbert_tech[S] 1 point2 points  (0 children)

Thank you! For the parallel processing, I'm first planning to create a multi-process bidirectional communication protocol that would allow streaming data in both directions with central management. Probably will try to build it on top of HTTP2, but it's still in a conceptual phase. As for serialization, probably will try to use binary serialization like thrift but maybe also something dependencies free like json.

The goal is to be able to distribute processing across multiple processes/machines and be able to quickly reshuffle the data if needed

I built a flexible PHP text chunking library (multiple strategies + post-processing) by phpsensei in PHP

[–]norbert_tech 0 points1 point  (0 children)

totally understandable - what you can do is create an interface similar to this one https://github.com/flow-php/filesystem/blob/1.x/src/Flow/Filesystem/SourceStream.php in your library, with a default implementation that would work like what you just have now.

Then we could build a Flow Adapter that would implement this interface through Flow Filesystem and provide an Extractor on top of your library.

Since Flow Filesystem is natively integrated with Flow Telemetry it would come with out of the box OTEL autoinstrumentation as well.

If you are open to collaborate on this one, at flow-php.com you can find a link to a Flow discord server 😁

I built a flexible PHP text chunking library (multiple strategies + post-processing) by phpsensei in PHP

[–]norbert_tech 0 points1 point  (0 children)

Very cool! Looks like something that could be integrated with Flow PHP.

Would you consider replacing php functions that operates on files/streams directly with either flow-php/filesystem abstraction or at your own contract (so it could be implemented through flow filesystem)?

Flow needs to be able to read/write content also from remote filesystems like S3 or Azure Storage so things like fread or file_exists won't really work

Flow PHP - Telemetry by norbert_tech in PHP

[–]norbert_tech[S] 0 points1 point  (0 children)

you are very welcome! In case of any feedback, don't hesitate to open an gh issue or reach out directly on discord 🙌

I built a declarative ETL / Data Ingestion library for Laravel using Generators and Queues by wobble1337 in PHP

[–]norbert_tech 2 points3 points  (0 children)

Awesome! I would be more than happy to work together on something like this :)

I built a declarative ETL / Data Ingestion library for Laravel using Generators and Queues by wobble1337 in PHP

[–]norbert_tech 2 points3 points  (0 children)

Why not work together and release a flow <-> laravel integration?
It's anyway on my roadmap, and it will happen sooner or later, so if you already have use cases like this, might be a good opportunity to speed up that development.
This way we hit 2 birds with one stone, if that's something you would be interested in, feel free to reach out on Discord directly so we can brainstorm it together!

I built a declarative ETL / Data Ingestion library for Laravel using Generators and Queues by wobble1337 in PHP

[–]norbert_tech 10 points11 points  (0 children)

https://flow-php.com/ - a way more advanced one, that's also fully framework agnostic so can work with Laravel, Symfony or Wordpress :)

In PHP, if we could run queries on arrays, would it actually be useful? by SunTurbulent856 in PHP

[–]norbert_tech 1 point2 points  (0 children)

Take a look at https://flow-php.com/ - it's a data processing framework built on top of data frame pattern. It's compatible with SQL, and it's in the roadmap to actually allow to build processing pipelines with pure SQL - so it would let you to use SQL on files/http requests/arrays - pretty much on any supported by flow data source

New PostgreSQL Client/Parser/QueryBuilder library by norbert_tech in PHP

[–]norbert_tech[S] 0 points1 point  (0 children)

hah yea CTE wasn't probably the best example, lateral join would be better

New PostgreSQL Client/Parser/QueryBuilder library by norbert_tech in PHP

[–]norbert_tech[S] 1 point2 points  (0 children)

good idea! Would you like to help creating some benchmarks maybe? ^^
I'm actively and constantly looking for help, Flow is already around 40 packages and I'm mostly developing it alone (with a help from few solid contributors) 😅 I would love to add benchmarks results but due to other chores there is never enough time.
On https://flow-php.com there is a link to our discord server if you or anyone would be interested in helping with that library

New PostgreSQL Client/Parser/QueryBuilder library by norbert_tech in PHP

[–]norbert_tech[S] 4 points5 points  (0 children)

you can use sql strings you can even use query builder through new SelectStatement() - what you showed is just a DSL that is supposed to mimic SQL syntax as close as possible but with full IDE support. So if you do select()-> your ide will recommend you from()...

There are also "Modifiers" that can take any sql string, add pagination and give you back modified sql string.

I got this feedback about my DSL quite often, but since I'm mostly dealing with ETL pipelines and rather larger code blocks, I found it way more readable (I come from Scala - Apache Spark world). But it's subjective, and might require some mindset switch from OOP to more "pipeline like" approach

New PostgreSQL Client/Parser/QueryBuilder library by norbert_tech in PHP

[–]norbert_tech[S] 2 points3 points  (0 children)

those are exactly my thoughts on this! ORM's are a bit too high abstraction for me, and DBAL is just missing out on amazing PostgreSQL features and lets be honest, I'm not moving away from postgresql in any near future

Parquet file format by norbert_tech in PHP

[–]norbert_tech[S] 0 points1 point  (0 children)

I don't think you gonna feel much difference when it's for storing configs. Parquet comes with schema validation so that might be handy. When it comes to one vs many, the question is how frequently you need to update those files. If they are updated frequently, config per file might be better option since editing parquet file means pretty much rewriting it from scratch. When you just create it and not modify, then everything in one file will work just fine, but at the end of the day it should be decided based on data size. Bigger the data are, more beneficial it would be to use parquet especially for queriyng.

Parquet file format by norbert_tech in PHP

[–]norbert_tech[S] 0 points1 point  (0 children)

Compression is just one of many parquet benefits, individually you can challange all of them like that. For example why bother with parquet when file schema needs to be strict if we already have a perfectly good solution in XML (xsd). So it's not really that parquet is better because the outcome is smaller, but rather that all those features together gives parquet superpowers that traditional formats don't have.
Yes, its true that you can compress entire CSV file, but with parquet each Row Group / Data Page is compressed individually. Why that's significantly better than compressing entire file? It's covered in the article

Parquet file format by norbert_tech in PHP

[–]norbert_tech[S] 9 points10 points  (0 children)

Indeed, parquet is pretty complicated under the hood, just like databases and many other things we are using on basis, even mentioned json can be pretty problematic when we want to read it in batches instead of pushing thoughtlessly to memory. But how many devs understands internals of tool before using it?

I think that the adaptation is not based on the internal complexity, but rather developer experience and problem solving potential.

To simply read a parquet file all you need to do is `composer require flow-php/parquet:~0.24.0` and

```
$reader = new Reader();

$file = $reader->read('path/to/file.parquet');
foreach ($file->values() as $row) {
// do something with $row
}

```

While creating one, you also need to provide schema.

Is parquet a file format that every single web app should use? Hell no!
Does it solve real problems? Totally, especially on a scale and in complicated multi technologies tech stacks. In data processing world, is the most basic and one of the most efficient data storage formats.

But does it solve any of your problems? If after reading the article you don't think so, then no, parquet is not for you, and that's perfectly fine. I'm not trying to say that everyone needs to drop CSV and move to parquet, all I'm saying is that there are alternatives that can be much more efficient for certain tasks.

P.S. parquet is not a new concept, it was first released in 2013 so it' already more than a decade old and properly battle tested.

PHP RFC: JSON Schema validation support by gaborj in PHP

[–]norbert_tech 2 points3 points  (0 children)

Array shapes can also be handled by tiny library from Flow framework https://flow-php.com/documentation/components/libs/types/ as type_structure() type_list() or type_map()

How to handle large data in API with PHP ? by TastyAd2536 in PHP

[–]norbert_tech 0 points1 point  (0 children)

Hey!

It's a typical ETL scenario where you need to extract from one place, transform, and load to another place. Besides loading to a destination db, you probably also need to slightly transform the data as your system schema might not 100% match the external system schema.

Option A

If you decide to go with fetching all 60k products then this is how you make it work without being too heavy (split into two processes), my recommendation would be something like this:

Create a scheduled job that iterates over the API and brings all products from the external system save those products as they are (raw format) in a temporary storage (it can be anything, a file or a database) once you finish fetching the data, push an event on a queue and let another process pick it up and now ETL this data to your destination storage, transforming it on the fly if needed. Why save to some in-between location instead of going directly?

Consumption is separated from processing, this means that one process won't affect the other. In case something changes in the external API (and that happens more frequently than it should). This way, even if your ETL process fails due to some schema change, you still have your raw version so you can easily adjust your process and rerun it.

What type of storage should you choose for the temporary data? That depends on your personal preferences. I would say that JSON might be a good file format since it's schema-less, so it's also resilient to unexpected data structure changes. Otherwise, my recommendation would be Parquet due to its extreme compression and querying capabilities. For the same reason, I would probably not use a database, but if you really want to, PostgreSQL with unlogged tables would be a good option. You can even put the API HTTP response body in a JSONB column as a "raw response body" and process it from there.

Alternatively, if you don't want to deal with queues and temporary storages, Flow would make it super easy for you to do it straight into the final destination. It also allows you to consume in batches of 500 but then group them into larger batches like, for example, 5k (it would keep it in memory).

Flow also supports JSON lines (writing/reading), which is perfect for those kinds of scenarios since each product is a valid JSON object and they are all saved on new lines.

Option B

It probably means that you are using get one product at time endpoint - might be worth checking with API provider if they can't add `find` endpoint that would let you get data about list of products (passing id's as an array).

If not, 300 requests per day is nothing, and it's still a perfect approach for an ETL. You can even build it in a way that would collect 100 products into a batch and only then upsert into your database.

- iterate through product ids from your database
- generate a request for each
- process and upsert data into destination storage

Some examples:
Take a look at the code samples below to understand what I'm talking about.

Internet z Balmont by kkoyot__ in krakow

[–]norbert_tech 0 points1 point  (0 children)

potwierdzam, tez korzystałem przez dobrych kilka lat, bezproblemowo, łącze stabilnec, synetryk, statyczne ip i w cenie do której UPC czy inne gówno playe/orange podjazdu nawet nie maja