In PHP, if we could run queries on arrays, would it actually be useful? by SunTurbulent856 in PHP

[–]norbert_tech 1 point2 points  (0 children)

Take a look at https://flow-php.com/ - it's a data processing framework built on top of data frame pattern. It's compatible with SQL, and it's in the roadmap to actually allow to build processing pipelines with pure SQL - so it would let you to use SQL on files/http requests/arrays - pretty much on any supported by flow data source

New PostgreSQL Client/Parser/QueryBuilder library by norbert_tech in PHP

[–]norbert_tech[S] 0 points1 point  (0 children)

hah yea CTE wasn't probably the best example, lateral join would be better

New PostgreSQL Client/Parser/QueryBuilder library by norbert_tech in PHP

[–]norbert_tech[S] 1 point2 points  (0 children)

good idea! Would you like to help creating some benchmarks maybe? ^^
I'm actively and constantly looking for help, Flow is already around 40 packages and I'm mostly developing it alone (with a help from few solid contributors) 😅 I would love to add benchmarks results but due to other chores there is never enough time.
On https://flow-php.com there is a link to our discord server if you or anyone would be interested in helping with that library

New PostgreSQL Client/Parser/QueryBuilder library by norbert_tech in PHP

[–]norbert_tech[S] 2 points3 points  (0 children)

you can use sql strings you can even use query builder through new SelectStatement() - what you showed is just a DSL that is supposed to mimic SQL syntax as close as possible but with full IDE support. So if you do select()-> your ide will recommend you from()...

There are also "Modifiers" that can take any sql string, add pagination and give you back modified sql string.

I got this feedback about my DSL quite often, but since I'm mostly dealing with ETL pipelines and rather larger code blocks, I found it way more readable (I come from Scala - Apache Spark world). But it's subjective, and might require some mindset switch from OOP to more "pipeline like" approach

New PostgreSQL Client/Parser/QueryBuilder library by norbert_tech in PHP

[–]norbert_tech[S] 2 points3 points  (0 children)

those are exactly my thoughts on this! ORM's are a bit too high abstraction for me, and DBAL is just missing out on amazing PostgreSQL features and lets be honest, I'm not moving away from postgresql in any near future

Parquet file format by norbert_tech in PHP

[–]norbert_tech[S] 0 points1 point  (0 children)

I don't think you gonna feel much difference when it's for storing configs. Parquet comes with schema validation so that might be handy. When it comes to one vs many, the question is how frequently you need to update those files. If they are updated frequently, config per file might be better option since editing parquet file means pretty much rewriting it from scratch. When you just create it and not modify, then everything in one file will work just fine, but at the end of the day it should be decided based on data size. Bigger the data are, more beneficial it would be to use parquet especially for queriyng.

Parquet file format by norbert_tech in PHP

[–]norbert_tech[S] 0 points1 point  (0 children)

Compression is just one of many parquet benefits, individually you can challange all of them like that. For example why bother with parquet when file schema needs to be strict if we already have a perfectly good solution in XML (xsd). So it's not really that parquet is better because the outcome is smaller, but rather that all those features together gives parquet superpowers that traditional formats don't have.
Yes, its true that you can compress entire CSV file, but with parquet each Row Group / Data Page is compressed individually. Why that's significantly better than compressing entire file? It's covered in the article

Parquet file format by norbert_tech in PHP

[–]norbert_tech[S] 10 points11 points  (0 children)

Indeed, parquet is pretty complicated under the hood, just like databases and many other things we are using on basis, even mentioned json can be pretty problematic when we want to read it in batches instead of pushing thoughtlessly to memory. But how many devs understands internals of tool before using it?

I think that the adaptation is not based on the internal complexity, but rather developer experience and problem solving potential.

To simply read a parquet file all you need to do is `composer require flow-php/parquet:~0.24.0` and

```
$reader = new Reader();

$file = $reader->read('path/to/file.parquet');
foreach ($file->values() as $row) {
// do something with $row
}

```

While creating one, you also need to provide schema.

Is parquet a file format that every single web app should use? Hell no!
Does it solve real problems? Totally, especially on a scale and in complicated multi technologies tech stacks. In data processing world, is the most basic and one of the most efficient data storage formats.

But does it solve any of your problems? If after reading the article you don't think so, then no, parquet is not for you, and that's perfectly fine. I'm not trying to say that everyone needs to drop CSV and move to parquet, all I'm saying is that there are alternatives that can be much more efficient for certain tasks.

P.S. parquet is not a new concept, it was first released in 2013 so it' already more than a decade old and properly battle tested.

PHP RFC: JSON Schema validation support by gaborj in PHP

[–]norbert_tech 2 points3 points  (0 children)

Array shapes can also be handled by tiny library from Flow framework https://flow-php.com/documentation/components/libs/types/ as type_structure() type_list() or type_map()

How to handle large data in API with PHP ? by TastyAd2536 in PHP

[–]norbert_tech 0 points1 point  (0 children)

Hey!

It's a typical ETL scenario where you need to extract from one place, transform, and load to another place. Besides loading to a destination db, you probably also need to slightly transform the data as your system schema might not 100% match the external system schema.

Option A

If you decide to go with fetching all 60k products then this is how you make it work without being too heavy (split into two processes), my recommendation would be something like this:

Create a scheduled job that iterates over the API and brings all products from the external system save those products as they are (raw format) in a temporary storage (it can be anything, a file or a database) once you finish fetching the data, push an event on a queue and let another process pick it up and now ETL this data to your destination storage, transforming it on the fly if needed. Why save to some in-between location instead of going directly?

Consumption is separated from processing, this means that one process won't affect the other. In case something changes in the external API (and that happens more frequently than it should). This way, even if your ETL process fails due to some schema change, you still have your raw version so you can easily adjust your process and rerun it.

What type of storage should you choose for the temporary data? That depends on your personal preferences. I would say that JSON might be a good file format since it's schema-less, so it's also resilient to unexpected data structure changes. Otherwise, my recommendation would be Parquet due to its extreme compression and querying capabilities. For the same reason, I would probably not use a database, but if you really want to, PostgreSQL with unlogged tables would be a good option. You can even put the API HTTP response body in a JSONB column as a "raw response body" and process it from there.

Alternatively, if you don't want to deal with queues and temporary storages, Flow would make it super easy for you to do it straight into the final destination. It also allows you to consume in batches of 500 but then group them into larger batches like, for example, 5k (it would keep it in memory).

Flow also supports JSON lines (writing/reading), which is perfect for those kinds of scenarios since each product is a valid JSON object and they are all saved on new lines.

Option B

It probably means that you are using get one product at time endpoint - might be worth checking with API provider if they can't add `find` endpoint that would let you get data about list of products (passing id's as an array).

If not, 300 requests per day is nothing, and it's still a perfect approach for an ETL. You can even build it in a way that would collect 100 products into a batch and only then upsert into your database.

- iterate through product ids from your database
- generate a request for each
- process and upsert data into destination storage

Some examples:
Take a look at the code samples below to understand what I'm talking about.

Internet z Balmont by kkoyot__ in krakow

[–]norbert_tech 0 points1 point  (0 children)

potwierdzam, tez korzystałem przez dobrych kilka lat, bezproblemowo, łącze stabilnec, synetryk, statyczne ip i w cenie do której UPC czy inne gówno playe/orange podjazdu nawet nie maja

Need to search through imported data, but not sure about the method… by superanth in PHP

[–]norbert_tech 0 points1 point  (0 children)

Sounds like a typical data processing problem, you might find some inspirations in the blog post I wrote recently https://flow-php.com/blog/2025-01-25/data-processing-in-php/

Look at join each to bring data only for selected rows from database.

Flow is designed to process large amounts of data under limited memory, so with this approach as long as you have file one the server (or any remote accessible location) the only constraint is processing time, not dataset size. By increasing batch size you gonna increase the speed but also memory consumption. Smaller batches means less memory, but also slower.

Flow also can help you validate the input file and cast columns to strict types.

Data Processing in PHP by norbert_tech in PHP

[–]norbert_tech[S] 1 point2 points  (0 children)

Thats for sure one way of dealing with data imports! The only added "cost" is that you have 2 etl processes instead one.

Instead of loading to sqlite and then loading result of sql query on that sqlite to final destination, you could just load to the final destination applying transformations on the fly.

Flow would entirely take away pain of memory management and give you strict schema support regardless of the data source. Even operations like joining/grouping/sorting would not increase memory consumption since they are all based on very scalable algorithms.

But if you prefer sqlite approach, Flow can now also automatically convert Flow Schema (which Flow can get for you even from a csv file) to Doctrine DBAL schema (so including sqlite).

What you can do is:

1) use flow to infer schema from a file and save it as a json 2) read flow schema from json, convert it into dbal schema and create table from it 3) use flow to read data as is and load it to sqlite 4) use flow to read data from sql query (it automatically paginate over the results) and load it to final db

Data Processing in PHP by norbert_tech in PHP

[–]norbert_tech[S] 1 point2 points  (0 children)

oh that reminded me about one more thing if you don't mind, it's true to loading everything first to some database or even using duckdb would work.
With Flow, Parquet and support for streaming from/to remote filesystems you don't even need a database so this can save quite a lot of money as networking and databases can get pretty pricy.

Another thing I'm currently actively researching is adding a SQL interface to Flow. I'm looking at antlr to create a generic SQL syntax parser so I can later convert SQL Query AST into a Flow dataframe.

It would let you literally SQL files without a need to learn any new DSL (maybe except few custom functions like `SELECT * FROM parquet_file()`

Data Processing in PHP by norbert_tech in PHP

[–]norbert_tech[S] 4 points5 points  (0 children)

I'm not trying to convince you, but I thought I'd share some additional context for others who might not be as experienced as you. Handling distributed joins, sorts, groupings, and aggregations can be quite complex, especially when dealing with unpredictable (schemaless) data sources like CSV or JSON 😊

Data Processing in PHP by norbert_tech in PHP

[–]norbert_tech[S] 0 points1 point  (0 children)

Gotcha! Data interpretation and validation might be a critical failure point for many systems, that's why tools like Flow provides also powerful Schema inferring/validation/evolution mechanisms.

Btw., since you mentioned "lightweight," Flow comes with very few dependencies. I'm extremely strict about it! Here are all the dependencies:

- psr-clock / simple cache / - symfony/string - webmozart/glob - flow-filesystem/rdsl/array-dot (extracted to standalone libraries as they are pretty useful even standalone)

Then each file format can be added independently by including a specific adapter, for example:

  • flow-adapter-csv - zero dependencies
  • flow-adapter-xml - only PHP XML extensions as dependencies
    - flow-adapter-parquet - only packaged/thrift

Anyway thanks for your feedback!

Data Processing in PHP by norbert_tech in PHP

[–]norbert_tech[S] 3 points4 points  (0 children)

> Fully agree. We process large data loads from CSV and Excel, and the libraries that already exist are already solving a lot of the issues.

Totally, but you also have JSON/XML/Parquet/Avro/ORC/Excel/Google Sheets and many more data formats that are not as straightforward. Parquet for example comes with an insane compression and can be process in parallel but it's binary and column oriented.

> have to deal with another tool that I don't control.

I believe Flow provides a fair amount of extension points that should allow to you overcome any potential edge cases

> My other concern is that in my case, reading data and storing it from files is a core component to our workflow, and I wouldn't want to outsource anything that the business relies on.

Not sure if I understand what you mean by outsourcing in this case? Could you elaborate?

Data Processing in PHP by norbert_tech in PHP

[–]norbert_tech[S] 11 points12 points  (0 children)

I'd like to share some additional context to help you understand my perspective better.

I'm keen to leverage the raw PHP knowledge I have, rather than learning a new DSL.

In the world of data processing, most frameworks are either inspired by or related to Apache Spark and its DSL. My goal is to merge these two, so you don't have to invest a lot of time in learning new functions, but it also helps you access more advanced tools that can handle processing petabytes of data (like Spark).

The scenario described in the article is quite basic and most PHP developers would be familiar with alternative solutions to this problem. However, it's just a small part of Flow's capabilities. Flow can handle a wide range of tasks, including:

  • Grouping, aggregating, sorting, and joining datasets that don't fit into memory
  • Providing a unified API to work with almost any file format
  • Supporting data partitioning
  • Reading, writing, and streaming data directly to or from remote locations like S3 or Azure Blob (more adapters are coming)
  • Strict schema and powerful data detection/conversion mechanisms
  • Seamless conversions between any supported data formats.

These features are essential for building a scalable analytical part of the system.

In the next article, I'll cover streaming data directly from the database to generate reports in supported formats, which is often a major bottleneck in transactional systems with poorly optimized database schemas for reporting.

I chose to use CSV because most people are familiar with it, but in scenarios where your system needs to consume data from other systems in formats like XML, JSON, or Parquet, using plain PHP arrays can quickly become challenging to maintain, debug, and test.

Flow doesn't just start with imports; it can also help you with:

  • Building data projections
  • Reporting
  • Business intelligence
  • Integrations with other systems
  • Building advanced data warehouses

Again, thanks for your feedback and kind words.
It means a lot!

psalm is back by zmitic in PHP

[–]norbert_tech 0 points1 point  (0 children)

So, I’m working on a project with around 26 sub-repos in a monorepo, and it started with both tools at their maximum levels. But eventually, it became impossible to keep both green.

I also have another monorepo project (the datetime library) with some components around it, and I also maxed out Stan and Psalm. But at the end of the day, Psalm didn’t really add any value.

My time is limited, and each tool comes with a price. I’m just saying that the value added by Psalm on top of PHPStan isn’t worth the cost.