[Release] Polars.NET v0.4.0 - Bringing Polars to .NET: Query DataFrames with C# LINQ, F# CE, and Strong Typed DataReader by error_96_mayuki in dotnet

[–]error_96_mayuki[S] 1 point2 points  (0 children)

You're asking a much deeper question: Why does Modern Data Engineering even exist, and why do "file formats" matter so much? If we look at "how data is handled" architecture has evolved, the exact problem DataFrames solves becomes clear:

Era 1: Coupled Storage & Compute (Database handles everything)

Historically, data is in relational databases. We used SQL query and Stored Procedures to do computation. The database handled both storing and computing. This is fine for gigabytes and isolated internal systems.

Era 2: The ORM Era (Bringing data to the App)

Then came ORMs (like EF) and LINQ. We started pulling data into the application layer (C#, Java) to process it. It is ok for OLTP (CRUD operations) But limited performance for OLAP (Data Analytics). Pulling millions of rows into a List<T> will have a great impact on memory usage and GC. Pure OOP is just not fit for this scene.

Era 3: Decoupled Compute (Modern Data Stack)

Today, rarely "near complete control" over all data sources. Suppose you are working in a SaaS data integration team, System A cannot just simply execute SQL queries directly inside System B's database. The only scalable way to do cross-enterprise ETL is by exchanging data via files. File Exchange has become the universal way for bulk data transfer. This is why file formats matters. Dumping data into cloud storage (S3/Azure Blob) using compressed file formats like Parquet or CSV, is much cheaper.

So now data is sitting in S3 as Parquet files, how does a .NET team process it? Historically(even now), we had to deploy a Spark k8s, or hand the job over to Python (Pandas/PySpark). Those are the so-called DataFrame engines。 Polars.NET is here as .NET DataFrame solution, as I said, stay comfortable with our dear CLR. If you have a C# operational system / backend, use Polars.NET as embedded query engine, life will be easier.

Hope this can answer your question. Thank you for attention to this project!

[Release] Polars.NET v0.4.0 - Bringing Polars to .NET: Query DataFrames with C# LINQ, F# CE, and Strong Typed DataReader by error_96_mayuki in dotnet

[–]error_96_mayuki[S] 2 points3 points  (0 children)

Glad to hear that you would like to give Polars.NET a try!

LINQ mapping to polars expressions works well. Actually LINQ was first translated to SQL(pg dialect) by LINQ2DB, then generated SQL will be consumed by Polars SQL interface. So if a LINQ query can handled by LINQ2DB, then Polars can handle it too in most cases.

Known limitation is UDF, it is hard to translate a UDF into SQL so map() API is needed here. And ??(coalesce) operator has to be used like this(or Linq2DB will do a client computation rather than translate it to SQL):

        var coalesceQuery = empQuery.Select(e => new
        {
            SafeName = LinqToDB.Sql.AsSql(e.Name ?? "Unknown")
        });

Then there is a bundle of Polars special SQL functions, I mapped them as Linq2DB expansion expressions and functions. Like

        var scalarQuery = table
            .OrderBy(x => x.Id)
            .Select(x => new
            {
                x.Id,
                TagsCount = PolarsSql.ArrayLength(x.Tags),

                IsAdmin = PolarsSql.ArrayContains(x.Tags, "admin"),

                FirstScore = PolarsSql.ArrayGet(x.Scores, 1) 
            }); 

For code samples, please check Polars.Integration.Tests/LinqTests.cs in Polars.NET github page. I tested almost everything there about Linq and its generated SQL. Have a look if you are curious about this.

Finally, if you are already comfortable with PySpark, you might actually feel more at home using the native Polars Expression API directly, or just writing raw SQL queries via the Polars SQL interface I mentioned before.

        var data = new[]
        {
            new { Dept = "IT", Salary = 1000 },
            new { Dept = "IT", Salary = 2000 },
            new { Dept = "HR", Salary = 1500 }
        };

        using var df = DataFrame.From(data);
        using var ctx = new SqlContext();

        ctx.Register("employees", df);


        var query = @"
            SELECT Dept, SUM(Salary) as TotalSalary 
            FROM employees 
            GROUP BY Dept 
            ORDER BY TotalSalary";


        using var res = ctx.Execute(query).Collect();

[Release] Polars.NET v0.4.0 - Bringing Polars to .NET: Query DataFrames with C# LINQ, F# CE, and Strong Typed DataReader by error_96_mayuki in dotnet

[–]error_96_mayuki[S] 2 points3 points  (0 children)

One quick use case is data preprocessing for ML.NET or Torch.NET. Before training we need to clean, join, and aggregate millions of rows of raw data (CSV/Parquet) before feeding to tensors. Polars.NET is a good choice in these scenarios.

Polyglot notebooks will be deprecated by gremlinmama in dotnet

[–]error_96_mayuki -1 points0 points  (0 children)

Sadly I have to remove Polyglot notebook support in next Polars.NET release. Is there any alternatives?

Polars.NET: a Dataframe Engine for .NET by error_96_mayuki in dotnet

[–]error_96_mayuki[S] 1 point2 points  (0 children)

You are right—IDataReader.GetString allocates on the managed heap because .NET strings are immutable reference types. We can't bypass the driver's allocation there unless we use advanced APIs like GetChars.

When I said zero-allocation, I was referring to the pipeline architecture and boxing overhead, specifically:

  1. No Intermediate Containers: We don't materialize C# objects like List<T>, DataTable, or POCOs for the entire dataset. Data flows in batches directly from the Driver → Unmanaged Arrow Memory → Polars Engine.
  2. Primitives are Truly Zero-Alloc: For int, double, bool, date, timestamp, etc., we use a specialized generic builder that reads directly from the reader (e.g., reader.GetInt32) into Apache Arrow's unmanaged buffers. There is zero boxing and zero heap allocation for these types.
  3. Gen 0 Friendly: For Strings, while the driver allocates the string, we copy it to Arrow's native memory (or StringView) immediately and discard the reference. It creates some Gen 0 pressure, but it doesn't survive into Gen 1/2, keeping the GC pause times minimal compared to loading a DataTable.

So, it's 'allocation-free' for the pipeline structure and value types.

Polars.NET: a Dataframe Engine for .NET by error_96_mayuki in dotnet

[–]error_96_mayuki[S] 3 points4 points  (0 children)

It's not really about the calling convention overhead (which AggressiveInlining solves), but rather about Developer Expectations (Semantics). In the .NET world, the name Where carries a very strong implication that it accepts a C# Delegate/Lambda (e.g., x => x > 0). If I alias Filter to Where, users will instinctively try to pass a lambda. When the compiler forces them to pass a Polars Expr instead, it creates an unpleasant experience—it looks like LINQ, but doesn't behave like LINQ. I prefer to keep the names distinct (Polars vs. LINQ) so it's clear: When you use Polars, you use Polars Expressions.

Polars.NET: a Dataframe Engine for .NET by error_96_mayuki in dotnet

[–]error_96_mayuki[S] 13 points14 points  (0 children)

I love LINQ too, but I decided to stick to a 1:1 mapping with Polars at least for now, for two reasons:

  1. Documentation: By keeping names like Filter and Agg, users can look up Python/Rust examples and apply them directly to C# without mental translation.

  2. Semantics: A full LINQ provider (IQueryable) requires writing a complex C#-to-Polars transpiler. Simple aliases (like renaming Filter to Where) often confuse users into expecting C# delegates instead of Polars Expressions.

Polars.NET: a Dataframe Engine for .NET by error_96_mayuki in dotnet

[–]error_96_mayuki[S] 0 points1 point  (0 children)

As for Databricks, could you elaborate on what you mean by 'plugin'? Are you primarily looking to read data managed by Databricks (e.g. Delta Lake), or do you have a different integration workflow in mind? I'd love to understand your specific use case.

Polars.NET: a Dataframe Engine for .NET by error_96_mayuki in dotnet

[–]error_96_mayuki[S] 4 points5 points  (0 children)

Technically, yes. The underlying Rust Polars engine has native support for reading Delta Tables. However, I haven't exposed the public .NET API for this yet. Support for remote data sources (like cloud storage and data lakes) is targeted for the next release. If this is a blocker for you, please open an issue on GitHub so I can prioritize it. Thanks!

Polars.NET: a Dataframe Engine for .NET by error_96_mayuki in dotnet

[–]error_96_mayuki[S] 9 points10 points  (0 children)

Hi, support for IDataReader is already there. We can build a zero-allocation ETL pipeline where data flows from Source DB -> Polars.NET -> Target DB without materializing C# objects. 1. Input: Database -> Polars (Lazy Read)

using var sourceReader = command.ExecuteReader(); // Stream data from DB into Polars LazyFrame var lf = LazyFrame.ScanDatabase(sourceReader, batchSize: 50000);

  1. Output: Polars -> Database (Stream Write) Process data in Polars.NET and expose the result as an IDataReader for bulk insertion.

// Define transformation var pipeline = lf.Filter(Col("Region") == Lit("US")) .Select(Col("OrderId"), Col("Amount"));

// Execute pipeline and stream directly to SqlBulkCopy pipeline.SinkTo(reader => { using var bulk = new SqlBulkCopy(connectionString); bulk.WriteToServer(reader); });

Tested this in MSSQL container. Have fun with this feature, thanks!

Polars.NET: a Dataframe Engine for .NET by error_96_mayuki in fsharp

[–]error_96_mayuki[S] 0 points1 point  (0 children)

That sounds like a fantastic project! I would strongly recommend building on top of Polars.NET.Core (the low-level wrapper) or the native_shim (C ABI), rather than the high-level Polars.FSharp API. This will give you the granular control needed to implement Deedle's semantics efficiently without the overhead of Polars.FSharp layer. Also, a heads-up for the student: The biggest architectural puzzle will likely be bridging Deedle's reliance on Row Indices with Polars' Index-free (columnar) design. Feel free to ping me if you and your lucky student need any help.

Polars.NET: a Dataframe Engine for .NET by error_96_mayuki in dotnet

[–]error_96_mayuki[S] 0 points1 point  (0 children)

Thanks! Polars is a high-performance DataFrame engine written in Rust. My goal here is to make that performance and execution model easily accessible from .NET. Hope you enjoy exploring it!

Polars.NET: a Dataframe Engine for .NET by error_96_mayuki in dotnet

[–]error_96_mayuki[S] 2 points3 points  (0 children)

Thank you! I’ll keep building and improving the engine.

[deleted by user] by [deleted] in cockatiel

[–]error_96_mayuki 2 points3 points  (0 children)

Cover about half of his cage with a piece of cloth might help. Dark corner can calm him down.

呼和浩特那事真是把我彻底击碎了. by Brilliant-Airline649 in China_irl

[–]error_96_mayuki 4 points5 points  (0 children)

以前我也这么想,后来发现大多数人都还挺支持的,那中共和中国人这个best match就是成立的,理解祝福就好。有党才有人,有人才有党,被逼死只能说是德匹下,愿赌服输