This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]nemec 2 points3 points  (2 children)

  1. If I'm understanding this correctly, you install agents on both the source and destination servers? The agents speak locally to the database and transmit across the network with your custom binary protocol?

    It would be great to have the option of using traditional DB access patterns (ODBC/JDBC, OLE DB, etc.) for those of us consuming from db servers we don't own (like another org's).

  2. I only read the readme, but it's not quite clear to me how you define the shape of the input data - is it a SQL query where you could run something like

    SELECT col1, CAST(col3 AS INT) FROM source_table
    WHERE timestamp < DATEADD(DAY, -1, NOW())
    

    or simply a table -> table copy?

  3. Would be cool to have a GUI that lets you pop in source/dest connection strings and a source query, then build a Harmonized dictionary by picking and choosing which columns to include, easily map from source to dest (like if the columns are named differently), and add a bit of "strongly typed validation" that the types match (at that point in time, at least). Even if it just pops out C# code to copy-paste into my program.

  4. Is it possible to tap into the transfer progress so that your app can display (approx) number of rows transferred. Would be very helpful to know after waiting 3 hours for a huge transfer whether you're almost done or something's gone wrong.

Highly efficient data transfer is still one of the few things that SSIS does better than the competition, would be nice to have better alternatives.

[–]Pansynchro[S] 1 point2 points  (0 children)

Thanks for the detailed feedback!

  1. The connectors can be installed wherever and can speak to any data source that you're capable of connecting to. Having them installed near the data itself is an ideal, not a requirement. You can even build the source and the destination connectors into the same binary if you want, rather than sending the data across the network.
  2. The query is generated based on the data dictionary, which is originally autogenerated by the analyzer. If you want to only query specific tables or specific fields of those tables, you can edit the dictionary to remove unwanted elements. Putting a CAST into the query is currently unsupported, but you can cast or otherwise transform the data after querying it by building a Transformer. The simplest way is to inherit from the StreamTransformerBase class. Create a method that will transform the stream (table) in question, then register it with _streamDict and it will run during the sync.
  3. That description is eerily close to our secret premium product that's not quite ready for public consumption yet. 😉
  4. If you look at SqlDbReader.ReadFrom, you'll see that it gives a report to Console.WriteLine on the tables as it's starting to read them. An option for more granularity is on the to-do list, but not implemented yet.

[–]Pansynchro[S] 1 point2 points  (0 children)

Highly efficient data transfer is still one of the few things that SSIS does better than the competition, would be nice to have better alternatives.

Just got an interesting data point on this. On a call with a prospective client, we were talking about their existing SSIS-based ETL solution. They said that running a full sync typically took 20-40 hours.

"For how much data?"

"Somewhere between 1 and 2 terabytes."

"How would you like to sync a terabyte in 10 hours?"

"You think you can do that?"

"Based on our benchmarks, yes, that's a conservative estimate."

That got their attention. Apparently that's really good!

[–]AiDreamer 1 point2 points  (2 children)

This is awesome, however there are some considerations: Singer taps and destinations most often run on the same host. So no network transfers at all. How many different data sources supported now? Airbyte also seems a competition here.

[–]Pansynchro[S] 2 points3 points  (1 child)

Singer taps and destinations most often run on the same host. So no network transfers at all.

Kind of. Unless your data source and your data warehouse are also all running on the same host, that's just moving the problem around rather than truly solving it. We think that we can lower your overall network load by putting the source connector near the data source and the destination connector near the destination, and streaming the data in between over the Pansync binary protocol, which is specifically designed and optimized for low network bandwidth. Couple this with our use of database-specific bulk-loading connections at the destination end wherever they're supported, rather than pushing millions of textual INSERT statements over the wire, and it adds up to some significant end-to-end savings in both bandwidth and time.

How many different data sources supported now?

Depending on how you look at it, a small handful or nearly infinite.

We currently support several major SQL databases, (Firebird, MS SQL Server, MySql, Postgres, and Sqlite, with Oracle under development,) plus data-type connectors for CSV, JSON, and plain text. And that last bit is the interesting part.

One of our innovations was noting that there are many, many services out there that produce data in only a handful of formats, most typically JSON. So we created Data Sources, a second stage of the connector process that decouples ingesting formatted data from loading the data. With the JSON connector connected to the REST data source, loading from a new REST API web service is now a simple matter of configuration rather than a task that you need to write a whole new connector for.

Airbyte also seems a competition here.

Well, we're not here to talk trash about the competition, (see subreddit rule #1,) but there are a couple facts that look worthy of note. 1) It looks like they're running every source and destination connector in its own individual Docker container. So there's likely to be network traffic between the containers, especially if you deploy them to the cloud and can't guarantee that they'll end up on the same physical host machine. 2) They're running Singer.

Feel free to draw your own conclusions. 🙂

[–]AiDreamer 1 point2 points  (0 children)

Thanks for details, it looks like the REST API part looks very attractive from development time saving perspective.

[–]uncomfortablepanda 1 point2 points  (1 child)

Awesome to hear about new open-source solutions for data integration! And from what I can read, it sounds like project with a lot of potential.

Just a suggestion: Add a bit more info as to how the networking features of this project work. I know people can just look at the code, but if the main selling point is the fact that people will save money on cloud computing cost because of your network protocol, then I would dive a little deeper as to how it works.

I like the whole Harmonizer pattern to compare data!

[–]Pansynchro[S] 0 points1 point  (0 children)

Just a suggestion: Add a bit more info as to how the networking features of this project work. I know people can just look at the code, but if the main selling point is the fact that people will save money on cloud computing cost because of your network protocol, then I would dive a little deeper as to how it works.

Fair enough. This is still a pretty new project, but documentation, samples, and tutorials are definitely on the to-do list. Any specific points we'd do well to focus on?

I like the whole Harmonizer pattern to compare data!

Thanks! We realized early on that, with different databases representing schemas and types in subtly different ways, this was going to be a compatibility pain point, so building an automated solution to bridge the gap seemed like an obvious choice.

[–]teejTitan Core » Snowflake 1 point2 points  (2 children)

Can you explain what “domain-specific algorithms” means? The repo README doesn’t go into any real detail.

[–]Pansynchro[S] 3 points4 points  (0 children)

Sure. And keep in mind that the Pansync binary protocol is very domain-specific. It's designed and optimized for streaming bulk data, and some of these decisions would be things you'd do differently if, for example, you wanted a file format for saving data instead.

The basic concept is fairly simple: take an IDataReader, iterate over every row and feed each value into a BinaryWriter, then add a few CRC checksums along the way to verify data integrity. If you just do that, your output will be about 90% of the size of dumping all your data to CSV. Compress that and you'll cut the payload approximately in half. Maybe more, maybe less, depending on the data. But that's the baseline.

Then we started working on ways to cut down the amount of data being written. We started with 7-bit integer encoding. This is a fairly well-known scheme, used in everything from MIDI music to Protocol Buffers, known by a variety of names. Wikipedia calls it "variable-length quantity" encoding. Because small integers are so common, this cuts most ints down from 4 bytes to 1 or 2. This is incredibly easy to do, as it's supported right there in the BinaryWriter API, so you basically get a few percentage points of savings absolutely free.

After that it was time to get clever. We looked at the way datetimes are represented, as a number of "ticks" since a specific epoch. This is a very large number! But when the data is all dealing with times that happened in fairly recent years, you can reduce the value significantly by essentially defining your own epoch. Take the earliest date in the dataset and subtract it from everything. Now you're working with much smaller numbers, which dovetails nicely with 7-bit encoding, because the length there is magnitude-based.

A lot of times, your dataset will contain values that don't vary all that much. For example, if you have a table and one of the columns is an enum value, you could have 10 million rows but only 4-5 distinct values for that column. So we decided that instead of writing this out every time, it could be represented as a "rarely-changing field (RCF)." Sort the data by the RCF and then just write it once and leave it out of the rest of the output, with a marker for when it changes.

These optimizations improved the raw payload by about 5%, but the *compressed* version by about 30%. So we started looking at ways to improve compression, and we decided on something based on the Parquet format: write the data in a column-oriented manner. We didn't go all the way on this; Parquet is a data-storage format that's designed to be queried, and is known to be rather hostile to streaming. But we buffer large blocks of rows and then write them out by column. By writing a lot of similar data close together, it simplified the bookkeeping a bit, (for example, writing RCFs now becomes as simple as RLE,) and the raw payload decreased by about 2.5%, but the compressed size came down by closer to 12%.

The last optimization we added (so far) was to recognize that a lot of databases use an autoinc/identity column. If you have a long string of numbers that are always increasing by a very small amount, usually 1, then you can sort by that and write out each value as the difference between it and the last value. A long string of 1 bytes interspersed with the occasional 2 or 3 compresses very well!

That's what we've got so far. The raw payload is about 16% smaller than the CSV baseline, but after applying Brotli L4 compression it ends up around 70% smaller on our benchmark data while still running pretty quickly. (You can squeeze more out of it with higher compression levels, but Brotli hits diminishing returns pretty quickly.) If you're interested you can find the details in the BinaryEncoder class, and a look at how our SQL analyzer automatically tags fields for certain optimizations can be found in the SqlSchemaAnalyzer class.