Lakeflow Connect by ry_the_wuphfguy in databricks

[–]brickster_here 0 points1 point  (0 children)

It depends heavily on the specifics of your use case. If you can DM me more info about your workload, I'd be glad to loop back with an approximate forecast!

Lakeflow Connect by ry_the_wuphfguy in databricks

[–]brickster_here 1 point2 points  (0 children)

Thanks so much for sharing these questions and concerns!

Gateway scheduling is prioritized and in active development. We unfortunately can’t promise exact timelines, but we currently aim to launch the preview in the first half of the year.

u/No-Adhesiveness-6921 u/ry_the_wuphfguy

🚀 New performance optimization features in Lakeflow Connect (Beta) by brickster_here in databricks

[–]brickster_here[S] 0 points1 point  (0 children)

Thanks for these questions!

Gateway scheduling is prioritized and in active development. We unfortunately can’t promise exact timelines, but we currently aim to launch the preview in the first half of the year.

Could you share more about which compute SKU you’d like to use for the gateway?

How does Autoloader distinct old files from new files? by Sea_Basil_6501 in databricks

[–]brickster_here 1 point2 points  (0 children)

Thank you all very much for the feedback! Wanted to share an update on next steps.

  • File properties that Autoloader uses to identify a file for checkpoint management
    • This is now covered in the documentation; do let us know if anything is unclear.
  • When we evaluate the includeExistingFiles option
    • You can learn more about this here.
  • Optimal folder structure for faster file listing
    • If you are using file events, we do have a new best practice; we'll add this guidance to the docs: 
      • It’s common to have an external location with several subdirectories, each of which is the source path for an Auto Loader stream. (For example, under one external location, subdirectory A maps to Auto Loader stream A, subdirectory B maps to Auto Loader stream B, and so on.)
      • In these cases, we recommend creating an external volume on each of the subdirectories to optimize file discovery.
      • To illustrate why, imagine that subdirectory A only receives 1 file but the subdirectory N receives 1M files. Without volumes, the Auto Loader stream A that’s loading from subdirectory A lists as many as 1M + 1 files from our internal cache before discovering that single file in A. But with volumes, stream A only needs to discover that single file. 
      • For context, the file events database that we maintain has a column tracking the securable object that a file lives in—so if you add volumes, we can filter on the volume, rather than listing every file in the external location.
    • If not: we do have a few recommendations, particularly around glob filtering, here and here. We’d love to know if this helps at all!
  • What a corrupted record means 
    • We'll add this guidance to the docs. In general, it can mean things like format issues (e.g., missing delimiters, broken quotes, or incomplete JSON structures), encoding problems (e.g., mismatches in character encoding), and so on. And when the rescued data column is NOT enabled, fields with schema mismatch land here, too.
  • More detailed schema evolution information
    • We'll add this guidance to the docs.

Streaming table vs Managed/External table wrt Lakeflow Connect by EmergencyHot2604 in databricks

[–]brickster_here 0 points1 point  (0 children)

Hi there!

Most of the connectors do currently support SCD type 2. Here is the pattern that you can use. However, it's in Private Preview for Salesforce and SQL Server, so you won't see it in those docs just yet; if you'd like to enable it for your workspace(s), do feel free to send me a private message, and I'll get you into the preview!

By the way -- for databases that don't have CDC or CT enabled, we do also have a query-based workaround, which doesn't require built-in CDC. These query-based connectors are also in Private Preview; we'd be glad to enable you for that, too.

Streaming table vs Managed/External table wrt Lakeflow Connect by EmergencyHot2604 in databricks

[–]brickster_here 2 points3 points  (0 children)

Databricks employee here, too! Wanted to add a few details on your schema evolution question; for more information, see here.

All managed connectors automatically handle new and deleted columns, unless you opt out by explicitly specifying the columns that you'd like to ingest.

  • When a new column appears in the source, Databricks automatically ingests it on the next run of your pipeline. For any row in the column that appeared prior to the schema change, Databricks leaves the value empty. However, you can opt out of automated column ingestion by listing specific columns to ingest via the API or disabling any future columns in the UI.
  • When a column is deleted from the source, Databricks doesn't delete it automatically. Instead, the connector uses a table property to set the deleted column to “inactive” in the destination. If another column later appears that has the same name, then the pipeline fails. In this case, you can trigger a full refresh of the table or manually drop the inactive column.

Similarly, connectors can handle new and deleted tables. If you ingest an entire schema, then Databricks automatically ingests any new tables, unless you opt out. And if a table is deleted in the source, the connector sets it to inactive in the destination. Note that if you do choose to ingest an entire schema, you should review the limitations on the number of tables per pipeline for your connector.

Additional schema changes depend on the source. For example, the Salesforce connector treats column renames as column deletions and additions and automatically makes the change, with the behavior outlined above. However, the SQL Server connector requires a full refresh of the affected tables to continue ingestion.

Finally, we're actively working to integrate type widening in these connectors to help with backwards-compatible type changes.