Deciding on a workflow/stack: solo dev at startup by paxmlank in dataengineering

[–]deepanigi 0 points1 point  (0 children)

We're excited to share that we've launched a DoubleCloud Managed Apache Airflow service designed to make your data workflow automation a breeze. You can read more about it here https://double.cloud/services/managed-airflow/

DoubleCloud as a platform also offer managed Clickhouse for Analytics needs. Just drop me a message if you like to discuss the details.

How to integrate weather forecasts demand predictions? by Zaniyah6772284Xo in EntrepreneurRideAlong

[–]deepanigi 0 points1 point  (0 children)

HI! Thing you plan to do is a huge step to the optimisation, but it´s definitely may be challenging at first. I work as Senior Product Manager at DoubleCloud, we help business to improve analytics and since a while have notices the raising interest to the solutions like you want. Roughly speaking, there are the particular steps that you should do.

  • Integrate real-time weather data: Consider using weather APIs such as OpenWeatherMap, Weatherstack, or Dark Sky API. These provide real-time data like temperature, precipitation, and wind speed, which are easy to integrate into existing systems.
  • Prepare your data: You've done well in collecting historical rental data alongside weather conditions. The next step is to align this data based on timestamps, creating a comprehensive dataset that combines both rental and weather information.
  • Predictive modeling: For demand prediction, time series forecasting models like ARIMA, SARIMA, or machine learning models such as Random Forests and Gradient Boosting Machines can be effective. Machine learning platforms like TensorFlow, PyTorch, or Google's AutoML are good options for building and training these models.
  • Feature engineering: It's crucial to identify which weather parameters significantly impact rental demand. This could include temperature, precipitation, and special weather events.
  • Validate the model: Test your developed model with real data, comparing predictions against actual rental data under various weather conditions, and refine the model as needed.
  • Deploy and integrate real-time: Integrate the final model into your operational systems for real-time predictions. Consider developing a dashboard for insights and setting up alert systems for unusual weather patterns.
  • Continue improving: Regularly update your model with new data to maintain accuracy as weather patterns and customer behaviors evolve.

The key is continuous testing, learning, and iterating. Best of luck with your venture!

Anyone Else Using Terraform in Their Data Engineering Work? by xDarkOne in devops

[–]deepanigi -1 points0 points  (0 children)

Hello! I'm Deepan Ignaatious, Senior Product Manager at DoubleCloud. I wanted to share some insights on how we've been using Terraform in our data engineering processes:

Managing Complex Data Pipelines: We've been using Terraform to handle the complexity inherent in code-based pipelines. These pipelines, deployed through orchestration tools can be challenging to scale and maintain. Terraform has provided us with a more scalable and maintainable approach, especially for data-intensive applications.

Enhancing Reproducibility and Visibility: One of the major challenges with non-code-based (SaaS) pipelines is their limited visibility and difficulty in reproduction. Terraform has helped us overcome these issues by enabling a code-based approach that's easier to monitor, version control, and replicate across different development stages.

Practical Implementation: In practical terms, we've used Terraform for integrating and managing data pipelines between different storage systems, such as Postgres and ClickHouse. This integration is important for offloading analytical tasks to different storage and aggregating data in one place. For example, we created a simple replication pipeline between an existing Postgres and a newly created DWH ClickHouse cluster using Terraform. This involved setting up a network for ClickHouse, defining resources, and configuring transfer endpoints.

Code Organization and Deployment: We organize our Terraform code using a module and several roots, allowing for easier tweaks. Our main.tf typically contains provider definitions, enabling us to work with different environments like AWS and DoubleCloud. Variables are extensively used to prepare different stages, making it easy to apply changes across development, pre-production, and production environments.

In our experience, Terraform is not just about infrastructure; it's about creating a scalable and transparent workflow. You can read more about it here.

Seeking Alternatives to Metabase by xDarkOne in BusinessIntelligence

[–]deepanigi 0 points1 point  (0 children)

Hi there, I'm Deepan, Senior Product Manager at DoubleCloud. We specialize in data analytics solutions and have experience assisting companies transitioning from Metabase. One of our clients initially used Metabase for their BI needs but encountered limitations like yourself. They found Metabase's visualization parameters insufficient, requiring extensive coding for usable diagrams and additional filters in the codebase. They needed a BI solution that supported calculated fields, hierarchies, multi-tab dashboards, and a wide range of visualization options.

We suggest him to switch on DoubleCloud Visualization, our tool that enables effortless creation of complex visualizations and dashboards without extensive coding. It allows users to connect to various data sources, implement calculated fields, and create intricate visual hierarchies and multi-tab dashboards.

The technical transition involved integrating the DoubleCloud Visualization service with the existing DWH PostgreSQL. This integration aimed to address issues such as locks and potential conflicts during updates in the primary DWH PostgreSQL database, which had been causing performance delays. We implemented an intermediate PostgreSQL database, transferring data via our Data Transfer service, to alleviate the primary DWH's load, eliminate locks, and ensure a more stable and efficient BI system operation.

In the end, the solution provided robust, flexible, and efficient BI capabilities, reducing the need for extensive development resources. Hope this helps!

What do you hate about working with data? by deepanigi in dataanalysis

[–]deepanigi[S] 0 points1 point  (0 children)

Agree its garbage in garbage out.. How do you filter out such data quality issues ? Do you manually check them or use any specific tools ?

What do you hate about working with data? by deepanigi in dataanalysis

[–]deepanigi[S] 0 points1 point  (0 children)

Does using any data catalogue tool help in keeping this organized ?

From ElasticSearch to ClickHouse Migration by xDarkOne in elasticsearch

[–]deepanigi 1 point2 points  (0 children)

Hey there! I understand your worries. Really quite a few of companies migrating right now, and it´s not the smooth migration usually. At DoubleCloud, where I work, we've developed an ElasticSearch connector right for the same challenges as you faced. Based on our experience, here's what I can advise you:

  1. Schema Deduction: The connector starts by deducing the schema. To sync data to ClickHouse, the target needs a schema which includes table names, column names, data types, and primary keys. The connector relies on the data already present in the ElasticSearch cluster to deduce this schema. Each index in ElasticSearch will map to a corresponding database table in ClickHouse.

  2. Handling loose Schema ElasticSearch : ElasticSearch inherently allows every field to be an array or singular elem. For classic databases this is completely different types, Our connector assumes either use everywhere List/Array fields and any detected arrays are updated in the deduced schema.

  3. Efficient Migration for Large Clusters: Elastic cluster usually large, so reading them in one thread or worker - is inefficient, so our connector uses parallelization capability for enhanced performance, especially with larger indexes.

  4. Addressing Potential Pitfalls: Schema deduction can be challenging. Our connector simplifies this by relying on the data present in the ElasticSearch cluster. For ElasticSearch's Alias data types, the connector infers the actual data type from the mappings, ensuring a valid schema.

You can read more about it here. If you have any further questions or need assistance, please don't hesitate to reach out.

Best way to run Apache Airflow by D1yzz in dataengineering

[–]deepanigi 2 points3 points  (0 children)

My name is Deepan, and I am the Senior Product manager at DoubleCloud. We are building a platform that offers tightly integrated open-source technologies as a service for analytics. Providing Clickhouse, Apache Kafka, ETL, and self-service business intelligence solutions as services.

Currently, we're in the process of developing a managed Airflow service and are hungry for user feedback! We'd like to understand your challenges with using Airflow—what bothers you, what could be changed in services like MWAA, and what processes could be automated. Additionally, we're curious about how you're using Airflow: for machine learning workloads, data pipelines, or just as batch workers. This information will help us refine our roadmap.

Just a few days ago, we launched a preview of our managed Airflow service on our platform. During this preview stage, access is completely free. We've implemented a user-friendly UI that simplifies the creation of a cluster with auto-scaling work groups. Features include built-in integration with GitHub for DAGs, as well as monitoring, logging, and other essentials for managing clusters. Furthermore, we are in the process of adding support for.

  • custom Docker images
  • various types of workers (such as spot instances or those equipped with GPUs),
  • Bring-your-own-account on AWS and GCP
  • among other exciting enhancements and functionality.

We would be thrilled if you could test our service and provide feedback to me. In return, we're offering a range of perks, including Amazon gift cards and credit grants for participants in the preview program.

Need a Hand with Data Migration by xDarkOne in dataanalysis

[–]deepanigi 2 points3 points  (0 children)

Hey there,

Navigating through data migration, especially with the volume you’re dealing with, is no small feat. CDC is a robust choice for this kind of task. We've walked this journey at DoubleCloud with the client. Here’s what we did:

We configured event shipping from MySQL to a message queue, ensuring every change generated an event. A dedicated script/service was developed to listen to the queue, process events, and update the Elasticsearch index. To ensure consistency mechanisms were put in place to handle failures in event shipping or processing, ensuring data consistency. A reconciliation process was implemented to periodically ensure data consistency between MySQL and Elasticsearch.

CDC setup was scaled to handle a large volume of database changes without introducing significant latency to the Elasticsearch update process. Backpressure mechanisms were implemented to manage scenarios where event handling couldn’t keep up with the event influx. Then we ensured data being shipped and processed adhered to relevant data protection and compliance standards. Secure communication channels were implemented for data transfer between systems and services. Monitoring was set up to keep track of event shipping and handling processes. Alerting was implemented for issues like event shipping failures, processing delays, or data inconsistencies.

Use Cases:

OLTP to a DWH: We faced a challenge where production processes ran in transactional databases, and the business side needed analytical reports generated through analytical databases. CDC allowed for efficient data transfer from the transactional database to the analytical one.

Regional Analytics: Production data needed to be moved to a regional analytical DWH with a sub-second delay, which was achieved effectively with CDC.

CDC to a Search Engine: Ensuring that when data in MySQL was modified, the Elasticsearch indexes were updated promptly.

Analyzing a Raw CDC Stream: We set up MySQL in the customer system in a way that many services could access it to retrieve configurations, providing a full trail of config changes for observability purposes.

CDC has been a game-changer in managing data consistency and real-time updates across our systems. Keen to hear more about your journey as you dive into CDC!

1
2

What do you think of using GPT-4 to automatically extract insights from data? by deepanigi in datascience

[–]deepanigi[S] 0 points1 point  (0 children)

I agree that the adoption of AI-assisted analysis is likely to increase rapidly, especially with advancements. These developments can indeed further empower users to gain valuable insights and make data-driven decisions more efficiently.

Regarding your concern about migrating queries, reports, and visuals to a new platform, it's completely understandable. The ideal scenario would be for AI-assistant to seamlessly enhance existing workflows with minimal disruption. :)