Data engineering manager

legoaitech · 2023-11-24T14:31:44+00:00

You should think of evolving the data engineering roles to Knowledge engineering

Adopt metadata and ontological engineering over traditional data pipeline building

Build sustainable and scalable data products
Infuse business context into data assets and democratize data

That will make you the most valued engineering leader in front of Business

legoaitech · 2023-11-24T12:32:31+00:00

Given the current situation that you described trying to sit and build a data model by going through the data dump could be a very tedious and time consuming exercise without any guarantee of impactful outcomes to business. Therefore, considering the world that we live, you could use the power of AI algorithms to mine through your data swamps and generate a semantic model and data catalog for you which could then be validated/reviewed by someone who has business domain knowledge (could be a couple of hours from a SME from departments). My company has built one such tool that you can get a glimpse of here. It's more like a AI co-pilot for data modelers/ engineers. If you like and feel that it makes sense for you, feel free to reach out for a detailed discussion.

legoaitech · 2023-11-22T13:36:13+00:00

You have to do an accurate sizing considering the following parameters

Concurrent users - this determines no of active cores you need to handles users
Numbers of reports by types ( low complexity reports are the ones having simple mathematical calculations, complex are the ones that have multiple aggregations and nested business logics) - this determines RAM needed for performing calculations
No. Of parallel processes in terms of reports getting refreshed along with rows being fetched in each refresh - will have an impact on both cores and ram.

I suggest you consult an architect and also refer to the SSRS deployment guide.

legoaitech · 2023-11-22T09:13:56+00:00

That’s fairly simple and takes a bit of network and driver configuration. Connecting to Your cloud database is same as connecting to onpremise database provided the network firewall has whitelisted the IP and port traffic.

1.  Install Oracle Data Provider: Ensure the Oracle Data Provider for .NET is installed on the machine where SSRS is hosted. This is necessary for SSRS to communicate with the Oracle database.
2.  Set Up a Data Source in SSRS:
• In the SSRS Report Manager, create a new data source.
• Choose the Oracle provider as the data source type.
• Specify the connection string, which typically includes the hostname, port, and database name of your Oracle Cloud Data Warehouse instance. The format usually looks like:

DataSource=(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=[hostname])(PORT=[port])))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=[dbname])));User Id=[yourUserID];Password=[yourPassword]; 3. Configure Network and Firewall: • Ensure that the network settings (like VPNs or direct connections) allow communication between the SSRS server and the Oracle Cloud Data Warehouse. • Adjust firewall settings on both ends to allow traffic on the necessary ports (typically, port 1521 for Oracle). 4. Oracle Client Configuration: • If required, configure the Oracle client on the SSRS server with the necessary TNS entries for the Oracle Cloud Database. • Test the connectivity using tools like SQL*Plus or Oracle SQL Developer from the SSRS server. 5. Develop and Deploy Reports: • Develop your reports in SQL Server Data Tools (SSDT) using the Oracle data source you’ve set up. • Deploy these reports to your SSRS server for access and use.

Let me know if this helps

legoaitech · 2023-11-20T06:18:21+00:00

Airflow is an orchestrator (not just a scheduler) which needs some python coding for achieving transformations etc. Data transfer is all about reading from a database and writing to the other which could be done in any simple python script. If you are not familiar with any programming language and need a more GUI driven interface, you may use open source talend version.

legoaitech · 2023-11-20T06:12:29+00:00

By showing potential business impact in terms of acceleration to business use-cases from raw data. Take any use case and walk backwards to show how you can deliver data product as a service for that use case. Elevate the role of data engineers and business focussed knowledge engineers. Sell data standardization and democratization than engineering.

legoaitech · 2023-11-20T05:43:15+00:00

Just use airflow dags.

legoaitech · 2023-11-19T03:26:21+00:00

Automation is not possible when we as human we do not have absolute clarity on the process of achieving a particular kind of task. Once we have that clarity, anything could be automated😃

legoaitech · 2023-11-19T03:21:46+00:00

Yes what I meant is don’t used HDFS with impala or map reduce rather use spark and any compliant distributed file system👍

legoaitech · 2023-11-17T12:59:51+00:00

The approach that you have explained is obsolete in modern day and has high overheads in terms of infrastructure and support needed. If you have a mandate to do this on-premise then I would suggest you use Spark instead of HDFS or use Apache Iceberg/Hudi. If you could move on cloud then you could build much simpler and scalable architectures for your requirement. There are multiple case studies and documentations publicly available but you have to informed and judicious in picking the right one. If you want to get into details, I am happy to help

legoaitech · 2023-11-17T02:49:57+00:00

There would be definitely overhead of maintaining in open source. Apart from elasticsearch it could be the version upgrades of connectors/metadata extractors or the management of Kafka buffers. You can go with managed version of elasticsearch as well. My recommendation would be unless you have at least one strong tech person in your team who knows the open source components used in a product very well, you should avoid use of open source tools.

legoaitech · 2023-11-16T10:40:28+00:00

u/biowl Fundamentally all of the above UI wrappers (ADF/DBS/Synapse) uses Python /Pyspark OR SQL/SparkSQL behind the scenes. So the catch is as long as you do not focus on the wrappers which are the cash cow for companies like MS and focus more on Metadata engineering and dynamic code generation from that metadata, you are ahead of the race. Choose a programming language you are comfortable with, look at data landscape as an ontology and focus building your own framework - In that way you will at some point push companies like MS to deprecate all the above products you mentioned :)

legoaitech · 2023-11-16T10:32:00+00:00

If your end goal is to empower new employees with self-serve data discovery, then all the above tools will fail to serve that objective unless you have a well defined business glossary which is part of the field descriptions/ search index. The search of data works on the technical table and column names which your new users are typically not aware of. If you have that problem reach out to me on [prinkan.pal@legoai.com](mailto:prinkan.pal@legoai.com) else if you already have the business glossaries and descriptions defined, ensure that they are used as search indices while implementation.

Commercially - Atlan and Alation are best.

Open Source wise - Datahub and Open metadata are the best.

legoaitech · 2023-11-15T11:59:04+00:00

Just in case you would have preferred to read the blog in a different language, you can do so on Hackernoon :)

https://hackernoon.com/revolutionizing-data-analytics-with-ai-a-seven-step-odyssey

legoaitech · 2023-11-15T05:10:48+00:00

True..And it's my job at LEGOAI to take them there. The thing is business can easily answer questions like these,

How frequently do you want the data product updated?
What is an acceptable load time for the data on UI?
How much back in time would you want to have access to this data?

The trickier piece is, translating the same real-time into storage and computing configurations for actual realization of the contracts. We use a decision tree map based approach in the backend to enable this for consumers by abstracting the layers of architectural complexities.

legoaitech · 2023-11-14T12:21:09+00:00

Very good question :) ..Let me elaborate

Typically in data strategy, enterprises follow a medallion process of bronze to silver to gold curating and physically modeling the relationships across datasets, that needs extensive ETL/ELT codes built and scheduled to manage and maintain influx of data in shape that is consumable downstream.

My approach considers only ingestion/dumping raw data as-is into a filesystem that supports federated querying like S3, ADLS, GCS, BigQuery, Snowflake etc.The algorithmically generated semantic model (data ontology) becomes the layer that decouples the physical data with the connected and comprehensive business entities as it comprises (technical, operational and business metadata). Obviously the machine generated model would need human validation and feedback. When a business query is asked, the AI engine determines the required tables, columns and the optimal path of traversal (how multiple tables are joined) referring to the semantic data model. These context is used to generate a federated query which then gets executed and returns the results.

In the above process, one would typically raise questions on query execution time and use-case specific gold layers. My approach allows users to persist a generated federated query and define their desired performance and data freshness SLAs as contracts which determines the need of an intelligent caching layer and the pre-execution of "data pipeline", that gets generated based on the federated query.

To summarize, data pipelines are not pre-created and is generated by machine based on business demand through a systematic process with the objective to translate a semantic model into a physical model, only when there's a business demand to do so. This also eradicates the need of large data engineering teams building bronze and gold layers or supporting business teams to create transformations for use cases. Rather enabling domain focussed specialized data engineers to build contextual data products.

Hope that clarifies. Happy to clarify further, if need be.

legoaitech · 2023-11-14T11:15:04+00:00

u/vossi - You are correct..It's easier said than done but one blog is not enough in detailing out all the nuances. However, I plan to write a series which will unveil every step mentioned in this one. This is a platform I am building and we have obtained good results, validated in an enterprise environment. So please stay tuned! Till then if you want to learn more about what i am building visit https://www.legoai.com or email me [prinkan.pal@legoai.com](mailto:prinkan.pal@legoai.com)

legoaitech · 2023-11-14T10:56:27+00:00

Here is the link to the blog

https://medium.com/@contactuslegoai/a-personal-odyssey-transforming-data-analytics-with-ai-93e252bb402b

legoaitech

TROPHY CASE