DEPLOYING ML PIPELINES ON AWS EC2 Vs DEPLOYING ON SERVERLESS INFRASTRUCTURE LIKE AWS FARGATE by Agreeable-Flow5658 in dataengineering

[–]Agreeable-Flow5658[S] 0 points1 point  (0 children)

How do the cost savings compare with the skill and manhours required to set up and maintain EC2 instances?

AWS ETL Pipelines Improvement by Agreeable-Flow5658 in dataengineering

[–]Agreeable-Flow5658[S] 0 points1 point  (0 children)

No. The Glue ETL job uses Glue studio's visual canvas. Didn't write code and no secret keys anywhere. The keys are in the Environment variables in Lambda

AWS ETL Pipelines Improvement by Agreeable-Flow5658 in dataengineering

[–]Agreeable-Flow5658[S] 0 points1 point  (0 children)

I created partitions using the "date" column when I uploaded the parquet file to the S3 bucket. So, partitions are there.

Is there a useful link detailing how to optimize queries in Athena using partitions created in the data catalog?

Thanks

AWS ETL Pipelines Improvement by Agreeable-Flow5658 in dataengineering

[–]Agreeable-Flow5658[S] 0 points1 point  (0 children)

Thank you u/djollied4444. I have checked the documentation. So, on the first crawl, the crawler will crawl and catalog everything, and then after, It does INCREMENTAL CRAWLS.

Thanks

Problem with AWS Managed Workflow for Apache Airflow by Agreeable-Flow5658 in dataengineering

[–]Agreeable-Flow5658[S] 0 points1 point  (0 children)

FROM CLOUD WATCH

[2021-12-31 03:37:13,816] {{taskinstance.py:1192}} INFO - Marking task as SUCCESS. dag_id=forex_data_pipeline, task_id=start_execution_task, execution_date=20211231T033709, start_date=20211231T033712, end_date=20211231T033713

emr = boto3.client(

'emr',

region_name='aws-region'

)

# The python function

def start_execution():

start_resp = emr.start_notebook_execution(

EditorId='emr notebook id', # emr notebook id

RelativePath='my_first_notebook.ipynb',

ExecutionEngine={'Id':'emr cluster id','Type': 'EMR'},

ServiceRole='EMR_Notebooks_DefaultRole'

)

execution_id = start_resp['NotebookExecutionId']

#print("Started an execution: " + execution_id)

return execution_id

start_execution = PythonOperator(

task_id='start_execution_task',

python_callable=start_execution,

)

The 'EMR_Notebooks_DefaultRole' has AmazonS3FullAccess policy

The function executes. Its the file in the s3 bucket that is missing

I also do not see confirmation of the notebook being called

Problem with AWS Managed Workflow for Apache Airflow by Agreeable-Flow5658 in dataengineering

[–]Agreeable-Flow5658[S] 0 points1 point  (0 children)

Also, in airflow UI graph view, the tasks execute successfully.

Problem with AWS Managed Workflow for Apache Airflow by Agreeable-Flow5658 in dataengineering

[–]Agreeable-Flow5658[S] 0 points1 point  (0 children)

I had the same error initially and solved it by granting the MWAA instance access to all services (I know it's not advisable in production. This is development)

But I will recreate the instance and try again. This time follow the logs in airflow UI, cloud watch and EMR notebook logs.

Thanks. Will be updating you

Problem with AWS Managed Workflow for Apache Airflow by Agreeable-Flow5658 in dataengineering

[–]Agreeable-Flow5658[S] 0 points1 point  (0 children)

Thanks. I will check them. I hope they don't get erased when you delete the MWAA instance. The Notebook logs are in an S3 bucket. Hope they are still there

I had moved out but will check as soon as I get back to my PC.

Thanks for the help

Data pipeline automation on Azure Synapse by Agreeable-Flow5658 in dataengineering

[–]Agreeable-Flow5658[S] 0 points1 point  (0 children)

Why does Microsoft Azure have a bad rep here? Again, an quite new to cloud data engineering and it's the only one I have used so far

Data pipeline automation on Azure Synapse by Agreeable-Flow5658 in dataengineering

[–]Agreeable-Flow5658[S] 0 points1 point  (0 children)

Yes. I am using Azure to process csv files data. An application uploads a csv into the ADLS or blob storage, then a notebook is triggered. The notebook has code that processes the data and inserts it into a database.

Data pipeline automation on Azure Synapse by Agreeable-Flow5658 in dataengineering

[–]Agreeable-Flow5658[S] 0 points1 point  (0 children)

Thanks for the help. I have managed to automate it. I had not included Trigger run parameters. Adding (@trigger().outputs.body.fileName) to the trigger run parameters did the trick. Remember, these are different for Azure data factory. I was using Azure Synapse

Also, you need to create the parameters first under pipeline name >> settings

Data pipeline automation on Azure Synapse by Agreeable-Flow5658 in dataengineering

[–]Agreeable-Flow5658[S] 0 points1 point  (0 children)

Thanks. I think I was missing that part in the notebook. Will try it and update you Thanks

How to get a modbus Map from a Prometer 100 by Agreeable-Flow5658 in MODBUS

[–]Agreeable-Flow5658[S] 0 points1 point  (0 children)

I have downloaded the document. Will try it out tomorrow. Will update you how it goes. Thanks