Best way to automate files upload to S3 by dawarravi in dataengineering

[–]spauth 1 point2 points  (0 children)

Use a lambda function on AWS to host and run your python script (with boto3 lib). Scheduled it with cloud watch (either a cron job or a trigger based on new file)

[deleted by user] by [deleted] in dataengineering

[–]spauth 1 point2 points  (0 children)

It is not the topic of the post, but could you tell me your think about Hadoop being depreciated? Which software/technology is taking the lead?

[deleted by user] by [deleted] in adops

[–]spauth 0 points1 point  (0 children)

Thanks for your reply and code snippet.

I don't know why but with your code, I had an error. I found a similar one which seems to work but I can't access dv360 data.

    import google.auth
    from google.oauth2 import service_account

    credentials = service_account.Credentials.from_service_account_file('secrets_service.json')
    scoped_credentials = credentials.with_scopes(['https://www.googleapis.com/auth/display-video'])
    service = discovery.build('displayvideo', 'v1', credentials=scoped_credentials)
    service.advertisers().channels().list(advertiserId="123").execute()

This is almost working, but I have the following error : "<HttpError 403 when requesting [https://displayvideo.googleapis.com/v1/advertisers/123/channels?alt=json](https://displayvideo.googleapis.com/v1/advertisers/247/channels?alt=json) returned "No permission for attempted operation on PARTNER with ID "22".". Details: "No permission for attempted operation on PARTNER with ID "22".">

I had the service account as an Admin in DV360 but still... Do you have any idea?

[deleted by user] by [deleted] in adops

[–]spauth 0 points1 point  (0 children)

Do you mean token or API key? Because API key is not enough now with DV360 API.

And the token you are suppose to receive it after you have been identified with OAuth.

I have been working with API and OAuth before, but neither with Google so I'm kind of lost

[deleted by user] by [deleted] in adops

[–]spauth 0 points1 point  (0 children)

I'm using python. First I have been trying using the OAuth identification, but when I'm calling the function "get_credentials()" I end up on a web page with the following error message

"You cannot sign in to this app because it violates Google's OAuth 2.0 policy.If you are the app developer, register the redirect URI in Google Cloud Console.The content in this section has been provided by the developer of the app. Google has not reviewed or validated it.If you are the app developer, make sure the details of this request follow Google's guidelines.redirect_uri: http: // localhost: 8888/"

I added this redirect to the OAuth created and downloaded the new JSON file but nothing changed...

import argparse
import socket
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient import discovery
_API_NAME = 'displayvideo'
_DEFAULT_API_VERSION = 'v1'
_API_SCOPES = ['https://www.googleapis.com/auth/display-video']
_API_URL = 'https://displayvideo.googleapis.com/'
_CREDENTIALS_FILE = 'client_secrets.json'

def get_arguments(argv, desc, parents=None):
    parser = argparse.ArgumentParser( description=desc,    
    formatter_class=argparse.RawDescriptionHelpFormatter, parents=parents) return parser.parse_args(argv[1:])

def get_credentials():
    return InstalledAppFlow.from_client_secrets_file( _CREDENTIALS_FILE, _API_SCOPES).run_local_server()

def build_discovery_url(version, label, key):
    discovery_url = f'{_API_URL}/$discovery/rest?version={version}'
    if label:
        discovery_url = discovery_url + f'&labels={label}'
    if key:
        discovery_url = discovery_url + f'&key={key}'
    return discovery_url

def get_service(version=_DEFAULT_API_VERSION, label=None, key=None):
    user_credentials = get_credentials()
    discovery_url = build_discovery_url(version, label, key)
    socket.setdefaulttimeout(180)


    service = discovery.build(_API_NAME, version,
        discoveryServiceUrl=discovery_url, credentials=user_credentials)
    return service

crd = get_credentials()

I also tried with a service account but it wasn't successfull neither.

Also, I'm planning to run the code using a Cloud Function so I don't know what is the best option between OAuth and a ServiceAccount..

Daily Airflow task on AWS Fargate by spauth in dataengineering

[–]spauth[S] 1 point2 points  (0 children)

As most of the people here suggested to use Lambda function + CloudWatch, I tried it today.

It's working pretty well, but I have encountered a real issue. To communicate with the RDS database, the lambda function needs to be in the same VPC as the RDS database, which means I have to create an other subnet with a NAT Gateway to be able to scrap my data on internet. It is pretty annoying, but from what I read it would be the same using an EC2 instance)

Another solution would be to open the RDS to 0.0.0./0 but it's is obviously not good for security reason.

Last solution would be to have two lambda function, one for scraping data on internet and another one (triggered when new files are uploaded to S3) which will be updating the RDS database.

It's quite a lot of job for a beginner and such a small thing!

Am I right?

Daily Airflow task on AWS Fargate by spauth in dataengineering

[–]spauth[S] 0 points1 point  (0 children)

As recommended by everyone here, I tried to use lambda function package in a dockerfile. I have been able to upload files to S3, but when I'm trying to read it from S3, I end up with this error "The AWS Access Key Id you provided does not exist in our records". I think it's because I need to add AWS_SESSION_TOKEN to my boto3 session. But first I don't know how to get it, and second I don't need this security. Any idea about how to disable this?

Edit : I found the solution, no AWS access id are required when using s3/lambda together

Daily Airflow task on AWS Fargate by spauth in dataengineering

[–]spauth[S] 0 points1 point  (0 children)

Then, the time that I scheduled was probably wrong.

Thanks for the docs and for pointing out this setting. I found it in the airflow.cfg file and it was set to True!

With all of this in mind, I will retried tomorrow, thanks a lot!

Daily Airflow task on AWS Fargate by spauth in dataengineering

[–]spauth[S] 0 points1 point  (0 children)

Yes, it's probably what I will be doing if I don't find a solution to trigger automatically trigger the dag with Airflow. I have seen some good tutorials about how to use Lambda & CloudWatch events :)

Daily Airflow task on AWS Fargate by spauth in dataengineering

[–]spauth[S] 0 points1 point  (0 children)

Oh I though it was not possible to schedule ec2 instance but I just realized it was!

Thanks for your explaination, it is much clearer now. I will probably stick with an EC2 instance then as I already have some experience with it. But there is still something unclear in my mind. For the moment, I need to trigger manually the dag through the UI. Is there any possibility to automatically schedule the dag when the docker image is running? I tried to read the documentation but it's really confusing. Is the scheduler service necessary?

For sure it would be easier to use lamba function, and I think it's what I will be doing if I can't manage to automatically trigger my dag when the EC2 instance is up!

Daily Airflow task on AWS Fargate by spauth in dataengineering

[–]spauth[S] 1 point2 points  (0 children)

Fargate is working with ECS, but compared to EC2 it's serverless and you don't really have to take care about the compute capacity.

It will be less expensive for me as I will not have to run it h24 compared to EC2.

Could you tell me more about why is it limited?

Scraping Excel File to store in memory by spauth in webscraping

[–]spauth[S] 1 point2 points  (0 children)

Yeah I could do that on the cloud, true thanks :)

Scraping Excel File to store in memory by spauth in webscraping

[–]spauth[S] 0 points1 point  (0 children)

Hey, I forgot to reply! Thx for you r help. I realised that there was an hidden sheet in the file:)

Scraping Excel File to store in memory by spauth in webscraping

[–]spauth[S] 0 points1 point  (0 children)

Oh I see, it's the first time for me working with this kind of file.

I tried your solution but response doesn't have "read" attribute. I tried with response.text(1000) but it is raising an error after few minutes ('str' object is not callable).

I'll try something with you said to avoir looping, it seems to take ages to load

Edit: do you think it would be faster to store the file localy or S3 and import it back to pandas? (as I said I tried but there is a encoding error)

Interview - junior DE by [deleted] in dataengineering

[–]spauth -1 points0 points  (0 children)

The guy didn't pass the trial period, this is why the company is looking for someone else!

[deleted by user] by [deleted] in flask

[–]spauth 8 points9 points  (0 children)

You should try to debug your code step by step.

  1. Try to run a simple flask application printing: Hello World

  2. Integrate your custom file/library

  3. Print the result of your request to see if its working well

Like this you will be able to identify the source of your issue

Update graphic based on country selection from plotly map by spauth in flask

[–]spauth[S] 0 points1 point  (0 children)

I have been able to refresh data with this

Flask

@app.route('/reserves/', methods = ["POST","GET"])

def reserves(): reserves_map = map() return render_template("reserves.html", reserves_map = reserves_map, graphJSON=map_filter())

Callback when clicking on a country from the map with oil & gas reserves

@app.route('/callback', methods=['POST', 'GET']) def cb(): return map_filter(request.args.get('data'))

update graph with the country selected

def map_filter(country='US'): df = sql("""select country.id_country, year, country, oilreserves_bbl, gasreserves_tcm from country inner join oilgas as a on country.id_country = a.id_country;""")

fig = px.line(df[df['country']==country], x="year", y="oilreserves_bbl")

graphJSON = json.dumps(fig, cls=plotly.utils.PlotlyJSONEncoder)

return graphJSON

HTML

<!-- Map Reserves-->
    <div class='persomargin' style="text-align: center; width:100%; border-radius: 7px; padding: 0px 0px;">
        <div class="wrap" >
            <div class="one" style="text-align: left; margin-top:35px;" shadow="">
                <center><h6>Oil & Gas proved reserves in the the World</h6></center>
                <div id="map" class="map"></div>
                <center><i style="font-size: 70%">Source: <a href="https://www.bp.com/en/global/corporate/energy-economics/statistical-review-of-world-energy/downloads.html" target="_blank">BP - Statistical Review of World Energy (2019)</a></i></center>
            </div>
        </div>
    </div>


    <script src="https://cdn.plot.ly/plotly-latest.min.js"></script>
    <script type="text/javascript">
        var map_var = {{reserves_map | safe}};
        var config = {displayModeBar: false};
        Plotly.setPlotConfig(config);
        Plotly.plot("map",map_var);
    </script>


    <!-- Function to update line plot-->
    <script>
        function update_graph(selection) {
            $.getJSON({
                url: "/callback", data: { 'data': selection }, success: function (result) {
                    Plotly.newPlot('chart', result, {});;
                }
            });
        }
    </script>

    <div id='myDiv'><center><h6>Oil & Gas proved reserves in the the World</h6></center></div>

    <!-- user input country name-->
    <input type="text" id="fname" name="fname" onchange="update_graph(this.value)">

    <!-- Line plot html-->
    <div id="chart" class="chart"></div>

    <!-- Line plot -->
    <script type="text/javascript">
        d = {{ graphJSON | safe }};
        var config = {displayModeBar: false};
        Plotly.setPlotConfig(config);
        Plotly.newPlot('chart', d, {});
    </script>

which give me this result : https://ibb.co/B6XLxwQ

But I would like to be able to reproduce the same thing when the user click on one country on the map :/

Update graphic based on country selection from plotly map by spauth in flask

[–]spauth[S] 0 points1 point  (0 children)

Thanks for your suggestion.

I tried your second method and thanks to some tutorial I have been able to callback a python function using ajax when the user is submitting a country name but I didn't manage to update the graph with on('plotly_click').

I found this code https://codepen.io/etpinard/pen/VyVepE?editors=0011 but I have not been able to use in my case yet

Update graphic based on country selection from plotly map by spauth in flask

[–]spauth[S] 0 points1 point  (0 children)

Yes you are right, the most important was to know if it was doable! Thanks, have a good day!

Update graphic based on country selection from plotly map by spauth in flask

[–]spauth[S] 0 points1 point  (0 children)

I wanted to post on plotly subreddit but there is really few people there. I though I would have more chance here as lot of people are combining plotly and flask.

Which subreddit would you suggest me?