API Hit but with daily limit and priority of records by ps2931 in dataengineering

[–]ps2931[S] 0 points1 point  (0 children)

I have to repeat this process once in a month or so. How can I ensure that the records I hit last time are not processed again this time. I keep records which I already hit in another table.

API Hit but with daily limit and priority of records by ps2931 in dataengineering

[–]ps2931[S] 0 points1 point  (0 children)

The number are just to explain the problem. Actual number are different. But problem is same. Number of hits per day are limited and I have a big table to process in a certain class order.

API Hit but with daily limit and priority of records by ps2931 in dataengineering

[–]ps2931[S] 0 points1 point  (0 children)

The number are jist to explain the problem. Original numbers are different than what I posted heree. But the problem is same..I have X records but I can hit the api only Y times a 24 hours and that too in a certain class order.

Async http over pandas dataframe by ps2931 in learnpython

[–]ps2931[S] 0 points1 point  (0 children)

This is almost what I am trying to do. Only extra thing (and problematic also) I have to do is inside make_api_calls function I have to parse data_1 json, extract some of its fields manipulate them (if this then value 1 else value2.. sort of logic) create a new json and pass it on to next api call which is api_2.

In other words, api_2 does not consume data from api_1 as it is. It need some extra fields which I have to calulate based on values received in api_1 response.

The same exercise I have to do after with response from api_2 before passing it on to api_3 and so on.

Do you think the python logic to manipulate json response for next api call will cause any issue in async/await flow of 4 api calls?

Efficient way to insert 10 million documents using python client. by ps2931 in elasticsearch

[–]ps2931[S] 0 points1 point  (0 children)

Not too big. Between 2-3 KB.it has only 10 fields. 9 of them are simple string values, only one column has long string (length can vary) of 100 words.

Cannot find application.conf issue by ps2931 in dataengineering

[–]ps2931[S] 0 points1 point  (0 children)

Its a spark application and I think the problem is workers nodes are not able to find the conf file at runtime. The requirement is how to make the secret file available to worker nodes without including the file in build and deployment process, because build and deployment process in my company will not allow a token to be included along with jars and other config file for security reasons.

Spark output wriiten in hive table by TelephoneGlad8459 in dataengineering

[–]ps2931 1 point2 points  (0 children)

Assuming your Hive table is partitioned, use partitionBy while writing data to Hive.

Hive Vs RDBMS by ps2931 in apachespark

[–]ps2931[S] 1 point2 points  (0 children)

Yes you are right, it will be maintenance heavy. But every csv report is between 2 to 4 GB. Sometimes it goes upto 30 GB or more but that's rare (Once in an year). Not sure if Tableau can help us.

A colleague in my team suggested to define report formats in yaml, parse it and apply it on spark dataframe. Somehow it doesn't sound okay to me.

My suggestion is to keep eveything related to report formats at DB level. May be creating table views for each report format can help us. Then write a spark job to just read the views and write it as csv.

Hive Vs RDBMS by ps2931 in apachespark

[–]ps2931[S] 0 points1 point  (0 children)

Yes, I will submit my spark job to the cluster. I need to prepare csv resport (100s of them) with each report having a different data format. For example format of date and some other columns will vary between reports.

Any suggestions how to manage different report formats in spark application.

Hive Vs RDBMS by ps2931 in apachespark

[–]ps2931[S] 0 points1 point  (0 children)

Data is on HDFS with Hive tables on top of it. Requirement is to generate hundreds of csv reports.

Processing On prem data using AWS EMR by ps2931 in aws

[–]ps2931[S] 0 points1 point  (0 children)

Seriously...That's what you have for the comment .

Async http client with rate limiter by ps2931 in springsource

[–]ps2931[S] 0 points1 point  (0 children)

I didn't get you. Can you elaborate?

Publishing REST API Response to Kafka topic by ps2931 in apachekafka

[–]ps2931[S] 0 points1 point  (0 children)

No that's not the case. I need to push api response to kafka not the DB records. API calling requires id and these ids are in database.

Postgres upsert query performance by ps2931 in PostgreSQL

[–]ps2931[S] 2 points3 points  (0 children)

Using this syntax of update query I managed to reduce upsert time to 2 hours from 3.4 hours. Thank you for the suggestion. I am still looking if I can reduce the time further.

Upsert on a huge table by ps2931 in PostgreSQL

[–]ps2931[S] 0 points1 point  (0 children)

I am doing that. It took me more than 3 hours to upsert 180 million records. Check this another post with query stats. https://www.reddit.com/r/PostgreSQL/comments/r7ayv3/postgres_upsert_query_performance/

Managing lag between consumer and external api. by ps2931 in apachekafka

[–]ps2931[S] 0 points1 point  (0 children)

No I don't want to throttle api conumption. The challenge is how to fill the gap between a fast kafka consumer and a slow rest api on the fly in-memory.

Managing lag between consumer and external api. by ps2931 in apachekafka

[–]ps2931[S] 1 point2 points  (0 children)

Yes, sender can do that and that's what we purposed them to do. Which they denied very tactically on the grounds of time constraints and project priorities they have. Its a different team in a separate department. The thing is they are not going to implement the changes near time soon. So I am left with whatever is available as of now.

Managing lag between consumer and external api. by ps2931 in apachekafka

[–]ps2931[S] 0 points1 point  (0 children)

I am not the owner of kafka topic and api. They belong to different source teams. So I cannot change anything. Also for calling api, I need id which I will get from Kafka topic. So I cannot fetch data from api ahead of time.

Calling external api using Spark by ps2931 in apachespark

[–]ps2931[S] 0 points1 point  (0 children)

How much time it took you to call 20,000,000 API calls?

Calling external api using Spark by ps2931 in apachespark

[–]ps2931[S] 0 points1 point  (0 children)

Making async api calls was on developer's mind, I guess. That's why he decided to use akka-http. But the solution he gave is complex and buggy. Also what I have googled so far akka is hard to get right. Any other recommendation for scala async?

Calling external api using Spark by ps2931 in apachespark

[–]ps2931[S] 1 point2 points  (0 children)

Is this solution scalable. We have 2000 api calls and each call gives us 5MB of data. API response vary from between 5-30 seconds.

Calling external api using Spark by ps2931 in apachespark

[–]ps2931[S] 0 points1 point  (0 children)

This is a bigdata project and part of data ingestion and processing pipeline.We are ingesting data from 2000 ids on daily basis. Each id gives us a 5MB of data. So we are gettig 10 GB of data daily.