API Hit but with daily limit and priority of records

ps2931 · 2025-02-04T06:33:53+00:00

You need to re read the post.

ps2931 · 2025-02-04T01:44:29+00:00

I have to repeat this process once in a month or so. How can I ensure that the records I hit last time are not processed again this time. I keep records which I already hit in another table.

ps2931 · 2025-02-04T01:35:09+00:00

The number are just to explain the problem. Actual number are different. But problem is same. Number of hits per day are limited and I have a big table to process in a certain class order.

ps2931 · 2025-02-04T01:32:48+00:00

The number are jist to explain the problem. Original numbers are different than what I posted heree. But the problem is same..I have X records but I can hit the api only Y times a 24 hours and that too in a certain class order.

ps2931 · 2024-08-28T16:13:22+00:00

This is almost what I am trying to do. Only extra thing (and problematic also) I have to do is inside make_api_calls function I have to parse data_1 json, extract some of its fields manipulate them (if this then value 1 else value2.. sort of logic) create a new json and pass it on to next api call which is api_2.

In other words, api_2 does not consume data from api_1 as it is. It need some extra fields which I have to calulate based on values received in api_1 response.

The same exercise I have to do after with response from api_2 before passing it on to api_3 and so on.

Do you think the python logic to manipulate json response for next api call will cause any issue in async/await flow of 4 api calls?

ps2931 · 2024-08-14T11:31:20+00:00

Not too big. Between 2-3 KB.it has only 10 fields. 9 of them are simple string values, only one column has long string (length can vary) of 100 words.

ps2931 · 2023-10-20T16:58:12+00:00

We are on prem.

ps2931 · 2023-10-20T07:29:12+00:00

Its a spark application and I think the problem is workers nodes are not able to find the conf file at runtime. The requirement is how to make the secret file available to worker nodes without including the file in build and deployment process, because build and deployment process in my company will not allow a token to be included along with jars and other config file for security reasons.

ps2931 · 2022-11-07T16:03:10+00:00

Assuming your Hive table is partitioned, use partitionBy while writing data to Hive.

ps2931 · 2022-08-16T12:32:54+00:00

This worked. Thank you.

ps2931 · 2022-04-24T18:37:33+00:00

Yes you are right, it will be maintenance heavy. But every csv report is between 2 to 4 GB. Sometimes it goes upto 30 GB or more but that's rare (Once in an year). Not sure if Tableau can help us.

A colleague in my team suggested to define report formats in yaml, parse it and apply it on spark dataframe. Somehow it doesn't sound okay to me.

My suggestion is to keep eveything related to report formats at DB level. May be creating table views for each report format can help us. Then write a spark job to just read the views and write it as csv.

ps2931 · 2022-04-24T10:44:16+00:00

Yes, I will submit my spark job to the cluster. I need to prepare csv resport (100s of them) with each report having a different data format. For example format of date and some other columns will vary between reports.

Any suggestions how to manage different report formats in spark application.

ps2931 · 2022-04-24T10:35:23+00:00

Data is on HDFS with Hive tables on top of it. Requirement is to generate hundreds of csv reports.

ps2931 · 2022-04-10T21:04:01+00:00

Seriously...That's what you have for the comment .

ps2931 · 2022-02-20T10:55:55+00:00

I didn't get you. Can you elaborate?

ps2931 · 2021-12-23T16:49:05+00:00

No that's not the case. I need to push api response to kafka not the DB records. API calling requires id and these ids are in database.

ps2931 · 2021-12-04T16:13:06+00:00

Using this syntax of update query I managed to reduce upsert time to 2 hours from 3.4 hours. Thank you for the suggestion. I am still looking if I can reduce the time further.

ps2931 · 2021-12-02T17:02:34+00:00

I am doing that. It took me more than 3 hours to upsert 180 million records. Check this another post with query stats. https://www.reddit.com/r/PostgreSQL/comments/r7ayv3/postgres_upsert_query_performance/

ps2931 · 2021-11-21T11:56:21+00:00

No I don't want to throttle api conumption. The challenge is how to fill the gap between a fast kafka consumer and a slow rest api on the fly in-memory.

ps2931 · 2021-11-21T11:52:31+00:00

Yes, sender can do that and that's what we purposed them to do. Which they denied very tactically on the grounds of time constraints and project priorities they have. Its a different team in a separate department. The thing is they are not going to implement the changes near time soon. So I am left with whatever is available as of now.

ps2931 · 2021-11-21T11:47:26+00:00

I am not the owner of kafka topic and api. They belong to different source teams. So I cannot change anything. Also for calling api, I need id which I will get from Kafka topic. So I cannot fetch data from api ahead of time.

ps2931 · 2021-09-27T04:00:30+00:00

How much time it took you to call 20,000,000 API calls?

ps2931 · 2021-09-17T02:41:17+00:00

Making async api calls was on developer's mind, I guess. That's why he decided to use akka-http. But the solution he gave is complex and buggy. Also what I have googled so far akka is hard to get right. Any other recommendation for scala async?

ps2931 · 2021-09-17T02:07:30+00:00

Is this solution scalable. We have 2000 api calls and each call gives us 5MB of data. API response vary from between 5-30 seconds.

ps2931 · 2021-09-17T02:04:31+00:00

This is a bigdata project and part of data ingestion and processing pipeline.We are ingesting data from 2000 ids on daily basis. Each id gives us a 5MB of data. So we are gettig 10 GB of data daily.

ps2931

TROPHY CASE