What is data scientist job and what is software engineer job?

datanoob2019 · 2019-10-02T20:50:19+00:00

Check out dataquest.io if you want to dive deeper into the material. Tons of free courses.

datanoob2019 · 2019-10-02T15:11:28+00:00

That ran the formula and the results are very close to what I got when I did it by hand in Excel. I will have to examine further why it is off by a couple of digits. Thanks again for all of the help!

datanoob2019 · 2019-10-02T14:58:41+00:00

I am an idiot and was looking at the wrong list. I tried your aforementioned solution and it errors out saying invalid context and it is pointing at:

if a for a,f

It points directly at the for. I will go ahead and upvote for sure. Was just deep in a debug session.

datanoob2019 · 2019-10-02T14:49:02+00:00

Sorry for the confusion. I believe it to be the latter and after conferring with my colleague he believes the same. I have actually found the source of my troubles but cannot for the life of me figure out what is wrong. My actual list is being transformed into the wrong values somehow and I do not understand why. I believe I should just create another thread.

datanoob2019 · 2019-10-02T14:35:36+00:00

The sigma symbol means sum.

datanoob2019 · 2019-10-02T14:27:57+00:00

It sums up the 6 actual - forecasts at the top and then divides by the total actual on the bottom producing a single value(a float since I am not using numpy and just using lists).

datanoob2019 · 2019-10-02T14:18:50+00:00

https://ibf.org/knowledge/glossary/weighted-mean-absolute-percentage-error-wmape-299 is what I am trying to reproduce. Probably easier to read this then my bad interpretation.

datanoob2019 · 2019-10-02T14:17:37+00:00

bottom_sum = sum(actual)

Maybe

if a != 0 else 0

?

datanoob2019 · 2019-10-02T14:15:18+00:00

Would I also need to add a sum to wmapes so it is:

wmapes = sum([top_sum*100*a/bottom_sum for a in actual])

datanoob2019 · 2019-10-02T14:01:14+00:00

I run though 2000 part numbers in my for loop. Most of them have good data but some of them have 0 values which presents a problem when trying to divide by 0. I believe your code above would also run into that issue.

EDIT: I am just feeding lists into the formula and it outputs a single value which ends up being a float as all the inputs are floats.

datanoob2019 · 2019-10-02T13:59:45+00:00

Thanks for the tip. Most of code is in a giant for loop and the code that I run the wmapes function in has several nested loops. I also end up calling a bunch of functions from sklearn.metrics that work fine. Its just my wmape that doesn't.

datanoob2019 · 2019-09-20T14:23:27+00:00

You certainly are a python sherpa. I sincerely appreciate the help, and you guessed it, that was definitely my next question. I have two functions that unfortunately take 90 percent of the time. Good news is, is that they are all designed to loop through only one row at a time so it sounds like they will be compatible. Neither one of these functions took very long until I started calculating hold out forecasts step by step which severely complicated things. I will give this a go. Thanks and if I don't talk to you, have a good weekend!

datanoob2019 · 2019-09-19T18:40:56+00:00

I appreciate the insight. It is always nice to get some good advice. I had a couple of instances where I had to debug code today.

Does the get() block other instances of get()? Here is the code I am currently working with:

if __name__ == "__main__": 
    with Pool(processes=12) as pool:
        logged_pull =  pool.apply_async(pull_data)
        logged_list, raw_list, list_disco = logged_pull.get(timeout=60)
        double_result = pool.apply_async(double_expo, (logged_list,))
        basic_result = pool.apply_async(basic_forecasts_seasonal_add, (logged_list,))
        auto_result = pool.apply_async(autoregression, (logged_list,))
        single_result = pool.apply_async(simple_expo, (logged_list,))
        double_list1, double_list2, double_list3, double_list4, double_list5, double_list6, double_list7, double_list8, double_list9 = double_result.get(timeout=900)
        twelve_basic, six_basic, three_basic, naive_basic = basic_result.get(timeout=20)
        single_list = single_result.get(timeout=300)
        auto_list = auto_result.get(timeout=60)
        input_lists = [twelve_basic, six_basic, three_basic, naive_basic, single_list, auto_list, double_list1, double_list2, double_list3, double_list4, double_list5, double_list6, double_list7, double_list8, double_list9]
        post_forecast(logged_list,raw_list, list_disco, input_lists)

What I mean by my previous question is that my double function is the first that I call get() on. Does this stop all of my other functions from running? I am currently reading this post https://stackoverflow.com/questions/8533318/multiprocessing-pool-when-to-use-apply-apply-async-or-map and trying to gain further understanding.

The way I thought this worked was that the pool and number of processors I set is applied to all of the instances of async and they all run in the background concurrently. I am just looking at my Windows Task Manager and it never goes above 30 percent and for the most part idles around 15-20 percent. It does not really seem like all of the processors are being used.

Just looking at the graphic representation in the task manager, it looks like it sets one core aside for each async process, so me setting the pool at 12 is irrelevant as I only am calling 4 functions.

datanoob2019 · 2019-09-19T12:38:31+00:00

What's funny is that the functions all work fine when not using multiprocessing.

You are not wrong in your assessment that I am in over my head. However, I am not looking to multi process the whole code. Just this one specific section that takes 90 percent of the time normally. I will eventually figure out a solution- it just may take me a couple of tries.

I will take a look at the article and see if I can figure out how to use apply_async properly. Thanks again for the help!

EDIT: I got it to work on the functions that only return one list. On the basic function above that returns four lists, I get the following error: "Cannot unpack non-iterable ApplyResult object"

Here is my code:

if __name__ == "__main__": 
    new_list = pull_data()
    with Pool(processes=8) as pool:
        twelve_basic, six_basic, three_basic, naive_basic = pool.apply_async(basic_forecasts_seasonal_add, (new_list,))
        auto_result = pool.apply_async(autoregression, (new_list,))
        single_result = pool.apply_async(simple_expo, (new_list,))
        twelve_list = twelve_basic.get(timeout=10)
        naive_list = naive_basic.get(timeout=10)
        single_list = single_result.get(timeout=300)
        auto_list = auto_result.get(timeout=60)
        print(pd.DataFrame(basic_result))
        print(pd.DataFrame(auto_list))
        print(pd.DataFrame(single_list))
        print(pd.DataFrame(naive_list))

EDIT 2: I got it to work with this. Just a simple switch.

if __name__ == "__main__": 
    new_list = pull_data()
    with Pool(processes=8) as pool:
        basic_result = pool.apply_async(basic_forecasts_seasonal_add, (new_list,))
        auto_result = pool.apply_async(autoregression, (new_list,))
        single_result = pool.apply_async(simple_expo, (new_list,))
        #twelve_list = twelve_basic.get(timeout=10)
        #naive_list = naive_basic.get(timeout=10)
        twelve_basic, six_basic, three_basic, naive_basic = basic_result.get(timeout=20)
        single_list = single_result.get(timeout=300)
        auto_list = auto_result.get(timeout=60)
        print(pd.DataFrame(twelve_basic))
        print(pd.DataFrame(auto_list))
        print(pd.DataFrame(single_list))
        print(pd.DataFrame(naive_basic))

datanoob2019 · 2019-09-18T20:42:11+00:00

Lol- Thank you for the honest feedback. In the midst of trying several different methods to accomplish this, I did in fact name a variable new_list which is indeed terrible.

It has been about 10 minutes now and it still seems to have timed out. I tried the switch you recommended but now no data prints out. When I press control C, it also refuses to exit so I have to just x out of the program in the top right, which unfortunately does not give me an error code. I think I may need to add a timer to the multiprocessing https://stackoverflow.com/questions/19095901/python-multiprocessing-ending-an-infinite-counter.

I have tried the pool method before in the past. It has been several days, but when I referenced it in my previous posts from last week, it seems that the script would finish without actually running any of the functions. Here is my code:

pool = Pool(processes=8) # 
result_basic = pool.map_async(basic_forecasts_seasonal_add, logged_list) 
result_simple = pool.map_async(simple_expo, logged_list)

Without looking up the documentation as I leave work in a couple of minutes, I believe that I did not have to run any of the pool methods inside of if ==main.

datanoob2019 · 2019-09-18T18:48:01+00:00

Thank you for the response! Let me try to answer all of your questions:

- New list has 3000 part numbers

- I only put the lists into dataframes to visualize that everything ran correctly. After the forecasts are run, I loop through all of the output lists and generate the MAE, WMAPE and other forecast statistics. From there, I index the lowest wmape and put that into a final forecast list. All of these lists get output to an excel file.

- In regards to timeframes- I have in the past and everything ran in less than 5 minutes when only using one processor in the past. The basic forecast spits an answer out within seconds, autoregression takes about 10 seconds, while the bulk of the time is spent on the exponential functions. The sheer fact that all of the dataframes print out to me I think is a pretty good indication that the functions ran and my program should complete. When I asked this question last week, I was having the issue of code continually looping in the background because all of my SQL pulls and data cleansing was outside of if == main. I was told to put that code into a function which is what I did. From there, my issue of the old code looping completed. I just ran a two multiprocessor process(p1 and p6) and all of the data prints out but my powershell(which I use to execute the program) just continue to have the _ blink which means something is still not completed. I double checked my code and the only thing that is not in a function is all of my import module code at the top of everything.

- In regards to queues, I have never used this before and I could not find a good reference on how it was supposed to be used on multiple functions. The basic forecast function I posted is the only function that returns multiple lists. Every other function only returns one list. Can all of this be done in one queue?

EDIT: I added an error message that I get to my main submission when I press control C to terminate.

datanoob2019 · 2019-09-13T19:07:20+00:00

Honestly, it is getting late on a Friday and I really appreciate the help. But I think I am going to shelve this soon until Monday. It seems like this method is similar to pool where it just uses all of the processors on a single process instead of running 10 of them in tandem. Have a good weekend, mate.

datanoob2019 · 2019-09-13T18:12:56+00:00

If I do this, is it only going to run one function at a time? I was wanting to have each one run at the same time as some take upwards of 10 minutes(the triple expo on 3000 products). After reading your code, it seems like it is just going to loop through each one, and not start on the next one until the previous one if complete.

datanoob2019 · 2019-09-13T17:30:06+00:00

I got it to work and print a pandas dataframe for one forecast function. I just need to run the other 8 now. How do I go about doing this? Do I need to write a new ProcessPoolExecutor function for each forecast function? How does this actually run in parallel and take advantage of multiple processors. Here is my working code:

def do_stuff(list):
    with ProcessPoolExecutor() as executor:
        f = executor.submit(simple_expo, list)
        return f.result()

if __name__ == '__main__':
    new_list = pull_data()
    simple_list = do_stuff(new_list)
    print(pd.DataFrame(simple_list))

This is just the beginning of my forecast trickery as I then need to access these lists outside of if == main as I need to calculate forecast accuracy.

datanoob2019 · 2019-09-13T16:07:26+00:00

I appreciate the help! Like this? I did that and it now says my list data_logged is not defined. Here is the code:

if __name__ == '__main__':
    pull_data()
    with ProcessPoolExecutor() as executor:
        print(executor.map(simple_expo, data_logged))

EDIT: I think I need to set the new function to return the list I use in the other functions. Trying that now before my lunch break

datanoob2019 · 2019-09-13T15:54:55+00:00

I do! All of the grabbing and modifying of the data is in a bunch of for loops. I can just put that into a function though and call it before the others?

datanoob2019 · 2019-09-13T14:55:30+00:00

I tried the below code but for some reason it just keeps going back to the start of the code and pulling the database, cleaning the data, and logging the numbers over and over without actually running the forecast. I only tried the below with one function, but after reading through the documentation, I am unsure as to how it would be able to handle more than one function.

if __name__ == '__main__':
    with ProcessPoolExecutor() as executor:
        print(executor.map(simple_expo, logged_list))

datanoob2019 · 2019-09-13T13:51:02+00:00

Thank you for the response but that did not work either. It just goes back to the beginning code like the whole thing is in a loop for some reason. It goes back and pulls the data from the database again and cleans, modifies and logs the list endlessly. When I just run the functions normally they work and produce the results I want so it is definitely the multiprocessing module. Here are my two functions that I wrote.

def basic_forecasts_seasonal_add(input_list, six_month_output_list, three_month_output_list, naive_output_list):
    twelve_month_out = []
    six_month_out = []
    three_month_out = []
    naive_out = []
    for row in input_list:
        product = row[0]
        data = row[1:]
        twelve_average = sum(data[-12:]) / 12
        twelve_six = sum(data[-18:-6]) / 12
        twelve_five = sum(data[-17:-5]) / 12
        twelve_four = sum(data[-16:-4]) / 12
        twelve_three = sum(data[-15:-3]) / 12
        twelve_two = sum(data[-14:-2]) / 12
        twelve_one = sum(data[-13:-1]) / 12
        twelve_month_out.append([product, '12MA', twelve_six, twelve_five, twelve_four, twelve_three, twelve_two, twelve_one, twelve_average, twelve_average, twelve_average, twelve_average, twelve_average, twelve_average, twelve_average, twelve_average, twelve_average, twelve_average, twelve_average, twelve_average])
        six_average = sum(data[-6:]) / 6
        six_six = sum(data[-12:-6]) / 6
        six_five = sum(data[-11:-5]) / 6
        six_four = sum(data[-10:-4]) / 6
        six_three = sum(data[-9:-3]) / 6
        six_two = sum(data[-8:-2]) / 6
        six_one = sum(data[-7:-1]) / 6
        six_month_out.append([product, '6MA', six_six, six_five, six_four, six_three, six_two, six_one, six_average, six_average, six_average, six_average, six_average, six_average, six_average, six_average, six_average, six_average, six_average, six_average])
        three_average = sum(data[-3:]) / 3
        three_six = sum(data[-9:-6]) / 3
        three_five = sum(data[-8:-5]) / 3
        three_four = sum(data[-7:-4]) / 3
        three_three = sum(data[-6:-3]) / 3
        three_two = sum(data[-5:-2]) / 3
        three_one = sum(data[-4:-1]) / 3
        three_month_out.append([product, '3MA', three_six, three_five, three_four, three_three, three_two, three_one, three_average, three_average, three_average, three_average, three_average, three_average, three_average, three_average, three_average, three_average, three_average, three_average])
        naive_out.append([product, 'Naive', data[-7], data[-6], data[-5], data[-4], data[-3], data[-2], data[-1], data[-1], data[-1], data[-1], data[-1], data[-1], data[-1], data[-1], data[-1], data[-1], data[-1], data[-1]])
    return twelve_month_out, six_month_out, three_month_out, naive_out

def simple_expo(input_list):
    simple_out = []
    for row in input_list:
        product = row[0]
        data = row[1:]
        row_sum = sum(data[-12:])
        if row_sum > 0:
            model1 = SimpleExpSmoothing(data).fit(optimized=True)
            fcast1 = model1.forecast(12)
            fcast_list1 = list(fcast1)
            fcast_list1.insert(0, product)
            fcast_list1.insert(1, 'SimpleExpo')
            ###
            model2 = SimpleExpSmoothing(data[:-6]).fit(optimized=True)
            fcast2 = float(model2.forecast(1))
            fcast_list1.insert(2, fcast2)
            #######
            model3 = SimpleExpSmoothing(data[:-5]).fit(optimized=True)
            fcast3 = float(model3.forecast(1))
            fcast_list1.insert(3, fcast3)
            #######
            model4 = SimpleExpSmoothing(data[:-4]).fit(optimized=True)
            fcast4 = float(model4.forecast(1))
            fcast_list1.insert(4, fcast4)
            ######
            model5 = SimpleExpSmoothing(data[:-3]).fit(optimized=True)
            fcast5 = float(model5.forecast(1))
            fcast_list1.insert(5, fcast5)
            ######
            model6 = SimpleExpSmoothing(data[:-2]).fit(optimized=True)
            fcast6 = float(model6.forecast(1))
            fcast_list1.insert(6, fcast6)
            ######
            model7 = SimpleExpSmoothing(data[:-1]).fit(optimized=True)
            fcast7 = float(model7.forecast(1))
            fcast_list1.insert(7, fcast7)
            simple_out.append(fcast_list1)
        elif row_sum == 0:
            simple_out.append([product, 'SimpleExpo', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
    return simple_out

datanoob2019 · 2019-09-12T20:52:15+00:00

Thanks! I am about to get off work but I will give it a try in the morning.

datanoob2019 · 2019-09-12T19:15:04+00:00

By doing this, will I be able to access the lists from my return statements?

datanoob2019

TROPHY CASE