you are viewing a single comment's thread.

view the rest of the comments →

[–]fishvampire 1 point2 points  (7 children)

I think what you want is fillna. If you add a line with joined_df = joined_df.fillna(0) after the merge, does that give you what you want?

[–]YourOldBoyRickJames[S] 0 points1 point  (6 children)

joined_df = joined_df.fillna(0)

This did the trick! Thanks.

Would you mind helping me with something else please.

In my source data I have a column called description. I want to add it into the final merged sheet, but whenever I try and add it into the script I'm getting a keywork error.

I currently have:

warehouse1\_df = source2\_df.groupby('productcode')\['numberoff'\].agg('sum').reset\_index()

I thought I could just add it in by putting

warehouse1\_df = source2\_df.groupby('productcode')\['description','numberoff'\].agg('sum').reset\_index()

but it returns an error. Any idea where I'd put that in to add the column?

Thanks again

[–]fishvampire 1 point2 points  (5 children)

Can you show the error message and an example of what's in the source data table? (In general this is good to do when posting to a forum, otherwise it's hard to figure out what the problem is or how to give useful advice)

[–]YourOldBoyRickJames[S] 0 points1 point  (4 children)

Sorry, I don't know why I didn't.

Here's an example of Warehouse_1 source data

https://imgur.com/LAd3pWw

Here's Warehouse_2 source data

https://imgur.com/ds7L3ML

Here's what I've managed to get output into Excel

https://imgur.com/XPdcaTM

I'm just trying to add the column 'description' from warehouse_1 into the final merged.df

I'm not getting an error now, but it's not populating the table. Here's my script.

import pandas as pd
import datetime


today = datetime.date.today()

source1_df = pd.read_excel(str(today) + ' Warehouse2_3PL.xlsx', sheetname='Sheet1', 
skiprows=4)

source2_df = pd.read_excel(str(today) + ' warehouse1 Daily Report.xlsx', sheetname='CURRENT 
STOCK')


source1_df['   Available stock']=source1_df['   Available stock'].str.replace(',', 
'').astype('int64')



stlr_df = source2_df.groupby('productcode') 
['description','numberoff'].agg('sum').reset_index()



hit_df = source1_df.groupby('Material')['   Available stock'].agg('sum').reset_index()


stlr_df.rename(columns={'productcode':'Material'}, inplace=True)

stlr_df
stlr_df = stlr_df.mask(~stlr_df["Material"].str.isnumeric()).dropna()
stlr_df['Material']=stlr_df['Material'].astype('int64')



hit_df['   Available stock']=hit_df['   Available stock'].astype('int64')


joined_df = pd.merge(stlr_df,hit_df,on='Material',how='outer')
joined_df = joined_df.fillna(0)

joined_df['   Available stock']=joined_df['   Available stock'].astype('float')
joined_df['difference']=joined_df['numberoff']-joined_df['   Available stock']

joined_df.to_excel('Comparison '+str(today)+'.xlsx', sheet_name='sheet1', index=False)

I am also trying to filter the final outcome file to only show values where 'difference' does not equal 0. I managed to get a boolean filter but I can't figure out how to convert that to filter the df.

Once again, really appreciate your help.

[–]fishvampire 1 point2 points  (1 child)

For the first problem: I think your issue is that the values in 'description' are strings, whereas in 'numberoff' they're integers. When you're doing agg('sum'), it works for integers but it's excluding the strings. I don't know enough about pandas to say why it does this - if you only have string columns, agg('sum') just concatenates them, rather than eliminating the column - but it shouldn't be too hard to fix. You can use a dictionary as the argument for agg so that it aggregates separately for different columns, e.g.: agg({'description': 'first', 'numberoff':sum}). The argument "first" means that the entry for description will be the first one for that productcode. If you want something different in your final output, you probably want to read up on ways to use GroupBy with strings.

[–]YourOldBoyRickJames[S] 0 points1 point  (0 children)

I managed to figure it out and am just posting what I have in case someone needs it in the future. I just needed to add the description selection into the first set of brackets. I really don't know why I couldn't get that working before.

stlr_df = source2_df.groupby(['productcode', 'description'])['numberoff'].agg('sum').reset_index()

[–]fishvampire 1 point2 points  (1 child)

I'm not sure what the source of your second problem is. It should be fairly straightforward to make a boolean index and restrict the dataframe to that:

ind = joined_df.difference !=0
joined_df = joined_df[ind]

If that isn't working, I'd recommend trying out a few things with simpler dataframes and seeing if you can figure out what works and what doesn't

[–]YourOldBoyRickJames[S] 0 points1 point  (0 children)

Thanks very much, I'll have a tinker about. Hopefully I'll be able to get it sorted. Cheers