python - Pandas: Create new dataframe that averages duplicates from another dataframe -
say have dataframe my_df
column duplicates, e..g
foo bar foo hello 0 1 1 5 1 1 2 5 2 1 3 5
i create dataframe averages duplicates:
foo bar hello 0.5 1 5 1.5 1 5 2.5 1 5
how can in pandas?
so far have managed identify duplicates:
my_columns = my_df.columns my_duplicates = print [x x, y in collections.counter(my_columns).items() if y > 1]
by don't know how ask pandas average them.
you can groupby
column index , take mean
:
in [11]: df.groupby(level=0, axis=1).mean() out[11]: bar foo hello 0 1 0.5 5 1 1 1.5 5 2 1 2.5 5
a trickier example if there non numeric column:
in [21]: df out[21]: foo bar foo hello 0 0 1 1 1 1 1 2 2 2 1 3
the above raise: dataerror: no numeric types aggregate
. not going win any prizes efficiency, here's generic method in case:
in [22]: dupes = df.columns.get_duplicates() in [23]: dupes out[23]: ['foo'] in [24]: pd.dataframe({d: df[d] d in df.columns if d not in dupes}) out[24]: bar hello 0 1 1 1 2 1 in [25]: pd.concat(df.xs(d, axis=1) d in dupes).groupby(level=0, axis=1).mean() out[25]: foo 0 0.5 1 1.5 2 2.5 in [26]: pd.concat([out[24], out[25]], axis=1) out[26]: foo bar hello 0 0.5 1 1 1.5 1 2 2.5 1
i think thing take away avoid column duplicates... or perhaps don't know i'm doing.
Comments
Post a Comment