function - how to predict how long it will take for python to run a script? -
so want make sure run program when optimal, example, if take 5 hours complete, should run overnight!
i know program end, , theoretically should able base length on size. here actual problem:
i need open 16 pickled files pandas dataframes add total of 1.5 gigs. note, need dataframes add 20 gigs, answer need way of telling how long following code take given total amounts of gigs:
import pickle import os def picklesave(data, picklefile): output = open(picklefile, 'wb') pickle.dump(data, output) output.close() print "file has been saved %s" % (picklefile) def pickleload(picklefile): pkl_file = open(picklefile, 'rb') data = pickle.load(pkl_file) pkl_file.close() return data directory = '/users/ryansaxe/desktop/kaggle_parkinsons/gps/' files = os.listdir(directory) dfs = [pickleload(directory + i) in files] new_file = directory + 'new_file_dataframe' picklesave(dfs,new_file)
so need write function following:
def time_fun(data_size_in_gigs): #some algorithm here print "your code take ___ hours run"
i have no clue how approach this, or if possible. ideas?
this execution time entirely dependent on system, i.e., hard drive / ssd, processor, etc. no 1 can tell upfront time take run on computer, way you'll able precise estimate run script on sample files add small size such 100mb, take note of how long took, , base estimations off of that.
def time_fun(data_size_in_gigs): benchmark = time_you_manually_tested_for_100mb time_to_run = data_size_in_gigs/0.1 * benchmark print "your code take time_to_run hours run"
edit: in fact, may want save benchmark (size,time) pair on file, automatically add new entries whenever run script. here in function, may example want retrieve 2 benchmarks closest data_size you're estimating, , estimate off of them, taking average , making proportional data_size
need. each adjacent pair of benchmarks define different linear slope accurate data near it.
| . | . time | . | . | . |_._________________ size
just avoid saving 2 benchmarks differ less 200mb example, actual time may vary , ruin estimation entries such (999mb, 100 minutes) followed (1gb, 95 minutes).
the projection of line defined 2 last points closest estimate have new all-time-high data sizes.
Comments
Post a Comment