parallel processing - Minimising number of calls to std::max in nested loop -
i'm trying reduce number of calls std::max in inner loop, i'm calling millions of times (no exaggeration!) , that's making parallel code run slower sequential code. basic idea (yes, assignment) code calculates temperature @ gridpoint, iteration iteration, until maximum change no more certain, tiny number (e.g 0.01). new temp average of temps in cells directly above, below , beside it. each cell has different value result, , want return largest change in cell given chunk of grid.
i've got code working it's slow because i'm doing large (excessively so) number of calls std::max in inner loop , it's o(n*n). have used 1d domain decomposition
notes: tdiff doesn't depend on what's in matrix
the inputs of reduction function result of lambda function
diff greatest change in single cell in chunk of grid on 1 iteration
blocked range defined earlier in code
t_new new temperature grid point, t_old old one
max_diff = parallel_reduce(range, 0.0, //lambda function returns local max [&](blocked_range<size_t> range, double diff)-> double { (size_t j = range.begin(); j<range.end(); j++) { (size_t = 1; < n_x-1; i++) { t_new[j*n_x+i]=0.25*(t_old[j*n_x+i+1]+t_old[j*n_x+i-1]+t_old[(j+1)*n_x+i]+t_old[(j-1)*n_x+i]); tdiff = fabs(t_old[j*n_x+i] - t_new[j*n_x+i]); diff = std::max(diff, tdiff); } } return diff; //return biggest value of tdiff iteration - once per 'i' }, //reduction function - takes in max diffs each iteration, picks largest [&](double a, double b)-> double { convergence = std::max(a,b); return convergence; } );
how can make code more efficient? want make less calls std::max need maintain correct values. using gprof get:
each sample counts 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 61.66 3.47 3.47 3330884 0.00 0.00 double const& std::max<double>(double const&, double const&) 38.03 5.61 2.14 5839 0.37 0.96 _zz4mainenkuln3tbb13blocked_rangeimeede_cles1_d
eta: 61.66% of time spent executing code on std::max calls, calls on 3 million times. reduce function called every output of lambda function, reducing number of calls std::max in lambda function reduce number of calls reduce function
first of all, expect std::max
inlined caller, it's suspicious gprof points out separate hotspot. maybe analyze debug configuration?
also, not think std::max
culprit here. unless special checks enabled in implementation, believe should equivalent (diff<tdiff)?tdiff:diff
. since 1 of arguments std::max variable update, can try if (tdiff>diff) diff = tdiff;
instead, doubt give (and perhaps compilers can such optimization on own).
most likely, std::max
highlighted result of sampling skid; i.e. real hotspot in computations above std::max
, makes perfect sense, due both more work , accesses non-local data (arrays) might have longer latency, if corresponding locations not in cpu cache.
depending on size of rows (n_x
) in grid, processing rows can inefficient, cache-wise. it's better reuse data t_old
as possible while in cache. processing rows, either don't re-use point t_old
@ until next row (for i+1
, i-1
points) or reuse once (for 2 neighbors in same row). better approach process grid rectangular blocks, helps re-use data hot in cache. tbb, way use blocked_range2d
. need minimal changes in code; basically, changing range type , 2 loops inside lambda: outer , inner loops should iterate on range.rows()
, range.cols()
, respectively.
Comments
Post a Comment