parallel processing - Minimising number of calls to std::max in nested loop -


i'm trying reduce number of calls std::max in inner loop, i'm calling millions of times (no exaggeration!) , that's making parallel code run slower sequential code. basic idea (yes, assignment) code calculates temperature @ gridpoint, iteration iteration, until maximum change no more certain, tiny number (e.g 0.01). new temp average of temps in cells directly above, below , beside it. each cell has different value result, , want return largest change in cell given chunk of grid.

i've got code working it's slow because i'm doing large (excessively so) number of calls std::max in inner loop , it's o(n*n). have used 1d domain decomposition

notes: tdiff doesn't depend on what's in matrix

the inputs of reduction function result of lambda function

diff greatest change in single cell in chunk of grid on 1 iteration

blocked range defined earlier in code

t_new new temperature grid point, t_old old one

max_diff = parallel_reduce(range, 0.0,         //lambda function returns local max         [&](blocked_range<size_t> range, double diff)-> double         {             (size_t j = range.begin(); j<range.end(); j++)             {                 (size_t = 1; < n_x-1; i++)                 {                     t_new[j*n_x+i]=0.25*(t_old[j*n_x+i+1]+t_old[j*n_x+i-1]+t_old[(j+1)*n_x+i]+t_old[(j-1)*n_x+i]);                     tdiff = fabs(t_old[j*n_x+i] - t_new[j*n_x+i]);                     diff = std::max(diff, tdiff);                 }                }             return diff;    //return biggest value of tdiff iteration - once per 'i'         },         //reduction function - takes in max diffs each iteration, picks largest         [&](double a, double b)-> double         {             convergence = std::max(a,b);             return convergence;         }     ); 

how can make code more efficient? want make less calls std::max need maintain correct values. using gprof get:

each sample counts 0.01 seconds.   %   cumulative   self              self     total             time   seconds   seconds    calls  ms/call  ms/call  name      61.66      3.47     3.47  3330884     0.00     0.00  double const& std::max<double>(double const&, double const&)  38.03      5.61     2.14     5839     0.37     0.96  _zz4mainenkuln3tbb13blocked_rangeimeede_cles1_d 

eta: 61.66% of time spent executing code on std::max calls, calls on 3 million times. reduce function called every output of lambda function, reducing number of calls std::max in lambda function reduce number of calls reduce function

first of all, expect std::max inlined caller, it's suspicious gprof points out separate hotspot. maybe analyze debug configuration?

also, not think std::max culprit here. unless special checks enabled in implementation, believe should equivalent (diff<tdiff)?tdiff:diff. since 1 of arguments std::max variable update, can try if (tdiff>diff) diff = tdiff; instead, doubt give (and perhaps compilers can such optimization on own).

most likely, std::max highlighted result of sampling skid; i.e. real hotspot in computations above std::max, makes perfect sense, due both more work , accesses non-local data (arrays) might have longer latency, if corresponding locations not in cpu cache.

depending on size of rows (n_x) in grid, processing rows can inefficient, cache-wise. it's better reuse data t_old as possible while in cache. processing rows, either don't re-use point t_old @ until next row (for i+1 , i-1 points) or reuse once (for 2 neighbors in same row). better approach process grid rectangular blocks, helps re-use data hot in cache. tbb, way use blocked_range2d. need minimal changes in code; basically, changing range type , 2 loops inside lambda: outer , inner loops should iterate on range.rows() , range.cols(), respectively.


Comments

Popular posts from this blog

blackberry 10 - how to add multiple markers on the google map just by url? -

php - guestbook returning database data to flash -

delphi - Dynamic file type icon -