PI: Simon Scheidegger, MCS
Description: We have a massively parallel code (scales up from 1 single node up to 4k nodes in Piz Daint at the swiss supercomputing centre — a Cray XC50 platform with NVIDIA P100 GPUs attached on every node). In our research, we ported this framework to KNL (AVX-512 etc), and would like to compare the 1-node performance from KNL vs. hybrid CPU/GPU nodes. The results will be reported in a paper submitted to SCS17.
Testbed: KNL – Xeon Phi Knights Landing (KNL) Cluster
Background: Dynamic stochastic general equilibrium models with heterogeneous agents are routinely used in modern macroeconomics and public finance for counter-factual policy analysis. One particular subclass are called overlapping generation(OLG) models. These models are important tools in public finance since they allow for a careful modeling of individuals’ decisions over the life cycle and their interactions with capital accumulation and economic growth.
There are now several areas where, over the last 10 to 20 years, deterministic OLG models have been fruitfully applied to the analysis of taxation and fiscal policy. In particular, large-scale deterministic versions of the model have been applied to the “fiscal gap”, to “dynamic scoring of tax policies”, and to the evaluations of social security reforms (see, e.g.). It is clear, however, that in order to be able to address these policy-relevant questions thoroughly, uncertainty needs to be included in the basic model. Both uncertainty about economic fundamentals as in  as well as uncertainty about future policy crucially affect individuals’ savings, consumption, and labor-supply decisions and the uncertainty into the specification of the model can overturn many results obtained in the deterministic model. Both uncertainty about future productivity and uncertainty about future taxes have first order effects on agents’ behavior. Unfortunately, when one introduces this form of uncertainty into the model, there do not exist steady state equilibria in these models as the stochastic aggregate shocks affect everybody’s return to physical and human capital, and by construction the effects do not cancel out in the aggregate, so that the distribution of wealth across generations changes with the stochastic aggregate shock. This feature makes it difficult to approximate equilibria with many agents of different ages and aggregate uncertainty – realistic calibrations of the model lead to very high dimensional problems that were so far thought to be unsolvable. This explains why relatively little policy-work has been carried out using stochastic OLG models. Building on (Scheidegger et al 2016), we want to show in our research how we can combine recent developments in computational mathematics and heterogeneous supercomputing hardware to compute global solutions to stochastic OLG models with large heterogeneity (i.e. high-dimensional continuous dimensions and discrete stochastic states) in relatively short times. This opens the way to the application of stochastic OLG models to quantitative public finance and to a more founded general equilibrium analysis of tax and transfer policies.
In stochastic dynamic models individuals’ optimal policies and prices are unknown functions of the underlying, high dimension states and are solved for by so-called time iteration algorithms. There are two major bottlenecks that create difficulties in achieving a fast time-to-solution process when solving large-scale dynamic stochastic OLG models with this iterative method, namely, in each iteration step, an economic function needs to be approximated. For this purpose, the function value has to be determined at many points in the high-dimensional state space, and each point involves solving a system of nonlinear equations (around 60 equations in 60 unknowns). We overcome these difficulties by massively reducing both the number of points to be evaluated and the time needed for each evaluation. Issue (i) is resolved by the use of adaptive sparse grids, while task (ii) is resolved using a hybrid parallelization scheme that uses message passing interface (MPI) among compute nodes and minimizes interprocess communication by using Intel threading building blocks (TBB). Furthermore, it relies on AVX or AVX-512 vectorization and offloads the function evaluations partially to GPUs to further speed up computations. This scheme enables us to make efficient use of the emerging hybrid high-performance (HPC) computing facilities.