Parallel Newton Step for the SC-OPF problem¶

CS205 Final Project, Group 4

Srivatsan Srinivasan, Manish Vuyyuru, Aditya Karan, Cory Williams

Fork it on Github!

https://github.com/Srivatsan-Srinivasan/cs205-final-project.git

Table of Contents¶

Introduction and Motivation
1. How is our work different from other works that tackle this problem?
Problem
Description of Parallel Solution
Models and Data
Parallel Application and Programming Models
Complications, Overheads and Advanced Features in the project
1. Complications: Communication and Synchronization overheads
2. Advanced Features and Modules
Performance Evaluation and Overheads
Discussion
Lessons, Future Work and Conclusion
References
Appendix

Introduction and Motivation ¶

DC Security Constrained Optimal Power Flow (DC-SCOPF) is a key problem in power systems operations for determining the optimal operation set points of controllable power injections in the event of possible contingencies. This problem can be formulated as a constrained optimization problem for minimizing the cost of power injection under the physical constraints of the power system. The timescale for these systems can range from a day before to near-real time. As more contingencies are considered - the scale of the problem can quickly grow. Thus, quickly and robustly solving DC-SCOPF remains an open challenge. Interior Point Methods (IPM) have been shown to be an effective means of solving the DC-SCOPF effectively taking advantage of the structure of the problem to achieve efficient convergence ^{[[1]](#Chiang2015), [[2]](#PDIP)}. However, one of the major bottlenecks that is a large spares matrix inversion is needed at each Newton Step. In this work, we use multi-grid solvers which have been successful in solving PDEs and develop a hierarchical parallel model with distributed and shared memory parallel architectures for expediting the calculation of the Newton step in DC-SCOPF.

How is our work different from other words that tackle this problem?¶

Of late, multiple works have used Interior Point Methods in order to solve the DC-SCOPF optimization problem (linear program) ^{[[1]](#Chiang2015), [[2]](#PDIP)}. Given that the matrices associated with power systems are sparse, specialized interior point method (IPM) solvers need to be developed for security constrained optimal power flow (DC-SCOPF). Such structure exploiting IPMs usually are plagued by a bottleneck of taking large/sparse inverses. To address this, we explore how to leverage ideas from multigrid solvers for improving computational efficiency of IPMs for DC-SCOPF. Algebraic multigrid (AMG) has drawn much interest for scaling performance. ^{[[3]](#PARMS)} On the other hand, there is a lack of work on developing multilevel solvers for sparse matrices associated with power systems, which demonstrate specific connectivity from the physical system. In this work, we develop a multilevel solver specifically crafted for the strongly local structure of the DC-SCOPF problem and at the same time, propose an efficient distributed solving scheme that takes advantage of the inherent local block nature of the system to effectively parallelize the solving procedure.

Problem ¶

Notation and Preliminaries ¶

In this problem, we consider the linearized (DC) formulation of the optimal flow problem. Consider a multi-area interconnected power network with $n_{bus}$ number of buses represented by a graph with potentially $n_{bus}^2$ edges (transmission lines) connecting the buses. Let the actual number of edges be $n_{line}$. Let $\mathbf{f} \in \mathbb{R}^{nline}$ denote the real power flow along every power line and let $\mathbf{p} \in \mathbb{R}^{n_{bus}}$ denote the power injection profile i.e. the value of the net real power injection at each bus. Let $\mathbf{\theta}$ be the voltage phase angle with respect to a reference. The power injection and power flow are linearly related to the power angle as $$\mathbf{p} = \mathbf{B}_{dc} \mathbf{\theta},$$ $$\;\;\mathbf{f} = \mathbf{D}_{dc} \mathbf{A}_{dc} \mathbf{\theta} $$ where $\mathbf{B}_{dc}$ is the bus-susceptance matrix, $\mathbf{A}_{dc},\mathbf{D}_{dc}$ represents the line-bus incidence matrix and the line susceptance matrix and all these matrices represent and all of these matrices naturally encode the network structure.

Contingency : We define a single contingency associated with a line $l$ as the scenario in which line $l$ is inactive. A contingency set could be thought as the collections of lines $l$ that are inactive. In concept the number of said sets can grow extremely large as the number of lines grows.

In the event of these failures, the power system should still be resilient and we need to find optimal power injection profile in the event of line failures at the minimum possible cost incurred.

DC Power Flow Model ¶

The DC-SCOPF formulation we consider is a linear optimization program with state variables equal to the bus voltages $\mathbf{\theta}$ (excluding the reference bus) and control variables set to the power injections $\mathbf{p}$. The equality constraints enforce power balance at each bus, which are linear under the DC model, and the inequality constraints are limits on the power injections and bidirectional power flows.

$$\min_{\theta^{(0)},\mathbf{\theta}^{(0)}\dots \theta^{(k)},\mathbf{p}} c_{0}(\mathbf{p}) \\ {\text { s.t. } \mathbf{p}_{min} \leq \mathbf{p} \leq \mathbf{p}_{max}} \\ \text{For each contingency $k =0,1 \dots K$} \\ \mathbf{B}_{dc}^{(k)}\mathbf{\theta}^{(k)}-\mathbf{p}=\mathbf{0} \\ {\mathbf{f}_{min} \leq \mathbf{D}_{dc}^{(k)} \mathbf{A}_{d c}^{(k)} \mathbf{\theta}^{(k)} \leq \mathbf{f}_{max}}$$

The linear program in the equations above can be written in matrix form as follows:

\begin{equation} \begin{split} \min_ {\{\mathbf{\theta}^{(k)}\}_{k=0}^K,\mathbf{p}} & c_0(\mathbf{p}) \\ \text{s.t.} \; \mathbf{G}_p \; \mathbf{p} &\leq \underbrace{\begin{bmatrix} \mathbf{p}_{min}\\ \mathbf{p}_{max} \end{bmatrix} }_{\mathbf{c}_p^T} \\ \text{For each contingency}&\text{ k = (0,...K)} \\ \mathbf{H}_k \begin{bmatrix} \mathbf{\theta}^{(k)} \\ \mathbf{p} \end{bmatrix} &= 0 \\ \mathbf{G}_k \theta^{(k)} &\leq \underbrace{\begin{bmatrix} \mathbf{f}_{min}\\ \mathbf{f}_{max} \end{bmatrix}}_{\mathbf{c}_i^T} \end{split} \end{equation}

The first constraint in ensuring that power injection profile $p$ is within an operating min/max. The equality per contingency is ensuring the power equations hold true in every contingency. The second inequality is ensuring that the real power flow along the various lines is within an allowable limit.

The dual variables associated with the inequality constraints in power injection variables and power flow variables are $\mathbf{\lambda}^{(p)}$ and $\mathbf{\lambda}^{(k)}$ respectively. The dual variables associated with the equality constraints are given by $\mathbf{\nu}^{(k)}$. With these Lagrangian duals, the Newton step in bordered block diagonal form is:

\begin{equation} \begin{bmatrix} \mathbf{M}^{(0)} & \mathbf{0} & .... & \mathbf{0} & \mathbf{C}^{(0)} \\ \mathbf{0} & \mathbf{M}^{(1)} & .... & \mathbf{0} & \mathbf{C}^{(1)} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ \mathbf{0} & \mathbf{0} & .... & \mathbf{M}^{(K)} & \mathbf{C}^{(K)} \\ [\mathbf{C}^{(0)}]^T & [\mathbf{C}^{(1)}]^T & .... & [\mathbf{C}^{(K)}]^T & \mathbf{C}_p \end{bmatrix} \begin{bmatrix} \triangle \mathbf{z}{(0)} \\ \triangle \mathbf{z}{(0)} \\ \vdots \\ \triangle \mathbf{z}{(K)} \\ \triangle \mathbf{z}{(p)} \\ \end{bmatrix} = - \mathbf{r}_* \end{equation}

where

\begin{align*} \triangle \mathbf{z}^{(k)} &= \begin{bmatrix} \triangle \mathbf{\theta}^{(k)} \\ \triangle \mathbf{\lambda}^{(k)} \\ \triangle \mathbf{\nu}^{(k)} \end{bmatrix}, \quad \triangle \mathbf{z}^{(p)} =\begin{bmatrix} \triangle \mathbf{p} \\ \triangle \mathbf{\lambda}_p \end{bmatrix} \\[10pt] \mathbf{M}^{(k)} &= \begin{bmatrix} \mathbf{0} & \mathbf{G}_k^T & \mathbf{H}_{k,\theta}^T \\ -D(\mathbf{\lambda}_k \mathbf{G}_k) & -D(\mathbf{\theta}_k \mathbf{G}_k - c_I) & 0 \\ \mathbf{H}_{k,\theta} & \mathbf{0} & \mathbf{0} \end{bmatrix} \quad k=0,1,...K \\[10pt] \mathbf{C}^{(k)} &= \begin{bmatrix} \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \mathbf{0} \\ \mathbf{H}_{k,p} & \mathbf{0} \\ \end{bmatrix}, \quad \mathbf{C}_p = \begin{bmatrix} \nabla^2c_0(\mathbf{p}) & \mathbf{G}_p^T \\ -D(\mathbf{\lambda}_p)\mathbf{G}_p & -D(\mathbf{G}_p \mathbf{p} - \mathbf{c}_p) \end{bmatrix} \end{align*}

Two-level Serial Multilevel Solver ¶

We are using a multi-level solver described in Algorithm 1 as a serial implementation for solving the Newton Step once the global system is prepared with all contingencies. The entire serial procedure involves two steps and the times we report for serial operations in our results include both these operations together.

Generating the system under contingencies from the base system
Multi-Level Solving (Algorithm 1)

ALGO 1 : Multi-Level Solver for the SCOPF Problem¶

Find an independent set permutation $\mathbf{P}_l$
Apply permutation $\hat{A} = \mathbf{P}_l^T\mathbf{A}_l\mathbf{P}_l$
Compute ILU factorization of $\hat{A}$ .
Solve via forward substitution: $\mathbf{L}_l\mathbf{f}_L'=\mathbf{f}_l$ for $\mathbf{f}_l'$
Descend to "coarser" level system: compute $b_{l+1} = \mathbf{g}_l - \mathbf{E}_l\mathbf{U}_l^{-1}\mathbf{f}_l'$
If level $l$ equals terminal level $L$ then solve $\mathbf{A}_{l+1}\mathbf{y}_l = \mathbf{b}_{l+1}$
Else Recurse with the same algorithm using $\mathbf{A}_{l+1},\mathbf{b}_{l+1}$
Ascend to finer level system:compute $\mathbf{f}_l'' = \mathbf{f}_l'-\mathbf{L}_l^{-1}\mathbf{F}_l\mathbf{y}_l$
Solve via backward substitution: $\mathbf{U}_l\mathbf{u}_l=\mathbf{f}_l''$

We developed a callgraph from profiling our code (see Figure 1 below), we found that there were three main bottlenecks. The biggest bottleneck was at the terminal step; secondly we found bottlenecks when descending from finer to courser systems; lastly, there was a small bottleneck when generating the system matrix.

Figure 1: Callgraph showing bottlenecks

Description of Parallel Solution ¶

In this section, we describe our parallelization plan (Figure 1) which aims to tackle three components.

Step 1 : Parallelize the generation of the system matrix with contingencies with the base system as input
Step 2 : Parallelize the global system into local system and descend from a finer system to a coarser system (Step 5 in Algorithm 1)
Step 3 : Parallelize the terminal level solver (Step 6 in Algorithm 1)

Block 1: Generation of the system with contingencies ¶

In our problem, we assume that the base system is provided without incorporating contingencies i.e. we already know the power system as a graph and the physical constraints relating to the bidirectional power flows between the lines and min,max power injections possible at each line. The first step is to then determine the full system under a given number of contingencies. In this case, we need to additionally generate power flow constraints with respect to the new contingencies that are introduced into the circuit.

In this work, \emph{we limit ourselves to six contingencies $(K=6)$} as proof of concept of our work, but in reality this could scale to potentially up to $O(n_{l})$ contingencies. Generating with just 6 contingencies already gives us a system that is characterized by a matrix $\mathbf{A}$ which is of the order of

$100000\times100000$ and hence, generating a parallel solution for such a system is a solid proof of concept in itself. Since the original system without the contingencies is not big enough for a huge memory requirement and the contingencies themselves originate independently and are later connected only with the power flow dynamics of the system, this is equivalent to running a "for" loop iteration in a serial version where the components of the "for" loop have no disruption on each other.Thus, we can generate the entire system for all contingencies in a shared memory architecture using PyMP. The expected theoretical speedup for this setting should be approximately K(=6) in our case since the system is generated. This step of parallelization is a case of weak scaling, where we can increase the number of processes as the problem size in terms of the number of contingencies grow. In the process flow diagram shown in Figure 3, this step can be seen as the first step that generates the initial level 0 system $A^0$ from the base case system matrix A in a shared memory parallel setup across $K(=6)$ contingencies.

Block 2: Independent System Design and Distributing the Global System in Local Systems - Distributed Solution of step 5 in Algorithm 1 ¶

Motivation and two-level distributed SC-OPF preliminaries for K contingencies ¶

The reason why a general system of equations cannot be inverted effectively in a completely distributed setting because of the dependence of variables among many other variables if the system is underpinned by a strongly connected graph. This would entail significant communication and synchronization overheads and any advantage gained out of potential parallelization could be lost. On the other hand, it is well-known that power systems have strong intrinsic local sub-structures as all buses are not expected to be connected to all the other buses. In such a setting, any solver that we build should take advantage of this fact and specialized multi-level methods need to be devised. In particular, we plan to design the system with maximum independence possible when distributed into local systems so that most number of variables could be solved while at the same time, minimizing the necessity of communication and synchronization to the best extent possible.

In a distributed setting, we aim to parallelize the global Newton step system described in the previous section by partitioning the variables amongst processors, so that each processor works on a set of local equations and owns a set of variables. The set of unknowns owned by processor i is split into two groups: 1) local unknowns $\mathbf{u}_i$ which are only involved in equations owned by the local processor and 2) interface unknowns $\mathbf{y}_i$ which are involved in both the local equations and equations owned by other processors. Partitioning the variables into local and interface sets induces a $2\times2$ block structure in the local matrix at every processor $i$

$$\mathbf{P^TA_iP_i}\begin{bmatrix} \mathbf{u}_i \\ \mathbf{y}_i \end{bmatrix} = \begin{bmatrix} \mathbf{B}_i & \mathbf{F}_i \\ \mathbf{E}_i & \mathbf{C}_i \end{bmatrix} \begin{bmatrix} \mathbf{u}_i \\ \mathbf{y}_i \end{bmatrix} = \begin{bmatrix} \mathbf{f}_i \\ \mathbf{g}_i \end{bmatrix} $$

Distribution Plan for global to local system for K contingencies ¶

We choose a permutation set in order to minimize both the communication speed of the inverse. Since our ultimate goal is to descent down using the Schurs complement, $C - EB^{-1}F$ it is advantageous to create $B$ so that the inverse can be quickly taken. We also note that by construction of our BBD matrix, the dual variable $\lambda$ is diagonal. Hence we choose an independent set where we group the $\lambda_i$ associated variables in $B$ and then group $\theta$ and $\mu$ in a block $C$ ^{[[6]](#MLSolver)}. The resulting permutation can be visualized in Figure 2.

Figure 2: Final Permutation structure - note the diagonal matrix for B

This allows us to create an independent set $B$ that can easily be parallelized across processors based on which contingency it belongs to.

At level $l$, let us denote the local and interface variables for processor $i$ are denoted $\mathbf{u}^l_i$ and $\mathbf{y}^l_i$ and work with the independent set permutation:

\begin{align*} \mathbf{u}^0 &= \begin{bmatrix} \mathbf{u}^0_0 \\ \mathbf{u}^0_1 \\ \vdots \\ \mathbf{u}^0_K \\ \mathbf{u}^0_p \\ \end{bmatrix} \quad \text{where} \;\mathbf{u}^0_i = \triangle \lambda_i, \; \mathbf{u}^0_p = \triangle \lambda_p, \quad \mathbf{y}^0 = \begin{bmatrix} \mathbf{y}^0_0 \\ \mathbf{y}^0_1 \\ \vdots \\ \mathbf{y}^0_K \\ \mathbf{y}^0_p \\ \end{bmatrix} \quad \text{where} \;\mathbf{y}^0_i = \begin{bmatrix} \triangle \mathbf{\nu}_i \\ \triangle \mathbf{\theta}_i \\ \end{bmatrix} \; \mathbf{y}^0_p = \triangle \mathbf{p} \end{align*}

where the subscripts refer to the processor index and the superscripts to the level. The global Newton Step that we saw earlier according to this independent set ordering can be seen as:

$$\mathbf{A}^0 \begin{bmatrix} \mathbf{u}^0 \\ \mathbf{y}^0 \end{bmatrix} = \begin{bmatrix} \mathbf{B}^0 & \mathbf{F}^0\\ \mathbf{E}^0 & \mathbf{C}^0 \end{bmatrix}\begin{bmatrix} \mathbf{u}^0 \\ \mathbf{y}^0 \end{bmatrix}=\begin{bmatrix} \mathbf{f}^0 \\ \mathbf{g}^0 \end{bmatrix}$$

with each component having the following structural form (we are not presenting the exact values of each sub-component for brevity reasons. One can work out the math from the originally defined global Newton step definition in the previous section to identify the exact analytical formulation of individual components.

$$\mathbf{B}^0 = \begin{bmatrix} \mathbf{B}^0_0 & \mathbf{0} & \dots & \mathbf{0} & \mathbf{0}\\ \mathbf{0} & \mathbf{B}^0_1 & \dots & \mathbf{0} & \mathbf{0}\\ \vdots & \vdots & \vdots & \vdots & \vdots \\ \mathbf{0} & \mathbf{0} & \dots & \mathbf{B}^0_K & \mathbf{0} \\ \mathbf{0} & \mathbf{0} & \dots & \mathbf{0} & \mathbf{B}^0_p & \\ \end{bmatrix} \quad \mathbf{E}^0 = \begin{bmatrix} \mathbf{E}^0_0 \\ \mathbf{E}^0_1 \\ \vdots \\ \mathbf{E}^0_K \\ \mathbf{E}^0_p \end{bmatrix} \quad \mathbf{F}^0 = \begin{bmatrix} \mathbf{F}^0_0 \\ \mathbf{F}^0_1 \\ \vdots \\ \mathbf{F}^0_K \\ \mathbf{F}^0_p \end{bmatrix} \quad \mathbf{C}^0 = \begin{bmatrix} \mathbf{C}^0_0 \\ \mathbf{C}^0_1 \\ \vdots \\ \mathbf{C}^0_K \\ \mathbf{C}^0_p \end{bmatrix}$$$$\mathbf{f}^0 = \begin{bmatrix} \mathbf{f}^0_0 \\ \mathbf{f}^0_1 \\ \vdots \\ \mathbf{f}^0_K \\ \mathbf{f}^0_p \end{bmatrix} \quad \mathbf{g}^0 = \begin{bmatrix} \mathbf{g}^0_0 \\ \mathbf{g}^0_1 \\ \vdots \\ \mathbf{g}^0_K \\ \mathbf{g}^0_p \end{bmatrix}$$

In this case, our independent set ordering distributes variables across K+2 processors, the base case processor $i=0$, the contingency processors $i=1,2,...K$ and the power injection processor $p$. This way, the global system is able to be distributed as a local system across processors. For example, the local system at process $i$ at level 0 can be written as

$$\mathbf{A}^0_i \begin{bmatrix} \mathbf{u}_i^0 \\ \mathbf{y}_i^0 \end{bmatrix} + \hat{\mathbf{A}}_i^0 \mathbf{y}^0_{i,ext} = \begin{bmatrix} \mathbf{B}_i^0 & \mathbf{F}_i^0 \\ \mathbf{C}_i^0 & \mathbf{U}_i^0 \end{bmatrix}\begin{bmatrix} \mathbf{u}_i^0 \\ \mathbf{y}_i^0 \end{bmatrix}+\begin{bmatrix} \mathbf{0} \\ \sum_{j \in \mathcal{N}_i} \mathbf{E}_{ij}^0\mathbf{y}_j^0 \end{bmatrix} = \begin{bmatrix} \mathbf{f}_i^0 \\ \mathbf{g}_i^0 \end{bmatrix} $$

where, as discussed earlier, $\mathbf{u}_i^0$ refers to local variables that are owned by processor $i$.

Distributed Calculation of Reduced System (Level l=0 -> l=1) ¶

As we have seen in the previous step, we have managed to reduce the Global Newton step and managed to distribute it across $K+2$ processors - the base case processor ($i=0$), K contingency processors ($i=1,2...K$) and a power injection processor ($p$). To descend from level 0 to level 1, the fine/local variables are eliminated using LU factorization. The level-1 system $\mathbf{A}^1 = \mathbf{C}^0-\mathbf{E}^0(\mathbf{B}^0)^{-1}\mathbf{F}^0$ equals the Schur's Complement and can be calculated in a distributed way due to the sparsity patterns seen in the previous step for $\mathbf{B}^0,\mathbf{C}^0,\mathbf{E}^0,\mathbf{F}^0$. The global system in a distributed computation would be (where each component is evaluated in a different processor and then assembled):

$$\mathbf{A}^1 = \sum_{i}\begin{bmatrix} \mathbf{0} \\ \vdots \\ \mathbf{C^0_i} \\ \vdots \\ \mathbf{0} \end{bmatrix} - \mathbf{M}_i^0 (\mathbf{B}_i^0 )^{-1}\mathbf{F}_i^0 \quad \text{where} \;\; \mathbf{E}^0 = [\mathbf{M}_0^0 \dots \mathbf{M}_K^0,\mathbf{M}_p^0] $$

Due to the sparsity nature of $\mathbf{B}^0,\mathbf{C}^0,\mathbf{E}^0,\mathbf{F}^0$, there is no need for any communication across the processors while doing the distributed descent step. Each processor solves its local variables independently and constructs its own component of the reduced L1-System. Once we have reconstructed the global matrix at level 1, $\mathbf{A}^1$, we once again use the same independent ordering we used while splitting the L-0 system to get a distributed local system at level 1 which is of the general form:

$$\begin{bmatrix} \mathbf{B}_i^1 & \mathbf{F}_i^1 \\ \mathbf{C}_i^1 & \mathbf{U}_i^1 \end{bmatrix}\begin{bmatrix} \mathbf{u}_i^1 \\ \mathbf{y}_i^1 \end{bmatrix}+\begin{bmatrix} \mathbf{0} \\ \sum_{j \in \mathcal{N}_i} \mathbf{E}_{ij}^1\mathbf{y}_j^1 \end{bmatrix} = \begin{bmatrix} \mathbf{f}_i^1 \\ \mathbf{g}_i^1 \end{bmatrix}$$

where the matrices $\mathbf{E}_{ij}^1$ encapsulate all the communication requirements necessary for us to solve the terminal level (L1. Remember we are only building a 2 level system in this work). ^{[[6]](#MLSolver)}

Block 3: Distributed Terminal Level Solution for K contingencies - Distributed Solution of step 6 in Algorithm 1 ¶

Remember that we are solving the system for upto Level-1 only (descending from level-0) and hence we need to devise a terminal solution at this level. We adapt an approximate Distributed Schur's Complement Technique ~\cite{Distr_Schurs} in order to solve the global level-1 system in a distributed way in the previous section. We already discussed how to reduce the global L-1 system into the local processors. In total we have K+2 processors in our distributed setup - one process for each contingency K of them, one processor for the base, and one for the power injection variables.

It is to be noted that the set of processors can be of two kinds. The base case and all the contingency processors can be grouped into one class and the processor $p$ which handles the power injection variables into another class. The former set of processors have both local $u_i^1$ and global $y_i^1$ variables to solve whereas the interface variables have only the interface variables $y_i^1$ to operate with. This is so because the power injection variables cannot be solved independently without utilizing information on the base and contingency constraints. It needs to exchange information with each of the contingency processors in order to receive the updates to the equality constraint dual variables. Algorithm 2 details how to solve the terminal system in the base case and the contingency processors and Algorithm 3 details how to solve the terminal system for the power injection processors. ^{[[6]](#MLSolver)}

In the actual solution, we do multiple iterations of Algorithm 2 and 3 since these solutions are iterative. At a high level, the base and each contingency processor receives the interface variable values from the power injection processor, solves one iteration of the solver and send the interface variables back to the power injection processor. The power injection processor receives the interface variables from all the processors, performs one solver iteration on the interface variables and sends them back to the base case and contingency processors.

ALGO 2 : Terminal Level (L1) Solution for non-power injection solvers ^{[[3]](#PARMS)}¶

Inputs: $(\mathbf{A}_i^1,[\mathbf{f_i^1},\mathbf{g_i^1}])$

To Solve: $u_i^1, y_i^1$

Compute ILU factorization of $\mathbf{A}_i^1 \approx \mathbf{L}_i^1\mathbf{U}_i^1$
Set $\mathbf{y}_i^1 := 0$
$\mathbf{r} := (\mathbf{L}_i^1\mathbf{U}_i^1)^{-1}\begin{bmatrix} \mathbf{f}_i^1 \\ \mathbf{g}_i^1 \end{bmatrix}$
$\beta = ||\mathbf{r}||_2, \mathbf{v_1} = \mathbf{r}/\beta$
Exchange Interface variables with the power injection processor $y_p^{1}$
Perform GMRES step and forward backward substitution to update the local variables $u_i^1$ and the interface variables $y_i^1$ in order to communicate it back to the power injection processor $p$.

ALGO 3 : Terminal Level (L1) Solution for non-power injection solvers ^{[[3]](#PARMS)}¶

Inputs: $(\mathbf{C}_p,\;\mathbf{g_i^1})$

To Solve: $y_i^1$

Compute ILU factorization of $\mathbf{C_p} \approx \mathbf{L}_p\mathbf{U}_p$
Set $\mathbf{y}_i^1 := 0$
$\mathbf{r} := (\mathbf{L}_i^1\mathbf{U}_i^1)^{-1}\mathbf{g}_i^1$
$\beta = ||\mathbf{r}||_2, \mathbf{v_1} = \mathbf{r}/\beta$
Exchange Interface variables with the base and contingency processors $y_i^{1},\; i =0,1,....K$
Perform GMRES step and forward backward substitution to update the interface variables $y_i^1$ in order to communicate it back to the other processors (base processor and contingency processors) $i=0,1,....K$
Distributed Back-substitution (Level $l = 1 \rightarrow l=0$): Distributed Solution of step 9 in Algorithm 1.* After solving $\mathbf{u}^1,\mathbf{y}^1$ in the terminal step, to ascend to the level-0 unknowns, back substitution is performed, $\mathbf{B}^0\mathbf{u}^0 = \mathbf{f}^0 - \mathbf{F}^0\mathbf{y}^0 $ and is done in parallel with no communication needed.

Figure 3: Computational graph for the solution

Models and Data ¶

We work in a compute intensive application and our only dependence on external inputs was some realizations of real-world power systems with physical constraints i.e. the base matrix $A$.

We used three test cases for our model (sources are cited): 5 buses^[[5]](#5), 189 buses^{[[4]](#189/2224)}, and 2224 buses^{[[4]](#189/2224)}. The contingencies were generated by imposing appropriate physical and power injection constraints assuming certain randomly selected lines in the power network was tripped and the contingencies were saved. These two datasources are inputs to block 1 which generates the system matrix for the Newton step from contingencies and base cases in parallel.

Parallel Application and Programming Models ¶

Figure 3 summarizes the parallel application and the model that was used. Here, we used PyMP to parallelize the generation of our initial matrix $A$ from base system and contingencies (ideally we would use a true shared memory process such as OpemMP but this option was not available in Python - our development language). Afterwards, we used distributed memory parallelism with mpi4py to do the steps of descent, terminal level solution and the back substitution.Our parallel solution exhibits both loop-level parallelism (system contingency generation - block 1) and task-level parallelism (distributed solver -blocks 2,3) where each processor works on a different task.

Our application is strongly compute intensive as it is computation-heavy and in all our processors, we are taking the data closer to the compute, which is the prime focus. Since each processor owns its own set of variables, Our programming model is technically Single Program Multiple Data (SPMD) while it is to be noted that the set of communications and instructions run for the terminal solution in the power injection processor is different from the rest of the processors in terms of its solving and communication requirements (for e.g. power injection processor has no local variables and has only interface variables which need to be communicated to all processors). Our parallel design exhibits weak scaling as we increase the number of processors with increasing problem size (number of contingencies). Also, we expect our parallel system to be coarse-grained since the computation to communication ratio is pretty high in each processor as most processors are largely independent because of our smart choice of independent partitioning set.

Complications, Overheads and Advanced Features in the project ¶

Complications : Communication and Synchronization overheads ¶

The main reason why solving linear systems cannot be done effectively in a distributed fashion is because of the dependence of the system variables on each other. In such a case, all processors need to communicate all their owned variables to all the other processors (in the worst case, this would mean that every processor needs every variable), making communication and synchronization costs overshoot the effects of parallelization. The same effect would be observed in our distributed multi-level system too if we implement parallelization by arbitrarily dividing the set of variables into processors without thinking about the overheads.

We minimize these communication overheads in the best possible way by taking advantage of the system's internal structure. Our choice of permutation matrix that reorients the system in order to partition the set of variables into base case variables, contingency variables and the power injection variables ensured that strongly locally connected variables were part of the same processor and thus, communication and synchronization overheads were reduced drastically providing us good speedups overall, allowing us to realize the power of parallelization. This was an example of where simple parallelization alone was not enough and it was necessary to adjust the parallelization strategy specific to the structure and requirements of the application in hand.

Advanced Features and Models ¶

Beyond mitigating the communication and synchronization overheads, we were also required to leverage some modules of MPI beyond what was used in the course. Remember that the power injection processor interfaces with all the other processors and communicates and synchronizes the interface variables back and forth while the rest communicate only with the power injection processor. To ensure that such a communication and synchronization is possible, we leveraged MPI's gather, broadcast, and scatter communication modules. These modules were used specifically to ensure that the power injection processor continually receive interface information from all the power injections while broadcasting it's own interface to the remaining processors. Also, in all our MPI implementations, we also learned to use mpi4py which is a Python equivalent of C-based MPI we learned in class with minor differences in attributes.

Performance Evaluation and Overheads ¶

All our experiments had a setup as follows : the PyMP based parallelization used K cores (1,2,....6) within one compute node in Odyssey and the mpi4py parallelization used K+2 compute nodes (base, K contingencies, power injection) in Odyssey while each node almost effectively (except for a few implicit multi-core operations of some scipy modules) used only one core per node (despite the available number of cores ranging from 8-32 in each node).

As referenced in Figure 3, there are three main blocks of parallelization we exploited; using Python Multiprocessing to develop the original matrix $A$, and MPI Distributed Memory when descending down the levels. Figure 4 shows how we can utilize these blocks to speed up the program, and also to improve runtime as the number of contingencies increases.

Figure 4: compute times using various blocks of parallelization

Note that in this figure and the proceeding figures, "K" is used to denote the number of contingencies. Using Python Multiprocessing proved to minimally improve the total runtime; this is mainly due to the fact that PyMP was used for a portion of the program that did not originally contribute to the majority of the compute time. Furthermore, using MPI in Block 2 (non-terminal level) had minimal impact. The majority of the improvement came from utilizing distributed memory at the terminal level where significant set of local and interface variables are solved and naturally, parallelizing this portion had maximal impact on our speedups.

Furthermore, the beneficial effects of parallelism can be observed when increasing the problem size. In our case, our problem size can be increased, both by increasing the number of buses or the number of contingencies. Figure 5 below shows how the speedup of the program alters with the number of buses and contingencies.

Figure 5: Speedups vs. contingencies with varying problem sizes

We also finally compare the relative advantages of using our parallelized multigrid solver compared to using a serial multigrid solver and to using scipy's sparse system solver. Figure 6 below indicates that with larger number of contingencies, our solver has the lowest run-times. This result is interesting because of two reasons - scipy solver internally parallelizes operations across cores within the same node and also, we found that residuals from the scipy solver was as bad as the residuals from our distributed solution, if not worse.

Figure 6: Total runtimes using various solvers

Discussion ¶

Effect of different parallel blocks

We have three parallel blocks in our parallel design - generation of system with contingencies, distributed descent and distributed terminal level solution. From the results, we see that maximum benefit was acquired by distributing the terminal level solver(block 3). This result provides us two valuable learning outcomes - parallelizing the major bottleneck always has the most significant overall value to the speedup and the value of effective design of a partitioning scheme that enhances the independence of the processors and reduces the the communication messages for transferring the interface variables. The other two parallel blocks (1 and 2), while they provided similar speedups (in terms of percentages) as the terminal solver, were not instrumental for the overall speedups since their serial run-times were not as high as compared to the terminal step as one can observe from the serial code profiling graph.

Scaling Model and Analysis of our system

Our experiments also indicate that our model scales reasonably well with the number of contingencies from 1 to 6. Remember that each contingency adds several constraints to the linear program and significantly increases the size of the final matrix system we intend to solve. We analyzed our model along the lines of Iso-time scaling and found that the problem had more or less similar computation time while the number of processors increased with a growing number of contingencies. Also, we note that our parallel design is an example of weak scaling as the number of processors grew with increasing number of contingencies i.e. problem size.

Problem Size matters

We also observed the effects of parallelization only when the system became sufficiently large (for e.g. with 6 contingencies and 2224 buses). We believe this was because of two reasons - i) for the smaller systems, the communication and synchronization overheads were trumping any advantages that parallelization provides in terms of speedup and ii) larger sparse systems are all the more likely to be ill-conditioned than smaller sparse systems, thus leading to poorer convergence. Thus, from our experiments, we realize that the need for parallelization always depends on the computational workload that each problem component warrants and thus, grows with the problem size.

Notes on Numerical Stability

Notwithstanding the run-time benefits of parallelization, the accuracy of our solution tended to perform marginally worse than the serial multilevel (based on residuals) - we believe that this could be attributed to the poorly conditioned sparse system matrices. Given that this process runs iteratively in the distributed implementation (between the contingency and power injection processors), there are multiple iterative solves of the interface variables and multiple significant communication/synchronization steps - an ill-conditioned matrix can lead to accumulating residuals in an iterative process like ours and could be a reason for our residuals to be non-trivial. Further work needs to be done to improve the numerical accuracy of the solution - mostly by investing efforts in designing a more thorough preconditioning schemes. It is worth noting, however, that typical built-in direct solvers (scipy.sparse etc.) also tend to perform fairly poorly - and in general across all sparse solvers, getting better handles of condition numbers of a sparse operations is a standard issue when working with this type of structure.

Note: description of the software and compute infrastructure is in the appendix

Lessons Learned, Future Work, Conclusion ¶

Lessons Learned ¶

The first lesson we learned by implementing real-world parallel systems is that blind parallelism is counterproductive and can make the system slower than the serial version. One needs to extensively think and analyze about parallel design before implementation. The first step involves identifying all possible overheads by profiling and ensure that the design guarantees as much independence as possible in the system. In a nutshell, we found that more independence between the processors meant that parallelization was more effective and scalable. The next key lesson is that scaling analysis is always an important aspect of performance evaluation. Many times, we found that the system behaved significantly differently for 1 contingency vs. 6 contingencies in terms of overheads etc. and one needs to ensure that the parallel design is robust enough to scale with an increasing problem size guaranteeing similar properties in terms of time, efficiency etc. This exercise also helped us to develop good practices for parallelizing any real-world application. In the process, we learned and adopted the workflow - profile the base version, identify the inefficiencies and optimize the serial code followed by profiling exercise once again and then, find avenues to parallelize and implement good parallel design and finally, profile to complete one iteration of the process. Any real-world problem entails multiple iterations of this process. Also, beyond the parallelization scope, we also learned significantly about issues on numerical stability, ill-conditioned matrices, robust solvers and accumulating residuals with iterative solvers in an attempt to solve our sparse system.

Future Work ¶

Some of the natural extensions and future work for our project are to scale for larger real-world power grids with a higher number of contingencies and buses of the order of 10000s, better preconditioning approaches for our sparse matrices to prevent numerical instability issues and high accumulating residuals in an iterative procedure, evaluate the trade-offs between synchronous (current) and asynchronous (future work) exchange of interface variables between processors, increase the number of levels in our solver in order to more robustly solve sparser systems and finally, attack other similar IPM optimization problems with similar network characteristics such as swarm robotics, cyber-networks and transport grids.

Conclusion ¶

Overall, we realized the benefits of effective parallel design for the Newton Step of the SC-OPF problem once we identified significant communication overheads and devised an efficient partitioning scheme. This scheme demonstrates well-engineered load balancing with minimal communication and synchronization overheads by exploiting the system's sparse structure and significantly speeds up the critical Newton Step for sufficiently large problems. Further work will explore means to improve accuracy by imploring more preconditioning techniques to improve stability.

Acknowledgement¶

We thank Prof.Na Li and her lab members for providing several key insights, inspiration and guidance for this work and also we genuinely thank Ariana Minot for her prior work on this problem.

References ¶

[1] Naiyuan Chiang and Andreas Grothey, "Solving security constrained optimal power flow problems by a structure exploiting interior point method" https://doi.org/10.1007/s11081-014-9250-1

[2] A. Minot, "A parallel primal-dual interior-point method for DC optimal power flow" , 2016 Power Systems Computation Conference (PSCC)

[3] Zhongze Li, "pARMS: a parallel version of the algebraic recursive multilevel solver" https://onlinelibrary.wiley.com/doi/abs/10.1002/nla.325

[4] 189 and 2224 buses http://www.nationalgrid.com/uk/Electricity/Codes/gbsqsscode/workinggroups/intgeneration/

[5] F. Li, "Small test systems for power system economic studies", IEEE PES General Meeting

[6] A. Minot, "A Multilevel Solver for Security Constrained OPF"

Appendix ¶

Source Code and Evaluation ¶

Please refer to the Github README for a more in-depth guide to the repo: https://github.com/Srivatsan-Srinivasan/cs205-final-project/blob/master/README.md

We evaluated our code through a check on the residuals of the solution. This was computed via the frobenius norm of $ A x - B $ where $x$ is the proposed solution by a solver. The implementation of this can be found in the function calculate residuals in the utils library in the solvers library https://github.com/Srivatsan-Srinivasan/cs205-final-project/blob/master/solvers/utils.py} and examples of how to invoke our residual profiler can be found in the example scripts. For example, see the multilevel_check_residuals function (in https://github.com/Srivatsan-Srinivasan/cs205-final-project/blob/master/profile_multilevel_example.py) or the check_residuals functions implemented in the profiler utilities (in https://github.com/Srivatsan-Srinivasan/cs205-final-project/blob/master/profile_baseline_example.py).

We noticed that our solvers were plagued by non-trivial residuals. Upon further investigation, we determined that the root cause seemed to be driven by system specific details. An simple direct solve implementation of the solvers using purely the Scipy library and an implementation of the multigrid solver purely in MATLAB all demonstrated similar non trivial residuals.

Software Design and How to Use Our Software ¶

Our software implements several version of newton step solvers. The solvers can be run at several different extents of parallelism. All versions of the implements solvers expose a very similar interface. The implementation of the solvers can be found in the 'solvers' folder. Notice that all the solvers, the baseline solver, the serial solver (with optional parallelism at block 1), the parallel solver (with optional parallelism at block 1 and 3) implement a set of common functions that all solvers should have. Namely, all solvers implement the following functions:

load_data() Providers a unified interface for loading the necessary data sets for all solvers. Loads the necessary data files to setup the problem for a provided number of buses and contingencies for the power network.
construct_newton() Provides a relatively consistent interface for constructing the newton matrix for the newton step solvers. The function takes in the loaded data sets and constructs the newton matrix. For the solvers that implement block 1 level parallelism, the function optionally takes in parameters that define the degree of parallelism (e.g. number of workers for PyMP, etc.)
solver() This function is the main workhorse and actually implements the logic for the solvers. For example, the scipy solver simply implements a call to the least squares scipy function. Also the fully parallelized solver implements the multi-grid approach with full parallelization (at blocks 1, 2, 3). and optionally an addition function for multilevel parallel solvers that implement the descent to lower levels.
- descend_level() This function implements the logic to descend through block 2 in a distributed fashion and is present only in solvers that support parallelism at block 2.

All available implementations of the solvers are packed into the 'solvers' folder. Common functionality have been abstracted away into two files in this 'solvers' folder. These are:

utils.py Here, the very generic functions such as data loading, etc. that have specific to do with the newton step problem are implemented. Most noteworthy function implemented here is the calculate_residuals() function.
solver_utils.py This folders implements the vast majority of the underlying logic for the solvers. For example, you'll find the actual implementation of the multilevel solvers here (see multigrid(), multigrid_parallel(), multigrid_full_parallel.

Additionally, in the parent folder, you'll find the necessary scripts to easily run these solvers and to profiler them. Notably, for each solver implemented in the 'solver' folder, you'll find two scripts, one to run the solver and another to easily profile these solvers. For example, to view an example usage for the fully parallelized multigrid solver (solver/solver_multilevel.py), you can refer to the simple driver script that just runs the solver for some power network (multilevel_example.py) or the driver script to run the available profilers for the solver (profile_multilevel_example.py). Below are listed the driver scripts and profiler driver scripts.

solver/solver_multilevel.py

 1. multilevel_example.py
 2. profile_multilevel_example.py

solver/solver_multilevel_serial.py

 1. serial_multilevel_example.py
 2. profile_serial_multilevel_example.py

solver/solver_baseline.py

 1. baseline_example.py
 2. profile_baseline_example.py

Again, the common profiler utilities have been moved into profiler library file profile_utils.py.

Specifically, again, for the fully parallelized multilevel solver, running with a bus size of 2224 and 6 contingencies with parallelization at blocks 1, 2, 3, to just run the solver, you would do python multilevel_example.py and to run different profilers on the script, you would edit the profiler driver script and run it, python profile_multilevel_example.py. For more details, please refer to the docstring in the codes and especially, the repo's README.

Technical Details about the Compute Infrastructrue ¶

We used the Harvard Odyssey Cluster as our sole compute infrastructure resource. All experiments and results reported herein were executed on this cluster. The hardware specs are listed below. Please note that these are the hardware specs available on each node, refer to the Github README for full details.

Model: Harvard Odyssey Cluster
Processors: Intel (R) Xeon (R) CPU E5-2683 v4 @ 2.10Ghz, 32 cores with 1 thread per core
Cache Memory:L1d cache 32K, L1i cache 32l, L2 cache: 256K, L3 cache: 40960K

Again, please refer to the README for the specific resources that were requested for the experiments. Operating system information can be found below:

Operating System: CentOS Linux Release 7.4.1708 Codename: Core
Kernel Version: Linux Version 3.10.0-693.21.1.e17.x86_64
Operating System Architecture/Machine Architecture/Processor Architecture: x86_64

For compilers and libraries see README for setup instructions.