1 Introduction

Ordinary differential equation (ODE) models are popular to characterize the dynamics of complex systems in wide applications, such as infectious disease (Chen and Wu 2008; Liang et al. 2010), gene regulation networks (Cao and Zhao 2008; Lu et al. 2011) and brain connectivity (Zhang et al. 2015; Wang et al. 2021). A general system of ODEs consisting of p dynamic processes is given by

$$\begin{aligned} \varvec{\theta }^{'}(t{;}\varvec{\gamma })=F(\varvec{\theta }(t{;}\varvec{\gamma }),\varvec{\gamma }), \end{aligned}$$
(1.1)

where \(\varvec{\theta }(t{;}\varvec{\gamma })=(\theta _1(t{;}\varvec{\gamma }),\dots ,\theta _p(t{;}\varvec{\gamma }))^\top \), F is usually a set of unknown functions describing the dependence between \(\varvec{\theta }(t{;}\varvec{\gamma })\) and its derivative \(\varvec{\theta }^{'}(t{;}\varvec{\gamma })\), and \(\varvec{\gamma }\) contains all the unknown parameters defining the system. Dynamic models in forms of ordinary differential equations (1.1) associate the rate of change of processes \(\varvec{\theta }(t{;}\varvec{\gamma })\) with themselves. For instance, the gene regulation network uses the link function F to quantify the regulatory effects of regulator genes on the expression change of a target gene (Wu et al. 2014). Neuronal state equations in a dynamic causal model (Friston et al. 2003) describe how instantaneous changes of the neuronal activities of system components are modulated jointly by the immediate states of other components.

ODE systems can be constructed by domain knowledge, such as the Newton’s laws of motion in physics and the Lotka-Volterra equations for predator-prey biological system. Parameters of an ODE system have scientific interpretations. However, the values of these parameters are often unknown, and there is a strong need to estimate them from noisy observations measured from the dynamic system. Let \({\mathcal {T}}\subset {\mathbb {R}}\) be a compact interval and consider a common set of discrete time points \(t_1,\dots , t_n\in {\mathcal {T}}\) for notation brevity. Denote by \(y_{ij}\) the observation according to the jth latent process \(\theta _j(t)\) at time \(t=t_i\), \(j=1,\dots ,p\). To facilitate a flexible modeling, we assume the observations conditional on the latent processes follow an exponential family distribution, i.e., the conditional density function \(f(y_{ij} {\mid }\, \theta _j(t_i))\) is given by

$$\begin{aligned} \exp \left\{ \frac{y_{ij}\theta _j(t_i)-b(\theta _j(t_i))}{a(\phi )} + c(y_{ij},\phi )\right\} , \end{aligned}$$

where \(a>0, b, c\) are known functions, \(\phi \) is either known or considered as a nuisance parameter. The exponential family covers many continuous and discrete probability distributions, including Gaussian, Bernoulli, binomial, Poisson and negative binomial. See Wood (2017) for a comprehensive review.

Parameter estimation for ODEs from noisy observations, also known as system identification, remains a challenging task in the statistical literature. Many methods have been developed including the nonlinear least squares (Biegler et al. 1986), the two-stage collocation methods (Varah 1982; Liang and Wu 2008; Lu et al. 2011; Brunel et al. 2014; Wu et al. 2014; Dattner and Klaassen 2015; Brunton et al. 2016; Chen et al. 2017; Dai and Li 2021), the parameter cascading methods (Ramsay et al. 2007; Cao and Ramsay 2007; Qi and Zhao 2010; Nanshan et al. 2022), and Bayesian methods (Huang et al. 2006; Zhang et al. 2017). Based on their different treatment of the differential equations, we summarize them into three categories. The first approach requires the latent processes to follow the ODEs exactly. It searches for the optimal ODE parameters by evaluating the fitting between the data and the numerical solutions to the ODEs. This is named as the gold standard approach by Chen et al. (2017). Since the numerical solution to the differential equations is sensitive to the values of ODE parameters, this approach suffers from numerical instability and intensive computation. The second approach is the two-stage collocation suggested by Varah (1982). The latent processes and their derivatives are estimated via data smoothing procedures at the first stage, followed by treating the estimated derivative as a response variable and minimizing a least squares measure based on the differential equations with respect to the ODE parameters. Compared to the gold standard approach that latent processes must comply with the ODEs, the two-stage collocation uses the differential equations only as a discrepancy measure. The performance relies heavily on a satisfactory estimate of the derivatives, which can be challenging in nonparametric smoothing. Recently, Dattner and Klaassen (2015), Chen et al. (2017) and Dai and Li (2021) considered an integrated form of the ODEs to address this issue. The third approach is generalized profiling or parameter cascading (Ramsay et al. 2007; Cao and Ramsay 2007; Nanshan et al. 2022). It provides a nested optimization procedure to iteratively update the estimates of the latent processes regularized by their adherence to the differential equations. It has been shown to provide more efficient parameter estimation with theoretical guarantee (Qi and Zhao 2010).

Motivated by large-scale time-course data analysis, there have been many efforts devoted to high-dimensional ODEs under various model assumptions. For example, Lu et al. (2011) considered the high-dimensional linear ODEs to identify dynamic gene regulatory network, while Zhang et al. (2015) studied a similar parametric model but allowed for two-way interactions. Brunton et al. (2016) proposed SINDy to identify the nonlinear dynamical system from a rich library of basis functions, which allows for more general interactions. Henderson and Michailidis (2014), Wu et al. (2014) and Chen et al. (2017) relaxed the parametric assumption by studying sparse additive ODEs. Dai and Li (2021) developed the kernel ODEs with functional ANOVA structure, which allows for pairwise interactions in a nonparametric fashion and gains greater flexibility. Regarding parameter estimation, all the above procedures for high-dimensional ODEs belong to the two-stage collocation category. Its popularity in parameter estimation for ODEs comes from several aspects. First, the two-stage methods have the so-called decoupled property (Liang and Wu 2008; Lu et al. 2011) that the estimation procedure for the p-dimensional ODE system can be decomposed into p one-dimensional ODEs independently. The number of ODE parameters to be estimated grows exponentially with the number of differential equations. In particular, in an additive ODE model, the nonparametric components need to be approximated with basis expansions, which leads to even more model parameters. The decoupled property effectively mitigates the computational burden in the high-dimensional setting. Second, the two-stage collocation methods are convenient to incorporate sparsity-inducing penalties when we expect a sparse ODE structure for better model interpretation (Wu et al. 2014; Chen et al. 2017). Although the generalized profiling approach provides efficient ODE parameter estimation by improved latent process smoothing, its extension to high-dimensional ODEs is unclear, partially because of its nested optimization procedures. In summary, it is a distinctive challenge to simultaneously balance latent process smoothing, ODE parameter estimation, and variable selection for high-dimensional ODE models.

In this article, we propose a novel joint estimation approach for generalized sparse additive ODEs (JADE). To simultaneously accomplish the aforementioned tasks, we formulate an objective function consisting of a data fidelity associating the observed data with the latent processes, an ODE fidelity measuring the adherence of the latent processes to the differential equations, and a sparsity regularization controlling the model complexity. As a joint estimation strategy, JADE uses the ODE fidelity to regularize the estimation of latent processes, likewise the generalized profiling approach (Ramsay et al. 2007), and thus provides more accurate estimation in both the processes and their derivatives. For recovering the sparse structure of the ODE system, JADE approximates the additive functional components with finite basis expansions and naturally integrates group-wise sparse penalties to achieve the selection of individual functionals. Furthermore, under the regularized estimation framework, the objective function of JADE is non-convex and non-differentiable. To tackle this computational challenge, we design a block coordinate descent algorithm to solve the non-convex optimization problem with proper parameter tuning. By comparing with other existing methods via extensive numerical studies, we show that JADE not only provides more stable and accurate latent process smoothing and ODE parameter estimation, but also has good sparse structure recovery performance.

Our main contributions are summarized as follows. First, we allow the noisy observed data to be non-Gaussian, which covers a variety of continuous and discrete distributions for practical interest. Most existing literature on ODEs is based on the nonlinear least squares criterion (Miao et al. 2014), while the extension to likelihood-based criterion requires more involved computation. Second, we present a unified estimation framework under which JADE is a joint approach, whereas the two-stage collocation methods and the generalized profiling approach can be understood as conditional and profile estimation strategies, respectively. It also justifies the central role played by the ODE fidelity component in bridging the latent processes with the differential equations. Third, due to the non-convex nature of our objective function, the classical iteratively re-weighted least squares algorithm (Wood 2017) for generalized linear modeling performs poorly. Instead, we adopt a gradient-descent method for optimization and achieve great generalization performance. Fourth, we prove the convergence of our block coordinate descent algorithm to a stationary point of the objective function. The success of our algorithm hinges on the block-separable structure of the problem and an inexact line search instead of an exact Hessian calculation. Not only are the gradient-based iterations cheaper, but also stronger global convergence is possible.

The rest of this paper is organized as follows. We review existing methods and develop our JADE in Sect. 2. Detailed computational procedure and convergence analysis are presented in Sect. 3. We investigate the numerical performance in Sect. 4 and illustrate with data examples in Sect. 5. Section 6 provides some concluding remarks.

2 Joint estimation for sparse additive ODEs

In this section, we first briefly review existing approaches to ODE parameter estimation and then present a unified estimation framework for latent processes, ODE parameters and sparse ODE structure, under which our method is developed from the joint estimation perspective. It is thus dubbed joint estimation for generalized sparse additive ordinary differential equations (JADE).

Under the additive assumption, the ODE system (1.1) takes the form

$$\begin{aligned} \theta ^{'}_j(t) = \gamma _{j0} + \sum _{k=1}^pf_{jk}(\theta _k(t)),\ j= 1,\ldots ,p. \end{aligned}$$
(2.1)

The additive components \(f_{jk}\)’s are usually approximated with basis expansions, for example, B-spline basis for numerical stability, such that

$$\begin{aligned} f_{jk}(\theta )= \varvec{\gamma }_{jk}^\top \varvec{\phi }(\theta )+\delta _{jk}(\theta ), \end{aligned}$$
(2.2)

where \(\theta \) denotes the value of a generic latent variable, \(\varvec{\phi }(\theta )=(\phi _1(\theta ),\dots ,\phi _L(\theta ))^\top \) is the vector of basis functions, \(\varvec{\gamma }_{jk}\in {\mathbb {R}}^L\) is the vector of basis coefficients, and \(\delta _{jk}\) is the residual function. Therefore, the additive ODE system (2.1) can be written as

$$\begin{aligned} \begin{aligned} \theta ^{'}_j(t)&=\gamma _{j0} + \sum _{k=1}^p \varvec{\gamma }_{jk}^\top \varvec{\phi }(\theta _k(t))\\&\quad +\sum _{k=1}^p \delta _{jk}(\theta _k(t)),\ j= 1,\ldots ,p. \end{aligned} \end{aligned}$$
(2.3)

Denote by \(\varvec{\gamma }_j=(\gamma _{j0}, \varvec{\gamma }_{j1},\dots ,\varvec{\gamma }_{jp})\) the ODE parameters in the jth differential equation and let \(\varvec{\gamma }=(\varvec{\gamma }_1,\dots ,\varvec{\gamma }_p)\) collect ODE parameters from all differential equations.

2.1 Two-stage collocation methods

Collocation with spline bases is first proposed by Varah (1982) for dynamic data fitting problems and has been successfully extended to parameter estimation and network reconstruction for high-dimensional ODE models (Liang and Wu 2008; Lu et al. 2011; Brunel et al. 2014; Henderson and Michailidis 2014; Wu et al. 2014; Dattner and Klaassen 2015; Chen et al. 2017; Dai and Li 2021). It is a two-stage procedure in which the latent processes and their derivatives are first fitted by smoothing methods, followed by estimating the ODE parameters given the smoothed estimates. In particular, it solves the following optimization problems, for \(j=1,\dots ,p\),

$$\begin{aligned} {{\widehat{\varvec{\gamma }}}}_j&= \mathop {{\mathrm{arg\,min}}}\limits _{\begin{array}{c} \gamma _{j0}\in {\mathbb {R}},\\ \varvec{\gamma }_{jk}\in {\mathbb {R}}^L \end{array}}\, \int _{\mathcal {T}} \bigg \{\frac{\mathrm {d}{{\widehat{\theta }}}_j(t)}{\mathrm {d}t}-\gamma _{j0}-\sum _{k=1}^p \varvec{\gamma }_{jk} \varvec{\phi }({{\widehat{\theta }}}_k)\bigg \}^2 \,\mathrm {d}t\\&\quad + \sum _{k=1}^p {\text {pen}}_{\lambda _\gamma }(\Vert \varvec{\gamma }_{jk}\Vert _2), \end{aligned}$$

with

$$\begin{aligned} {{\widehat{\theta }}}_j(t) = \mathop {{\mathrm{arg\,min}}}\limits _{\theta \in {\mathcal {H}}} -\frac{1}{n}\sum _{i=1}^n\{y_{ij}\theta (t_i)-b(\theta (t_i))\}, \end{aligned}$$

where \({\text {pen}}_{\lambda _\gamma }(\cdot )\) is a penalty function inducing sparse structure for the ODE system, \({\mathcal {H}}\) is a closed ball in a reproducing kernel Hilbert space, and \({{\widehat{\theta }}}_j\) is known as the exponential family smoothing spline estimator (Wahba et al. 1995; Gu 2013).

More involved two-stage collocation methods have been proposed for improvement. Wu et al. (2014) developed a five-step variable selection procedure for the sparse additive ODE model (2.1), which refines the two-stage collocation approach with the standard group lasso and adaptive group lasso regularizations. However, considerable care is required in the smoothing step to ensure a satisfactory estimation of the latent process derivatives. Dattner and Klaassen (2015) and Chen et al. (2017) further proposed a more robust ODE parameter estimation strategy for the second stage. Specifically, they used an integrated form of the differential equations to build up an objective function and successfully avoided estimating the derivatives. The two-stage collocation has an attractive decoupled property. Once the latent processes are estimated and fixed, the second stage procedure can be performed for each differential equation separately (Varah 1982; Wu et al. 2014; Chen et al. 2017; Dai and Li 2021). Computational efficiency together with theoretical guarantee leads to the prevalence of the two-stage collocation methods for parameter estimation and variable selection for high-dimensional ODE models.

2.2 Generalized profiling procedure

Ramsay et al. (2007) proposed the generalized profiling approach for parameter estimation of ODEs. It is based on a generalized data smoothing strategy along with a generalization of profiled estimation. The ODE parameters \(\{\varvec{\gamma }_j\}_{j=1}^p\) are of primary interest, while the latent processes \(\{\theta _j(t)\}_{j=1}^p\) are nuisance. In contrast to the two-stage collocation methods, the generalized profiling approach involves a nested optimization procedure. During the inner optimization of the profiling procedure, an intermediate estimate of the latent processes \({{\widehat{\varvec{\theta }}}}(t,\varvec{\gamma })=({{\widehat{\theta }}}_1(t,\varvec{\gamma }),\dots ,{{\widehat{\theta }}}_p(t,\varvec{\gamma }))\) is obtained by minimizing a combination of data and ODE fidelity criteria

$$\begin{aligned}&-\frac{1}{n}\sum _{j=1}^p\sum _{i=1}^n \left\{ y_{ji}{\theta _j(t_i)}-b(\theta _j(t_i))\right\} \nonumber \\&\quad + \lambda _\theta \sum _{j=1}^p\int _{\mathcal {T}} \bigg \{\frac{\mathrm {d}\theta _j(t)}{\mathrm {d}t}-\gamma _{j0}-\sum _{k=1}^p \varvec{\gamma }_{jk}^\top \varvec{\phi }(\theta _k(t))\bigg \}^2 \mathrm {d}t, \end{aligned}$$
(2.4)

where a larger value of the tuning parameter \(\lambda _\theta >0\) encourages more adherence of the latent processes to the differential equation system. A data fitting criterion, such as log-likelihood, is then optimized with respect to the ODE parameters \(\varvec{\gamma }\) by holding \({{\widehat{\varvec{\theta }}}}(t,\varvec{\gamma })\) as a function of \(\varvec{\gamma }\), that is,

$$\begin{aligned} {{\widehat{\varvec{\gamma }}}} = \mathop {{\mathrm{arg\,min}}}\limits _{\varvec{\gamma }} -\frac{1}{n}\sum _{j=1}^p\sum _{i=1}^n \big \{y_{ji}{{{\widehat{\theta }}}_j(t_i{;}\varvec{\gamma })}-b({{\widehat{\theta }}}_j(t_i{;}\varvec{\gamma }))\big \}. \end{aligned}$$

The generalized profiling approach proceeds with solving inner and outer optimizations iteratively with a non-decreasing sequence of \(\lambda _\theta \). Although the generalized profiling approach provides a more efficient estimation strategy for small-scale ODEs, its extension to the high-dimensional ODEs remains challenging. Unlike the two-stage collocation methods, the optimization procedure of the generalized profiling cannot be decoupled for each latent process or differential equation. It is computationally demanding to handle a large number of primary and nuisance parameters. Moreover, when a sparsity-inducing penalty is introduced in the outer optimization to control the model complexity, the objective function is not convex such that the Gauss-Newton method used by Ramsay et al. (2007) is no longer applicable.

2.3 Joint estimation approach

To facilitate latent process and ODE parameter estimation together with sparse structure identification for sparse additive ODEs, we formulate a unified estimation framework and build our JADE procedure. Combining the ingredients of the two-stage collocation and generalized profiling methods, we define an objective function consisting of three components: a data fidelity associating the observations with the latent processes, an ODE fidelity measuring the adherence of latent processes to the differential equations, and a sparsity regularization controlling the model complexity. Specifically, JADE aims to simultaneously estimate the latent processes \(\varvec{\theta }(t)\) and ODE parameters \(\varvec{\gamma }\) by minimizing

$$\begin{aligned} \begin{aligned}&Q(\varvec{\theta }(t),\varvec{\gamma })\\&\quad =-\frac{1}{n}\sum _{j=1}^p\sum _{i=1}^n \left\{ y_{ji}{\theta _j(t_i)}-b(\theta _j(t_i))\right\} \\&\qquad + \lambda _\theta \sum _{j=1}^p\int _{\mathcal {T}} \bigg \{\frac{\mathrm {d}\theta _j(t)}{\mathrm {d}t}-\gamma _{j0}-\sum _{k=1}^p \varvec{\gamma }_{jk}^\top \, \varvec{\phi }(\theta _k(t))\bigg \}^2 \mathrm {d}t \\&\qquad + \sum _{k=1}^p {\text {pen}}_{\lambda _\gamma }(\Vert \varvec{\gamma }_{jk}\Vert _2), \end{aligned}\nonumber \\ \end{aligned}$$
(2.5)

where \(\lambda _\theta , \lambda _\gamma >0\) are tuning parameters balancing the ODE fidelity and model complexity, respectively. The penalty function \({\text {pen}}_{\lambda _\gamma }(\cdot )\), for example, the group lasso (Yuan and Lin 2006) or its variants, introduces group-wise sparsity by forcing all elements in \(\varvec{\gamma }_{jk}\) to be either zero or nonzero and thus allows for identifying significant additive components in the ODE system.

Our JADE procedure views both the latent processes and ODE parameters of primary interest. Observe that the latent processes \(\varvec{\theta }(t)\) appear in the first two components in (2.5), while the ODE parameters \(\varvec{\gamma }\) are associated with the last two components. It indicates that the ODE fidelity plays a central role in bridging the latent processes with the ODE structure. When latent processes are approximated with basis expansions, the basis coefficients and ODE parameters naturally fit in a block optimization architecture. A closer look at \(Q(\varvec{\theta }(t),\varvec{\gamma })\) in (2.5) reveals more insight. On the one hand, for the latent processes or equivalently their basis coefficients, the likelihood term in Q is convex, but the ODE fidelity is non-convex due to the use of basis \(\varvec{\phi }\). On the other hand, the expression of differential equations in the second component of (2.5) is linear in \(\varvec{\gamma }_j\)’s. Thus, optimizing Q with respect to the ODE parameters is essentially solving group-regularized least squares. Computational details of JADE are presented in Sect. 3.

We conclude this section by comparing with other estimation strategies under the unified estimation framework. The two-stage collocation methods first obtain \({{\widehat{\varvec{\theta }}}}(t)\) from the data fidelity component and then minimize the latter two components in (2.5) for an estimate of \(\varvec{\gamma }\) by conditioning on \({{\widehat{\varvec{\theta }}}}(t)\). The estimated latent processes are only based on observations but are not required to satisfy the differential equations. On the other hand, the generalized profiling approach does not consider sparsity regularization. It treats \(\varvec{\theta }\) as a nuisance parameter, where the inner optimization deduces the dependence of \({{\widehat{\varvec{\theta }}}}(t, \varvec{\gamma })\) on \(\varvec{\gamma }\) by minimizing the first two components in (2.5), followed by profiling out \({{\widehat{\varvec{\theta }}}}(t, \varvec{\gamma })\) and optimizing the data fidelity with respect to \(\varvec{\gamma }\) iteratively. In summary, from the estimation perspective, the two-stage collocation methods follow a conditional strategy, and the generalized profiling approach is in a profiling fashion.

3 Block coordinate descent algorithm

We develop a block coordinate descent algorithm for our JADE procedure and establish its global convergence. Like collocation based methods, we use a finite basis to approximate the jth latent process with \({{\widetilde{\theta }}}_j(t) = {\mathbf {c}}_j^\top \varvec{\psi }(t)\), where \({\mathbf {c}}_j \in {\mathbb {R}}^M\), \(\varvec{\psi }(t)=(\psi _1(t),\ldots ,\psi _M(t))^\top \) and M is the number of basis functions. We assume M to be the same for \(j=1,\dots ,p\) without loss of generality. Now the objective function of JADE in (2.5) can be expressed as a function of basis coefficients of latent processes and ODE additive components, denoted by \({\mathbf {c}}=({\mathbf {c}}_1,\dots ,{\mathbf {c}}_p)\) and \(\varvec{\gamma }=(\varvec{\gamma }_1,\dots ,\varvec{\gamma }_p)\), respectively. The parameters to be optimized naturally form 2p blocks, which prompts us to consider a block coordinate descent algorithm.

With some abuse of notation, we use \(Q({\mathbf {c}}, \varvec{\gamma })\) to denote the quantity in (2.5) with \(\varvec{\theta }(t)\) being replaced by their basis expansions. We will not distinguish them in the following discussion since the definition should be clear in context. According to (2.5), we write

$$\begin{aligned} Q({\mathbf {c}},\varvec{\gamma })=S({\mathbf {c}},\varvec{\gamma })+R(\varvec{\gamma }), \end{aligned}$$

where

$$\begin{aligned}&S({\mathbf {c}},\varvec{\gamma })=-\frac{1}{n}\sum _{j=1}^p\sum _{i=1}^n \left\{ y_{ji}{{\mathbf {c}}_j^\top \varvec{\psi }(t_i)}-b({\mathbf {c}}_j^\top \varvec{\psi }(t_i))\right\} \\&\quad + \lambda _\theta \sum _{j=1}^p\int _{\mathcal {T}} \bigg \{{\mathbf {c}}_j^\top \frac{\mathrm {d}\varvec{\psi }(t)}{\mathrm {d}t}-\gamma _{j0}-\sum _{k=1}^p \varvec{\gamma }_{jk}^\top \, \varvec{\phi }({\mathbf {c}}_k^\top \varvec{\psi }(t))\bigg \}^2\mathrm {d}t \end{aligned}$$

adds up the likelihood and the ODE fidelity terms, and \(R(\varvec{\gamma })=\sum _{k=1}^p {\text {pen}}_{\lambda _\gamma }(\Vert \varvec{\gamma }_{jk}\Vert _2)\) regularizes the sparsity of \(\varvec{\gamma }\). We have two key observations for the above decomposition. First, \(S({\mathbf {c}},\varvec{\gamma })\) is a continuously differentiable function of all blocks of parameters. Particularly, it is a block-separable quadratic function of \(\varvec{\gamma }_j\)’s, but is non-convex in \({\mathbf {c}}_j\)’s due to the additive assumption of the ODEs. Second, \(R(\varvec{\gamma })\) encourages group-wise sparsity for \(\varvec{\gamma }\) and thus has a block-separable structure, i.e., \(R(\varvec{\gamma }) = \sum _{j=1}^p R_j(\varvec{\gamma }_j)\).

Sects. 3.1 and 3.2 describe the updating rules for the latent processes and the additive components in the ODE system, respectively. We then address identifiability and parameter tuning issues and analyze the global convergence of the proposed method.

3.1 Update the latent processes

Recall that for \(j=1,\dots ,p\), the jth latent process \(\theta _j(t)\) is approximated by basis expansion \({\mathbf {c}}_j^\top \varvec{\psi }(t)\), and the JADE objective function \(Q({\mathbf {c}},\varvec{\gamma })\) depends on \({\mathbf {c}}_j\) only through \(S({\mathbf {c}},\varvec{\gamma })\). Let \(\nabla _{{\mathbf {c}}_j} S({\mathbf {c}}, \varvec{\gamma })\in {\mathbb {R}}^M\) be the gradient of \(S({\mathbf {c}},\varvec{\gamma })\) with respect to \({\mathbf {c}}_j\). We use it to build a quadratic approximation to \(S({\mathbf {c}}, \varvec{\gamma })\) with respect to \({\mathbf {c}}_j\) and generate an improving direction. More precisely, we choose a positive definite matrix \({\mathbf {H}}_{{\mathbf {c}}_j} \in {\mathbb {R}}^{M\times M}\) and move \({\mathbf {c}}_j\) along the direction \({\mathbf {d}}_{{\mathbf {c}}_j}({\mathbf {c}},\varvec{\gamma })\), where

$$\begin{aligned} \begin{aligned} {\mathbf {d}}_{{\mathbf {c}}_j}({\mathbf {c}},\varvec{\gamma })&=\mathop {{\mathrm{arg\,min}}}\limits _{{\mathbf {d}}\in {\mathbb {R}}^M} \big \{ (\nabla _{{\mathbf {c}}_j} S({\mathbf {c}}, \varvec{\gamma }))^\top {\mathbf {d}}+ \frac{1}{2} {\mathbf {d}}^\top {\mathbf {H}}_{{\mathbf {c}}_j} {\mathbf {d}}\big \}\\&= -{\mathbf {H}}_{{\mathbf {c}}_j}^{-1}\,(\nabla _{{\mathbf {c}}_j} S({\mathbf {c}}, \varvec{\gamma })). \end{aligned} \end{aligned}$$
(3.1)

The above first equality implies that \({\mathbf {d}}_{{\mathbf {c}}_j}({\mathbf {c}},\varvec{\gamma })\) minimizes the quadratic perturbation of \(S({\mathbf {c}},\varvec{\gamma })\) caused by the change of \({\mathbf {c}}_j\). Because \(S({\mathbf {c}},\varvec{\gamma })\) is continuously differentiable in \({\mathbf {c}}_j\), a typical choice of \({\mathbf {H}}_{{\mathbf {c}}_j}\) is the Hessian matrix \(\nabla ^2_{{\mathbf {c}}_j} S({\mathbf {c}}, \varvec{\gamma })\). In this case, the JADE algorithm is equivalent to the Newton-type strategy to minimize \(S({\mathbf {c}}, \varvec{\gamma })\). However, it can be expensive to compute all Hessian matrices for blocks \({\mathbf {c}}_j\)’s, and the Hessian matrices are not guaranteed to be positive definite because \(S({\mathbf {c}}, \varvec{\gamma })\) is non-convex in \({\mathbf {c}}\). Based on the above considerations, we follow Tseng and Yun (2009) and choose a diagonal Hessian approximation \({\mathbf {H}}_{{\mathbf {c}}_j}\), whose diagonal elements are the same as those of \(\nabla ^2_{{\mathbf {c}}_j} S({\mathbf {c}}, \varvec{\gamma })\) thresholded within a positive interval \([{{\underline{\mu }}},{{\overline{\mu }}}]\) where \(0<{{\underline{\mu }}}\le {{\overline{\mu }}}\). Such a choice satisfies the positive definite requirement and allows for cheaper iterations than the exact Hessian matrices.

Once the descent direction \({\mathbf {d}}_{{\mathbf {c}}_j}({\mathbf {c}},\varvec{\gamma })\) is obtained, we implement a line search with the step size \(\alpha _j>0\) selected by the Armijo rule (Burke 1985; Nocedal and Wright 2006; Fletcher 2013), see detail in Appendix A. Finally, the basis coefficient of the jth latent process is updated by \({\mathbf {c}}_j + \alpha _j {\mathbf {d}}_{{\mathbf {c}}_j}({\mathbf {c}},\varvec{\gamma })\).

Before moving forward to update the ODE system, we remark that in (2.2) we use basis \(\varvec{\phi }(\theta )\) to expand the additive components \(f_{jk}\)’s. It is common to assume \(\varvec{\phi }\) is defined on a compact support \({\mathcal {I}}\). However, when \({{\widetilde{\theta }}}_j(t) = {\mathbf {c}}_j^\top \varvec{\psi }(t)\) is updated for approximating the jth latent process, its range may exceed \({\mathcal {I}}\). To resolve this issue, we exploit a smooth and monotone transformation map \(\sigma : {\mathbb {R}} \rightarrow {\mathcal {I}}\) to construct a composite basis \(\varvec{\phi }(\sigma ({{\widetilde{\theta }}}_j))\). Choices of \(\sigma (\theta )\) can be the sigmoid function class, including the logistic function \((1+e^{-\theta })^{-1}\), the hyperbolic tangent \(\tanh (\theta )\) and the arctangent function \(\arctan (\theta )\). For notation brevity, we still use \(\varvec{\phi }(\cdot )\) as the basis function composed with transformation \(\sigma (\cdot )\) without causing confusion.

3.2 Update the additive components in the ODE system

Consider updating the ODE parameters in the jth differential equation \(\varvec{\gamma }_j=(\gamma _{j0},\varvec{\gamma }_{j1},\dots ,\varvec{\gamma }_{jp})\), where \(\gamma _{j0}\) is the constant intercept and \(\varvec{\gamma }_{jk}\) is the vector of basis coefficients when \(f_{jk}\) is expanded with the basis functions \(\varvec{\phi }\), \(k=1,\dots ,p\). Given the latent process estimates \(\{{{\widetilde{\theta }}}_j(t): 1\le j\le p\}\) and the ODE parameters in other differential equations, minimizing \(Q({\mathbf {c}},\varvec{\gamma })\) with respect to \(\varvec{\gamma }_j\) amounts to solving

$$\begin{aligned} \begin{aligned}&\mathop {{\mathrm{arg\,min}}}\limits _{\varvec{\gamma }_j\in {\mathbb {R}}^{pL+1}} \int _{\mathcal {T}} \bigg \{\frac{\mathrm {d}{{\widetilde{\theta }}}_j(t)}{\mathrm {d}t}-\gamma _{j0}-\sum _{k=1}^p\varvec{\gamma }_{jk}^\top \, \varvec{\phi }({{\widetilde{\theta }}}_k(t))\bigg \}^2 \mathrm {d}t \\&\quad +\sum _{k=1}^p{\text {pen}}_{\lambda _\gamma }(\Vert \varvec{\gamma }_{jk}\Vert _2), \end{aligned} \end{aligned}$$
(3.2)

which coincides with a penalized regression problem with group selection. Examples of \({\text {pen}}_{\lambda _\gamma }(\cdot )\) include group lasso, group SCAD and group MCP. See Huang et al. (2012) and Breheny and Huang (2015) for a comprehensive review of efficient algorithms.

In our JADE algorithm, we choose the adaptive group lasso penalty \({\text {pen}}_{\lambda _\gamma }(\Vert \varvec{\gamma }_{jk}\Vert _2) = \lambda _\gamma \, w_{jk} \Vert \varvec{\gamma }_{jk}\Vert _2\). It has been shown to achieve both estimation efficiency and selection consistency (Wang and Leng 2008). Let \(\{{{\widetilde{\varvec{\gamma }}}}_{jk}, 1\le k\le p\}\) be the regular group lasso estimates. We follow Zou (2006) to set the weight \(w_{jk}=\Vert {{\widetilde{\varvec{\gamma }}}}_{jk}\Vert _2^{-\nu }\) if \(\Vert {{\widetilde{\varvec{\gamma }}}}_{jk}\Vert _2>0\), where \(\nu \) is a positive number, otherwise \(w_{jk}=\infty \). A remarkable feature of updating the ODE parameters is that they can be performed in parallel due to the block separability of the ODE fidelity and group-wise sparse penalty in terms of \(\varvec{\gamma }_j\), \(j=1,\dots ,p\). Finally, the additive components in the ODE system are updated by implementing adaptive group lasso regressions in parallel. To summarize, we describe the updating rules of JADE in Algorithm 1.

figure a

3.3 Identifiability

Because of the additive assumption of the ODE system, the ODE parameters are not fully identifiable. One can always add a constant to an additive component \(f_{jk}\) in (2.1) and subtract it from another, and the model remains the same. It is related to the collinearity of the basis functions. To address this issue, we require that each additive component has the mean zero in the sense of its basis representation, that is, \(\sum _{i=1}^{m} \varvec{\gamma }_{jk}^\top \, \varvec{\phi }(\theta _k(t_i))= 0\) for \(1\le j,k\le p\), where \(t_1,\dots , t_m\in {\mathcal {T}}\). Once the ODE parameters \(\varvec{\gamma }_j\) is updated, we subtract \(\sum _{i=1}^{m} \varvec{\gamma }_{jk}^\top \, \varvec{\phi }(\theta _k(t_i))\) from the kth additive component and shift the intercept \(\gamma _{j0}\) to \(\gamma _{j0}+\sum _{k=1}^p \sum _{i=1}^{m} \varvec{\gamma }_{jk}^\top \, \varvec{\phi }(\theta _k(t_i))\).

3.4 Tuning parameter selection

Two tuning parameters appear in our JADE procedure. The first one is \(\lambda _\theta \) controlling the adherence of the latent processes to the ODE system. Similar to the generalized profiling approach (Ramsay et al. 2007), we expect a large \(\lambda _\theta \) such that the ODEs are satisfied or approximately so. As pointed by Qi and Zhao (2010) and Carey and Ramsay (2021), there exist many local optima when \(\lambda _\theta \) is small, and \(\lambda _\theta \) should be gradually increased together with monitoring the performance of the parameter estimates. However, our numerical experiments show that JADE is not sensitive to \(\lambda _\theta \) as long as \(\lambda _\theta \) is not too small, and we thereby set it as a fixed large constant. A numerical justification is provided in Sect. 4.3.

The second tuning parameter \(\lambda _\gamma \) controls the amount of sparse regularization on the ODE parameters. We propose to search for optimal \(\lambda _\gamma \) by minimizing the following criterion

$$\begin{aligned} \begin{aligned}&\sum _{j=1}^p \log \bigg [ \int _{\mathcal {T}} \bigg \{\frac{\mathrm {d}{{\widehat{\theta }}}_j(t)}{\mathrm {d}t}-{{\widehat{\gamma }}}_{j0} -\sum _{k=1}^p{{\widehat{\varvec{\gamma }}}}_{jk}^\top \, \varvec{\phi }({{\widehat{\theta }}}_k(t))\bigg \}^2 \,\mathrm {d}t \bigg ] \\&\quad + \sum _{j=1}^p \mathrm {nz}({{\widehat{\varvec{\gamma }}}}_{j}) \frac{\log (n)}{n}, \end{aligned} \end{aligned}$$
(3.3)

where \(\mathrm {nz}({{\widehat{\varvec{\gamma }}}}_{j})\) is the number of non-zero elements in the ODE parameter \({{\widehat{\varvec{\gamma }}}}_{j}\). It shares a similar spirit of the Bayesian information criterion commonly used in the ODE parameter estimation literature (Wu et al. 2014; Chen et al. 2017). The first ODE fidelity term can be understood as likelihood measuring how compatible the estimated ODE parameters are given the latent process estimates, while the remaining term in (3.3) penalizes the model complexity.

3.5 Global convergence analysis

In our JADE algorithm, the latent processes are updated by a gradient descent method with the step size selected by the Armijo rule. We next show that the updating rule for the ODE parameters \(\varvec{\gamma }_j\) in Sect. 3.2 is equivalent to a gradient descent method with the step size chosen by the Armijo rule.

Recall that the penalty function \(R(\varvec{\gamma })\) is block-separable such that \(R(\varvec{\gamma }) = \sum _{j=1}^p R_j(\varvec{\gamma }_j)\). We define the gradient direction for minimizing \(Q({\mathbf {c}},\varvec{\gamma })\) with respect to \(\varvec{\gamma }_j\) as

$$\begin{aligned} \begin{aligned} {\mathbf {d}}_{\varvec{\gamma }_j}({\mathbf {c}},\varvec{\gamma })&=\mathop {{\mathrm{arg\,min}}}\limits _{{\mathbf {d}}\in {\mathbb {R}}^{pL+1}} \bigg \{ (\nabla _{\varvec{\gamma }_j} S)^\top {\mathbf {d}}+ \frac{1}{2} {\mathbf {d}}^\top (\nabla ^2_{\varvec{\gamma }_j} S) {\mathbf {d}}\\&\quad + R_j(\varvec{\gamma }_j+{\mathbf {d}}) \bigg \}, \end{aligned} \end{aligned}$$
(3.4)

where \(\nabla _{\varvec{\gamma }_j} S\) and \(\nabla ^2_{\varvec{\gamma }_j} S\) are the gradient and Hessian of \(S({\mathbf {c}},\varvec{\gamma })\) with respect to \(\varvec{\gamma }_j\). Since \(S({\mathbf {c}},\varvec{\gamma })\) is quadratic in each block vector \(\{\varvec{\gamma }_j: 1\le j\le p\}\), the group penalized solution in (3.2) amounts to updating \(\varvec{\gamma }_j\) to \(\varvec{\gamma }_j+{\mathbf {d}}_{\varvec{\gamma }_j}({\mathbf {c}},\varvec{\gamma })\) while holding fixed other block vectors. With a slight abuse of notation, the above gradient update leads to a change of the JADE objective function value given by

$$\begin{aligned} \begin{aligned}&Q({\mathbf {c}}, \varvec{\gamma }_{-j}, \varvec{\gamma }_j+{\mathbf {d}}_{\varvec{\gamma }_j}) - Q({\mathbf {c}}, \varvec{\gamma })\\&\quad =(\nabla _{\varvec{\gamma }_j} S)^\top {\mathbf {d}}_{\varvec{\gamma }_j} + \frac{1}{2}{\mathbf {d}}_{\varvec{\gamma }_j}^\top (\nabla ^2_{\varvec{\gamma }_j} S)\, {\mathbf {d}}_{\varvec{\gamma }_j}\\&\qquad + R_j(\varvec{\gamma }_j + {\mathbf {d}}_{\varvec{\gamma }_j}) - R_j(\varvec{\gamma }_j). \end{aligned} \end{aligned}$$

By the definition of \({\mathbf {d}}_{\varvec{\gamma }_j}({\mathbf {c}},\varvec{\gamma })\) in (3.4), the right-hand side of the above display is always non-positive, and thus we can set the step size to be constant one, which satisfies the Armijo rule (A1) in the Appendix.

To sum up, our JADE algorithm can be understood as a block coordinate gradient descent method with step sizes chosen by the Armijo rule. We need the following assumption for convergence analysis.

Assumption 1

The eigenvalues of the matrices \({\mathbf {H}}_{{\mathbf {c}}_j}\) and the Hessians \(\nabla ^2_{\varvec{\gamma }_j} S\) are bounded away from zero and infinity uniformly over \(j=1,\dots ,p\).

As discussed in Sect. 3.1, the eigenvalues of \({\mathbf {H}}_{{\mathbf {c}}_j}\)’s are bounded by the thresholding construction. The Hessians \(\nabla ^2_{\varvec{\gamma }_j} S\), \(j=1,\dots ,p\), admit the following explicit expression

$$\begin{aligned} 2\lambda _\theta \int _{{\mathcal {T}}} {{\overline{\varvec{\phi }}}}(\varvec{\theta }(t))\,{{\overline{\varvec{\phi }}}}(\varvec{\theta }(t))^\top \, \mathrm {d}t, \end{aligned}$$
(3.5)

where \({{\overline{\varvec{\phi }}}}(\varvec{\theta }(t))\) is the augmented basis function vector concatenating the B-spline basis \(\varvec{\phi }(\theta _1(t)),\ldots ,\varvec{\phi }(\theta _p(t))\). The second part of Assumption 1 is essentially made for the basis function \(\varvec{\phi }(\cdot )\), which is also adopted by Chen et al. (2017) to ensure identifiability. In fact, the eigenvalues of (3.5) are upper bounded because of the boundedness of B-spline basis and the compactness of \({\mathcal {T}}\). When any two of the latent processes are not too close to each other over \({\mathcal {T}}\), the eigenvalues of (3.5) can be determined by the diagonal blocks \(\int _{{\mathcal {T}}} \varvec{\phi }(\theta _k(t))\,\varvec{\phi }(\theta _k(t))^\top \, \mathrm {d}t \), \(k=1,\dots ,p\), which are positive definite matrices due to the linear independence of B-splines (de Boor 1978). Finally, the global convergence result follows from Theorem 1 of Tseng and Yun (2009).

Theorem 1

Let \(\{({\mathbf {c}}^r,\varvec{\gamma }^r)\}\) be the sequence generated by the JADE Algorithm 1 under Assumption 1. If the step sizes \(\{\alpha _j^r: 1\le j\le p\}\) are chosen by the Armijo rule with \(\inf _{1\le j\le p} \alpha _{\mathrm{init},j}>0\) and \(\sup _{j,r} \alpha _j^r<\infty \), then every cluster point of the sequence \(\{({\mathbf {c}}^r, \varvec{\gamma }^r)\}\) is a stationary point of the objective function \(Q({\mathbf {c}}, \varvec{\gamma })\).

4 Simulation studies

In this section, we compare the performance of JADE procedure and two of the two-stage collocation methods, GRADE (Chen et al. 2017) and SA-ODE (Wu et al. 2014), on sparse additive ODEs.

4.1 Set-up

We consider the example from Chen et al. (2017) where the number of latent processes \(p=10\). The latent processes \(\{\theta _j(t): 1\le j\le p, t\in {\mathcal {T}}=[0,20]\}\) satisfy the ODE system:

$$\begin{aligned} {\left\{ \begin{array}{ll} \theta '_{2k-1}(t) = \beta _{2k-1,0} + \varvec{\beta }^\top _{2k-1,2k-1}\,{\mathbf {h}}(\theta _{2k-1}(t))\\ \qquad \quad \qquad + \varvec{\beta }^\top _{2k-1,2k}\,{\mathbf {h}}(\theta _{2k}(t)) \\ \theta '_{2k}(t) = \beta _{2k,0} + \varvec{\beta }^\top _{2k,2k-1}\,{\mathbf {h}}(\theta _{2k-1}(t))\\ \qquad \quad \qquad + \varvec{\beta }^\top _{2k,2k}\,{\mathbf {h}}(\theta _{2k}(t)) \end{array}\right. }, \end{aligned}$$
(4.1)

where \(k= 1,\ldots ,5\), and \({\mathbf {h}}(\theta ) = (\theta , \theta ^2, \theta ^3)^\top \) is the cubic monomial basis. Among all \(\varvec{\beta }_{jk}\)’s where \(1\le j,k\le 10\), eight of them are nonzero, while the rest ninety-two components are zero. Detailed specification of (4.1) is presented in Appendix B. The system (4.1) can be numerically solved using R package |deSolve| given initial conditions.

Given the latent processes, we generate samples from Gaussian, Poisson and Bernoulli distributions, respectively. Let \(t_1,\ldots ,t_n\) be equally spaced over \({\mathcal {T}}\) with various sample sizes \(n=40, 100\), and 200. For Gaussian distribution, \(y_{ij}\) is drawn from \({\mathcal {N}}(\theta _j(t_i), \sigma ^2)\) with known variance \(\sigma ^2\) at different levels. For Poisson distribution, we allow for 10 replicated observations at each time point from \({\text {Poisson}}(\lambda _j(t_i))\), where the intensity \(\lambda _j(t_i)=\exp (\theta _j(t_i))\). For Bernoulli distribution, we allow for 40 replicated observations at each time point with the probability of success being \(p_j(t_i)=\exp \{\theta _j(t_i)\}/(1+\exp \{\theta _j(t_i)\})\}\).

In the JADE procedure, we initialize with the smoothing spline estimates for the latent processes, using the ssanova suite for Gaussian observations and the gssanova suite for Poisson or Bernoulli observations in R package gss. When estimating the additive ODE components, we choose a cubic spline basis with four knots and the logistic transformation function \(\sigma (\theta )=(1+e^{-\theta })^{-1}\). The block coordinate descent algorithm updates the parameters for a maximum of 4 iterations. GRADE and SA-ODE are implemented under their default configurations.

We evaluate the performance of different methods on two aspects. The first is the estimation performance for the latent processes and the additive ODE components. We use the following two averaged mean squared errors (MSE) to measure the estimation error of the latent processes and their derivatives, respectively,

$$\begin{aligned} {\mathrm {MSE}}({{\widehat{\varvec{\theta }}}}(t))&= \frac{1}{p} \sum _{j=1}^{p} \int _{{\mathcal {T}}} \left\{ {{\widehat{\theta }}}_j(t) - \theta _j(t) \right\} ^2 \mathrm {d}t, \\ {\mathrm {MSE}}({{\widehat{\varvec{\theta }}}}'(t))&= \frac{1}{p} \sum _{j=1}^{p} \int _{{\mathcal {T}}} \left\{ {{\widehat{\theta }}}'_j(t) - \theta ^{'}_j(t) \right\} ^2 \mathrm {d}t. \end{aligned}$$

To better characterize the estimation in the additive ODE components, we assess the estimation error on the active set \({\mathcal {A}}=\{f_{jk}:f_{jk}\not \equiv 0\}\) and inactive set \({\mathcal {A}}^c\) separately, where the universe set consists of all additive components. Specifically, we consider

$$\begin{aligned}&{\mathrm {MSE}}_{{\text {active}}}({\widehat{f}})= \frac{1}{|{\mathcal {A}}|} \sum _{f_{jk}\in {\mathcal {A}}} \int _{{\mathcal {R}}_k} \left\{ {\widehat{f}}_{jk}(\theta ) - f_{jk}(\theta )\right\} ^2 \mathrm {d}\theta ,\\&{\mathrm {MSE}}_{{\text {inactive}}}({\widehat{f}})= \frac{1}{|{\mathcal {A}}^c|} \sum _{f_{jk}\in {\mathcal {A}}^c} \int _{{\mathcal {R}}_k} \left\{ {\widehat{f}}_{jk}(\theta ) - f_{jk}(\theta )\right\} ^2 \mathrm {d}\theta , \end{aligned}$$

where \(|\cdot |\) calculates the cardinality of a set, and \({\mathcal {R}}_k\) is the range of \(\theta _k(t)\) on \({\mathcal {T}}\). The second aspect is the network discovery. True-positive rates and false-positive rates are used to describe the accuracy of identifying nonzero additive ODE components. We repeat each experiment for 100 times and present averaged performance evaluations in Tables 1, 2 and 3.

4.2 Results

Table 1 Gaussian observation. Comparison of methods in latent process fitting, ODE additive components estimation and network discovery. Standard errors are displayed in the parentheses under the evaluation scores

Table 1 presents the results for Gaussian observations under various sample sizes and signal-to-noise ratios. In terms of latent process and derivative fitting, both GRADE and SA-ODE use the smoothing spline estimates without using any information from the ODE system. As expected, they are outperformed by JADE under all simulation settings since the estimates by JADE are regularized by the ODE fidelity and achieve substantial improvement. In some cases with a small sample size, JADE reduces about \(30\%\) of \({\text {MSE}}({{\widehat{\varvec{\theta }}}})\) compared to the other competing methods. In terms of estimating nonzero ODE additive components, JADE also demonstrates its strength. When the sample size is 40 or 100, JADE yields the smallest \({\text {MSE}}_{{\text {active}}}({\widehat{f}})\), which is natural as JADE uses improved latent process estimates. However, when the smoothing spline estimates have already been accurate enough, the advantage of JADE over the two-stage methods becomes less significant. In particular, when the sample size is 200, GRADE beats JADE in two experiments out of three, because it adopts an integrated form of the ODE system. Comparing the methods directly using the ODEs, JADE performs better than SA-ODE. Concerning the metric \({\text {MSE}}_{{\text {inactive}}}({\widehat{f}})\), GRADE has a dominating performance because it yields fewer false positives, which is consistent with the findings in Chen et al. (2017). For network recovery, all three methods can identify at least 80% of the nonzero additive components correctly. Under the scenario with a small sample size and a lower noise level, the true-positive rates of the three methods can all reach 100%. Roughly speaking, JADE and SA-ODE tend to select more nonzero additive components than GRADE.

Table 2 Poisson observation. Comparison of methods in latent process fitting, ODE additive components estimation and network discovery. Standard errors are displayed in the parentheses under the evaluation scores
Table 3 Bernoulli observation. Comparison of methods in latent process fitting, ODE additive components estimation and network discovery. Standard errors are displayed in the parentheses under the evaluation scores

Table 2 and 3 display the comparison results for Poisson or Bernoulli observations, respectively. Note that GRADE and SA-ODE are originally designed for Gaussian observations only. We extend them under the likelihood-based framework and develop an iteratively re-weighted least squares algorithm (Wood 2017) for efficient computation. Results of both Poisson and Bernoulli cases deliver a similar conclusion as the Gaussian case. Although we allow for multiple observations at each time point, it is in general a more challenging task for non-Gaussian data modeling. We notice that the estimation error and the false positive rates for network recovery are significantly higher than the Gaussian case. In terms of latent process and derivative fitting, JADE remains as the best because of the use of ODE regularization. Regarding the network recovery, GRADE delivers the best false positive rates even in the most challenging case with Bernoulli observations, while JADE and SA-ODE are conservative in selecting nonzero ODE components.

To sum up, GRADE adopts an integrated form of the ODE system and achieves the most reliable network recovery performance. JADE not only delivers competitive performance in network discovery, but is also superior in reducing the estimation error of latent process and additive ODE components.

4.3 Insensitivity of \(\lambda _\theta \)

In Sect. 3.4, we claim that JADE is insensitive to \(\lambda _\theta \) as long as it is not too small. We use the ODE system from Sect. 4.1 and generate 100 Gaussian samples with the signal-to-noise ratio to be 10 for each ODE process. All experiments are repeated for 100 times, and the averaged results are given in Table 4. We observe that when \(\lambda _\theta \) is no smaller than 0.1, the performance of the JADE procedure has only minor variations. When \(\lambda _\theta \) is smaller than 0.1, the estimation of ODE additive components becomes worse, while other criteria remain robust. Therefore, we suggest that \(\lambda _\theta \) can be fixed as a reasonable constant, for example, \(\lambda _\theta =1\) in this small simulation.

Table 4 Performance of JADE with different \(\lambda _\theta \)’s. Standard errors are displayed in the parentheses under the evaluation scores

5 Real data examples

5.1 Gene regulatory network in yeast cell cycle

Inferring gene regulatory networks (GRNs) has attracted many research interests (Emmert-Streib et al. 2014). In this study, we focus on the GRN in the yeast cell cycle. It is beneficial to obtain a causal understanding of the regulatory effect of the genes controlled by the yeast cell cycle. The data used in our study is from the open database provided by Spellman et al. (1998). They synchronize the cell cycle based on the \(\alpha \)-factor and measure the mRNA expression levels of 6,178 genes by the high-throughput microarray at 7-minute intervals for 119 minutes. We choose 100 representative genes that are identified to meet the minimum criterion for cell cycle regulation according to the study in Spellman et al. (1998). For each gene, the time when the gene expression level reaches its peak is recorded. The peaking time may fall into one of the five phases of the cell cycle: G1, S, S/G2, G2/M, M/G1, indicating which period the gene displays its primary function.

Fig. 1
figure 1

The estimated gene regulatory network among 100 genes during the yeast cell cycle, in which 63 isolated nodes are not displayed. Genes are shaded with different colors to indicate the phases where the expression levels reach their peaks

We treat the time-course gene expression levels as Gaussian observations and normalize them to zero mean and unit variance. The mean expression levels of the 100 genes are fitted as the latent ODE processes. The gene regulatory network recovered by JADE is displayed in Fig. 1. Among 100 genes, 37 genes are estimated to regulate or be regulated by other genes, while the rest 63 are regarded as isolated genes and are thus not displayed. Genes are shaded with different colors to indicate their corresponding cell cycle phase where the expression levels reach their peaks. The thickness of the edge indicates the strength of the estimated regulatory effect under \(L_2\) norm.

Our result recognizes the MFA1 and FLR1 as the hub nodes that regulate the most genes. MFA1, an \(\alpha \)-factor precursor, has been identified to have a strong expression in the \(\alpha \)-factor based experiment (Spellman et al. 1998). It explains why it shows significant regulatory effects. FLR1 controls the production of a major facilitator superfamily that facilitates small solutes across cell membranes, which is also fundamental to biological processes (Nguyen et al. 2001). We also find that some genes of the same peaking phase may form a connected cluster in the regulatory network, such as the cluster with SMC1, MFA1, FKS1 and SPC98, and the other cluster with YGR260W, CDC20 and PMP1. These clusters suggest there might be co-regulation among multiple genes during the same period. We also notice that no gene of the S/G2 phase is connected with FLR1 that peaks at the S phase. Meanwhile, the S/G2 genes take about 35% of all its regulated targets. This result implies that the formulation of the regulatory effect may depend on those phases.

We validate the reconstructed network with the Web-based Gene Set Analysis Toolkit (WebGestalt) (Liao et al. 2019). The WebGestalt performs enrichment analysis by first expanding the estimated network with 10 additional candidate genes besides the initial 100 genes and then generating a gene co-expression network, where the edges indicate strong correlations or dependencies in the expression. Details of the enrichment analysis is given in the Supplementary Material. By comparing the JADE and WebGestalt networks, we observe that the adjacent nodes identified by JADE are either directly connected or connected through one candidate gene at most in the WebGestalt network. For example, MFA1 and GIN4 both have co-expression with the candidate gene CCR4 in the WebGestalt network, which is represented by the edge between MFA1 and GIN4 in the JADE network.

Fig. 2
figure 2

The nonlinear regulatory effects that the hub MFA1 exerts on six different genes. In each panel, the x-axis measures the mean expression level of MFA1, and the y-axis measures the instantaneous effect of MFA1 on the mean expression level of another gene

Table 5 Companies selected in eight sectors for stock price data analysis

Next, we inspect the nonlinear regulatory effects. In Fig. 2, we display the regulatory effects of the hub MFA1 on six genes. All of them exhibit strong nonlinear patterns varying along with the gene expression level of MFA1. For example, the regulatory effect on PMI40 is negative when the expression level of MFA1 is low and is positive otherwise. This result implies the necessity of using the sparse additive ODE system for GRN modeling. It may help not only identify the nonlinear regulation effects, but also monitor the changing pattern of the effects along with the varying expression levels.

Comparison with the results by GRADE and SA-ODE is available in the Supplementary Material. All the three methods identify MFA1 and FLR1 as the hub nodes, confirming their vital regulatory functions. Moreover, the JADE network recognizes that FLR1 has a more influential role in this gene regulatory network. For example, FLR1 is connected to SMC1 which is related to an essential protein involved in chromosome segregation and double-strand DNA break repair (Cherry et al. 1998; Strunnikov et al. 1993). This finding suggests that the expression of SMC1 can be affected by the production of a major facilitator super-family controlled by FLR1. By contrast, neither the SA-ODE nor GRADE network has FLR1 connected to SMC1.

5.2 Stock price change direction

The coronavirus pandemic, breaking out at the beginning of 2020, has resulted in severe economic disruption worldwide and implied massive turmoil on the financial market. To motivate the use of the JADE method in a practical application with Bernoulli observations, we consider modeling the change direction patterns of stock prices. The objective of this study is to estimate the latent probability processes of price increases and characterize the nonlinear influential effects among stocks. To this end, we select 40 companies and group them into \(p=8\) sectors according to their major business. The number of companies in each sector is \(L=5\). The list of the companies is presented in Table 5.

Fig. 3
figure 3

Three estimated nonlinear effects among eight sectors in the stock price data analysis. The x-axis \(p_k, 1\le k\le 8,\) denotes the probability of stock price increase within Sector k, and the y-axis, denoted by \(f_{jk}\), measures the effect of Sector k on Sector j

We collect the stock price indices for all trading days \(t_1,\ldots , t_{248}\) during the year of 2020. For the lth company from Sector k, where \(1\le l\le 5\) and \(1\le k\le 8\), we encode its stock price change direction on trading day \(t_i\) as a Bernoulli observation. If its closing price on trading day \(t_{i+d}\) is higher than that on trading day \(t_i\), we set \(y_{ikl}=1\) and \(y_{ikl}=0\) otherwise. We choose \(d=3\) to avoid unnecessary fluctuation in price changes. We assume that all stocks from the same sector share a common price changing pattern such that the Bernoulli observations are regarded as repeated measurements from a binomial distribution. More precisely, \(y_{ik}=\sum _{l=1}^L y_{ikl}\) follows a binomial distribution with the probability of increase being \(p_k(t_i) = 1/(1+\exp (-\theta _k(t_i)))\), where \(\theta _k(t)\)’s are the latent processes governed by a sparse additive ODE system

$$\begin{aligned} \theta ^{'}_j(t) = \gamma _{j0} + \sum _{k=1}^8 f_{jk}(\theta _k(t)),\qquad j= 1,\ldots ,8. \end{aligned}$$

In Fig. 3, we display three estimated nonlinear effects representing the potential interactions between sectors. In each panel, the x-axis measures the probability that stock prices in a sector will increase and the y-axis denoted by f measures how the derivative of the other sector is affected. Given the x-coordinate, a positive f value means there will be a higher probability of seeing the stock prices of the affected sector increase and a negative f suggests the opposite tendency. In the top left panel, we observe that when the probability of an increasing price in Sector 4 (Consumer Services) is greater than 0.5, the f value is negative, which implies a slowing growth or a decreasing trend in the stock price of Sector 5 (Online Retail Shopping). Meanwhile, when the probability of price increase for Sector 4 is less than 0.1, the f value is positive, implying a potential price increase for Sector 5. Such a negative correlation indicates the competitive nature between Consumer Services and Online Retail Shopping. Due to the health concern brought by the coronavirus pandemic, the number of consumers visiting the department store significantly decreased. At the same time, innumerable online orders have boosted the stock price of e-commerce companies. Similarly, we observe upward momentum of Sector 5 (Online Retail Shopping) upon Sector 1 (Information Technology) and Sector 3 (Pharmaceutical). We also apply GRADE and SA-ODE to investigate the interactions among the same set of sectors and display them in the Supplementary Material. The estimated functional components are similar except in some regions. For example, \(f_{54}\) estimated by GRADE wiggles around 0, indicating that Sector 5 (Online Retail Shopping) has no clear effect pattern on Sector 4 (Consumer Services); \(f_{15}\) estimated by SA-ODE has a sharp change from positive to negative in [0.4, 0.6] and then quickly increases from negative to positive in [0.6, 0.8], which seems unrealistic.

Fig. 4
figure 4

The fitted probabilities that the stock prices will increase in three trading days varying along the time. Each trajectory starts from January 1 to December 31 in 2020. The red color indicates the probability is above 0.5, and the green color indicates the opposite. The green dashed lines mark February 20 and the red dashed lines mark November 24

The estimated ODE trajectories for each sector are plotted in Fig. 4. We first observe that all the trajectories dropped to their valleys after the sudden crash of the global stock market on February 20, 2020. During that period, multiple circuit breakers were triggered by the coronavirus pandemic. From then on, several sectors kept suffering from the long-lasting stock price decrease, such as the consumer services and the energy industry. In the meantime, some other companies grabbed the opportunities and made huge profits. Besides the online retail companies, information technology companies kept boosting under the growing demands for information services and electronics devices. At the end of the year 2020, we witnessed rebounds in all sectors. On November 24, Dow Jones hit 30,000 during the United States presidential transition.

6 Conclusion

In this article, we propose a new approach called JADE for parameter estimation in generalized sparse additive ODEs. A unified estimation framework is developed by taking into account the data likelihood, ODE fidelity and sparse regularization simultaneously. It covers existing two-stage collocation methods, the generalized profiling method and our JADE procedure. As a joint approach, JADE has the advantage in estimating the latent processes and additive ODE components by regularizing the smooth fits with the ODE system, while the group lasso penalty helps achieve a sparse estimation of the network structure at the same time.

To address the computational issue, we design a block coordinate descent algorithm with provable global convergence. Compared to the classical iteratively re-weighted least squares algorithm for generalized linear models, JADE adopts a gradient-descent optimization strategy where an inexact line search is applied instead of exact Hessian calculations. It is simple, highly parallelizable, and achieves great generalization performance. Empirically, JADE is superior in both latent process and ODE parameter estimation together with reliable sparse network identification. To sum up, JADE is a good alternative for dynamical modeling of non-Gaussian data with nonlinear ODEs.