This paper describes a systematic approach for system design space exploration through the application of machine learning (ML) methods for advanced system analysis. A demonstration of applying this method for signal integrity analysis, and a case study of 112Gb SerDes systems analysis based on channel operating margin (COM) simulation methodology, are provided. This work was presented at DesignCon 2020.
Design space exploration plays an extremely important role for SerDes system design. The outcome of the design exploration is a set of decisions on possible values and ranges of parameters defining stackups, materials, interconnect geometry and equalization. Covering a design space for all possible configurations and variations may require an enormous amount of parameter value combinations. Simple sweeps of design parameters may be not suitable and expertise-based analysis becomes complex when tens or hundreds of features are involved. A systematic, automated approach is required to address SerDes design solution space coverage for multiple equalization mechanisms and various channel configurations affecting the system performance.
In this paper we demonstrate a practical application of machine learning (ML) based methods for advanced design space exploration. The ML approach we use for design exploration is called feature range analysis, or simply “range analysis” [1].
Applying ML techniques to design space exploration of system performance allows a methodical, automated analysis of the solution space, factoring in a variety of operating conditions, controlled and uncontrolled factors, and multiple system configurations. This analysis provides a feasible way to handle the complexity of Ethernet systems with insights on system behavior comprehendible for engineers. It can be used as a decision support tool for design choices in the hands of the system architect, signal integrity (SI) designer, SI Engineer, and more.
Here is a brief outline of the three steps in the suggested systematic ML-based methodology that is further explained in the paper [2] (See full paper here.)
First, the solution space for a typical serial chip-to-chip (C2C) link (such as shown in Fig. 1) is mapped along with its multiple constraints, and multiple channel models, corresponding to the cases of interest required to cover this space. Selection of important parameters and their possible variations should be defined by an expert. Models for the solution space are generated with an EM simulator.
Fig. 1. A typical simple serial link under investigation: TP0 to TP5 are locations of ports for analysis with the reference package; TPa to TPb are locations of ports for analysis with the custom package model.
Second, an investigation of the system level performance is conducted, covering variation within the same design, and manufacturing tolerances. In this work, IEEE 802.3 STD COM methodology is used, which enables an evaluation of overall system performance as well as channel quality when used with a standard described base line transmitter and receiver with configurable equalization capabilities. COM is just one number that characterizes quality of the C2C link – that is very convenient for use in a ML algorithm. Also, this method allows the channel designers to gain insight into their expected product quality without the need for proprietary simulators or detailed information regarding their device.
Third, we perform a design/system exploration as follows: given a response variable, we find the parameters (or features, in ML terminology) having the greatest effect on the response. The response variable is the COM metric in this case with 3 dB pass/fail level or for good/bad performance and 4 dB for the excellent performance. We look for combinations (conjunctions) of ranges of numerical features or values of nominal features having the greatest effect on the response variable. The main question we answer using the ML techniques is the following: If the response variable does not satisfy the spec (< 3 dB), what are the parameter combinations and their value ranges accounting for that? The root-cause of the failures in the design is identified, and then an insight on how to optimize the design is provided.
As an example, the results of a 112Gb system case investigation for a simple PCB link shown in Fig. 1 are provided in Table 1 and Table 2 and further illustrated in Fig. 2. The features or parameters in this simple case are package and PCB link lengths, PCB interconnect impedance, dielectric thickness, dielectric constant and loss tangent, conductor roughness parameters, and spacing for differential traces.
Table 1. Important single-range features for system with package length of 12 and 31 mm.
Table 2. Important range triplet features for system with package length of 12 and 31 mm.
The analysis shows that, surprisingly, the most important single range feature (Table 1) is not the PCB channel length (s4p_Tx_PCB_L), as commonly thought, but the package length (Pkg_len_Rx). The triplet of characterstics identified to have the strongest impact on the performance (Table 2) consists of: (1) package length (Pkg_len_Rx), (2) PCB channel length (s4p_Tx_PCB_L) and (3) PCB channel impedance (s4p_PCB_Imp).
The effect of PCB link length is further illustrated in Fig. 2. It shows COM vs total loss at the Nyquist frequency separately for 6 PCB link lengths. We can see that too short and too long links fail, while links in the middle have a much better performance. We can see that the links with shorter interconnects in the package systematically fail. An SI expert can easily explain that, but this conclusion was provided by the automated ML-based algorithm in this case – this is the most important outcome of the approach. (The details are provided in the paper [2]).
Fig. 2. COM vs link total loss at Nyquist frequency for two reference package lengths (both graphs). The right graph shows PCB link length in color from blue (0.5in) to red (12in).
The proposed ML analysis can be performed for complex systems with many system variables and complex output behavior, while bad, good, or excellent performance can be determined by the specification, an expert’s opinion, or relative system performance.
The analysis method can support systems with thousands of input variables and beyond, of all types – continuous, categorical, or ordered categorical, so there is no practical limitation on the size of systems that we can explore. In contrast, other commonly used methods such as Bayesian Optimization, struggle to converge when there are tens or hundreds of input variables, and do not provide any design insight, which is crucial in most applications.
[1] Z. Khasidashvili, A. J. Norman. Range Analysis and Applications to Root Causing. In: 6th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2019.
[2] A. Manukovsky, Y. Shlepnev, Z. Khasidashvili, E. Zalianski, Machine Learning Applications for COM Based Simulation of 112Gb Systems, DesignCon 2020.