With today’s new technologies and rigid requirements on security along with emphasis on doing more with less, systems become highly complex and unpredictable.  To make matters more complicated, IT services are now adapted to business requirement on demand to include automated and dynamic resource allocation.  Making our systems more efficient and better performance can be a very challenging task. 

LTG Federal approaches to tackle this problem is developing simulation model to study the design and operations of these systems as many complex engineering, operations, and business system can be modeled as discrete-event.  Once the model is verified and validated against the production metrics, it can be leveraged to evaluate multiple alternatives to identify feasible solutions.

Written by Chris Rogan

What Is Discrete Event Simulation?

A discrete system is one for which an event occurs at separate point in time where the state variables change instantaneously.  An event may change the state of the system.  Discrete-event simulation allows us to observe the system behavior over time and not just the end state, and gain insight into the relationship among various components.

Few examples of discrete systems are call center, car wash station, or banking system.  A car wash station has two events: car arrival and car departure. At time T0, car arrival causes the (state variable) number of cars waiting to increase by 1.  At time T1, car departure causes the number of cars waiting to decrease by 1.  Another state variable of this system is the bay status that changes from busy to idle corresponding to the arrival and departure of a car.

Model Formulation

At LTG Federal, we follow the following steps when performing a simulation study

1 – Study the system and formulate problem

There are various approaches to build a model.  It is important that we must understand the system of interest because we cannot improve what we lack the knowledge of.   Also, understanding the objectives, questions to be answered, and performance measures allow us to properly build a model that is detailed enough to be able predict the system’s behavior and performance, yet not so detailed that it becomes unmanageable and irrelevant to the study.

2 – Collect and analyze data

For IT system we need performance and capacity related data.  They are Workload, Infrastructure, Flow, and Resource

Workload – what work is being requested and its arrival rate

Infrastructure – components will be used to do this work

Flow – the sequence steps of the work through the system

Resource – what resources will get consumed as the work is accomplished

Data is gathered from the field instruments, SMEs, or log file.  When analyzing performance data, we focus on obtaining distribution function, in addition to mean value, by generating best fit curve through the histogram of the collected data.  An example of a distribution function derived from sample data is shown below:

By implementing distribution function instead of constant for operation time and resource consumption, we can run Monte Carlo simulation.  For the above example, processing time of Service1 is implemented as Log logistic distribution.

3 – Develop the model

We select ExtendSim as the tool to develop discrete-event simulation model because of its ease of use, flexibility, and scalability from the simplest to the most complex systems.

There are many factors to consider when developing a model: Objectives, Performance measures, Data availability, and time constraints

We are also mindful about developing reusable blocks in the model to improve its efficiency and scalability

4 – Verify and validate the model

It is critical to perform a structure walk-through of the model before SMEs and those who are familiar with the system.  This is the time to confirm the logics and assumptions to ensure correct implementation of the model.  If invalid, we update the model with correct information and repeat the process.  Once verified, the model must be validated against the deployed metrics that is captured through instrumentation and logs where applicable. 

5 – Execute Simulation and Analyze Output Data

Once it is confirmed that the model is a valid abstract representation of the system, we can execute various scenarios to explore modifications of the system and operations.  We then compare the various alternatives and identify options for improvement

Simulation Example

Model Topology

We develop a sample discrete-event simulation model using ExtendSim tool and record the simulation run to illustrate the capabilities of simulation model in predicting system behavior and identifying system bottleneck.  We also demonstrate how to use the model to explore alternative system configuration and options to remove the bottleneck.  There are two videos, the model topology and simulation results.

The left section of the video is the model topology, which contains three main sections:

  • User interface section (Setting): Allow user to set parameters for the model. In this sample, user can turn on or off queue polling and specify queue polling time
  • Main section (middle section): the model workflow or architecture flow of the system
  • Utility section: Provide interaction between the main section and user interface and excel input file

The model is developed in multiple layers with details embedded within hierarchical block.  The model workflow begins at the Workloads block where transactions are created with inter-arrival time implemented as exponential distribution.   Transactions first go through Enterprise Service Bus then wait at Scheduled queue.  When Services block becomes available, it dequeues transactions from the queue, processes the transactions, and sends them to Local Service Bus.  The system interacts with vendor application through Local Service Bus, where transactions are routed to proper vendor queues based on required operations.  Once completed, Vendors send response back through Local Service Bus, and eventually Enterprise Service Bus, where stakeholders retrieve the responses.

User can double click on a block to view detail logic of that block.  The right section of the video is the detailed logic of the Services Block, where it sends request to other services, Service1, Service2, and Service 3, sequentially.  A transaction cannot proceed until all three request services are completed.

Service Time block in the middle of the screen is where resources, cpu time or processing time, are consumed by Services block.  As mentioned above, these parameters are implemented as distribution function for Monte Carlo simulation.  Monte Carlo analysis allows us to evaluate the system over a range of inputs to determine the statistical probability and confidence interval of the mean.

Simulation Results:

At the beginning of the simulation, we let the system reaches its steady state, then increase the workload (control parameter) by 50%.  A few variables (state variables) are changing at this point:

  • Queue1 length increases
  • CPU utilization of Pod1, Pod2, and Pod3 increase
  • Numbers of transactions in the system increase
  • Response times increase

All these changes are expected.  The important point we are trying to demo here is the interaction between components and system behavior over time.  For a complex system, we may not be able to predict such impacts without having a simulation model of the system

Next, we want to remove the bottleneck so that the system can handle the increased workload.  From our understanding of the system, Pod 1 directly pulls transactions from Queue1, thus it is reasonable to update Pod1 configuration to handle more transactions at the same time (concurrent threads) or reduce its processing time.  Given that Pod1 utilization is about 80%, we decided to allocate additional cpu to Pod1.  As expected, the system eventually works off the queue and Pod1 utilization is decreasing.  However, if we look closely, this might not be a good solution as Pod2 utilization appears to continue increasing.  In other word, we might have just moved the bottle neck downstream.

The goal of our demo is to illustrate how we can leverage simulation model to assess system performance, to predict the system impacts upon changes in workloads, and to estimate capacity required to support system requirements.  This model can also be used to evaluate other alternatives