Our telecom client found themselves in a situation where increasing network traffic was resulting in out-of-balance utilization across system resources and was resulting in significant decreases in performance across the network. To remediate this, they began upgrading the aging F5 load balancers with new, more powerful models while adding security and DNS features. However, the upgrade had to be rolled back only a few hours after the initial implementation, because the load balancers had lost connectivity to the real servers that make up the pool for the virtual servers.
Because of this, they sought out Kalles Group to assist in helping to identify the source of the issue. This was difficult at first, since the load balancers had been on the network for weeks for configuration and testing, with only customer-facing services disabled. In addition, the lost connectivity seemed to occur under high load.
Kalles Group consultants were challenged to uncover the root cause(s) of the lost connectivity and implement a quick solution.
Both vendors (F5 and Cisco) were immediately engaged to discuss the issue. Kalles Group consultants worked closely with the vendors to verify that network connectivity best practices had been followed and to assist with troubleshooting the issue. However, a second attempt at the upgrade was unsuccessful.
Our client had both the old units and the new units in parallel to make rolling back easy if needed. The network side in the data center consisted of redundant Nexus 7000 chassis. Our client was also using port channels with the VPC feature for maximum redundancy and performance.
Given the complexity of this setup, Kalles Group consultants decided to move the load balancers to a single standard Cisco iOS switch. They were not able to reproduce the issue in the lab, but the same symptoms continued to show, which prompted the hardware vendor to shift its focus and look deeper into the system instead of the environment.
Using this approach, multiple network traffic captures were performed on the load balancer and the network. The captures were analyzed internally and then submitted to the vendors. The vendor was ultimately able to identify that the TMM (Traffic Management Microkernel) was dropping packets internally, eventually resulting in the TMM crashing. Consequently, Kalles Group successfully partnered with the hardware vendor in providing critical analysis that led to the development and release of a high priority hotfix, which directly addressed the root cause.
Consequently, Kalles Group successfully partnered with the hardware vendor in providing critical analysis that led to the development and release of a high priority hotfix, which directly addressed the root cause.
As a result of this work, network traffic is now being processed efficiently and the load on network and systems resources is spread evenly which has brought performance back down to nominal levels and reduced the amount of pressure on individual network components.