In this article by Marius Sandbu, author of the book Implementing NetScaler VPX™ - Second Edition, explains the purpose of NetScaler is to act as a logistics department; it serves content to different endpoints using different protocols across different types of media, and it can either be a physical device or a device on top of a hypervisor within a private cloud infrastructure. Since there are many factors are at play here, there is room for tuning and improvement. Some of the topics we will go through in this article are as follows:

Tuning for virtual environments
Tuning TCP traffic

(For more resources related to this topic, see here.)

Tuning for virtual environments

When setting up a NetScaler in a virtual environment, we need to keep in mind that there are many factors that influence how it will perform, for instance, the underlying CPU of the virtual host, NIC throughput and capabilities, vCPU over allocation, NIC teaming, MTU size, and so on. So, always important to remember the hardware requirements when setting up a NetScaler VPX on a virtualization host.

Another important factor to keep in mind when setting up a NetScaler VPX is the concepts of Package Engines. By default, when we set up or import a NetScaler, it is set up with two vCPU. The first of these two is dedicated for management purpose and the second vCPU is dedicated to doing all the packet processing, such as content switching, SSL offloading, ICA-proxy, and so on.

It is important to note that the second vCPU might be seen as 100% utilized in the hypervisor performance monitoring tools, but the correct way to check if it is being utilized is by using the CLI command stat system

Now, by default, VPX 10 and VPX 200 only have support for one packet engine. This is because of the fact that due to its bandwidth limitations it does not require more packet engine CPUs to process the packets. On the other hand, VPX 1000 and VPX 3000 have support for up to 3 packet engines. This, in most cases, is needed to be able to process all the packets that are going through the system, if the bandwidth is going to be utilized to its fullest.

In order to add a new packet engine, we need to assign more vCPUs to the VPX and more memory. Packet engines also have the benefit of load balancing processing between them, so instead of having a vCPU that is 100% utilized, we can even the load between multiple vCPUs and get an even better performance and bandwidth. The following chart shows the different editions and support for multiple packet engines:

License/Memory	2 GB	4 GB	6 GB	8 GB	10 GB	12 GB
VPX 10	1	1	1	1	1	1
VPX 200	1	1	1	1	1	1
VPX 1000	1	2	3	3	3	3
VPX 3000	1	2	3	3	3 Unlock access to the largest independent learning library in Tech for FREE! Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of. Renews at $19.99/month. Cancel anytime	3

It is important for us to remember that multiple PEs are only available for VMware, XenServer, and Hyper-V, but not on KVM.

If we plan on using NIC-teaming on the underlying virtualization host, there are some important aspects to consider.

Most of the different vendors have guidelines which describe the kinds of load balancing techniques that are available in the hypervisor.

For instance, Microsoft has a guide that describes their features. You can find the guide at http://www.microsoft.com/en-us/download/details.aspx?id=30160.

One of the NIC teaming options, called Switch Independent Dynamic Mode, has an interesting side effect; it replaces the source MAC address of the virtual machine with the one of the primary NIC on the host, hence we might experience packet loss on a VPX. Therefore, it is recommended in most cases that we have LACP/LAG, or in case of Hyper-V, use the HyperVPort distribution feature instead.

Note that the features such as SRV-IO or PCI pass through are not supported for NetScaler VPX.

NetScaler 11 also introduced the support for Jumbo Frames for the VPX, which allows for a much higher payload in an Ethernet frame. Instead of the traditional 1500 bytes, we can scale up to 9000 bytes of payload. This allows for a much lower overhead since the frames contain more data.

This requires that the underlying NIC on the hypervisor supports this feature and is enabled as well, and this in most cases just work for communication with backend resources and not with users accessing public resources. This is because of the fact that most routers and ISP block such high MTU.

This feature can be configured on at the Interface level in NetScaler, which can be done under System | Network | Interface then choose Select Interface and click on Edit. Here, we have an option called Maximum Transmission Unit which can be adjusted up to 9,216 Bytes.

It is important to note that NetScaler communicates with backend resources using Jumbo frames, and then adjusts the MTU when communicating back with clients. It can also communicate with Jumbo frames in both paths, in case the NetScaler is set up as a backend load balancer.

It is important to note that NetScaler only supports Jumbo frames load balancing for the following protocols:

TCP
TCP based protocols, like HTTP
SIP
RADIUS

TCP tuning

Much of the traffic which is going through NetScaler is based on the TCP protocol. Either it is ICA-Proxy, HTTP and so on.

TCP is a protocol that provides reliable, error-checked delivery of packets back and forth. This ensures that data is successfully transferred before being processed further. TCP has many features to adjust bandwidth during transfer, congestion checking, adjusting segment sizing, and so on. We will delve a little into all these features in this section.

We can adjust the way NetScaler uses TCP using TCP profiles, by default all services and vServers that are created on the NetScaler uses the default TCP profile nstcp_default_profile.

Note that these profiles can be found by navigating to System | Profiles | TCP Profiles. Make sure not to alter the default TCP profile without properly consulting the network team, as this affects the way in which TCP works for all default services on the NetScaler.

This default profile has most of the different TCP features turned off; this is to ensure compatibility with most infrastructures. The profile has not been adjusted much since it was first added in NetScaler. Citrix also has a lot of other different profiles depending on the use cases, so we are going to look a bit closer on the different options we have here.

For instance, the profile nstcp_default_XA_XD_profile, which is intended for ICA-proxy traffic has some differences from the default profile, which are as follows:

Window Scaling
Selective Acknowledgement
Forward Acknowledgement
Use of Nagles Algorithm

Window Scaling is a TCP option that allows the receiving point to accept more data than it is allowed in the TCP RFC for window size before getting an acknowledgement. By default, the window size is set to accept 65.536 bytes. With Window scaling enabled, it basically bitwise shifts the window size. This is an option that needs to be enabled on both endpoints in order to be used, and will only be sent in the initial three-way handshake.

Select Acknowledgement (SACK) is a TCP option that allows for better handling of TCP retransmission. In a scenario of two hosts communicating where SACK is not enabled, and suddenly one of the hosts drops out of the network and loses some packets when they come back online it receives more packets from the other host. In this case the first host will ACK from the last packet it got from the other host before it dropped out. With SACK enabled, it will notify the other host of the last packet it got before it dropped out, and the other packets it received when he got back online. This allows for faster recovery of the communication, since the other host does not need to resend all the packets.

Forward Acknowledgement (FACK) is a TCP option which works in conjunction with SACK and helps avoid TCP congestion by measuring the total number of data bytes that are outstanding in the network. Using the information from SACK, it can more precisely calculate how much data it can retransmit.

Nagles Algoritm is a TCP feature that tries to cope with the small packet problem. Applications like Telnet often sends each keystroke within its own packet, creating multiple small packets containing only 1 byte of data, which results in a 41-byte packet for one keystroke. The algorithm works by combining a number of small outgoing messages into the same message, thus avoiding an overhead.

Since ICA is a protocol that operates with many small packets, which might create congestion, Nagle is enabled in the TCP profile. Also, since many will be connecting using 3G or Wi-Fi, which might, in some cases, be unreliable to change channel, we need options that require the clients to be able to re-establish a connection quickly, which allows the use of SACK and FACK.

Note that Nagle might have negative performance on applications that have their own buffering mechanism and operate inside LAN.

If we take a look at another profile like nstcp_default_lan, we can see that FACK is disabled; this is because resources needed to calculate the amount of outstanding data in a high-speed network might be too much.

Another important aspect of these profiles is the TCP congestion algorithms. For instance, nstcp_default_mobile uses the Westwood congestion algorithm; this is because it is much better at handling large bandwidth-delay paths, such as the wireless.

The following congestion algorithms are available in NetScaler:

Default (Based on TCP Reno)
Westwood (Based on TCP Westwood+)
BIC
CUBIC
Nile (Based on TCP Illinois)

It is worth noting here that Westwood is aimed for 3G/4G connections or other slow wireless connections. BIC is aimed for high bandwidth connections with high latency, such as WAN connections. CUBIC is almost like BIC, but not as aggressive when it comes to fast-ramp and retransmissions. However, it is important to note that CUBIC is the default TCP algorithm in Linux kernels from 2.6.19 to 3.1

Nile is a newly created algorithm created by Citrix, which is based upon TCP Illinois and is targeted at high-speed, long-distance networks. It achieves higher throughput than standard TCP and is also compatible with standard TCP.

So, here we can customize the algorithm that is better suited for a service. For instance, if we have a vServer that serves content to mobile devices, then we could use the nstcp_default_mobile TCP profile.

There are also some other parameters that are important to keep in mind while working with the TCP profile.

One of these parameters is multipath TCP. This is feature which allows endpoint which has multiple paths to a service. It is typically a mobile device which has WLAN and 3G capabilities, and allows the device to communicate with a service on a NetScaler using both channels at the same time. This requires that the device supports communication on both methods and that the service or application on the device supports Multipath TCP.

So let's take an example of how a TCP profile might look like. If we have a vServer on NetScaler that is used to service an application to mobile devices. Meaning that the most common way that users can access this service is using 3G or Wi-Fi. The web service has its own buffering mechanism meaning that it tries not to send small packets over the link. The application is Multipath-TCP aware.

In this scenario, we can leverage the nstcp_default_mobile profile, since it has most of the defaults for a mobile scenario, but we can also enable multipath TCP and create a new profile of it and bind it to the vServer.

In order to bind a TCP profile to a vServer, we have go to a particular vServer | Edit | Profiles | TCP Profiles, as shown in the following screenshot:

optimizing-netscaler-traffic-img-0

Note that AOL did a presentation of their own TCP customization on NetScaler; the presentation can be found at http://www.slideshare.net/masonke/net-scaler-tcpperformancetuningintheaolnetwork. It is also important to note that TCP should always be done in cooperation with the network team.