There is no question that the rendering system of modern graphics devices is complicated. Even rendering a single triangle to the screen engages many of these components, since GPUs are designed for large amounts of parallelism, as opposed to CPUs, which are designed to handle virtually any computational scenario. Modern graphics rendering is a high-speed dance of processing and memory management that spans software, hardware, multiple memory spaces, multiple languages, multiple processors, multiple processor types, and a large number of special-case features that can be thrown into the mix.

To make matters worse, every graphics situation we will come across is different in its own way. Running the same application against a different device, even by the same manufacturer, often results in an apples-versus-oranges comparison due to the different capabilities and functionality they provide. It can be difficult to determine where a bottleneck resides within such a complex chain of devices and systems, and it can take a lifetime of industry work in 3D graphics to have a strong intuition about the source of performance issues in modern graphics systems.

Thankfully, Profiling comes to the rescue once again. If we can gather data about each component, use multiple performance metrics for comparison, and tweak our Scenes to see how different graphics features affect their behavior, then we should have sufficient evidence to find the root cause of the issue and make appropriate changes. So in this article, you will learn how to gather the right data, dig just deep enough into the graphics system to find the true source of the problem, and explore various solutions to work around a given problem.

There are many more topics to cover when it comes to improving rendering performance, so in this article we will begin with some general techniques on how to determine whether our rendering is limited by the CPU or by the GPU, and what we can do about either case. We will discuss optimization techniques such as Occlusion Culling and Level of Detail (LOD) and provide some useful advice on Shader optimization, as well as large-scale rendering features such as lighting and shadows. Finally, since mobile devices are a common target for Unity projects, we will also cover some techniques that may help improve performance on limited hardware.

(For more resources related to this topic, see here.)

Profiling rendering issues

Poor rendering performance can manifest itself in a number of ways, depending on whether the device is CPU-bound, or GPU-bound; in the latter case, the root cause could originate from a number of places within the graphics pipeline. This can make the investigatory stage rather involved, but once the source of the bottleneck is discovered and the problem is resolved, we can expect significant improvements as small fixes tend to reap big rewards when it comes to the rendering subsystem.

The CPU sends rendering instructions through the graphics API, that funnel through the hardware driver to the GPU device, which results in commands entering the GPU's Command Buffer. These commands are processed by the massively parallel GPU system one by one until the buffer is empty. But there are a lot more nuances involved in this process.

The following shows a (greatly simplified) diagram of a typical GPU pipeline (which can vary based on technology and various optimizations), and the broad rendering steps that take place during each stage:

dynamic-graphics-img-0

The top row represents the work that takes place on the CPU, the act of calling into the graphics API, through the hardware driver, and pushing commands into the GPU. Ergo, a CPU-bound application will be primarily limited by the complexity, or sheer number, of graphics API calls.

Meanwhile, a GPU-bound application will be limited by the GPU's ability to process those calls, and empty the Command Buffer in a reasonable timeframe to allow for the intended frame rate. This is represented in the next two rows, showing the steps taking place in the GPU. But, because of the device's complexity, they are often simplified into two different sections: the front end and the back end.

The front end refers to the part of the rendering process where the GPU has received mesh data, a draw call has been issued, and all of the information that was fed into the GPU is used to transform vertices and run through Vertex Shaders. Finally, the rasterizer generates a batch of fragments to be processed in the back end. The back end refers to the remainder of the GPU's processing stages, where fragments have been generated, and now they must be tested, manipulated, and drawn via Fragment Shaders onto the frame buffer in the form of pixels.

Note that "Fragment Shader" is the more technically accurate term for Pixel Shaders. Fragments are generated by the rasterization stage, and only technically become pixels once they've been processed by the Shader and drawn to the Frame Buffer.

There are a number of different approaches we can use to determine where the root cause of a graphics rendering issue lies:

Profiling the GPU with the Profiler
Examining individual frames with the Frame Debugger
Brute Force Culling

GPU profiling

Because graphics rendering involves both the CPU and GPU, we must examine the problem using both the CPU Usage and GPU Usage areas of the Profiler as this can tell us which component is working hardest.

For example, the following screenshot shows the Profiler data for a CPU-bound application. The test involved creating thousands of simple objects, with no batching techniques taking place. This resulted in an extremely large Draw Call count (around 15,000) for the CPU to process, but giving the GPU relatively little work to do due to the simplicity of the objects being rendered:

dynamic-graphics-img-1

This example shows that the CPU's "rendering" task is consuming a large amount of cycles (around 30 ms per frame), while the GPU is only processing for less than 16 ms, indicating that the bottleneck resides in the CPU.

Meanwhile, Profiling a GPU-bound application via the Profiler is a little trickier. This time, the test involves creating a small number of high polycount objects (for a low Draw Call per vertex ratio), with dozens of real-time point lights and an excessively complex Shader with a texture, normal texture, heightmap, emission map, occlusion map, and so on, (for a high workload per pixel ratio).

The following screenshot shows Profiler data for the example Scene when it is run in a standalone application:

dynamic-graphics-img-2

As we can see, the rendering task of the CPU Usage area matches closely with the total rendering costs of the GPU Usage area. We can also see that the CPU and GPU time costs at the bottom of the image are relatively similar (41.48 ms versus 38.95 ms). This is very unintuitive as we would expect the GPU to be working much harder than the CPU.

Be aware that the CPU/GPU millisecond cost values are not calculated or revealed unless the appropriate Usage Area has been added to the Profiler window.

However, let's see what happens when we test the same exact Scene through the Editor:

dynamic-graphics-img-3

This is a better representation of what we would expect to see in a GPU-bound application. We can see how the CPU and GPU time costs at the bottom are closer to what we would expect to see (2.74 ms vs 64.82 ms).

However, this data is highly polluted. The spikes in the CPU and GPU Usage areas are the result of the Profiler Window UI updating during testing, and the overhead cost of running through the Editor is also artificially increasing the total GPU time cost.

It is unclear what causes the data to be treated this way, and this could certainly change in the future if enhancements are made to the Profiler in future versions of Unity, but it is useful to know this drawback.

Trying to determine whether our application is truly GPU-bound is perhaps the only good excuse to perform a Profiler test through the Editor.

The Frame Debugger

A new feature in Unity 5 is the Frame Debugger, a debugging tool that can reveal how the Scene is rendered and pieced together, one Draw Call at a time. We can click through the list of Draw Calls and observe how the Scene is rendered up to that point in time. It also provides a lot of useful details for the selected Draw Call, such as the current render target (for example, the shadow map, the camera depth texture, the main camera, or other custom render targets), what the Draw Call did (drawing a mesh, drawing a static batch, drawing depth shadows, and so on), and what settings were used (texture data, vertex colors, baked lightmaps, directional lighting, and so on).

The following screenshot shows a Scene that is only being partially rendered due to the currently selected Draw Call within the Frame Debugger. Note the shadows that are visible from baked lightmaps that were rendered during an earlier pass before the object itself is rendered:

dynamic-graphics-img-4

If we are bound by Draw Calls, then this tool can be effective in helping us figure out what the Draw Calls are being spent on, and determine whether there are any unnecessary Draw Calls that are not having an effect on the scene. This can help us come up with ways to reduce them, such as removing unnecessary objects or batching them somehow. We can also use this tool to observe how many additional Draw Calls are consumed by rendering features, such as shadows, transparent objects, and many more. This could help us, when we're creating multiple quality levels for our game, to decide what features to enable/disable under the low, medium, and high quality settings.

Brute force testing

If we're poring over our Profiling data, and we're still not sure we can determine the source of the problem, we can always try the brute force method: cull a specific activity from the Scene and see if it results in greatly increased performance. If a small change results in a big speed improvement, then we have a strong clue about where the bottleneck lies. There's no harm in this approach if we eliminate enough unknown variables to be sure the data is leading us in the right direction.

We will cover different ways to brute force test a particular issue in each of the upcoming sections.

CPU-bound

If our application is CPU-bound, then we will observe a generally poor FPS value within the CPU Usage area of the Profiler window due to the rendering task. However, if VSync is enabled the data will often get muddied up with large spikes representing pauses as the CPU waits for the screen refresh rate to come around before pushing the current frame buffer. So, we should make sure to disable the VSync block in the CPU Usage area before deciding the CPU is the problem.

dynamic-graphics-img-5

Brute-forcing a test for CPU-bounding can be achieved by reducing Draw Calls. This is a little unintuitive since, presumably, we've already been reducing our Draw Calls to a minimum through techniques such as Static and Dynamic Batching, Atlasing, and so forth. This would mean we have very limited scope for reducing them further.

What we can do, however, is disable the Draw-Call-saving features such as batching and observe if the situation gets significantly worse than it already is. If so, then we have evidence that we're either already, or very close to being, CPU-bound. At this point, we should see whether we can re-enable these features and disable rendering for a few choice objects (preferably those with low complexity to reduce Draw Calls without over-simplifying the rendering of our scene). If this results in a significant performance improvement then, unless we can find further opportunities for batching and mesh combining, we may be faced with the unfortunate option of removing objects from our scene as the only means of becoming performant again.

There are some additional opportunities for Draw Call reduction, including Occlusion Culling, tweaking our Lighting and Shadowing, and modifying our Shaders. These will be explained in the following sections.

However, Unity's rendering system can be multithreaded, depending on the targeted platform, which version of Unity we're running, and various settings, and this can affect how the graphics subsystem is being bottlenecked by the CPU, and slightly changes the definition of what being CPU-bound means.

Multithreaded rendering

Multithreaded rendering was first introduced in Unity v3.5 in February 2012, and enabled by default on multicore systems that could handle the workload; at the time, this was only PC, Mac, and Xbox 360. Gradually, more devices were added to this list, and since Unity v5.0, all major platforms now enable multithreaded rendering by default (and possibly some builds of Unity 4).

Mobile devices were also starting to feature more powerful CPUs that could support this feature. Android multithreaded rendering (introduced in Unity v4.3) can be enabled through a checkbox under Platform Settings | Other Settings | Multithreaded Rendering. Multithreaded rendering on iOS can be enabled by configuring the application to make use of the Apple Metal API (introduced in Unity v4.6.3), under Player Settings | Other Settings | Graphics API.

When multithreaded rendering is enabled, tasks that must go through the rendering API (OpenGL, DirectX, or Metal), are handed over from the main thread to a "worker thread". The worker thread's purpose is to undertake the heavy workload that it takes to push rendering commands through the graphics API and driver, to get the rendering instructions into the GPU's Command Buffer. This can save an enormous number of CPU cycles for the main thread, where the overwhelming majority of other CPU tasks take place. This means that we free up extra cycles for the majority of the engine to process physics, script code, and so on.

Incidentally, the mechanism by which the main thread notifies the worker thread of tasks operates in a very similar way to the Command Buffer that exists on the GPU, except that the commands are much more high-level, with instructions like "render this object, with this Material, using this Shader", or "draw N instances of this piece of procedural geometry", and so on. This feature has been exposed in Unity 5 to allow developers to take direct control of the rendering subsystem from C# code. This customization is not as powerful as having direct API access, but it is a step in the right direction for Unity developers to implement unique graphical effects.

Confusingly, the Unity API name for this feature is called "CommandBuffer", so be sure not to confuse it with the GPU's Command Buffer.

Check the Unity documentation on CommandBuffer to make use of this feature: http://docs.unity3d.com/ScriptReference/Rendering.CommandBuffer.html.

Getting back to the task at hand, when we discuss the topic of being CPU-bound in graphics rendering, we need to keep in mind whether or not the multithreaded renderer is being used, since the actual root cause of the problem will be slightly different depending on whether this feature is enabled or not.

In single-threaded rendering, where all graphics API calls are handled by the main thread, and in an ideal world where both components are running at maximum capacity, our application would become bottlenecked on the CPU when 50 percent or more of the time per frame is spent handling graphics API calls. However, resolving these bottlenecks can be accomplished by freeing up work from the main thread. For example, we might find that greatly reducing the amount of work taking place in our AI subsystem will improve our rendering significantly because we've freed up more CPU cycles to handle the graphics API calls.

But, when multithreaded rendering is taking place, this task is pushed onto the worker thread, which means the same thread isn't being asked to manage both engine work and graphics API calls at the same time. These processes are mostly independent, and even though additional work must still take place in the main thread to send instructions to the worker thread in the first place (via the internal CommandBuffer system), it is mostly negligible. This means that reducing the workload in the main thread will have little-to-no effect on rendering performance.

Note that being GPU-bound is the same regardless of whether multithreaded rendering is taking place.

GPU Skinning

While we're on the subject of CPU-bounding, one task that can help reduce CPU workload, at the expense of additional GPU workload, is GPU Skinning. Skinning is the process where mesh vertices are transformed based on the current location of their animated bones. The animation system, working on the CPU, only transforms the bones, but another step in the rendering process must take care of the vertex transformations to place the vertices around those bones, performing a weighted average over the bones connected to those vertices.

This vertex processing task can either take place on the CPU or within the front end of the GPU, depending on whether the GPU Skinning option is enabled. This feature can be toggled under Edit | Project Settings | Player Settings | Other Settings | GPU Skinning.

Front end bottlenecks

It is not uncommon to use a mesh that contains a lot of unnecessary UV and Normal vector data, so our meshes should be double-checked for this kind of superfluous fluff. We should also let Unity optimize the structure for us, which minimizes cache misses as vertex data is read within the front end.

We will also learn some useful Shader optimization techniques shortly, when we begin to discuss back end optimizations, since many optimization techniques apply to both Fragment and Vertex Shaders.

The only attack vector left to cover is finding ways to reduce actual vertex counts. The obvious solutions are simplification and culling; either have the art team replace problematic meshes with lower polycount versions, and/or remove some objects from the scene to reduce the overall polygon count. If these approaches have already been explored, then the last approach we can take is to find some kind of middle ground between the two.

Level Of Detail

Since it can be difficult to tell the difference between a high quality distance object and a low quality one, there is very little reason to render the high quality version. So, why not dynamically replace distant objects with something more simplified?

Level Of Detail (LOD), is a broad term referring to the dynamic replacement of features based on their distance or form factor relative to the camera. The most common implementation is mesh-based LOD: dynamically replacing a mesh with lower and lower detailed versions as the camera gets farther and farther away. Another example might be replacing animated characters with versions featuring fewer bones, or less sampling for distant objects, in order to reduce animation workload.

The built-in LOD feature is available in the Unity 4 Pro Edition and all editions of Unity 5. However, it is entirely possible to implement it via Script code in Unity 4 Free Edition if desired.

Making use of LOD can be achieved by placing multiple objects in the Scene and making them children of a GameObject with an attached LODGroup component. The LODGroup's purpose is to generate a bounding box from these objects, and decide which object should be rendered based on the size of the bounding box within the camera's field of view. If the object's bounding box consumes a large area of the current view, then it will enable the mesh(es) assigned to lower LOD groups, and if the bounding box is very small, it will replace the mesh(es) with those from higher LOD groups. If the mesh is too far away, it can be configured to hide all child objects. So, with the proper setup, we can have Unity replace meshes with simpler alternatives, or cull them entirely, which eases the burden on the rendering process.

Check the Unity documentation for more detailed information on the LOD feature: http://docs.unity3d.com/Manual/LevelOfDetail.html.

This feature can cost us a large amount of development time to fully implement; artists must generate lower polygon count versions of the same object, and level designers must generate LOD groups, configure them, and test them to ensure they don't cause jarring transitions as the camera moves closer or farther away. It also costs us in memory and runtime CPU; the alternative meshes need to be kept in memory, and the LODGroup component must routinely test whether the camera has moved to a new position that warrants a change in LOD level.

In this era of graphics card capabilities, vertex processing is often the least of our concerns. Combined with the additional sacrifices needed for LOD to function, developers should avoid preoptimizing by automatically assuming LOD will help them. Excessive use of the feature will lead to burdening other parts of our application's performance, and chew up precious development time, all for the sake of paranoia. If it hasn't been proven to be a problem, then it's probably not a problem!

Scenes that feature large, expansive views of the world, and lots of camera movement, should consider implementing this technique very early, as the added distance and massive number of visible objects will exacerbate the vertex count enormously. Scenes that are always indoors, or feature a camera with a viewpoint looking down at the world (real-time strategy and MOBA games, for example) should probably steer clear of implementing LOD from the beginning. Games somewhere between the two should avoid it until necessary. It all depends on how many vertices are expected to be visible at any given time and how much variability in camera distance there will be.

Note that some game development middleware companies offer third-party tools for automated LOD mesh generation. These might be worth investigating to compare their ease of use versus quality loss versus cost effectiveness.

Disable GPU Skinning

As previously mentioned, we could enable GPU Skinning to reduce the burden on a CPU-bound application, but enabling this feature will push the same workload into the front end of the GPU. Since Skinning is one of those "embarrassingly parallel" processes that fits well with the GPU's parallel architecture, it is often a good idea to perform the task on the GPU. But this task can chew up precious time in the front end preparing the vertices for fragment generation, so disabling it is another option we can explore if we're bottlenecked in this area. Again, this feature can be toggled under Edit | Project Settings | Player Settings | Other Settings | GPU Skinning.

GPU Skinning is available in Unity 4 Pro Edition, and all editions of Unity 5.

Reduce tessellation

There is one last task that takes place in the front end process and that we need to consider: tessellation. Tessellation through Geometry Shaders can be a lot of fun, as it is a relatively underused technique that can really make our graphical effects stand out from the crowd of games that only use the most common effects. But, it can contribute enormously to the amount of processing work taking place in the front end.

There are no simple tricks we can exploit to improve tessellation, besides improving our tessellation algorithms, or easing the burden caused by other front end tasks to give our tessellation tasks more room to breathe. Either way, if we have a bottleneck in the front end and are making use of tessellation techniques, we should double-check that they are not consuming the lion's share of the front end's budget.

Back end bottlenecks

The back end is the more interesting part of the GPU pipeline, as many more graphical effects take place during this stage. Consequently, it is the stage that is significantly more likely to suffer from bottlenecks.

There are two brute force tests we can attempt:

Reduce resolution
Reduce texture quality

These changes will ease the workload during two important stages at the back end of the pipeline: fill rate and memory bandwidth, respectively. Fill rate tends to be the most common source of bottlenecks in the modern era of graphics rendering, so we will cover it first.

Fill rate

By reducing screen resolution, we have asked the rasterization system to generate significantly fewer fragments and transpose them over a smaller canvas of pixels. This will reduce the fill rate consumption of the application, giving a key part of the rendering pipeline some additional breathing room. Ergo, if performance suddenly improves with a screen resolution reduction, then fill rate should be our primary concern.

Fill rate is a very broad term referring to the speed at which the GPU can draw fragments. But, this only includes fragments that have survived all of the various conditional tests we might have enabled within the given Shader. A fragment is merely a "potential pixel," and if it fails any of the enabled tests, then it is immediately discarded. This can be an enormous performance-saver as the pipeline can skip the costly drawing step and begin work on the next fragment instead.

One such example is Z-testing, which checks whether the fragment from a closer object has already been drawn to the same pixel already. If so, then the current fragment is discarded. If not, then the fragment is pushed through the Fragment Shader and drawn over the target pixel, which consumes exactly one draw from our fill rate. Now imagine multiplying this process by thousands of overlapping objects, each generating hundreds or thousands of possible fragments, for high screen resolutions causing millions, or billions, of fragments to be generated each and every frame. It should be fairly obvious that skipping as many of these draws as we can will result in big rendering cost savings.

Graphics card manufacturers typically advertise a particular fill rate as a feature of the card, usually in the form of gigapixels per second, but this is a bit of a misnomer, as it would be more accurate to call it gigafragments per second; however this argument is mostly academic. Either way, larger values tell us that the device can potentially push more fragments through the pipeline, so with a budget of 30 GPix/s and a target frame rate of 60 Hz, we can afford to process 30,000,000,000/60 = 500 million fragments per frame before being bottlenecked on fill rate. With a resolution of 2560x1440, and a best-case scenario where each pixel is only drawn over once, then we could theoretically draw the entire scene about 125 times without any noticeable problems.

Sadly, this is not a perfect world, and unless we take significant steps to avoid it, we will always end up with some amount of redraw over the same pixels due to the order in which objects are rendered. This is known as overdraw, and it can be very costly if we're not careful.

The reason that resolution is a good attack vector to check for fill rate bounding is that it is a multiplier. A reduction from a resolution of 2560x1440 to 800x600 is an improvement factor of about eight, which could reduce fill rate costs enough to make the application perform well again.

Overdraw

Determining how much overdraw we have can be represented visually by rendering all objects with additive alpha blending and a very transparent flat color. Areas of high overdraw will show up more brightly as the same pixel is drawn over with additive blending multiple times. This is precisely how the Scene view's Overdraw shading mode reveals how much overdraw our scene is suffering.

The following screenshot shows a scene with several thousand boxes drawn normally, and drawn using the Scene view's Overdraw shading mode:

dynamic-graphics-img-6

At the end of the day, fill rate is provided as a means of gauging the best-case behavior. In other words, it's primarily a marketing term and mostly theoretical. But, the technical side of the industry has adopted the term as a way of describing the back end of the pipeline: the stage where fragment data is funneled through our Shaders and drawn to the screen.

If every fragment required an absolute minimum level of processing (such as a Shader that returned a constant color), then we might get close to that theoretical maximum. The GPU is a complex beast, however, and things are never so simple. The nature of the device means it works best when given many small tasks to perform. But, if the tasks get too large, then fill rate is lost due to the back end not being able to push through enough fragments in time and the rest of the pipeline is left waiting for tasks to do.

There are several more features that can potentially consume our theoretical fill rate maximum, including but not limited to alpha testing, alpha blending, texture sampling, the amount of fragment data being pulled through our Shaders, and even the color format of the target render texture (the final Frame Buffer in most cases). The bad news is that this gives us a lot of subsections to cover, and a lot of ways to break the process, but the good news is it gives us a lot of avenues to explore to improve our fill rate usage.

Occlusion Culling

One of the best ways to reduce overdraw is to make use of Unity's Occlusion Culling system. The system works by partitioning Scene space into a series of cells and flying through the world with a virtual camera making note of which cells are invisible from other cells (are occluded) based on the size and position of the objects present.

Note that this is different to the technique of Frustum Culling, which culls objects not visible from the current camera view. This feature is always active in all versions, and objects culled by this process are automatically ignored by the Occlusion Culling system.

Occlusion Culling is available in the Unity 4 Pro Edition and all editions of Unity 5.

Occlusion Culling data can only be generated for objects properly labeled Occluder Static and Occludee Static under the StaticFlags dropdown. Occluder Static is the general setting for static objects where we want it to hide other objects, and be hidden by large objects in its way. Occludee Static is a special case for transparent objects that allows objects behind them to be rendered, but we want them to be hidden if something large blocks their visibility.

Naturally, because one of the static flags must be enabled for Occlusion Culling, this feature will not work for dynamic objects.

The following screenshot shows how effective Occlusion Culling can be at reducing the number of visible objects in our Scene:

dynamic-graphics-img-7

This feature will cost us in both application footprint and incur some runtime costs. It will cost RAM to keep the Occlusion Culling data structure in memory, and there will be a CPU processing cost to determine which objects are being occluded in each frame.

The Occlusion Culling data structure must be properly configured to create cells of the appropriate size for our Scene, and the smaller the cells, the longer it takes to generate the data structure. But, if it is configured correctly for the Scene, Occlusion Culling can provide both fill rate savings through reduced overdraw, and Draw Call savings by culling non-visible objects.

Shader optimization

Shaders can be a significant fill rate consumer, depending on their complexity, how much texture sampling takes place, how many mathematical functions are used, and so on. Shaders do not directly consume fill rate, but do so indirectly because the GPU must calculate or fetch data from memory during Shader processing. The GPU's parallel nature means any bottleneck in a thread will limit how many fragments can be pushed into the thread at a later date, but parallelizing the task (sharing small pieces of the job between several agents) provides a net gain over serial processing (one agent handling each task one after another).

The classic example is a vehicle assembly line. A complete vehicle requires multiple stages of manufacture to complete. The critical path to completion might involve five steps: stamping, welding, painting, assembly, and inspection, and each step is completed by a single team. For any given vehicle, no stage can begin before the previous one is finished, but whatever team handled the stamping for the last vehicle can begin stamping for the next vehicle as soon as it has finished. This organization allows each team to become masters of their particular domain, rather than trying to spread their knowledge too thin, which would likely result in less consistent quality in the batch of vehicles.

We can double the overall output by doubling the number of teams, but if any team gets blocked, then precious time is lost for any given vehicle, as well as all future vehicles that would pass through the same team. If these delays are rare, then they can be negligible in the grand scheme, but if not, and one stage takes several minutes longer than normal each and every time it must complete the task, then it can become a bottleneck that threatens the release of the entire batch.

The GPU parallel processors work in a similar way: each processor thread is an assembly line, each processing stage is a team, and each fragment is a vehicle. If the thread spends a long time processing a single stage, then time is lost on each fragment. This delay will multiply such that all future fragments coming through the same thread will be delayed. This is a bit of an oversimplification, but it often helps to paint a picture of how poorly optimized Shader code can chew up our fill rate, and how small improvements in Shader optimization provide big benefits in back end performance.

Shader programming and optimization have become a very niche area of game development. Their abstract and highly-specialized nature requires a very different kind of thinking to generate Shader code compared to gameplay and engine code. They often feature mathematical tricks and back-door mechanisms for pulling data into the Shader, such as precomputing values in texture files. Because of this, and the importance of optimization, Shaders tend to be very difficult to read and reverse-engineer.

Consequently, many developers rely on prewritten Shaders, or visual Shader creation tools from the Asset Store such as Shader Forge or Shader Sandwich. This simplifies the act of initial Shader code generation, but might not result in the most efficient form of Shaders. If we're relying on pre-written Shaders or tools, we might find it worthwhile to perform some optimization passes over them using some tried-and-true techniques. So, let's focus on some easily reachable ways of optimizing our Shaders.

Consider using Shaders intended for mobile platforms

The built-in mobile Shaders in Unity do not have any specific restrictions that force them to only be used on mobile devices. They are simply optimized for minimum resource usage (and tend to feature some of the other optimizations listed in this section).

Desktop applications are perfectly capable of using these Shaders, but they tend to feature a loss of graphical quality. It only becomes a question of whether the loss of graphical quality is acceptable. So, consider doing some testing with the mobile equivalents of common Shaders to see whether they are a good fit for your game.

Use small data types

GPUs can calculate with smaller data types more quickly than larger types (particularly on mobile platforms!), so the first tweak we can attempt is replacing our float data types (32-bit, floating point) with smaller versions such as half (16-bit, floating point), or even fixed (12-bit, fixed point).

The size of the data types listed above will vary depending on what floating point formats the target platform prefers. The sizes listed are the most common. The importance for optimization is in the relative size between formats.

Color values are good candidates for precision reduction, as we can often get away with less precise color values without any noticeable loss in coloration. However, the effects of reducing precision can be very unpredictable for graphical calculations. So, changes such as these can require some testing to verify whether the reduced precision is costing too much graphical fidelity.

Note that the effects of these tweaks can vary enormously between one GPU architecture and another (for example, AMD versus Nvidia versus Intel), and even GPU brands from the same manufacturer. In some cases, we can make some decent performance gains for a trivial amount of effort. In other cases, we might see no benefit at all.

Avoid changing precision while swizzling

Swizzling is the Shader programming technique of creating a new vector (an array of values) from an existing vector by listing the components in the order in which we wish to copy them into the new structure. Here are some examples of swizzling:

float4 input = float4(1.0, 2.0, 3.0, 4.0);  // initial test value

float2 val1 = input.yz;  // swizzle two components

float3 val2 = input.zyx;  // swizzle three components in a different order

float4 val3 = input.yyy;  // swizzle the same component multiple times

float sclr = input.w;
float3 val4 = sclr.xxx  // swizzle a scalar multiple times

We can use both the xyzw and rgba representations to refer to the same components, sequentially. It does not matter whether it is a color or vector; they just make the Shader code easier to read. We can also list components in any order we like to fill in the desired data, repeating them if necessary.

Converting from one precision type to another in a Shader can be a costly operation, but converting the precision type while simultaneously swizzling can be particularly painful. If we have mathematical operations that rely on being swizzled into different precision types, it would be wiser if we simply absorbed the high-precision cost from the very beginning, or reduced precision across the board to avoid the need for changes in precision.

Use GPU-optimized helper functions

The Shader compiler often performs a good job of reducing mathematical calculations down to an optimized version for the GPU, but compiled custom code is unlikely to be as effective as both the Cg library's built-in helper functions and the additional helpers provided by the Unity Cg included files. If we are using Shaders that include custom function code, perhaps we can find an equivalent helper function within the Cg or Unity libraries that can do a better job than our custom code can.

These extra include files can be added to our Shader within the CGPROGRAM block like so:

CGPROGRAM
// other includes
#include "UnityCG.cginc"
// Shader code here
ENDCG

Example Cg library functions to use are abs() for absolute values, lerp() for linear interpolation, mul() for multiplying matrices, and step() for step functionality. Useful UnityCG.cginc functions include WorldSpaceViewDir() for calculating the direction towards the camera, and Luminance() for converting a color to grayscale.

Check the following URL for a full list of Cg standard library functions: http://http.developer.nvidia.com/CgTutorial/cg_tutorial_appendix_e.html.

Check the Unity documentation for a complete and up-to-date list of possible include files and their accompanying helper functions: http://docs.unity3d.com/Manual/SL-BuiltinIncludes.html.

Disable unnecessary features

Perhaps we can make savings by simply disabling Shader features that aren't vital. Does the Shader really need multiple passes, transparency, Z-writing, alpha-testing, and/or alpha blending? Will tweaking these settings or removing these features give us a good approximation of our desired effect without losing too much graphical fidelity? Making such changes is a good way of making fill rate cost savings.

Remove unnecessary input data

Sometimes the process of writing a Shader involves a lot of back and forth experimentation in editing code and viewing it in the Scene. The typical result of this is that input data that was needed when the Shader was going through early development is now surplus fluff once the desired effect has been obtained, and it's easy to forget what changes were made when/if the process drags on for a long time. But, these redundant data values can cost the GPU valuable time as they must be fetched from memory even if they are not explicitly used by the Shader. So, we should double check our Shaders to ensure all of their input geometry, vertex, and fragment data is actually being used.

Only expose necessary variables

Exposing unnecessary variables from our Shader to the accompanying Material(s) can be costly as the GPU can't assume these values are constant. This means the Shader code cannot be compiled into a more optimized form. This data must be pushed from the CPU with every pass since they can be modified at any time through the Material's methods such as SetColor(), SetFloat(), and so on. If we find that, towards the end of the project, we always use the same value for these variables, then they can be replaced with a constant in the Shader to remove such excess runtime workload. The only cost is obfuscating what could be critical graphical effect parameters, so this should be done very late in the process.

Reduce mathematical complexity

Complicated mathematics can severely bottleneck the rendering process, so we should do whatever we can to limit the damage. Complex mathematical functions could be replaced with a texture that is fed into the Shader and provides a pre-generated table for runtime lookup. We may not see any improvement with functions such as sin and cos, since they've been heavily optimized to make use of GPU architecture, but complex methods such as pow, exp, log, and other custom mathematical processes can only be optimized so much, and would be good candidates for simplification. This is assuming we only need one or two input values, which are represented through the X and Y coordinates of the texture, and mathematical accuracy isn't of paramount importance.

This will cost us additional graphics memory to store the texture at runtime (more on this later), but if the Shader is already receiving a texture (which they are in most cases) and the alpha channel is not being used, then we could sneak the data in through the texture's alpha channel, costing us literally no performance, and the rest of the Shader code and graphics system would be none-the-wiser. This will involve the customization of art assets to include such data in any unused color channel(s), requiring coordination between programmers and artists, but is a very good way of saving Shader processing costs with no runtime sacrifices.

In fact, Material properties and textures are both excellent entry points for pushing work from the Shader (the GPU) onto the CPU. If a complex calculation does not need to vary on a per pixel basis, then we could expose the value as a property in the Material, and modify it as needed (accepting the overhead cost of doing so from the previous section Only expose necessary variables). Alternatively, if the result varies per pixel, and does not need to change often, then we could generate a texture file from script code, containing the results of the calculations in the RGBA values, and pulling the texture into the Shader. Lots of opportunities arise when we ignore the conventional application of such systems, and remember to think of them as just raw data being transferred around.

Reduce texture lookups

While we're on the subject of texture lookups, they are not trivial tasks for the GPU to process and they have their own overhead costs. They are the most common cause of memory access problems within the GPU, especially if a Shader is performing samples across multiple textures, or even multiple samples across a single texture, as they will likely inflict cache misses in memory. Such situations should be simplified as much as possible to avoid severe GPU memory bottlenecking.

Even worse, sampling a texture in a random order would likely result in some very costly cache misses for the GPU to suffer through, so if this is being done, then the texture should be reordered so that it can be sampled in a more sequential order.

Avoid conditional statements

In modern day CPU architecture, conditional statements undergo a lot of clever predictive techniques to make use of instruction-level parallelism. This is a feature where the CPU attempts to predict which direction a conditional statement will go in before it has actually been resolved, and speculatively begins processing the most likely result of the conditional using any free components that aren't being used to resolve the conditional (fetching some data from memory, copying some floats into unused registers, and so on). If it turns out that the decision is wrong, then the current result is discarded and the proper path is taken instead.

So long as the cost of speculative processing and discarding false results is less than the time spent waiting to decide the correct path, and it is right more often than it is wrong, then this is a net gain for the CPU's speed.

However, this feature is not possible on GPU architecture because of its parallel nature. The GPU's cores are typically managed by some higher-level construct that instructs all cores under its command to perform the same machine-code-level instruction simultaneously. So, if the Fragment Shader requires a float to be multiplied by 2, then the process will begin by having all cores copy data into the appropriate registers in one coordinated step. Only when all cores have finished copying to the registers will the cores be instructed to begin the second step: multiplying all registers by 2.

Thus, when this system stumbles into a conditional statement, it cannot resolve the two statements independently. It must determine how many of its child cores will go down each path of the conditional, grab the list of required machine code instructions for one path, resolve them for all cores taking that path, and repeat for each path until all possible paths have been processed. So, for an if-else statement (two possibilities), it will tell one group of cores to process the "true" path, then ask the remaining cores to process the "false" path. Unless every core takes the same path, it must process both paths every time.

So, we should avoid branching and conditional statements in our Shader code. Of course, this depends on how essential the conditional is to achieving the graphical effect we desire. But, if the conditional is not dependent on per pixel behavior, then we would often be better off absorbing the cost of unnecessary mathematics than inflicting a branching cost on the GPU. For example, we might be checking whether a value is non-zero before using it in a calculation, or comparing against some global flag in the Material before taking one action or another. Both of these cases would be good candidates for optimization by removing the conditional check.

Reduce data dependencies

The compiler will try its best to optimize our Shader code into the more GPU-friendly low-level language so that it is not waiting on data to be fetched when it could be processing some other task. For example, the following poorly-optimized code, could be written in our Shader:

float sum = input.color1.r;
sum = sum + input.color2.g;
sum = sum + input.color3.b;
sum = sum + input.color4.a;
float result = calculateSomething(sum);

If we were able to force the Shader compiler to compile this code into machine code instructions as it is written, then this code has a data dependency such that each calculation cannot begin until the last finishes due to the dependency on the sum variable. But, such situations are often detected by the Shader compiler and optimized into a version that uses instruction-level parallelism (the code shown next is the high-level code equivalent of the resulting machine code):

float sum1, sum2, sum3, sum4;
sum1 = input.color1.r;
sum2 = input.color2.g;
sum3 = input.color3.b
sum4 = input.color4.a;
float sum = sum1 + sum2 + sum3 + sum4;
float result = CalculateSomething(sum);

In this case, the compiler would recognize that it can fetch the four values from memory in parallel and complete the summation once all four have been fetched independently via thread-level parallelism. This can save a lot of time, relative to performing the four fetches one after another.

However, long chains of data dependency can absolutely murder Shader performance. If we create a strong data dependency in our Shader's source code, then it has been given no freedom to make such optimizations. For example, the following data dependency would be painful on performance, as one step cannot be completed without waiting on another to fetch data and performing the appropriate calculation.

float4 val1 = tex2D(_tex1, input.texcoord.xy);
float4 val2 = tex2D(_tex2, val1.yz);
float4 val3 = tex2D(_tex3, val2.zw);

Strong data dependencies such as these should be avoided whenever possible.

Surface Shaders

If we're using Unity's Surface Shaders, which are a way for Unity developers to get to grips with Shader programming in a more simplified fashion, then the Unity Engine takes care of converting our Surface Shader code for us, abstracting away some of the optimization opportunities we have just covered. However, it does provide some miscellaneous values that can be used as replacements, which reduce accuracy but simplify the mathematics in the resulting code. Surface Shaders are designed to handle the general case fairly efficiently, but optimization is best achieved with a personal touch.

The approxview attribute will approximate the view direction, saving costly operations. halfasview will reduce the precision of the view vector, but beware of its effect on mathematical operations involving multiple precision types. noforwardadd will limit the Shader to only considering a single directional light, reducing Draw Calls since the Shader will render in only a single pass, but reducing lighting complexity. Finally, noambient will disable ambient lighting in the Shader, removing some extra mathematical operations that we may not need.

Use Shader-based LOD

We can force Unity to render distant objects using simpler Shaders, which can be an effective way of saving fill rate, particularly if we're deploying our game onto multiple platforms or supporting a wide range of hardware capability. The LOD keyword can be used in the Shader to set the onscreen size factor that the Shader supports. If the current LOD level does not match this value, it will drop to the next fallback Shader and so on until it finds the Shader that supports the given size factor. We can also change a given Shader object's LOD value at runtime using the maximumLOD property.

This feature is similar to the mesh-based LOD covered earlier, and uses the same LOD values for determining object form factor, so it should be configured as such.

Memory bandwidth

Another major component of back end processing and a potential source of bottlenecks is memory bandwidth. Memory bandwidth is consumed whenever a texture must be pulled from a section of the GPU's main video memory (also known as VRAM). The GPU contains multiple cores that each have access to the same area of VRAM, but they also each contain a much smaller, local Texture Cache that stores the current texture(s) the GPU has been most recently working with. This is similar in design to the multitude of CPU cache levels that allow memory transfer up and down the chain, as a workaround for the fact that faster memory will, invariably, be more expensive to produce, and hence smaller in capacity compared to slower memory.

Whenever a Fragment Shader requests a sample from a texture that is already within the core's local Texture Cache, then it is lightning fast and barely perceivable. But, if a texture sample request is made, that does not yet exist within the Texture Cache, then it must be pulled in from VRAM before it can be sampled. This fetch request risks cache misses within VRAM as it tries to find the relevant texture. The transfer itself consumes a certain amount of memory bandwidth, specifically an amount equal to the total size of the texture file stored within VRAM (which may not be the exact size of the original file, nor the size in RAM, due to GPU-level compression).

It's for this reason that, if we're bottlenecked on memory bandwidth, then performing a brute force test by reducing texture quality would suddenly result in a performance improvement. We've shrunk the size of our textures, easing the burden on the GPU's memory bandwidth, allowing it to fetch the necessary textures much quicker. Globally reducing texture quality can be achieved by going to Edit | Project Settings | Quality | Texture Quality and setting the value to Half Res, Quarter Res, or Eighth Res.

In the event that memory bandwidth is bottlenecked, then the GPU will keep fetching the necessary texture files, but the entire process will be throttled as the Texture Cache waits for the data to appear before processing the fragment. The GPU won't be able to push data back to the Frame Buffer in time to be rendered onto the screen, blocking the whole process and culminating in a poor frame rate.

Ultimately, proper usage of memory bandwidth is a budgeting concern. For example, with a memory bandwidth of 96 GB/sec per core and a target frame rate of 60 frames per second, then the GPU can afford to pull 96/60 = 1.6 GB worth of texture data every frame before being bottlenecked on memory bandwidth.

Memory bandwidth is often listed on a per core basis, but some GPU manufacturers may try to mislead you by multiplying memory bandwidth by the number of cores in order to list a bigger, but less practical number. Because of this, research may be necessary to confirm the memory bandwidth limit we have for the target GPU hardware is given on a per core basis.

Note that this value is not the maximum limit on the texture data that our game can contain in the project, nor in CPU RAM, not even in VRAM. It is a metric that limits how much texture swapping can occur during one frame. The same texture could be pulled back and forth multiple times in a single frame depending on how many Shaders need to use them, the order that the objects are rendered, and how often texture sampling must occur, so rendering just a few objects could consume whole gigabytes of memory bandwidth if they all require the same high quality, massive textures, require multiple secondary texture maps (normal maps, emission maps, and so on), and are not batched together, because there simply isn't enough Texture Cache space available to keep a single texture file long enough to exploit it during the next rendering pass.

There are several approaches we can take to solve bottlenecks in memory bandwidth.

Use less texture data

This approach is simple, straightforward, and always a good idea to consider. Reducing texture quality, either through resolution or bit rate, is not ideal for graphical quality, but we can sometimes get away with using 16-bit textures without any noticeable degradation.

Mip Maps are another excellent way of reducing the amount of texture data being pushed back and forth between VRAM and the Texture Cache. Note that the Scene View has a Mipmaps Shading Mode, which will highlight textures in our scene blue or red depending on whether the current texture scale is appropriate for the current Scene View's camera position and orientation. This will help identify what textures are good candidates for further optimization.

Mip Maps should almost always be used in 3D Scenes, unless the camera moves very little.

Test different GPU Texture Compression formats

The Texture Compression techniques helpe reduce our application's footprint (executable file size), and runtime CPU memory usage, that is, the storage area where all texture resource data is kept until it is needed by the GPU. However, once the data reaches the GPU, it uses a different form of compression to keep texture data small. The common formats are DXT, PVRTC, ETC, and ASTC.

To make matters more confusing, each platform and GPU hardware supports different compression formats, and if the device does not support the given compression format, then it will be handled at the software level. In other words, the CPU will need to stop and recompress the texture to the desired format the GPU wants, as opposed to the GPU taking care of it with a specialized hardware chip.

The compression options are only available if a texture resource has its Texture Type field set to Advanced. Using any of the other texture type settings will simplify the choices, and Unity will make a best guess when deciding which format to use for the target platform, which may not be ideal for a given piece of hardware and thus will consume more memory bandwidth than necessary.

The best approach to determining the correct format is to simply test a bunch of different devices and Texture Compression techniques and find one that fits. For example, common wisdom says that ETC is the best choice for Android since more devices support it, but some developers have found their game works better with the DXT and PVRTC formats on certain devices.

Beware that, if we're at the point where individually tweaking Texture Compression techniques is necessary, then hopefully we have exhausted all other options for reducing memory bandwidth. By going down this road, we could be committing to supporting many different devices each in their own specific way. Many of us would prefer to keep things simple with a general solution instead of personal customization and time-consuming handiwork to work around problems like this.

Minimize texture sampling

Can we modify our Shaders to remove some texture sampling overhead? Did we add some extra texture lookup files to give ourselves some fill rate savings on mathematical functions? If so, we might want to consider lowering the resolution of such textures or reverting the changes and solving our fill rate problems in other ways. Essentially, the less texture sampling we do, the less often we need to use memory bandwidth and the closer we get to resolving the bottleneck.

Organize assets to reduce texture swaps

This approach basically comes back to Batching and Atlasing again. Are there opportunities to batch some of our biggest texture files together? If so, then we could save the GPU from having to pull in the same texture files over and over again during the same frame. As a last resort, we could look for ways to remove some textures from the entire project and reuse similar files. For instance, if we have fill rate budget to spare, then we may be able to use some Fragment Shaders to make a handful of textures files appear in our game with different color variations.

VRAM limits

One last consideration related to textures is how much VRAM we have available. Most texture transfer from CPU to GPU occurs during initialization, but can also occur when a non-existent texture is first required by the current view. This process is asynchronous and will result in a blank texture being used until the full texture is ready for rendering. As such, we should avoid too much texture variation across our Scenes.

Texture preloading

Even though it doesn't strictly relate to graphics performance, it is worth mentioning that the blank texture that is used during asynchronous texture loading can be jarring when it comes to game quality. We would like a way to control and force the texture to be loaded from disk to the main memory and then to VRAM before it is actually needed.

A common workaround is to create a hidden GameObject that features the texture and place it somewhere in the Scene on the route that the player will take towards the area where it is actually needed. As soon as the textured object becomes a candidate for the rendering system (even if it's technically hidden), it will begin the process of copying the data towards VRAM. This is a little clunky, but is easy to implement and works sufficiently well in most cases.

We can also control such behavior via Script code by changing a hidden Material's texture:

GetComponent<Renderer>().material.texture = textureToPreload;

Texture thrashing

In the rare event that too much texture data is loaded into VRAM, and the required texture is not present, the GPU will need to request it from the main memory and overwrite the existing texture data to make room. This is likely to worsen over time as the memory becomes fragmented, and it introduces a risk that the texture just flushed from VRAM needs to be pulled again within the same frame. This will result in a serious case of memory "thrashing", and should be avoided at all costs.

This is less of a concern on modern consoles such as the PS4, Xbox One, and WiiU, since they share a common memory space for both CPU and GPU. This design is a hardware-level optimization given the fact that the device is always running a single application, and almost always rendering 3D graphics. But, all other platforms must share time and space with multiple applications and be capable of running without a GPU. They therefore feature separate CPU and GPU memory, and we must ensure that the total texture usage at any given moment remains below the available VRAM of the target hardware.

Note that this "thrashing" is not precisely the same as hard disk thrashing, where memory is copied back and forth between main memory and virtual memory (the swap file), but it is analogous. In either case, data is being unnecessarily copied back and forth between two regions of memory because too much data is being requested in too short a time period for the smaller of the two memory regions to hold it all.

Thrashing such as this can be a common cause of dreadful graphics performance when games are ported from modern consoles to the desktop and should be treated with care.

Avoiding this behavior may require customizing texture quality and file sizes on a per-platform and per-device basis. Be warned that some players are likely to notice these inconsistencies if we're dealing with hardware from the same console or desktop GPU generation. As many of us will know, even small differences in hardware can lead to a lot of apples-versus-oranges comparisons, but hardcore gamers will expect a similar level of quality across the board.

Lighting and Shadowing

Lighting and Shadowing can affect all parts of the graphics pipeline, and so they will be treated separately. This is perhaps one of the most important parts of game art and design to get right. Good Lighting and Shadowing can turn a mundane scene into something spectacular as there is something magical about professional coloring that makes it visually appealing. Even the low-poly art style (think Monument Valley) relies heavily on a good lighting and shadowing profile in order to allow the player to distinguish one object from another. But, this isn't an art book, so we will focus on the performance characteristics of various Lighting and Shadowing features.

Unity offers two styles of dynamic light rendering, as well as baked lighting effects through lightmaps. It also provides multiple ways of generating shadows with varying levels of complexity and runtime processing cost. Between the two, there are a lot of options to explore, and a lot of things that can trip us up if we're not careful.

The Unity documentation covers all of these features in an excellent amount of detail (start with this page and work through them: http://docs.unity3d.com/Manual/Lighting.html), so we'll examine these features from a performance standpoint.

Let's tackle the two main light rendering modes first. This setting can be found under Edit | Project Settings | Player | Other Settings | Rendering, and can be configured on a per-platform basis.

Forward Rendering

Forward Rendering is the classical form of rendering lights in our scene. Each object is likely to be rendered in multiple passes through the same Shader. How many passes are required will be based on the number, distance, and brightness of light sources. Unity will try to prioritize which directional light is affecting the object the most and render the object in a "base pass" as a starting point. It will then take up to four of the most powerful point lights nearby and re-render the same object multiple times through the same Fragment Shader. The next four point lights will then be processed on a per-vertex basis. All remaining lights are treated as a giant blob by means of a technique called spherical harmonics.

Some of this behavior can be simplified by setting a light's Render Mode to values such as Not Important, and changing the value of Edit | Project Settings | Quality | Pixel Light Count. This value limits how many lights will be treated on a per pixel basis, but is overridden by any lights with a Render Mode set to Important. It is therefore up to us to use this combination of settings responsibly.

As you can imagine, the design of Forward Rendering can utterly explode our Draw Call count very quickly in scenes with a lot of point lights present, due to the number of render states being configured and Shader passes being reprocessed. CPU-bound applications should avoid this rendering mode if possible.

More information on Forward Rendering can be found in the Unity documentation: http://docs.unity3d.com/Manual/RenderTech-ForwardRendering.html.

Deferred Shading

Deferred Shading or Deferred Rendering as it is sometimes known, is only available on GPUs running at least Shader Model 3.0. In other words, any desktop graphics card made after around 2004. The technique has been around for a while, but it has not resulted in a complete replacement of the Forward Rendering method due to the caveats involved and limited support on mobile devices. Anti-aliasing, transparency, and animated characters receiving shadows are all features that cannot be managed through Deferred Shading alone and we must use the Forward Rendering technique as a fallback.

Deferred Shading is so named because actual shading does not occur until much later in the process; that is, it is deferred until later. From a performance perspective, the results are quite impressive as it can generate very good per pixel lighting with surprisingly little Draw Call effort. The advantage is that a huge amount of lighting can be accomplished using only a single pass through the lighting Shader. The main disadvantages include the additional costs if we wish to pile on advanced lighting features such as Shadowing and any steps that must pass through Forward Rendering in order to complete, such as transparency.

The Unity documentation contains an excellent source of information on the Deferred Shading technique, its advantages, and its pitfalls: http://docs.unity3d.com/Manual/RenderTech-DeferredShading.html

Vertex Lit Shading (legacy)

Technically, there are more than two lighting methods. Unity allows us to use a couple of legacy lighting systems, only one of which may see actual use in the field: Vertex Lit Shading. This is a massive simplification of lighting, as lighting is only considered per vertex, and not per pixel. In other words, entire faces are colored based on the incoming light color, and not individual pixels.

It is not expected that many, or really any, 3D games will make use of this legacy technique, as a lack of shadows and proper lighting make visualizations of depth very difficult. It is mostly relegated to 2D games that don't intend to make use of shadows, normal maps, and various other lighting features, but it is there if we need it.

Real-time Shadows

Soft Shadows are expensive, Hard Shadows are cheap, and No Shadows are free. Shadow Resolution, Shadow Projection, Shadow Distance, and Shadow Cascades are all settings we can find under Edit | Project Settings | Quality | Shadows that we can use to modify the behavior and complexity of our shadowing passes. That summarizes almost everything we need to know about Unity's real-time shadowing techniques from a high-level performance standpoint. We will cover shadows more in the following section on optimizing our lighting effects.

Lighting optimization

With a cursory glance at all of the relevant lighting techniques, let's run through some techniques we can use to improve lighting costs.

Use the appropriate Shading Mode

It is worth testing both of the main rendering modes to see which one best suits our game. Deferred Shading is often used as a backup in the event that Forward Rendering is becoming a burden on performance, but it really depends on where else we're finding bottlenecks as it is sometimes difficult to tell the difference between them.

Use Culling Masks

A Light Component's Culling Mask property is a layer-based mask that can be used to limit which objects will be affected by the given Light. This is an effective way of reducing lighting overhead, assuming that the layer interactions also make sense with how we are using layers for physics optimization. Objects can only be a part of a single layer, and reducing physics overhead probably trumps lighting overhead in most cases; thus, if there is a conflict, then this may not be the ideal approach.

Note that there is limited support for Culling Masks when using Deferred Shading. Because of the way it treats lighting in a very global fashion, only four layers can be disabled from the mask, limiting our ability to optimize its behavior through this method.

Use Baked Lightmaps

Baking Lighting and Shadowing into a Scene is significantly less processor-intensive than generating them at runtime. The downside is the added application footprint, memory consumption, and potential for memory bandwidth abuse. Ultimately, unless a game's lighting effects are being handled exclusively through Legacy Vertex Lighting or a single Directional Light, then it should probably include Lightmapping to make some huge budget savings on lighting calculations. Relying entirely on real-time lighting and shadows is a recipe for disaster unless the game is trying to win an award for the smallest application file size of all time.

Optimize Shadows

Shadowing passes mostly consume our Draw Calls and fill rate, but the amount of vertex position data we feed into the process and our selection for the Shadow Projection setting will affect the front end's ability to generate the required shadow casters and shadow receivers. We should already be attempting to reduce vertex counts to solve front end bottlenecking in the first place, and making this change will be an added multiplier towards that effort.

Draw Calls are consumed during shadowing by rendering visible objects into a separate buffer (known as the shadow map) as either a shadow caster, a shadow receiver, or both. Each object that is rendered into this map will consume another Draw Call, which makes shadows a huge performance cost multiplier, so it is often a setting that games will expose to users via quality settings, allowing users with weaker hardware to reduce the effect or even disable it entirely.

Shadow Distance is a global multiplier for runtime shadow rendering. The fewer shadows we need to draw, the happier the entire rendering process will be. There is little point in rendering shadows at a great distance from the camera, so this setting should be configured specific to our game and how much shadowing we expect to witness during gameplay. It is also a common setting that is exposed to the user to reduce the burden of rendering shadows.

Higher values of Shadow Resolution and Shadow Cascades will increase our memory bandwidth and fill rate consumption. Both of these settings can help curb the effects of artefacts in shadow rendering, but at the cost of a much larger shadow map size that must be moved around and of the canvas size to draw to.

The Unity documentation contains an excellent summary on the topic of the aliasing effect of shadow maps and how the Shadow Cascades feature helps to solve the problem: http://docs.unity3d.com/Manual/DirLightShadows.html.

It's worth noting that Soft Shadows do not consume any more memory or CPU overhead relative to Hard Shadows, as the only difference is a more complex Shader. This means that applications with enough fill rate to spare can enjoy the improved graphical fidelity of Soft Shadows.

Optimizing graphics for mobile

Unity's ability to deploy to mobile devices has contributed greatly to its popularity among hobbyist, small, and mid-size development teams. As such, it would be prudent to cover some approaches that are more beneficial for mobile platforms than for desktop and other devices.

Note that any, and all, of the following approaches may become obsolete soon, if they aren't already. The mobile device market is moving blazingly fast, and the following techniques as they apply to mobile devices merely reflect conventional wisdom from the last half decade. We should occasionally test the assumptions behind these approaches from time-to-time to see whether the limitations of mobile devices still fit the mobile marketplace.

Minimize Draw Calls

Mobile applications are more often bottlenecked on Draw Calls than on fill rate. Not that fill rate concerns should be ignored (nothing should, ever!), but this makes it almost necessary for any mobile application of reasonable quality to implement Mesh Combining, Batching, and Atlasing techniques from the very beginning. Deferred Rendering is also the preferred technique as it fits well with other mobile-specific concerns, such as avoiding transparency and having too many animated characters.

Minimize the Material count

This concern goes hand in hand with the concepts of Batching and Atlasing. The fewer Materials we use, the fewer Draw Calls will be necessary. This strategy will also help with concerns relating to VRAM and memory bandwidth, which tend to be very limited on mobile devices.

Minimize texture size and Material count

Most mobile devices feature a very small Texture Cache relative to desktop GPUs. For instance, the iPhone 3G can only support a total texture size of 1024x1024 due to running OpenGLES1.1 with simple vertex rendering techniques. Meanwhile the iPhone 3GS, iPhone 4, and iPad generation run OpenGLES 2.0, which only supports textures up to 2048x2048. Later generations can support textures up to 4096x4096. Double check the device hardware we are targeting to be sure it supports the texture file sizes we wish to use (there are too many Android devices to list here). However, later-generation devices are never the most common devices in the mobile marketplace. If we wish our game to reach a wide audience (increasing its chances of success), then we must be willing to support weaker hardware.

Note that textures that are too large for the GPU will be downscaled by the CPU during initialization, wasting valuable loading time, and leaving us with unintended graphical fidelity. This makes texture reuse of paramount importance for mobile devices due to the limited VRAM and Texture Cache sizes available.

Make textures square and power-of-2

The GPU will find it difficult, or simply be unable to compress the texture if it is not in a square format, so make sure you stick to the common development convention and keep things square and sized to a power of 2.

Use the lowest possible precision formats in Shaders

Mobile GPUs are particularly sensitive to precision formats in its Shaders, so the smallest formats should be used. On a related note, format conversion should be avoided for the same reason.

Avoid Alpha Testing

Mobile GPUs haven't quite reached the same levels of chip optimization as desktop GPUs, and Alpha Testing remains a particularly costly task on mobile devices. In most cases it should simply be avoided in favor of Alpha Blending.

Summary

If you've made it this far without skipping ahead, then congratulations are in order. That was a lot of information to absorb for just one component of the Unity Engine, but then it is clearly the most complicated of them all, requiring a matching depth of explanation. Hopefully, you've learned a lot of approaches to help you improve your rendering performance and enough about the rendering pipeline to know how to use them responsibly!

To learn more about Unity 5, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended:

Unity 5 Game Optimization (https://www.packtpub.com/game-development/unity-5-game-optimization)
Unity 5.x By Example (https://www.packtpub.com/game-development/unity-5x-example)
Unity 5.x Cookbook (https://www.packtpub.com/game-development/unity-5x-cookbook)
Unity 5 for Android Essentials (https://www.packtpub.com/game-development/unity-5-android-essentials)