7 min read

(For more resources related to this topic, see here.)

Data sizing

The cost of abstractions in terms of data size plays an important role. For example, whether or not a data element can fit into a processor cache line depends directly upon its size. On a Linux system, we can find out the cache line size and other parameters by inspecting the values in the files under /sys/devices/system/cpu/cpu0/cache/.

Another concern we generally find with data sizing is how much data we are holding at a time in the heap. GC has direct consequences on the application’s performance. While processing data, often we do not really need all the data we hold on to. Consider the example of generating a summary report of sold items for a certain period (months) of time. After the subperiod (month wise), summary data is computed. We do not need the item details anymore, hence it’s better to remove the unwanted data while we add the summaries. This is shown in the following example:

(defn summarize [daily-data] ; daily-data is a map (let [s (items-summary (:items daily-data))] (-> daily-data (select-keys [:digest :invoices]) ; we keep only the required key/val pairs (assoc :summary s)))) ;; now inside report generation code (-> (fetch-items period-from period-to :interval-day) (map summarize) generate-report)

Had we not used select-keys in the preceding summarize function, it would have returned a map with extra summary data along with all the other existing keys in the map. Now, such a thing is often combined with lazy sequences. So, for this scheme to work, it is important not to hold on to the head of the lazy sequence.

Reduced serialization

An I/O channel is a common source of latency. The perils of over-serialization cannot be overstated. Whether we read or write data from a data source over an I/O channel, all of that data needs to be prepared, encoded, serialized, de-serialized, and parsed before being worked on. It is better for every step to have less data involved in order to lower the overhead. Where there is no I/O involved, such as in-process communication, it generally makes no sense to serialize.

A common example of over-serialization is encountered while working with SQL databases. Often, there are common SQL query functions that fetch all columns of a table or a relation—they are called by various functions that implement the business logic. Fetching data that we do not need is wasteful and detrimental to the performance for the same reason that we discussed in the preceding paragraph. While it may seem more work to write one SQL statement and one database query function for each use case, it pays off with better performance. Code that uses NoSQL databases is also subject to this anti-pattern—we have to take care to fetch only what we need even though it may lead to additional code.

There’s a pitfall to be aware of when reducing serialization. Often, some information needs to be inferred in absence of the serialized data. In such cases where some of the serialization is dropped so that we can infer other information, we must compare the cost of inference versus the serialization overhead. The comparison may not be necessarily done per operation, but rather on the whole. Then, we can consider the resources we can allocate in order to achieve capacities for various parts of our systems.

Chunking to reduce memory pressure

What happens when we slurp a text file regardless of its size? The contents of the entire file will sit in the JVM heap. If the file is larger than the JVM heap capacity, the JVM will terminate by throwing OutOfMemoryError. If the file is large but not large enough to force the JVM into an OOM error, it leaves a relatively smaller JVM heap space for other operations in the application to continue. A similar situation takes place when we carry out any operation disregarding the JVM heap capacity. Fortunately, this can be fixed by reading data in chunks and processing them before reading further.

Sizing for file/network operations

Let us take the example of a data ingestion process where a semi-automated job uploads large Comma Separated File (CSV) files via the File Transfer Protocol (FTP) to a file server, and another automated job, which is written in Clojure, runs periodically to detect the arrival of files via the Network File System (NFS). After detecting a new file, the Clojure program processes the file, updates the result in a database, and archives the file. The program detects and processes several files concurrently. The size of the CSV files is not known in advance, but the format is predefined.

As per the preceding description, one potential problem is that since there could be multiple files being processed concurrently, how do we distribute the JVM heap among the concurrent file-processing jobs? Another issue could be that the operating system imposes a limit on how many files can be opened at a time; on Unix-like systems, you can use the ulimit command to extend the limit. We cannot arbitrarily slurp the CSV file contents—we must limit each job to a certain amount of memory and also limit the number of jobs that can run concurrently. At the same time, we cannot read a very small number of rows from a file at a time because this may impact performance.

(def ^:const K 1024) ;; create the buffered reader using custom 128K buffer-size (-> filename java.io.FileInputStream java.io.InputStreamReader (java.io.BufferedReader (* K 128)))

Fortunately, we can specify the buffer size when reading from a file or even from a network stream so as to tune the memory usage and performance as appropriate. In the preceding code example, we explicitly set the buffer size of the reader to facilitate the same.

Sizing for JDBC query results

Java’s interface standard for SQL databases, JDBC (which is technically not an acronym), supports fetch-size for fetching query results via JDBC drivers. The default fetch size depends on the JDBC driver. Most JDBC drivers keep a low default value so as to avoid high memory usage and attain internal performance optimization. A notable exception to this norm is the MySQL JDBC driver that completely fetches and stores all rows in memory by default.

(require '[clojure.java.jdbc :as jdbc]) ;; using prepare-statement directly (we rarely use it directly, shown just for demo) (with-open [stmt (jdbc/prepare-statement conn sql :fetch-size 1000 max-rows 9000) rset (resultset-seq (.executeQuery stmt))] (vec rset)) ;; using query (query db [{:fetch-size 1000} "SELECT empno FROM emp WHERE country=?" 1])

When using the Clojure Contrib library java.jdbc (https://github.com/clojure/java.jdbc as of Version 0.3.0), the fetch size can be set while preparing a statement as shown in the preceding example.

The fetch size does not guarantee proportional latency; however, it can be used safely for memory sizing.

We must test any performance-impacting latency changes due to fetch size at different loads and use cases for the particular database and JDBC driver. Besides fetch-size, we can also pass the max-rowsargument to limit the maximum rows to be returned by a query. However, this implies that the extra rows will be truncated from the result, not that the database will internally limit the number of rows to realize.

Resource pooling

There are several types of resources on the JVM that are rather expensive to initialize. Examples are HTTP connections, execution threads, JDBC connections, and so on. The Java API recognizes such resources and has built-in support for creating a pool of some of those resources so that the consumer code borrows a resource from a pool when required and at the end of the job simply returns it to the pool. Java’s thread pools and JDBC data sources are prominent examples. The idea is to preserve the initialized objects for reuse. Even when Java does not support pooling of a resource type directly, you can always create a pool abstraction around custom expensive resources.

The pooling technique is common in I/O activities, but it can be equally applicable to non-I/O purposes where the initialization cost is high.

Summary

Designing an application for performance should be based on the use cases and patterns of anticipated system load and behavior. Measuring performance is extremely important to guide optimization in the process. Fortunately, there are several well-known optimization patterns to tap into, such as resource pooling, and data sizing. Thus we analysed the performance optimization using these patterns.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here