Thursday, August 18, 2016

Why Co-location Does Matter to Gain High-Performance Java EE

Introduction


Various types of JDBC drivers are implemented by various coders, it means the JDBC specification designed by the JCP team only defines the various interfaces, referenced as JDBC API, but each coder/implementer can implement these interfaces in their own ways.

Based on the ways, we can classify the implementations into four types.
  1. JDBC Driver - Type 1 (JDBC ODBC Bridge)

    This is an approach wherein the implemented class in Java makes calls to the code written in Microsoft languages (native), which speaks directly to the database.

  2. JDBC Driver - Type 2 (Part Native Driver)

    This is an approach wherein the implemented class in Java makes calls to the code written by the database provider (native), which speaks directly to the database.

  3. JDBC Driver - Type 3

    This is an approach wherein the implemented class in Java makes calls to the code written from application server providers, which speaks directly to the database. More exploration on the way the Java Driver interacts with the Middleware is required here. This kind of driver is extremely flexible since it requires no code installed on the client and a single driver can actually provide access to multiple databases.

  4. JDBC Driver - Type 4 (Thin Driver)

    This is an approach wherein the implemented by the database provider class in Java speaks directly to the database. In other words, it is a pure Java library that translates a JDBC request directly to a Database specific protocol.

The most used in the distributed world is the JDBC Type 4 driver, but in the z Systems world - the different point is. The purpose of the article is to show how significant the performance of a Java EE application can be improved just by enabling the right JDBC driver.



Benchmark


Let's review a simple benchmark for compare JDBC Type 2 vs JDBC Type 4. Average response time per operation is chosen as a parameter, so the less value is better. JMH is used for workload simulation and average time measuring. JMH is a Java harness for building, running, and analysing nano/micro/milli/macro benchmarks written in Java and other languages targeting the JVM.

The recommended way to run a JMH benchmark is to use Maven to set up a standalone project that depends on the jar files of your application. This approach is preferred to ensure that the benchmarks are correctly initialized and produce reliable results.

After the build is done, the self-contained executable JAR, which holds the benchmark, will be generated. The following command is used for starting:

$ java -jar benchmarks.jar "DB2ConnectionsBenchmark" -f 4 -t 4 -wi 10 -i 25 -si true -gc true 
-p server=mvz2 -p port=9081

The benchmark is executed on a 6 ways (4 GCP + 2 zIIP) IBM zEnterprise Business Class 12 mainframe in four threads, four sequenced forks of a client JVM with 10 warmup and 25 working iterations for each one will be done.

The source code is available on GitHub.

Workload



The benchmark Java class works as an HTTP client, calls a server and counts the average response time. The server is a java servlet deployed on WebSphere Liberty Profile 9 Beta for z/OS. The Liberty Profile server is connected to a DB2 11 for z/OS database instance lunched on the same z/OS image. Two connection pools are leveraged for establishing connections to the database, the first pool is configured for the JDBC Type 2 driver, while the second pool is for the JDBC Type 4 one, respectively. The configuration is described inside the server.xml file:

<?xml version="1.0" encoding="UTF-8" ?>
<server description="JDBC Type 2 vs Type 4 DataSource definition">
    <featureManager>
        <feature>jdbc-4.0</feature>
        <feature>zosTransaction-1.0</feature>
    </featureManager>

    <nativeTransactionManager shutdownTimeout="20s" 
                              resourceManagerNamePrefix="DEMOS"/>

    <library id="DB2T2LibRef">
        <fileset dir="/usr/lpp/db2b10/jdbc/classes"/>
        <fileset dir="/usr/lpp/db2b10/jdbc/lib"/>
    </library>

    <library id="DB2JCC4Lib">
        <fileset dir="/usr/lpp/db2b10/jdbc/classes" 
                 includes="db2jcc4.jar db2jcc_license_cisuz.jar"/>
    </library>

    <jdbcDriver id="DB2T2" libraryRef="DB2T2LibRef"/>

    <dataSource id="jdbc/t2/test" jndiName="jdbc/t2/test" jdbcDriverRef="DB2T2"
                isolationLevel="TRANSACTION_READ_COMMITTED"
                type="javax.sql.ConnectionPoolDataSource">
        <connectionManager maxPoolSize="20" minPoolSize="8" connectionTimeout="10s" agedTimeout="10m"/>
        <properties.db2.jcc driverType="2" databaseName="DB0A"/>
    </dataSource>

    <dataSource id="jdbc/t4/test" jndiName="jdbc/t4/test" isolationLevel="TRANSACTION_READ_COMMITTED">
        <jdbcDriver libraryRef="DB2JCC4Lib"/>
        <connectionManager maxPoolSize="20" minPoolSize="8" connectionTimeout="10s" agedTimeout="10m"/>
        <properties.db2.jcc databaseName="DB0A" serverName="localhost" portNumber="446"
                            user="pavel" password="xxx"/>
    </dataSource>
</server>

The servlet encapsulates the business logic of the tested application. The logic is very easy, just to get a database connection from a pool, execute an SQL query using the connection, iterate over a result set returned by the query, calculate a size of the result set, return the size into a response, and release the connection. There is a limit for retrieved result sets, and the limit is specified as a value of the s servlet parameter.

public abstract class AbstractConnectionBenchmarkServlet extends HttpServlet {

    private static final long serialVersionUID = -7382506933509679639L;

    protected abstract DataSource getDS();
    
    @Override
    protected void doGet(HttpServletRequest req, HttpServletResponse resp)
            throws ServletException, IOException {        
        
        int size = Integer.parseInt(req.getParameter("s"));
       
        try {            
            resp.setStatus(HttpServletResponse.SC_OK);
            int[] res = getRecordsCount(size);              
            resp.getWriter().write("<p>" + res[1] + " messages have been " + 
                    "retrieved, last hash: " + res[0] + "</p>");
        } catch (Exception ex) {
            resp.setStatus(HttpServletResponse.SC_INTERNAL_SERVER_ERROR);
            ex.printStackTrace(resp.getWriter());
        } finally {
            resp.getWriter().flush();
        }
    }
    
    private int[] getRecordsCount(int size) throws Exception {
        try (Connection conn = getDS().getConnection();
             PreparedStatement pstmt = conn.prepareStatement(
                     "SELECT * FROM ( " + 
                       "       SELECT ROW_NUMBER() OVER (ORDER BY empno) AS rownumber," + 
                       "       empno, first_name, last_name, salary " + 
                       "       FROM ROOT.EMP2" + 
                       ") AS EMPLOYEE " + 
                       "WHERE rownumber between ? and ? ")) {
            pstmt.setInt(1, 1);
            pstmt.setInt(2, size);
               
            try (ResultSet res = pstmt.executeQuery()) {
                int lastHash = 0; 
                int count = 0;
                while(res.next()) {
                    lastHash = res.getInt(1) * 1000 + res.getString(2).hashCode() 
                               + res.getString(3).hashCode() + res.getString(4).hashCode() 
                               + res.getBigDecimal(5).hashCode();
                    count++;
                }                   
                
                return new int[]{lastHash, count};
            } 
        }
    }
}

The lastHash variable is calculated due to deal with the dead code elimination JIT compiler optimization.

Two versions of the servlet are deployed on Liberty Profile, both are the sub-classes of the AbstractConnectionBenchmarkServlet class and contain the same business logic, but they are bound to different database pools, which used the JDBC Type 2 and JDBC Type 4 drivers respectively.

Data


After successful execution of the golden run, the following results were gone:

# Run complete. Total time: 00:27:32

Benchmark                        (port) (server)  (size)  (zdbc)  Mode  Cnt  Score   Error  Units
DB2ConnectionsBenchmark.message    9081  mvz2      10      t2     avgt  100  0.886 ± 0.010  ms/op
DB2ConnectionsBenchmark.message    9081  mvz2      10      t4     avgt  100  1.139 ± 0.019  ms/op
DB2ConnectionsBenchmark.message    9081  mvz2      25      t2     avgt  100  0.911 ± 0.011  ms/op
DB2ConnectionsBenchmark.message    9081  mvz2      25      t4     avgt  100  1.156 ± 0.009  ms/op
DB2ConnectionsBenchmark.message    9081  mvz2      50      t2     avgt  100  0.948 ± 0.016  ms/op
DB2ConnectionsBenchmark.message    9081  mvz2      50      t4     avgt  100  1.212 ± 0.007  ms/op
DB2ConnectionsBenchmark.message    9081  mvz2     100      t2     avgt  100  1.035 ± 0.007  ms/op
DB2ConnectionsBenchmark.message    9081  mvz2     100      t4     avgt  100  1.363 ± 0.029  ms/op

Here are two variables, which specify the behaviour of the benchmark, zdbc is a type of used driver, and size specifies the number of records retrieved from the database on each iteration.

For better understanding of the results, let's plot a chart:


The performance advantage gained by just leveraging the JDBC Type 2 driver even on this simplest workload is about 30%. An application gains about 30% additional performance just when the right JDBC driver is taken into account. Not so bad!

Analysis


IBM Jinsight is an amazing tool for looking into the code execution. The Jinsight profiler consists of three parts: a JVMTI-based agent running in couple with the application, a control tool, which is used for start and stop profiling, both parts are platform dependent and there are only for z/OS, Linux on z and Linux on x. The agent generates a dump file during their work. For dump visualizing and analysis, there is a GUI-based tool referenced as JinsightLive. The tool is cross-platform and is able to be launched on a Windows machine, for example.

Let's look how exactly the tested application work in two different modes: JDBC T2 and T4. The JDBC T2 execution path is shown on the following picture:


JDBC T4:


As shown on the above pictures, the method 'executeQuery' is very hot and the method includes the socket read operation for JDBC T4, and the 'nativeOpen' operation as opposite for the JDBC T2 kind of drivers.

Jinsight is able to show how much time is taken by a method or a tree of some selected methods. For deeply learning how methods work, this kind of information is more usable than the execution diagram. In the bellow part of the article, actual JDBC Type 2 and corresponding JDBC Type 4 methods are shown for the comparison purposes.

There is a method tree for the JDBC Type 2 option that retrieves 10 records from the database:


There are two hot methods shown on the picture above: the 'executeQuery' and 'next' ones. As opposite, the JDBC Type 4 variant spends more time on the 'executeQuery' method and a little bit less time on the 'next' one:


Let's learn a little bit deeply how the 'executeQuery' method works for JDBC Type 2:


The 'nativeOpen' method is the heart of the 'executeQuery' one. The method reads data from the database using the z/OS Cross Memory services. WebSphere Application Server calls a DB2 address space using a program call (PC) instruction. So, only a single hardware instruction is needed to call a program from another address space, i.e. DB2.

For JDBC Type 4, the method 'executeQuery' contains the 'socketRead' operation, so as expected, the JDBC Type 4 driver retrieves the data from the database leveraging a network connection and as expected it takes a bit more time than a program call through the cross memory services:


The second hottest method for the JDBC Type 2 driver is the 'next' method contains two hottest parts: the 'closeX' method and the 'nextX' one. The 'closeX' method involves the RRS services for transaction management and contains the commit operation ('readLocalCommit'/'nativeCommit').


The most time is spent in these methods, so this fact is a reason why JDBC Type 2 loads only about 30% faster: there is no this kind of overhead on transaction commit for the JDBC Type 4 driver.


By default, the JDBC Type 2 driver retrieves only one row during each 'ResultSet#next()' operation (see the 'nativeFetch' method). Even though one single iteration takes not so much time, when the number of retrieved records is increasing, the method duration grows-up.


There is an option in DB2, called as Multi-row FETCH. This type of operation can provide better performance than retrieving one row with each FETCH statement. When the application retrieves data, the IBM Data Server Driver for JDBC and SQLJ determines whether to use multi-row FETCH, depending on several factors:

  • The setting of the enableRowsetSupport property;
  • The setting of the useRowsetCursor property, for connections to DB2 for z/OS;
  • The type of IBM Data Server Driver for JDBC and SQLJ connectivity that is being used;
  • The version of the IBM Data Server Driver for JDBC and SQLJ.

More information about Multi-row FETCH is on the Multi-row SQL operations in JDBC applications page in IBM Knowledge Center.

Using Jinsight, the following execution diagram could be visualized when the Multi-row FETCH option is taken into account.


The data is retrieved from the database by the 'getNextRowset' method while the ResultSet#next() operation is executed the first time. Each other execution of the 'next' method just checks if the corresponding row is in the current rowset by using the 'rowIsInCurrentRowset' method.

CPU consumption


Let's investigate how much processor resource is needed to support each type of connectivity between an application server and a database. The first step of this analysis is to get the information about the resources consumption. We use the z/OS Workload Manager classification to assign a unique report class for each servlet associated with the corresponding sort of JDBC drivers. The information for the classification is provided by the Liberty server and specified in the wlmClassification element contained in the server.xml configuration file.

<?xml version="1.0" encoding="UTF-8" ?>
<server description="WLM configuration">
    <featureManager>
        <feature>zoswlm-1.0</feature>
    </featureManager>

    <wlmClassification>
        <httpClassification transactionClass="DBMTRAT2" resource="/sync-to-thread/dbtest/t2" />
        <httpClassification transactionClass="DBMTRAT4" resource="/sync-to-thread/dbtest/t4" />
        <httpClassification transactionClass="DBMDTRAN" />
    </wlmClassification>

    <zosWorkloadManager collectionName="CZSR01" />
</server>

The REPBMRT2 and REPBMRT4 report classes are assigned to the corresponding transaction classes specified above. The same service class is assigned to both transaction classes, therefore the same performance goals are taken into account for both servlets.


When the JDBC Type 2 driver is leveraged, a single WLM enclave is associated with the thread that a request is dispatched on. The enclave will be carried to the DB2 database as part of the flow. In this case there is no enclave for the database subsystem, instead, it will use the enclave created by the application server, therefore CPU time used by the dispatched request in the application server as well as the database subsystem will be reported together for the enclave.


When the JDBC Type 4 driver is taken into account, the situation is a bit different. The distributed data facility (DDF) allows client applications that run in a remote environment to access data in DB2 servers. Its own set of enclaves will be created for DDF, so the CPU time used by the dispatched request in the database subsystem will be reported separately.


The PPXSRPTS reports generated by the RMF Postprocessor are used as a source of the following data. The reports were gathered based on 15 minutes measuring intervals while the benchmark was running in four threads retrieving by 50 records from the database. Let's me show a chart based on the APPL % values present in the reports.


The chart demonstrates the JDBC Type 4 driver allows to get a significant advantage from offloading eligible workload to zIIP Special Engines. As we can see, when JDBC Type 4 driver is utilized, the whole Liberty workload is offloaded to zIIP as well as the majority of the database one.

Plain data


All produced during the golden runs data such as run statistics and reports as well as the application server configuration are collected into the corresponding folders inside a repository on GitHub.

Conclusions


Well, it seems the benchmark is an argument to show the advanced WebSphere Liberty Profile for z/OS values. The provided by z/OS inter-process communication mechanism, which is a program call (PC) hardware instructions, demonstrates the significant advantages over any kind of the network interaction, even the loopback interface. Up to 30% per a database query of the additional performance can be gained just by using the right JDBC driver. But on the other hand, this kind of driver pushes the whole workload on the General Processors and eliminates the advantages of zIIP engines utilization, such as the low pricing and not imposing IBM software charges on the capacity provided by. System architects, which develop new Java-based mainframe solutions based on WebSphere Liberty Profile inside CICS for example, should take in mind this trade-off between the gained performance and the level of resource consumption.

Would you like to give a 'Like'? Please follow me on Twitter!