How-To's

Optimizing Load Balancing in PLINQs with ReSharper Ultimate

When filtering large amounts of data, using a PLINQ instead of a sequential LINQ can provide significant performance boosts. Still, optimal parallelization is quite a complex task that depends on a variety of factors, such as the number of CPU cores, the nature of input data, and others. In some cases, a query that seems parallelizable works even slower after being transformed into PLINQ.

Optimizing a PLINQ is, first of all, the task of finding the optimal load balance between worker threads. This usually requires multiple experiments with various input data, loads, and computer configurations. By using the right tools, you can find the best solution faster. In this post we’ll explain why ReSharper Ultimate (or more specifically ReSharper with dotTrace integrated in Visual Studio) is possibly the best candidate and why it is more convenient than the standard approach of measuring overall query time.

Sample application

Suppose we have a large project with many subsystems and we’ve just implemented a new feature that allows searching for a specific string in files stored inside zip packages*. Our application will use it to find particular files in numerous zips.

*This example, as well as the entire post, is inspired by PLINQ performance problems we faced when processing multiple NuGet packages in ReSharper. However, this post is not about optimizing a specific algorithm; it’s more about giving you a basic understanding of how you can analyze PLINQ performance with ReSharper Ultimate.

Our feature is based on the ZipPackage class (representing a zip package) and the FileInZip class (representing a file inside a zip). The FileInZip class provides the ContainsString method that finds a particular string in the file. This allows us to use simple LINQs for searching (e.g., …where file.ContainsString(str)…). To perform the search, a FileInZip instance extracts the corresponding file from a zip package to MemoryStream. We won’t provide the exact class implementation in this post but you can get the entire example on GitHub.

Sample app classes

Creating a run configuration

Now we want to test the feature on some sample zip packages without running the entire application (remember, it’s huge). Should we create an integration test for this? No, much easier. With ReSharper Ultimate we are able to run specific static methods using run configurations (read this post for a lot more details about the feature). In a nutshell, we should write a static method that contains only the code we need. ReSharper Ultimate’s run configurations feature will allow us to run just this method in a separate JetLauncher process. In this way we can run and debug any part of our code by isolating it from the main application. For example, anywhere in our project we can create the following static method:

        public static void TestStringInZipFeature()
    	{
 	        // We prepared 15 sample zips (1.zip - 15.zip) with text files
        	// inside. Some of them contain the string 'test'.      
        	var packages = new List();
        	for (int i = 1; i <= 15; i++)     	
            	packages.Add(new ZipPackage(string.Format("c:\\zips\\{0}.zip", i)));  

        	var str = "test";

        	// LINQ
        	var query = (from zip in packages
            	             from file in zip.Files
                             where file.ContainsString(str)
                             select file).ToList();                                   	

        	Debug.WriteLine("Number of files containing '{0}': {1}", str, query.Count);
    	}

Then we just use Alt+Enter on the method and select Debug | Run:Method's action list

This opens the Run Configuration Properties window. Clicking Save & Execute will run the method (under the JetLauncher process) and save the particular run configuration for future use.

Run configuration properties

Our method writes the result to the debug output. Debug output

OK, the algorithm works as expected, but how can we measure its effectiveness? Let’s profile our method! If you have ReSharper Ultimate (or both ReSharper and dotTrace) installed on your machine, you can profile any run configuration. So, all we need to do is start profiling the run configuration we’ve just created.

Profiling run configuration

To profile a static method, we once again use Alt+Enter on the method and select Profile (<profiling type>)*, where <profiling type> is the type of profiling that will be used by dotTrace. We’ll use the Timeline profiling type as it not only provides time data for method calls but also binds them to the timeline, which is extremely important when analyzing multi-threaded applications.

*The default profiling type can be selected in Performance Snapshots Browser which can be opened via ReSharper | Profile | Show Performance Snapshots… Profile through the action list

This runs profiling and automatically takes a performance snapshot when the process completes its work. Then the collected snapshot is opened in dotTrace Timeline Viewer.

If you’re new to Timeline Viewer, find more details about it here. In a few words, it is a part of dotTrace that allows you to slice and dice the collected snapshot data using filters. For example, select time intervals where a certain event takes place (UI freeze, GC and so on) and find out what methods stand behind that. Or, vice versa, select time intervals where a specific method is executed and see what subsystems are working at this moment.

If we now take a look at the Threads Diagram in Timeline Viewer, we’ll see that the profiled static method utilizes just one Main thread (as expected) and requires about 70 seconds to complete:Threads Diagram showing sequential LINQ

The Top Methods filter shows that the highest own time belongs to the Ionic.Crc.CRC32.SlurpBlock method. This is kind of expected as this method is responsible for zip extraction (the app uses the DotNetZip library to work with zip packages).

Top Methods and Call Tree

As 70 seconds is too long for us, we decide to parallelize our LINQ so that multiple threads can unzip packages at the same time.

Using PLINQ

The most obvious solution is to add the AsParallel method to the input sequence in our LINQ.

var query = (from zip in packages.AsParallel()
            	  from file in zip.Files
                  where file.ContainsString(str)
                  select file).ToList();

After profiling, we see that this solution kind of works – overall execution time has decreased to 39 seconds. Still, that result is far from ideal. The Threads Diagram shows that the data is split among 8 threads and, while some threads do most of the job, others complete their work much earlier.Timeline Diagram for PLINQ with AsParallel

So, why does this happen? Let’s use dotTrace’s filtering capabilities to find out the root cause. Our input data are zip files stored on a disk, so we should take a look at how threads treat these files. To do this, we simply turn on the File I/O filter in Interval Filters. Now, all other filters show the data only for the time intervals where file I/O takes place.

Threads Diagram with applied File IO filter

The File I/O: File Name sub-filter shows how File I/O time is distributed among specific files:

File IO File Name subfilter

The sub-filter states that some files require more time than others. If we now turn on a filter by a specific file, we’ll see that each zip file is processed by exactly one thread:

Threads Diagram with applied File IO File Name filter

Therefore, the problem lies in the zip-per-thread load partitioning. As zip files have different sizes, threads take different amounts of time to process them. This means we need to implement partitioning in a way that lets each thread extract files from any zip package.

Custom partitioner

Our solution is to create a custom partitioner that will split the input data into equal chunks. In our case the partitioner must provide not zip packages to PLINQ, but the files inside these packages. For this purpose, we can try using a “default” partitioner, an instance of the Partitioner class. We just need to prepare the input data for the partitioner (files in zips) and specify whether we want to use dynamic or static partitioning (this is specified in the partitioner’s Create method).

With dynamic partitioning, it will try to balance the load dynamically so files will be provided to threads as they are requested. This can provide a speed boost on computers with many cores. The static partitioner prepares all chunks before the start, resulting in a lower overhead. This approach may be faster on computers with only a few cores.

Let’s try dynamic partitioning:

                
                // Select all files
        	var filesInZips = (from zipPackage in packages
                                   from file in zipPackage.Files
                                   select file).ToList();

        	// Default dynamic partitioner
        	var customPartitioner = Partitioner.Create(filesInZips, true);

        	// PLINQ
        	var query = (from file in customPartitioner.AsParallel()
                     	     where file.ContainsString(str)
                     	     select file).ToList();

After we profile the method, we can see that the load is now much more balanced. The makes the method take only 29 seconds to complete, which means a 10-second (25%) improvement. The default static partitioner on an 8-core computer shows almost the same results:

Threads Diagram for default dynamic partitioner

As we can see on Threads Diagram though, the partitioning is still not perfect as some threads complete their work a little bit earlier than the others. This happens because the “default” partitioner knows nothing about the internal structure of our input data (text files).

The problem is that bigger files require more time for extraction. This means we need to fill the chunks not based on the number of extracted files, but on their size. We can try to solve this task by creating our own static partitioner. Such a partitioner must be based on the Partitioner class and must implement the GetPartitions method, which returns enumerators for the given number of partitions.

On an 8-core machine, our custom partitioner works a little better than the default partitioners. The load is more balanced now with about 10% additional improvement (from 29 seconds down to 26).

Threads Diagram for custom static partitioner

What next? The only way to ensure your PLINQ works as fast as possible is to test it on as many configurations as possible. For example, if we try all the partitioners we’ve just created on a 4-core PC, we’ll see somewhat different results:

Threads Diagrams on a 4-core PC

While the default static and dynamic partitioners are still close, the custom one increases its lead. It takes 30 seconds, which is about a 15% improvement compared to 35-37 seconds. The small difference in absolute time values between 8-core and 4-core PCs may be explained by the large number of garbage collections which block worker threads (as our algorithm allocates a LOT of memory for zip extraction). Excessive GCs would make a good target for optimization too, but that’s a story for another post.

As you can see, PLINQ performance offer rich opportunities for experimenting, and ReSharper Ultimate can help a lot. We hope you find this post helpful and give a try to ReSharper Ultimate features on your own.

image description