Janea Systems Diagnoses Major Matrix Multiplication Operation Performance Issue on PyTorch

PyTorch, hailed as the go-to open-source machine learning library, is trusted worldwide by developers and companies such as OpenAI. Initially built for Linux, Microsoft has entrusted Janea Systems with the upkeep of the library on Windows since 2022, bolstering our commitment to democratizing access to machine learning and ensuring PyTorch remains an inclusive and robust platform for Windows users.

There was an open issue on the Pytorch Github where a contributor noticed that matrix multiplication was over five times slower on Windows. With matrix multiplication being a fundamental operation within machine learning, we stepped in to help. Read on to learn about matrix multiplication and how our Senior Software Engineer, Ionut Manta, investigated the problem.

What is Matrix Multiplication: The Heart of Computational Tasks

Matrix multiplication is a fundamental operation in mathematics and computer science. It's used extensively in various fields, such as physics, engineering, computer graphics, and machine learning. At its core, multiplying matrices involves combining two bi-dimensional arrays of numbers (matrices) to produce a new matrix. This operation is crucial for performing linear transformations, solving systems of linear equations, and manipulating data in algorithms.

Matrices represent datasets, weights, and transformations in neural networks. Multiplying matrices helps in forward propagation and backpropagation algorithms, essentially enabling the training and functioning of neural networks.

Benchmarking Matrix Multiplication Operations: Linux vs Windows

Benchmarking the time taken to complete a matrix multiplication operation between Windows and Linux, PyTorch contributor linssswww found that when the size of the input matrix is large, their example used a matrix with 1,000,000 rows and 4 columns ((1000000, 4)), the time taken to complete the operation on both platforms is similar (around 0.101 to 0.109 seconds respectively). However, when the size of the input matrix is small, 1,000 rows and 4 columns ((1000, 4)), the performance on Linux is roughly 5.5 times faster than on Windows (0.031242s vs 0.00567626953125s)

Impact on Small Matrix Operations

When a machine learning application involves numerous operations on small matrices, even the slightest delays add up, leading to significantly longer processing times. For instance, if a machine learning algorithm needs to perform thousands of these small matrix multiplications as part of its computation—such as during the training phase or when making predictions—the additional time taken for each multiplication can make for a substantial delay.

Examples of the Impact of Slow Matrix Multiplication Operations in Machine Learning

Consider a real-time machine learning application, like a recommendation system on an e-commerce website that relies on matrix operations to predict and display product recommendations as a user browses instantly. This system might use small matrices to represent user and product features, and through matrix multiplication, it computes the likelihood of a user being interested in a particular product.

On Windows, where the operation takes approximately 0.031242 seconds for a small matrix, as opposed to 0.00567627 seconds on Linux, the system would experience latency issues. This difference might be negligible for a single recommendation, but the latency adds up in scenarios where thousands of recommendations are generated simultaneously. This could result in a slower response time, potentially impacting user experience by making the website appear sluggish.

Broader Implications

In a broader scope, developers who are running experiments and training models on Windows machines might find that their iteration time is slower due to these performance issues. This can extend the time needed for model development and experimentation. For machine learning engineers who need to quickly train and test multiple models to find the best one, a slower iteration cycle could lead to longer project timelines and increased costs, especially if they are relying on cloud-based Windows machines where compute time is billed by the minute or second.

Investigation and Technical Insight

Ionut Manta, Senior Software Engineer at Janea Systems, looked into reproducing and investigating the issue, boiling it down to a small example. His work revealed that the performance discrepancy stemmed from the initialization phase of matrix multiplication operations. While the first call to the function exhibited similar timings across both platforms, subsequent calls on Windows took almost twice as long. This behavior was particularly noticeable in small matrices operations, where initialization time became a significant performance factor.

Further exploration traced the issue back to specific code segments within PyTorch's CUDA implementation. By creating a minimal example that isolated the initialization phase, Ionut not only confirmed the bug but also provided a clear path for addressing it. From here, Ionut reported the issue to NVIDIA, contributing to a broader understanding of the performance challenges across platforms.

This work started with analyzing the full PyTorch codebase, taking us through loops of benchmarking subsections of the code path that multiplies matrices. Ionut investigated several sections of the code, such as memory allocation, performance standard containers, and optimizations done by different compilers. By excluding sections with similar performance, Ionut was able to boil the issue down to the CUDA section. One major difficulty working on this was that the performance difference did not show on the first invocation, so when the code was reduced to its simplest form the difference disappeared.

The Benefit of Matrix Multiplication Performance Enhancement on PyTorch

The issue investigated by Ionut and the team has significant implications:

Efficiency: For applications that frequently perform small matrix multiplications, reducing the initialization time on Windows platforms means computations can be done faster, improving the overall performance.
Resource Optimization: More efficient matrix operations can lead to lower CPU and GPU usage, saving energy and allowing for more computations to be done in parallel.
User Experience: Developers and researchers using PyTorch on Windows for machine learning tasks can expect smoother, more consistent performance, especially in applications requiring rapid computations, such as real-time data analysis or interactive simulations.
Accessibility: By improving performance on Windows, PyTorch becomes more accessible to a broader audience, many of whom use Windows as their primary development environment. This can help democratize access to machine learning tools and resources.

Conclusion

The initiative to diagnose the matrix multiplication performance issue on Windows platforms represents a significant step forward in ensuring PyTorch's reliability and efficiency across operating systems. Improving PyTorch not only enhances the user experience for developers and researchers but also strengthens the foundation of PyTorch as a versatile tool for machine learning and deep learning applications. As we progress, the resolve and collaboration demonstrated by the PyTorch community will continue to drive innovations and optimizations, making advanced machine learning libraries like PyTorch tools more accessible and efficient for everyone.

Ionut Manta is a Senior Software Engineer at Janea Systems. Learn more about him here.

This particular issue is one of hundreds we've looked into on PyTorch. Find out more about our ongoing work ensuring PyTorch performs optimally on Windows here.