Supercharge ML performance on xPUs with the new XProf profiler and Cloud Diagnostics XProf library

Faster profile loading and tool transitions

Another major change we have made to the TensorBoard XProf plugin is improving loading time. We added support for multithreading to the plugin, which allows larger profiles to load much faster. We also added caching, which allows the second load to be even faster, and allows users to move smoothly between the tools in XProf while doing their performance optimization.

Machine options for hosting TensorBoard + XProf with Xprofiler

By default, XProfiler create will create a VM, specifically c4-highmem-8 VM. You can change the machine type with the -m flag. Also, if you want to create a TensorBoard instance on a GKE pod, you can pass —GKE flag to XProfiler create. Some customers prefer to have their TensorBoard instance hosted on a GKE pod as it makes it easier to manage this TensorBoard instance along with the rest of their workload deployed on GKE.

The VM or GKE pod that hosts TensorBoard makes loading of large profiles and multiple profiles much faster on Google Cloud than locally hosted TensorBoard. Based on our benchmarking, profiles of order of 1GB load will load within a few minutes for the first load using the default c4-highmem-8 VM. You can choose different machine types based on your performance and cost needs.

Link for sharing profiles

After you run XProfiler create, you will get something like the following message

Instance for gs:// has been created.

You can access it via following,

1.https://-dot-us-.notebooks.googleusercontent.com.

2. XProfiler connect -z -l gs:// -m ssh

Instance is hosted at XProf-97db0ee6-93f6-46d4-b4c4-6d024b34a99f VM.

Note: The first option (1) is a link that has been created which you can just click and view your XProf profiles on TensorBoard. Performance optimization is a very iterative and collaborative process, so in order to enable this collaboration, the Cloud Diagnostics XProf library creates a link to he TensorBoard instance so that users can easily share their profiles with their teams and with Google engineers helping with performance optimization on Google Cloud. You control who has access to the link based on permissions set for the Cloud Storage bucket that the TensorBoard instance is pointing to.

In case the link doesn’t work for some reason, we also provide a way to SSH into the TensorBoard instance in order to view your profiles using XProfiler connect command.

On-demand profile capture

If you enabled the profiler server in your workload code and want to perform on-demand profiling, you can do this in two ways:

Click on the “capture profile” button on TensorBoard UI. We support on demand capture for workloads running on GKE and Compute Engine.
Use XProfiler capture in CLI, providing similar information as your would through the “Capture profile” button on the TensorBoard UI.

New capabilities of XProf

With the updated XProf, users will see many updated features for the most popular tools which include:

Trace viewer
Memory viewer/memory profile
Graph Viewer
HLO Op profile/HLO Op stats
Overview page

Most notably, on the memory viewer, you can now see 7 different types of memory including HBM (high bandwidth memory), host, and for TPUs – SparseCore, VMEM, SMEM, CMEM and Sync Flags (SFlag).

You will also see many of the links from trace viewer and HLO Op profile back to Graph viewer work seamlessly for all Ops. We have also improved source line visibility to cover more Ops.

The most common flow used for finding performance bottlenecks using XProf looks something like the following:

MLE opens up XProf in TensorBoard and looks at different ops in trace viewer or HLO op profile
They click on the op that they interested in digging into to get more details
For this op, they click on the link to graph viewer to see how the Op is placed in their model
They take a look at memory viewer to see if HBM/host/SparseCore memory is utilized efficiently for the model and for the specific Op
Once they have determined which Ops they want to optimize, they look at the source code line for those Ops in order to implement any optimizations.

The updated XProf tool makes this entire flow smooth and easy.

New XProf tools

In addition, we also have released a few new tools in XProf including:

Framework op stats– performance statistics of framework-level operations (e.g., JAX or TensorFlow).
Roofline– visually see whether your program is memory-bound or compute-bound, and how close the program’s performance is to the hardware’s theoretical peak performance, represented as a “roofline”.
Megascale stats– analyze multi-slice communication performance of workloads spanning multiple TPU slices that communicate across the Data Center Network (DCN).
GPU kernel stats– performance statistics and the originating framework operation for every GPU-accelerated kernel that was launched during a profiling session.

Pallas Kernel Visibility in XProf

One of the big areas of performance visibility that has been requested in XProf is around visibility of performance for Pallas kernels. These kernels were displayed in XProf as “custom calls”, but it was hard to see details of the performance of the custom call and its implementation. We are very happy to announce increased support and visibility for Pallas kernels within XProf for you. Now, you can see more details of your Pallas kernel in both HLO Op Profile as well as Graph Viewer. For each Pallas kernel custom call, you will see the name of the kernel if it is a common Pallas kernel, and when you click it, you will see performance and other information about the kernel in the side panel. To get accurate performance metrics when clicking on the kernel in the side panel, the kernel author must provide a cost model by passing a pl.CostEstimate object to their pallas_call function. In addition, there is a “custom call text” button where the user can see more details about the Pallas kernel implementation.

Supercharge ML performance on xPUs with the new XProf profiler and Cloud Diagnostics XProf library

Eric Yuan and Santi Subotovsky on breakout success at Disrupt 2025

Leave a Reply Cancel reply

You might also like

Supercharge ML performance on xPUs with the new XProf profiler and Cloud Diagnostics XProf library

Eric Yuan and Santi Subotovsky on breakout success at Disrupt 2025

10 extra exhibit tables open at Disrupt 2025

PayPal adds new one-to-one payment links that will soon support crypto

Time’s running out to volunteer at TechCrunch Disrupt 2025

Israel announces seizure of $1.5M from crypto wallets tied to Iran

Stay Connected

Welcome Back!

Create New Account!

Retrieve your password