Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
pytorch
GitHub Repository: pytorch/tutorials
Path: blob/main/advanced_source/cpp_custom_ops_sycl.rst
3698 views
.. _cpp-custom-ops-tutorial-sycl:

Custom SYCL Operators
=====================

.. grid:: 2

    .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
       :class-card: card-prerequisites

       * How to integrate custom operators written in SYCL with PyTorch

    .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
       :class-card: card-prerequisites

       * PyTorch 2.8 or later for Linux
       * PyTorch 2.10 or later for Windows
       * Basic understanding of SYCL programming

.. note::

  ``SYCL`` serves as the backend programming language for Intel GPUs (device label ``xpu``). For configuration details, see:
  `Getting Started on Intel GPUs <https://docs.pytorch.org/docs/main/notes/get_start_xpu.html>`_. The Intel Compiler, which comes bundled with `Intel Deep Learning Essentials <https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html>`_, handles ``SYCL`` compilation. Ensure you install and activate the compiler environment prior to executing the code examples in this tutorial.

PyTorch offers a large library of operators that work on Tensors (e.g. torch.add, torch.sum, etc).
However, you may wish to bring a new custom operator to PyTorch. This tutorial demonstrates the
best path to authoring a custom operator written in SYCL. Tutorials for C++ and CUDA operators are available in the :ref:`cpp-custom-ops-tutorial`.

Follow the structure to create a custom SYCL operator:

.. code-block:: text

  sycl_example/
  ├── setup.py
  ├── sycl_extension
  │   ├── __init__.py
  │   ├── muladd.sycl
  │   └── ops.py
  └── test_sycl_extension.py

Setting up the Build System
---------------------------

If you need to compile **SYCL** code (noting that the extension should be ``.sycl``), use `torch.utils.cpp_extension.SyclExtension <https://docs.pytorch.org/docs/stable/cpp_extension.html#torch.utils.cpp_extension.SyclExtension>`_.
The setup process is very similar to C++/CUDA, except the compilation arguments need to be adjusted for SYCL.

Using ``sycl_extension`` is as straightforward as writing the following ``setup.py``:

.. code-block:: python

    import os
    import torch
    import glob
    import platform
    from setuptools import find_packages, setup
    from torch.utils.cpp_extension import SyclExtension, BuildExtension

    library_name = "sycl_extension"
    py_limited_api = True

    IS_WINDOWS = (platform.system() == 'Windows')

    if IS_WINDOWS:
        cxx_args = [
            "/O2",
            "/std:c++17",
            "/DPy_LIMITED_API=0x03090000",
        ]
        sycl_args = ["/O2", "/std:c++17"]
    else:
        cxx_args = [
            "-O3",
            "-fdiagnostics-color=always",
            "-DPy_LIMITED_API=0x03090000"
        ]
        sycl_args = ["-O3"]

    extra_compile_args = {
        "cxx": cxx_args,
        "sycl": sycl_args
    }

    assert(torch.xpu.is_available()), "XPU is not available, please check your environment"

    # Source files collection
    this_dir = os.path.dirname(os.path.curdir)
    extensions_dir = os.path.join(this_dir, library_name)
    sources = list(glob.glob(os.path.join(extensions_dir, "*.sycl")))

    # Construct extension
    ext_modules = [
        SyclExtension(
            f"{library_name}._C",
            sources,
            extra_compile_args=extra_compile_args,
            py_limited_api=py_limited_api,
        )
    ]

    setup(
        name=library_name,
        packages=find_packages(),
        ext_modules=ext_modules,
        install_requires=["torch"],
        description="Simple Example of PyTorch Sycl extensions",
        cmdclass={"build_ext": BuildExtension},
        options={"bdist_wheel": {"py_limited_api": "cp39"}} if py_limited_api else {},
    )

Defining the custom op and adding backend implementations
---------------------------------------------------------
First, let's write a SYCL function that computes ``mymuladd``:

In order to use this from PyTorch’s Python frontend, we need to register it
as a PyTorch operator using the ``TORCH_LIBRARY`` API. This will automatically
bind the operator to Python.


If you also have a SYCL implementation of ``myaddmul``, you can also register it
in a separate ``TORCH_LIBRARY_IMPL`` block:

.. code-block:: cpp

    #include <c10/xpu/XPUStream.h>
    #include <sycl/sycl.hpp>
    #include <ATen/Operators.h>
    #include <torch/all.h>
    #include <torch/library.h>


    #include <Python.h>

    namespace sycl_extension {

    // ==========================================================
    // 1. Kernel
    // ==========================================================
    static void muladd_kernel(
        int numel, const float* a, const float* b, float c, float* result,
        const sycl::nd_item<1>& item) {
        int idx = item.get_global_id(0);
        if (idx < numel) {
            result[idx] = a[idx] * b[idx] + c;
        }
    }

    class MulAddKernelFunctor {
    public:
        MulAddKernelFunctor(int _numel, const float* _a, const float* _b, float _c, float* _result)
            : numel(_numel), a(_a), b(_b), c(_c), result(_result) {}
        void operator()(const sycl::nd_item<1>& item) const {
            muladd_kernel(numel, a, b, c, result, item);
        }

    private:
        int numel;
        const float* a;
        const float* b;
        float c;
        float* result;
    };

    // ==========================================================
    // 2. Wrapper
    // ==========================================================
    at::Tensor mymuladd_xpu(const at::Tensor& a, const at::Tensor& b, double c) {
        TORCH_CHECK(a.sizes() == b.sizes(), "a and b must have the same shape");
        TORCH_CHECK(a.dtype() == at::kFloat, "a must be a float tensor");
        TORCH_CHECK(b.dtype() == at::kFloat, "b must be a float tensor");
        TORCH_CHECK(a.device().is_xpu(), "a must be an XPU tensor");
        TORCH_CHECK(b.device().is_xpu(), "b must be an XPU tensor");

        at::Tensor a_contig = a.contiguous();
        at::Tensor b_contig = b.contiguous();
        at::Tensor result = at::empty_like(a_contig);

        const float* a_ptr = a_contig.data_ptr<float>();
        const float* b_ptr = b_contig.data_ptr<float>();
        float* res_ptr = result.data_ptr<float>();
        int numel = a_contig.numel();

        sycl::queue& queue = c10::xpu::getCurrentXPUStream().queue();
        constexpr int threads = 256;
        int blocks = (numel + threads - 1) / threads;

        queue.submit([&](sycl::handler& cgh) {
            cgh.parallel_for<MulAddKernelFunctor>(
                sycl::nd_range<1>(blocks * threads, threads),
                MulAddKernelFunctor(numel, a_ptr, b_ptr, static_cast<float>(c), res_ptr)
            );
        });

        return result;
    }

    // ==========================================================
    // 3. Registration
    // ==========================================================
    TORCH_LIBRARY(sycl_extension, m) {
    m.def("mymuladd(Tensor a, Tensor b, float c) -> Tensor");
    }

    TORCH_LIBRARY_IMPL(sycl_extension, XPU, m) {
        m.impl("mymuladd", &mymuladd_xpu);
    }

    } // namespace sycl_extension

    // ==========================================================
    // 4. Windows Linker
    // ==========================================================
    extern "C" {
        #ifdef _WIN32
        __declspec(dllexport)
        #endif
        PyObject* PyInit__C(void) {
            static struct PyModuleDef moduledef = {
                PyModuleDef_HEAD_INIT,
                "_C",
                "XPU Extension Shim",
                -1,
                NULL
            };
            return PyModule_Create(&moduledef);
        }
    }


Create a Python Interface
-------------------------

Create a Python interface for our operator in the ``sycl_extension/ops.py`` file:

.. code-block:: python

  import torch
  from torch import Tensor
  __all__ = ["mymuladd"]

  def mymuladd(a: Tensor, b: Tensor, c: float) -> Tensor:
      """Performs a * b + c in an efficient fused kernel"""
      return torch.ops.sycl_extension.mymuladd.default(a, b, c)

Initialize Package
------------------

Create ``sycl_extension/__init__.py`` file to make the package importable:

.. code-block:: python

    import ctypes
    import platform
    from pathlib import Path

    import torch

    current_dir = Path(__file__).parent.parent
    build_dir = current_dir / "build"

    if platform.system() == 'Windows':
        file_pattern = "**/*.pyd"
    else:
        file_pattern = "**/*.so"

    lib_files = list(build_dir.glob(file_pattern))

    if not lib_files:
        current_package_dir = Path(__file__).parent
        lib_files = list(current_package_dir.glob(file_pattern))

    assert len(lib_files) > 0, f"Could not find any {file_pattern} file in {build_dir} or {current_dir}"
    lib_file = lib_files[0]


    with torch._ops.dl_open_guard():
        loaded_lib = ctypes.CDLL(str(lib_file))

    from . import ops

    __all__ = [
        "loaded_lib",
        "ops",
    ]

Testing SYCL extension operator
-------------------

Use simple test to verify that the operator works correctly.

.. code-block:: python

  import torch
  from torch.testing._internal.common_utils import TestCase
  import unittest
  import sycl_extension

  def reference_muladd(a, b, c):
      return a * b + c

  class TestMyMulAdd(TestCase):
      def sample_inputs(self, device, *, requires_grad=False):
          def make_tensor(*size):
              return torch.randn(size, device=device, requires_grad=requires_grad)

          def make_nondiff_tensor(*size):
              return torch.randn(size, device=device, requires_grad=False)

          return [
              [make_tensor(3), make_tensor(3), 1],
              [make_tensor(20), make_tensor(20), 3.14],
              [make_tensor(20), make_nondiff_tensor(20), -123],
              [make_nondiff_tensor(2, 3), make_tensor(2, 3), -0.3],
          ]

      def _test_correctness(self, device):
          samples = self.sample_inputs(device)
          for args in samples:
              result = sycl_extension.ops.mymuladd(*args)
              expected = reference_muladd(*args)
              torch.testing.assert_close(result, expected)

      @unittest.skipIf(not torch.xpu.is_available(), "requires Intel GPU")
      def test_correctness_xpu(self):
          self._test_correctness("xpu")

  if __name__ == "__main__":
      unittest.main()

This test checks the correctness of the custom operator by comparing its output against a reference implementation.

Conclusion
----------

In this tutorial, we demonstrated how to implement and compile custom SYCL operators for PyTorch. We specifically showcased an inference operation ``muladd``. For adding backward support or enabling torch.compile compatibility, please refer to :ref:`cpp-custom-ops-tutorial`.