Restore "Implement IPC-enabled events. (#1145)" #1176

Andy-Jost · 2025-10-22T17:56:20Z

This reverts commit bcd40ff.

copy-pr-bot · 2025-10-22T17:56:23Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Andy-Jost · 2025-10-22T17:56:55Z

/ok to test 9dd067e

greptile-apps

Greptile Overview

Greptile Summary

This PR restores IPC (Inter-Process Communication) support for CUDA events, previously removed in commit bcd40ff. The implementation adds serialization capabilities to Event objects via IPCEventDescriptor, enabling events to be shared across process boundaries using CUDA's cuIpcOpenEventHandle API. Key changes include: adding _ipc_enabled and _ipc_descriptor fields to Event (cuda_core/cuda/core/experimental/_event.pxd), enabling IPC by default for events created via Device.create_event(), implementing pickle support via multiprocessing reducer registration, and reorganizing test helpers into a cuda_core/tests/helpers/ package structure with new utilities (PatternGen, LatchKernel, TimestampedLogger). The feature enforces constraints that IPC events must be bound (created via Stream.record), cannot enable timing, and cannot be re-recorded.

Critical Issues

1. Multiprocessing reducer breaks non-IPC event pickling (cuda_core/cuda/core/experimental/_event.pyx, lines 294-297)

The global multiprocessing reducer registered for Event unconditionally calls get_ipc_descriptor() when pickling any Event instance, but this method raises RuntimeError if the event is not IPC-enabled. This will break existing code that pickles regular (non-IPC) events through multiprocessing queues/pipes.

def _reduce_event(event):
    return event.from_ipc_descriptor, (event.get_ipc_descriptor(),)  # RuntimeError if not IPC-enabled
multiprocessing.reduction.register(Event, _reduce_event)

Solution: Add a conditional check in the reducer to handle both IPC and non-IPC events.

2. Critical bug in LatchKernel instantiation (cuda_core/tests/helpers/latch.py, line 50)

The constructor accesses self.busy_wait_flag[0] before the busy_wait_flag property is defined (property defined at lines 61-63). This will cause an AttributeError when instantiating the class, breaking all tests that use this helper.

# Line 50 - property doesn't exist yet
self.busy_wait_flag[0] = 0

# Lines 61-63 - property defined later
@property
def busy_wait_flag(self):
    return ctypes.cast(int(self.buffer.handle), ctypes.POINTER(ctypes.c_int32))

Solution: Move the property definition before the __init__ method or directly use the ctypes cast in __init__.

3. Potential memory leak in PatternGen.verify_buffer (cuda_core/tests/helpers/buffers.py, lines 76-82)

The method allocates scratch_buffer via DummyUnifiedMemoryResource.allocate() but never explicitly frees it. Depending on the Buffer lifecycle management in the codebase, this could accumulate memory over repeated test runs.

4. Removed platform capability checks may cause test failures (cuda_core/tests/memory_ipc/test_errors.py, test_workerpool.py, test_send_buffers.py)

Multiple test files removed the supports_ipc_mempool() guard that previously skipped tests on platforms where the driver rejects IPC-enabled mempool creation (e.g., certain WSL configurations). Tests will now fail instead of skip on incompatible platforms.

5. Inconsistent device_id handling for IPC-imported events (cuda_core/cuda/core/experimental/_event.pyx, lines 193-194)

IPC-imported events set device_id = -1 and ctx_handle = None with "??" comments, indicating uncertainty about the correct values. The .device and .context properties will return None for these events, which may break downstream code expecting valid objects.

Confidence Score: 2/5

The multiprocessing reducer bug and LatchKernel instantiation error are critical and will cause immediate failures. The IPC event implementation needs review of its pickle integration strategy and proper handling of device/context for imported events before merging.

_{20 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

cuda_core/tests/helpers/latch.py

cuda_core/tests/memory_ipc/test_send_buffers.py

cuda_core/tests/helpers/buffers.py

cuda_core/tests/memory_ipc/test_event_ipc.py

Andy-Jost · 2025-10-22T18:52:14Z

/ok to test cbeebf3

Andy-Jost · 2025-10-23T17:40:38Z

/ok to test dfc96c3

Andy-Jost · 2025-10-23T17:43:06Z

cuda_core/tests/test_helpers.py

+    if not IS_WINDOWS and not IS_WSL:
+        # On any sort of Windows system, checking the memory before stream
+        # sync results in a page error.
+        log("checking target == 0")
+        assert compare_equal_buffers(target, zeros)


This was the cause of the CI crash. A surprise to me, accessing unified memory that is currently queued in a CUDA stream operation results in a page error on Windows. On Linux it works as expected (to me, anyway).

Andy-Jost · 2025-10-23T17:46:33Z

/ok to test 51e90e0

coderabbitai · 2025-10-23T21:14:21Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Andy-Jost · 2025-10-23T21:14:58Z

/ok to test 6bbfe18

rparolin · 2025-10-23T22:09:40Z

cuda_core/tests/helpers/buffers.py

+from cuda.core.experimental import Buffer, MemoryResource
+from cuda.core.experimental._utils.cuda_utils import driver, handle_return
+
+if sys.platform.startswith("win"):


We should migrate this to a common place so that this pattern doesn't get replicated in multiple files. It would be a maintenance headache if we had to update it in multiple places.

rparolin · 2025-10-23T22:42:28Z

cuda_core/tests/helpers/latch.py

+                       }
+
+                       // Avoid 100% spin.
+                       __nanosleep(10000000); // 10 ms


10ms is a lot for a polling loop like this. You could probably get away with a 1ms and as a nit if you don't need the granularity of nanosleep you could use the std::thread::sleep as an alternative.

rparolin · 2025-10-23T23:00:08Z

cuda_core/tests/memory_ipc/test_event_ipc.py

+        log(f"got event ({hex(e.handle)})")
+        stream2.wait(e)
+        log("enqueuing copy on stream2")
+        buffer.copy_from(twos, stream=stream2)


Shouldn't we verify that the buffer we got in the child process contains "1" before we copy "2" on top of it?

github-actions · 2025-10-23T23:28:55Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

leofang · 2025-10-24T03:08:09Z