You can also encounter thread divergence due to loops. Here’s a somewhat contrived example. We have some pre-loop code in our kernel. All the threads are going to do this code, and then they’re going to reach this for loop. And the way I’ve constructed it is, they’re going to go through this loop a number of times equal to their thread index. So thread 0 will execute this code once. Thread 1 will execute it twice. Thread 2 will execute it 3 times and so on. And then eventually they’re all going to exit the loop and proceed and do some post loop stuff. So what does this look like? Here’s a bunch of threads and they’re all in the same thread block. I’ve just color coded them so you can see what they do more easily. And they’re all going to be executing this pre-loop code, and then they’re going to reach the loop. So thread 0 is going to proceed into this loop code. And they just keep going. Thread 1 is going to execute the loop code, and then execute again, and keep going. Thread 2 will execute the loop code again and again, and keep going. And thread 3 will execute the loop code 4 times. So if we think about these threads a little differently in terms of what they’re doing over time, the first order is executing the pre-loop code, then goes ahead and executes the loop code. And then it really just kind of sits around. Okay, it doesn’t have anything to do for a while because, in the mean time, thread 1 has executed the pre-loop code and then the loop code and then executes the loop code again. The 3rd thread executes the pre-loop code, the loop code, the loop code, then executes the loop code again. And the final thread executes pre-loop code, and then executes the loop code 4 times. And finally, all the threads can go ahead and proceed with post-loop code. This diagram, when you draw it like this, kind of gives you a sense of why loop divergence is a bad thing, why it slows you down. Because it turns out that the hardware likes to run these threads together, and as long as they’re doing the same thing, as long as they’re executing the same code, then it has the ability to do that. But in this case, the blue thread proceeds for a while, and then, because it’s not going to do the loop again, it just ends up waiting around while the other threads do so. And then the red thread waits for a little while. The green thread waits a little bit. And only the purple thread was executing, at full efficiency the whole time. And so you can imagine that if the hardware gets some efficiency out of running all 4 of these threads at the same time, then that efficiency has been lost during this portion of the loop.