Could a simple change in the layout of a data structure in memory result in a 20% reduction in performance? Recently in my Digital Systems Organization and Design class, we have been discussing data caches in modern processors. As a toy example of the ability to explicitly prefetch data into the cache, Professor Milo Martin presented this C code which performs a depth-first traversal of a binary tree, adding the values at every node:

(Full source available at https://gist.github.com/946002 — Fork it! Modify it! Let me know!)

int tree_add(treenode* t){ if(t == NULL) return 0; __builtin_prefetch(t->left); __builtin_prefetch(t->right); return t->value + tree_add(t->right) + tree_add(t->left); }

Needing an exercise in C, and curious about the effects of explicit prefetching, I decided to test this out. First I generated a tree in a depth-first fashion to try the algorithm on:

treenode* buildtree(int height){ if(h < 0) return NULL; treenode* t = malloc(sizeof(treenode)); if(t == NULL){ printf("Could not allocate memory"); exit(1); } t->value = rand() % 10; --height; t->right = buildtree(height); t->left = buildtree(height); return t; }

When I tested 10,000 runs of the `tree_add`

algorithm on a tree of height 20 (2,097,151 nodes) with and without the explicit `__builtin_prefetch`

function, I found nearly zero difference (372.257728 seconds without prefetching vs. 372.901536 seconds with prefetching). Professor Martin suggested this might be an artifact of the fact that I was generating the tree in depth-first fashion. So I generated a breadth-first tree:

treenode* newnode(){ treenode* t = malloc(sizeof(treenode)); if(t == NULL){ printf("Could not allocate memory"); exit(1); } t->value = rand() % 10; t->left = NULL; t->right = NULL; return t; } treenode* buildtreeBFS(int height){ if(height < 0) return NULL; treenode* t = newnode(); list* l = malloc(sizeof(list)); if(l == NULL){ printf("Could not allocate memory"); exit(1); } addfront(l, t); int nodes = pow(2, height+1) - 2; printf("%d\n", nodes); while(nodes > 0){ treenode* node = removeback(l); node->left = newnode(); node->right = newnode(); addfront(l, node->left); addfront(l, node->right); nodes -= 2; } while(removeback(l) != NULL); //clear out the list free(l); return t; }

This is where things get surprising. Running the inorder traversal on the breadth-first generated tree takes significantly longer than on the depth-first tree! Again, the explicit prefetching seems to have little effect (446.019808 seconds without prefetching vs. 447.025984 seconds with prefetching). We take 20% longer to perform the `tree_add`

function on the breadth-first tree than we did with the depth-first tree. Though these are functionally the same data structure, I hypothesize that their layout in memory, along with the effects of spatial locality on caching data, accounts for the performance difference.

Since modern computers use cache blocks that can hold many of our `treenode`

structs, if we are accessing nodes that are lined up sequentially in memory, we will have far fewer cache misses as we traverse the tree. When we generated the tree in the same depth-first order as we later traversed it in, we exploited this locality by accessing sequentially allocated heap locations. Using the breadth-first tree, we jumped around through addresses in the heap, likely incurring a much higher cost for memory accesses.

I find it fascinating that in a realistic scenario we see such a large performance difference using data structures with so many apparent similarities. It really proves the usefulness of hardware classes even for someone who never intends to design computer hardware. Thank you Professor Martin!