Profiling without a Profiler
You want to know why your program is running slow. One way to do this is to run it inside a profiler. This will sample your program's state as a function of time, and return how often it is inside each function. However, what if your program is complex and some parts are time-critical? The extra overhead of a profiler may give spurious results. We need a way to profile only the parts of the code we are interested in.
What we would like is the possiblity to add a single token to the functions we wish to investigate. So how can this work? There can be multiple ways to exit a function, so the obvious trick of adding something at the start and end doesn't really work. Fortunately, the extensions that gcc provides are powerful enough for us to construct some sort of simple profiling harness. By using a "cleanup" attribute, we can create a variable with a destructor, even in C code.
Doing so, we obtain something looking like:
#define DECL_PROF_FUNC\
__attribute__((cleanup(ll_prof_func_exit))) struct ll_prof_data ll_func_prof = {ll_readtsc(), __func__, __LINE__}
Where this magic define allows use to place the token DECL_PROF_FUNC somewhere at the start of each function we wish to profile. This declares a variable called ll_func_prof that is a structure. The structure is filled with some information, and then when the function exits, the function ll_prof_func_exit is called with a pointer to that structure. By storing the information about the current execution of the function within ll_func_prof we can work out how much time is spent within it.
So firstly, we need some way of determining time. The obvious way of using gettimeofday() is too slow. As is clock_gettime() . Instead, we will use the low-level performance counter of the processor. This is not without its difficulties. On multi-core machines, the different cores may have unsynchronized counters. This means that if the process we are profilling is suspended by the operating system, and resumed on a different core, the timing will be spurious. In addition, modern processors can tick at different rates as power-saving kicks in. This can change the rate measured as time passes, also giving spurious results.
Fortunately, if the time taken is small (and we are assuming it is), then both problems are managable. We simply ignore the clock going backwards, and ignore time periods that are too long. This means that the profiler only works on small, fast functions. However, if we really care about large slow functions, we can easily move to a coarser but more monotonic timer.
To read the tsc, we need to use some inline asm, which is unfortunately also dependent on whether or not your code is 64bit or 32bit:
#ifdef __x86_64__
static __inline__ unsigned long long ll_readtsc(void)
{
unsigned hi, lo;
__asm__ volatile ("rdtsc" : "=a"(lo), "=d"(hi));
return ((unsigned long long) lo) | (((unsigned long long) hi) << 32);
}
#else
static __inline__ unsigned long long ll_readtsc(void)
{
unsigned long long int x;
__asm__ volatile ("rdtsc" : "=A" (x));
return x;
}
#endif
So, to finish off the definition of DECL_PROF_FUNC we need to define the structure it stores into. Basically, we want to record the time of entry into the function, the name of that function, and also the line on which DECL_PROF_FUNC is written. We will use this line number as a key into another data structure to record the global profiling results. Thus we have for the profiling data:
struct ll_prof_data
{
unsigned long long time;
const char *func_name;
unsigned line;
};
And for the global state:
#define MAX_PROF_LINE 65535
struct ll_prof_entry
{
unsigned long long time;
const char *name;
int err_count;
int func_count;
};
static struct ll_prof_entry ll_prof[MAX_PROF_LINE] = {{0, NULL, 0, 0}};
The call at the end of each profiled function's execution will then call ll_prof_func_exit() This function can then collate the information into the global state:
#define MAX_DELTAT (1ULL << 32)
static void ll_prof_func_exit(struct ll_prof_data *pd)
{
unsigned long long x = ll_readtsc();
struct ll_prof_entry *pe = &ll_prof[pd->line];
/* Compile-time test for problems */
//if (pd->line >= MAX_PROF_LINE) exit(0);
__sync_add_and_fetch(&pe->func_count, 1);
/* Save name for later */
pe->name = pd->func_name;
/* Ignore time reversing, or progressing too quickly */
if ((x < pd->time) || (x > pd->time + MAX_DELTAT))
{
__sync_add_and_fetch(&pe->err_count, 1);
return;
}
/* Collate time */
__sync_add_and_fetch(&pe->time, x - pd->time);
}
Notice how in the above, we use synchronized (atomic) additions, rather than the more obvious method of counting the number of times that that function ran and how long it took. This makes the profiling thread-safe. However, these atomic additions are slower than the non-bus-locked ones, so if your code doesn't use threads, then consider replacing them with their conventional alternatives.
So out program has run its test sequence, and now we have all the profiling data we could ever want somehow stored inside the ll_prof array. How shall we get at it? Of course, we could call some function at program exit to print it out, but there are many ways a program can complete. We don't want to bother wrapping every call to exit or abort , and adding something at the bottom of main() . Fortunately, gcc saves us work yet again. We can use the destructor function attribute to make a function run after main exits.
Inside this destructor function, we can sort the results, and print them out:
static int ll_cmp_prof(const void *p1, const void *p2)
{
const struct ll_prof_entry *pt1 = p1;
const struct ll_prof_entry *pt2 = p2;
if (pt1->time > pt2->time) return -1;
return (pt1->time < pt2->time);
}
static __attribute__((destructor)) void ll_print_prof(void)
{
int i, j = 0;
unsigned long long total = 0;
/* Compact entries */
for (i = 0; i < MAX_PROF_LINE; i++)
{
if (!ll_prof[i].name) continue;
ll_prof[j] = ll_prof[i];
total += ll_prof[j].time;
j++;
}
/* Sort them */
qsort(ll_prof, j, sizeof(struct ll_prof_entry), ll_cmp_prof);
printf("Profile results:\n");
for (i = 0; i < j; i++)
{
printf("%s:\t\t%llu\t%d\t%d\n", ll_prof[i].name, ll_prof[i].time >> 20,ll_prof[i].func_count, ll_prof[i].err_count);
}
}
So using the above we can create a one-header-file profiler for C. We can scatter DECL_PROF_FUNC inside the functions we care about, and the profiler will nicely print out the results. However, there is one minor issue. The above code is designed for profiling a single .c file at a time. If you try to profile multiple files, it is a little more tricky. You need to make sure that the global state doesn't get corrupted if two files just happen to have annotations on the same line, or use functions of the same name... However, since those problems are so rare, and trivial to work-around, we will ignore them. It is just too useful to have such a simple profiler than to worry and add extra complexity that will be rarely needed.
|
Comments