Jim Durbin, Rich Gossweiler, Randy Pausch
University of Virginia,
School of Engineering
Department of Computer Science
[durbin | rich |
pausch]@Virginia.edu, (804) 982-2289
Table of Contents
Amortizing
3D Graphics Optimization Across Multiple Frames
Cover
Page
Submission Category: Conference
Paper
Title: Amortizing 3D Graphics
Optimization Across Multiple
Frames
Contributors:
Jim Durbin (804) 982-2292 durbin@virginia.edu
Rich
Gossweiler (804) 982-2289 rich@Virginia.edu
Randy Pausch (804) 982-2211
pausch@Virginia.edu
University of Virginia, School of
Engineering
Department of Computer Science
Charlottesville, VA
22903
Contribution: Description of
optimization mechanism
Primary Contact: Rich
Gossweiler (rich@virginia.edu)
Keywords:
three-dimensional graphics, interactive graphics, real-time, optimization,
rendering, virtual environments
100 word abstract:
This paper describes a mechanism for improving rendering rates
dynamically during runtime in an interactive three-dimensional graphics
application. Well-known techniques such as transforming hierarchical geometry
into a flat list and removing redundant graphics primitives are often performed
off-line on static databases, or continuously every rendering frame. In
addition, these optimizations are usually performed over the whole database. We
observe that much of the database remains static for a fixed period of time,
while other portions are modified continuously (e.g. the camera position), or
are repeatedly modified during some finite interval (e.g. during user
interaction). We have implemented a runtime optimization mechanism which is
sensitive to repeated, local database changes. This mechanism employs timing
strategies which optimize only when the cost of optimization will be amortized
over a sufficient number of frames. Using this optimization scheme, we observe a
rendering speedup of roughly 2.5 in existing applications. We discuss our
initial implementation of this mechanism, the improved timing measurements, the
issues and assumptions we made, and future improvements.
However, for the rendering engine, a hierarchical data structure
is not optimal. To keep the highly pipelined architecture of modern graphics
hardware from stalling, we would prefer to make few, if any calculations while
traversing the graphics database. To that end, a flat list of graphics
primitives is preferable, because it requires no combination of transformations
(e.g. matrix multiplications). Flat lists also offer the opportunity to perform
compiler-like peephole optimizations (e.g. removing redundancies), and is
efficient for pipelining graphics.
The use of optimization techniques
to improve rendering speeds is well-established, in both research (e.g. the
Berkeley and UNC Walkthrough projects [7][1][2])
and commercial systems (e.g. SGI Performer [11]).
These two systems exemplify both ends on the spectrum of when to
optimize. The Berkeley Walkthrough assumes that the database is static, and can
perform aggressive off-line optimization to restructure the database for
improved runtime performance. Performer, on the other hand, supports very
dynamic environments and cannot make assumptions about the static nature of the
database. Performer employs global database optimization for every frame.
Since optimization takes time, Performer users prefer to execute these
mechanisms on an auxiliary CPU to minimize impact on the rendering CPU. Both
techniques apply optimization globally over the database.
In this paper, we present optimizations that occurs after higher-level
techniques such as culling objects which are not in view, or using multiple
representations of objects stored at various levels of detail [12].
We have observed that in many applications
some portions of the database remain unchanged during the entire run, and other
portions change repeatedly, but during brief intervals (e.g. when the user is
directly manipulating the object), and still other portions change continuously
(e.g. the camera location).
This paper describes a mechanism for
improving rendering rates dynamically during runtime in an interactive
three-dimensional graphics application. Well-known techniques such as
transforming hierarchical geometry into a flat list and removing redundant
graphics primitives are often performed off-line on static databases, or
continuously every rendering frame. In addition, these optimizations are usually
performed over the whole database. We observe that much of the database remains
static for a fixed period of time, while other portions are modified
continuously (e.g. the camera position), or are repeatedly modified during some
finite interval (e.g. during user interaction). We have implemented a runtime
optimization mechanism which is sensitive to repeated, local database changes.
This mechanism employs timing strategies which optimize only when the cost of
optimization will be amortized over a sufficient number of frames. Using this
optimization scheme, we observe a rendering speedup of roughly 2.5 in existing
applications. We discuss our initial implementation of this mechanism, the
improved timing measurements, the issues and assumptions we made, and future
improvements.
three-dimensional graphics,
interactive graphics, real-time, optimization, rendering, virtual
reality
In 1976 Clarke suggested that a
hierarchical data structure would have several characteristics which are useful
for manipulating and rendering graphics objects [4].
One powerful advantage of a tree structure for the application programmer is
that matrices at each node in the tree can represent individual coordinate
systems. When the programmer manipulates a graphics object, its children will
inherit the changes implicitly [13].
This paper is currently submitted to ACM UIST `95.
Since we have constructed a rendering system which is application independent
[8],
we, like Performer, cannot make assumptions that a given portion of the database
will remain static. We must analyze the database during runtime. Instead of
attempting to transform the entire database during each frame, our mechanism
records the frequency with which portions of the database change, and uses this
information to predict how frequently the object will be changed in the near
future. The mechanism determines or predicts the cost of performing an
optimization and uses a simple utility function to determine the value of
optimizing those portions of the database. The algorithm collects data about the
average frequency for which each object gets altered. If the object does not get
altered frequently, then the cost to optimize the object may be worth the
savings recouped over later frames. However, if the object is constantly
changing, then there is less incentive to perform the optimization, as the
benefits will be fleeting. Our approach is a hybrid, as shown in Figure 1:
Figure
1: When to optimize
We implemented our optimization
mechanism as part of the Alice graphics system [10].
Alice supports a variety of applications ranging from immersed virtual
environments, to rapid exploration of three-dimensional interfaces, to a
teaching tool for graphics classes. The rendering subsystem [8]
is based on a hierarchical data structure. All graphical objects in the
rendering sub-system are represented as nodes in the tree. Nodes contain
transformation matrices and may also contain geometry consisting of polygons,
polylines or polypoints.
When rendering from this data structure, there are two basic forms of
inefficiency:
1) matrix multiplications to combine the implicitly chained transformations
(in order to render a leaf node, we must first apply all the transformations
from the root node to that leaf).
2) redundant calls to set state in the graphics pipeline; for example, in
most objects, a large number of polygons are the same color. Repeatedly calling
the graphics setColor() subroutine (the OpenGL glColor3f() call) [9]
disrupts the pipeline. In addition, even using a local if statement in the inner
rendering loop also disrupts the hardware pipeline [5],
degrading rendering performance. This is partially because the if statement to
compare with, for example, the current normal vector, needs to compare three
floating point values. The most efficient mechanism is to produce the list of
graphics calls which will be streamlined into the graphics pipeline, and to
remove the redundant calls. This is exactly what Performer does on an extra CPU
for each frame.
Our technique involves several
optimization phases and a separate mechanism for analyzing the benefit of
performing the optimizations. The phases are:
- streamlining each individual node in the tree by creating an array of
graphics calls without any intervening computations. This is done for each
node in the tree, so that if the node is altered, only that particular node's
streamlined array needs to be re-constructed
- peephole optimization to the streamlined list, removing redundancies
- flattening the hierarchical tree structure -- coalescing nodes so that
each subtree node has a single cache list representing all of its children at
one transformed level. We call this list the streamlined array.
- peephole optimization of the flattened list for the entire subtree
Since invoking these optimization mechanisms requires time, we
need a guideline to determine if it is worth performing these operations. The
algorithm must evaluate the time needed to perform the optimizations, how long
the object has remained unaltered and how much of a predicted improvement will
be achieved.
Converting Each Node into a Rendering List
Each node in the tree may be
thought of as a container holding a set of properties describing the object or
sub-object at that level. For example, in a hierarchical model of a bunny, the
bunny's foot may be a child of the bunny's leg. As a node, the foot contains
geometry, color information and inherits the transformation matrix from its
parent (the leg). When traversing the foot's node, the rendering engine may have
to perform several conditional statements to determine how to draw the foot. The
color may be inherited from the parent (the leg), set for all of the polygons,
be specified for each individual polygon, or specified for each individual
vertex of each polygon. Evaluating these conditional statements every frame
disrupts the pipeline. If the graphics calls are the same from frame to frame,
these calls can be cached into an array. Then the rendering engine can iterate
through the array rather than making repeated conditional evaluations. While it
is surprising that seemingly minor conditional tests have a strong impact on
rendering performance, this is due to the highly pipelined nature of high-end
graphics hardware. To quote the authors of SGI Performer:
"The data structures used to represent geometry and the code which
transfers that data to the graphics hardware very often make or break an
immediate-mode graphics application." [11]
Peephole Optimization
When rendering an object, there are many polygons
which are the same color. Instead of resetting the color for every vertex or for
every polygon, the color call can be factored out and made only once.
This redundancy factoring can occur for many of the properties which
characterize the object (e.g. the color, the vertex normal vectors, and the
textures). For the initial implementation, our optimization removes redundant
color and normal information. This operation is performed during the
construction of the streamlined array.
Early peephole compilers used a similar method to track the source of a
register's value and remove redundant load commands [6].
More sophisticated peephole optimizers perform increasingly complex pattern
matching and even rearrange code in an attempt to minimize the number of
instructions and register manipulations; we hope to use a number of these
techniques to further optimize rendering.
Flattening the Tree
Since the optimization mechanism is traversing the
hierarchical structure and creating cache arrays for each node, the cost to
accumulate the transformation matrices is trivial. This allows the optimization
mechanism to effectively flatten subtrees into a single node with a longer cache
array.
Figure
2: Tree flattening
Note that the original hierarchy is not discarded, since alterations to the
nodes in the tree require that the cache be invalidated and the original
hierarchy be available.
Flattening the hierarchy into a rendering list is an explicit function call
in Performer, and therefore makes it the responsibility of the application
programmer to perform this operation. Our optimization mechanism engages
automatically when optimization is cost-effective.
The important distinction of our
mechanism is how it determines when optimization should be done, and to which
portions of the database. Performer does optimization globally to the entire
hierarchy every frame. Our approach is to perform a utility measurement,
comparing the cost/benefit of flattening subtrees and factoring out redundant
calls. We measure how long an object remains unaltered, how long it takes to
optimize, and the value of performing the optimizations.
Wall-clock Time v. Frame Time
Because we are interested in how many
times an object is modified versus how many times it is rendered, it is more
appropriate to use rendering frame counts rather than wall-clock time. For
example, if we used wall-clock time, and the system pauses momentarily (perhaps
due to other users on the machine) a single frame could take arbitrarily long to
render. The ratio of how frequently the object is modified to how frequently it
is rendered is what is important.
While frame-counting is necessary to determine when to optimize, wall-clock
time is necessary to compute the effectiveness of the optimizations. We measure
the rendering times of objects in both their unoptimized and optimized states.
Important Information for the Algorithm
AUT -- Average Unaltered Time:
the ratio of the number of frames where an object is not altered to the total
number of frames (This is a running total average, but other options are
discussed later).
TTRU -- Time to Render Unoptimized: the amount of time to render an object
without any optimizations.
TTRO -- Time To Render Optimized: the amount of time to render an object once
it has been optimized.
OT -- Optimization Time: the time it takes to perform the optimization on the
object.
Computing Variables
Since there already is a penalty for loading an
object from the disk we expend slightly more time to obtain timing information.
We load the object and render it several times without performing a swapbuffer()
call. During this time we gather the TTRU, the TTRO and the OT. We also create
the streamlined array for each node. The performance cost of gathering this
information is not observable, as it is dominated by the time to read the object
from the disk.
Dynamically Created Objects
During the execution of the program,
subtrees may be reparented, effectively creating new objects, and the timing
variables cease to be representative. In this case, we use a simple, coarse
predictive function based on the number of polygons in a subtree.
Based on timings taken on a large number of subtrees on an SGI Reality
Engine2 [3],
we find for an N polygon subtree, the representative times are:
EstimatedTTRU = (0.0000127 * N) seconds
EstimatedTTRO = (0.0000038 * N) seconds
EstimatedOT = (0.0000879 * N) seconds
The Utility Function
For each node, we store the TTRU, TTRO, and OT. The
utility function computes the amount of savings we will obtain over the average
unaltered time (AUT) and adds the cost of performing the optimization. The total
is compared to the cost of rendering unoptimized over the same AUT:
if (TTRO*AUT + OT < TTRU*AUT) then optimize...
This simple algorithm does not prevent arbitrarily bad hitches. For example,
if the object has not changed for a very long time, but the optimization time
(OT) is five seconds, the algorithm would perform the optimization, and the
system would stall for five seconds. This problem can be solved by placing a
hard upper limit on the allowable OT.
Since our system
supports a dynamic tree, we must invalidate the caches when an object's
characteristics change (e.g. color, translucency, texture), when the object's
matrix undergoes a transformation, and when reparenting occurs. The caches are
marked dirty and the rendering engine traverses the unoptimized hierarchy
instead. Since each node maintains its own streamlined array, any unaltered node
keeps its streamlined array:
Figure
3: invalidating the optimizations
The effectiveness of this optimization
mechanism depends on how frequently an application alters an object. If the
object is altered continuously, then no optimizations are performed. If an
object is never altered, then it is optimized once and remains in the optimized
state for the duration of the application.
To establish that these optimizations produce a worthwhile savings, we
measured the time to render a variety of objects when they are optimized and
when they are unoptimized. The results are shown in Table 1. The graphical
objects in the table represent several contrived and actual objects to explore
variations on tree configurations and redundancy removal. Note that these
speedups are greater than 2.5; actual day-to-day use achieves roughly a 2.5
speedup because of constant cost operations per frame such as clearing and
swapping the frame buffer, which dilutes the speedup somewhat.
Note to the reviewers: the enclosed video tape shows the amusement park
simulation rendering at 12 frames per second without optimizations and 29 frames
per second with the optimizations engaged.
As future work we would like to gather statistics about how frequently
objects are optimized and how long they remain optimized.
The graphics database is accessed by both
the application program and the rendering engine, but the usage patterns and
frequency of access dictate very different internal representations.
Transforming a subtree into a render-optimized list steals rendering time which
may be recouped over future frames. The question of utility becomes, "Is it
worth the time to optimize now, to speed up the application for the future?" By
observing that some parts of the tree are active for discrete periods of time,
we have implemented a runtime optimization mechanism which optimizes only local
portions when the utility of optimization appears worthwhile. The results are
dependent on usage patterns, but initial timing experiments indicate that this
mechanism is useful for improving runtime efficiency. Our initial measurements
indicate a factor of 2.5 increase in rendering speeds when the optimization
mechanism is employed.
We have two high level
observations based on this implementation. The first is that preliminary results
of employing this mechanism look promising. Alice was recently used in a
graphics class project where ten students built an amusement park (each student
built individual rides). This was a fair test, involving an application written
by a ten person team, unaware of the underlying optimization techniques. The
optimization mechanism improves the rendering by a factor of 2.5.
Our second observation is that trees are typically constructed with very
little depth -- on the order of five levels or less.
The following is a brief list of issues we
intend to consider as we continue this work.
The rendering engine, by computing
the bounding volumes of subtrees, knows which objects in the tree are located
near each other after the matrix transformations. This may not be identical to
the way the user has constructed the parent-child relationships of the tree.
For portions of the database which do not change over long periods of time,
the rendering engine could flatten the application tree and then reconstruct a
different tree based on spatial locality. This would allow for more efficient
high-level culling.
We currently compute all variables
for the optimizations at runtime. There may be some benefit in storing the
timing data across runs. This information might be useful when determining the
utility of flattening and may serve as a second-order statistic about the nature
of the object (e.g. the "lamp" object is used by many applications as static
decoration).
We currently compute the Average
Unaltered Time (AUT) over the entire run. An alternate approach is to weight the
timings, so that over time the more recent alterations influence the average
more than alterations having occurred a long time ago. The general problem is
similar to the page replacement problem in operating systems.
It might be
prudent in some cases to allow the application to explicitly express good
moments to perform optimizations. For example, if the application knows that the
user has paused or gone into another mode, then that might be a good time to
optimize.
For example, if we flatten a
bunny object which has children nodes of head, body and arms, and then we rotate
the head, the entire head tree will be unflattened. If, instead, we
progressively flatten each subtree into larger and larger lists, we can
unflatten only the portions which have been altered, rather than the whole
subtree.
Using this
optimization mechanism does not preclude off-line optimization. For example,
with a static database such as the Berkeley Walkthrough, off-line optimization
can provide a great deal of effective visibility culling. During runtime a given
"cell of visibility" may still have a large number of objects. Our mechanism may
then be effective for improving the cell.
- 1.
- John M. Airey, John H. Rholf and Frederick Brooks Jr., Towards Image
Realism with Interactive Update Rates in Complex Virtual Building
Environments, ACM SIGGRAPH Special Issue on 1990 Symposium on
Interactive 3D Graphics, 24 (2), 1990, pp. 41-50.
- 2.
- Frederick Brooks Jr, Walkthrough - A Dynamic Graphics System for
Simulating Virtual Buildings, Proceedings of the 1986 Workshop on
Interactive 3D Graphics.
- 3.
- Kurt Akeley, Reality Engine Graphics, ACM Annual Proceedings of
SIGGRAPH `93 (Anaheim California), August 1-6, 1993, Addison-Wesley, pp.
109-116.
- 4.
- James H. Clarke, Hierarchical Geometric Models for Visible Surface
Algorithms, Communications of the ACM, 19(10), October 1976, pp.
547-554.
- 5.
- Sharon Clay, member of SGI Perfomer Group. Personal conversation, April
25, 1995.
- 6.
- Jack W. Davidson and Christopher W. Fraser, Register Allocation and
Exhaustive Peephole Optimization, Software -- Practice and
Experience, 14(9), John Wiley & Sons, September 1984, pp. 857-865.
- 7.
- Thomas A. Funkhouser, Carlo Séquin and Seth J. Teller, Management of
Large Amounts of Data in Interactive Building Walkthroughs, 1992
Symposium on Interactive 3D Graphics, ACM SIGGRAPH, Cambridge Mass. April
1992, pp. 11-20.
- 8.
- Rich Gossweiler, Chris Long, Shuichi Koga, Randy Pausch, DIVER: A
Distributed Virtual Environment Research Platform, IEEE Symposium on
Research Frontiers in Virtual Reality, October 25-26, 1993, San Jose, CA,
pp. 10-15.
- 9.
- Jackie Neider, Tom Davis and Mason Woo, Open GL Programming Guide,
Addison-Wesley, Reading, Mass, 1993.
- 10.
- Randy Pausch, Tommy Burnette, A.C. Capehart, Matthew Conway, Dennis
Cosgrove, Rob DeLine, Jim Durbin, Rich Gossweiler, Shuichi Koga, Jeff White,
Alice: A Rapid Prototyping System for 3D Graphics, IEEE Computer
Graphics and Applications, 15(3), May 1995, pp. 8-11.
http://www.cs.virginia.edu/~alice/
- 11.
- John Rohlf and James Helman, IRIS Performer: A High Performance
Multiprocessing Toolkit for Real-Time 3D Graphics, ACM Annual
Proceedings of SIGGRAPH `94, (Orlando Florida), July 24-29, 1994,
Addison-Wesley, pp. 381-393.
- 12.
- Bruce Schachter (Ed.), Computer Image Generation, John Wiley and
Sons, New York, NY, 1983.
- 13.
- Andries van Dam, et al. PHIGS+ Functional Description 3.0,
Computer Graphics, 22 (3), July, 1988, pp. 124-218.