Visualization of Large-Scale Distributed Data

来源:互联网 发布:python爬虫经典书籍 编辑:程序博客网 时间:2024/06/08 15:38

INTRODUCTION
The primary goal of visualization is insight. An effective visualization is best achieved
through the creation of a proper representation of data and the interactive manipulation and
querying of the visualization. Large-scale data visualization is particularly challenging because
the size of the data is several orders of magnitude larger than what can be managed on an average
desktop computer. Data sizes range from terabytes to petabytes (and soon exabytes) rather than a
few megabytes to gigabytes. Large-scale data can also be of much greater dimensionality, and
there is often a need to correlate it with other types of similarly large and complex data.
Furthermore the need to query data at the level of individual data samples is superseded by the
need to search for larger trends in the data. Lastly, while interactive manipulation of a derived
visualization is important, it is much more difficult to achieve because each new visualization
requires either re-traversing the entire dataset, or compromising by only viewing a small subset of
the whole. Large-scale data visualization therefore requires the use of distributed computing.
The individual components of a data visualization pipeline can be abstracted as:
The degree to which these individual components are distributed or collocated has historically
been driven by the cost to deploy and maintain infrastructure and services. Early in the history of
scientific computing, networking bandwidth was expensive and therefore scarce. Consequently
early visualization pipelines tended to minimize the movement of data over networks in favor of
collocating data storage with data processing. However, as the amount and variety of data
continued to grow at an exponential pace, it became too costly to maintain full replicas of the data
for each individual that needed to use it. Instead, by leveraging the widespread expansion of the
Internet and other national and international high-speed network infrastructure such as the
National LambdaRail 1 , Internet-2 2 , and the Global Lambda Integrated Facility 3 , data and service
providers began to migrate toward a model of widespread distribution of resources.
In this chapter we will first introduce the various instantiations of the visualization pipeline
and the historic motivation for their creation. We will then examine individual components of the
pipeline in detail to understand the technical challenges that must be solved in order to ensure
continued scalability. We will discuss distributed data management issues that are specifically relevant to large-scale visualization. We will also introduce key data rendering techniques and
explain through case studies approaches for scaling them by leveraging distributed computing.
Lastly we will describe advanced display technologies that are now considered the “lenses” for
examining large-scale data.
Collocated Data, Filtering, Rendering and Display Resources
Most visualization software packages have a pipeline architecture where raw data comes in at
one end of the pipeline from disk or the network, moves through a sequence of filters that process
the data on the CPU and generate computer graphics primitives (e.g. lines, triangles, splats,
pixels) which are rendered on the GPU, and displayed on a monitor at the other end of the
pipeline. Some filters deal with accessing data or generating data. Other filters convert data from
one form to another. Finally there are filters that deal with the creation of computer graphics.
Each filter has an explicit input and output format allowing compatible filters to be linked
together to move data through the pipeline, and new filters can be added to the library of filters to
support new data formats, graphics formats or rendering techniques.
For example, in 2008 the NASA funded ENDURANCE project sent an autonomous
underwater vehicle (AUV) into ice covered West Lake Bonney in Antarctica. The AUV took
measurements of temperature, conductivity, pH, and five other values at over 18,000 points
within the lake. If a scientist wants to see overall temperature trends in the lake then the pipeline
could read in the data, and plot each temperature value as a box at the appropriate X, Y, Z
location where each box is colored based on a transfer function that maps a given temperature
value to a particular color. To look inside the lake, the scientist may want to see the temperature
values only on a plane slicing through the lake. Here the pipeline could add a filter to interpolate
raw temperature values across the surface of the plane. The scientist might also want to compare
multiple values (e.g. temperature, conductivity, and pH) by plotting all three simultaneously.
The advantages of this kind of pipeline/layered approach were described in the 1987 report
from the first workshop on Visualization in Scientific Computing (McCormick, 1988) that in
many ways marked the official beginning of the field of Scientific Visualization. These
advantages include:

Easy adaptation to a variety of data sources and display devices;

Software that is tailorable to the needs of diverse applications;

The ability to add more techniques and algorithms;

Layers that can be distributed to different computers to better leverage their capabilities
and minimize cost;

Affording economically viable development of a shared and interoperable set of tools;

Software that can accommodate different budgets and equipment preferences of a broad
community of users;

Being able to address the demands of a broad scientific community in coherent unison
rather than on a case-by-case basis.
Rendering can produce geometric primitives such as points, lines, or surfaces made of
triangles that can be cached on the graphics card. When the user changes their viewpoint on the
visualization, these primitives can be quickly re-rendered without re-running the data access, data
conversion, and rendering filters in the pipeline. While adequate for point, vector, or surface data,
these polygonal representations are less appropriate for volumetric data. An alternative is direct
rendering where there is no geometric representation and individual data points are drawn as
‘splats’, or volumes are rendered through raytracing each pixel on the screen. Direct rendering
can give much better views of volumetric data, however as there are no graphics primitives to be
cached, the entire scene must be re-rendered when the viewpoint changes.
In addition to rendering the dataset itself, supplementary graphics are often needed to put the
data into context (e.g. in our example of the lake above, it might be helpful to see an outline of
the lake, or a North pointing arrow, or latitude and longitude lines) and to supply information
about the visualization (depth scale, legend, color bars, etc.) Filters in the pipeline generate these
supplemental graphics or read them from disk. These also must be re-drawn when the
visualization is re-rendered but supplementary graphics are usually very lightweight compared to
the data being visualized.
Rendering is usually done locally on the display machine. These visualizations can be viewed
in 2D on a typical monitor screen or projector, or in 3D using a stereoscopic monitor, CAVE©
(Figure 2) (Cruz-Neira, 1992), or other stereo projection setup. Rendering in 3D may make it
easier to disambiguate complex multi-dimensional datasets but the creation of the 3D version
requires rendering the dataset from two viewpoints – one for each eye. For users lacking a 3D
display, the ability to quickly rotate a visualization, and the ability to view the dataset from
multiple angles simultaneously can help disambiguate complex scenes. While this text focuses on
rendering data visually there is also a body of work focusing on ‘visualizing’ data using
sonification or haptics to augment visualization through the use of other senses. In these cases
most of the pipeline remains the same except that there is additional rendering and display needed
for these other senses.
The user interacts with the visualization by directly or indirectly modifying the filters in the
pipeline – either by adding or removing filters, or changing the parameters of the existing filters.
One of the major advantages of a pipeline architecture is that the entire pipeline does not need to
be re-run when a change is requested. If intermediate results are stored at each of the filters then
only the filters downstream from the affected filter need to be recomputed. This saves time at the
cost of needing more memory to hold both the original dataset and its intermediate forms. Once
the pipeline is set up correctly, similar datasets (e.g. multiple timesteps) can be fed through the
pipeline without changing the filter settings.The user can interact with the visualization in several ways. In the simplest case the user can
change their point of view, or rotate and scale the visualization to see the visualization as an
object from the outside looking in, or from within the visualization looking around. The user will
also typically want to turn individual visualizations or supplementary graphics on and off, change
the time-step in an animated sequence, change the color transfer function, or change the
representation of the visualization (points, surface, glyphs.) More sophisticated users will want to
dynamically change the pipeline itself. Interaction with the pipeline is typically through a
graphical user interface that is linked into the pipeline, allowing the user to interact with familiar
interface elements - buttons, sliders, check boxes that affect the filter parameters behind the
scenes. On desktop systems interaction is typically performed with a keyboard and mouse.
Immersive environments with head-tracking allow a user to more naturally change their point of
view by simply moving their head, but these environments typically use a hand-held pointer
instead of a mouse and keyboard making it harder to accurately change parameters quickly.
Given the large amounts of data in multi-dimensional datasets, the amount of data must be
reduced at several points prior to display; there will often be more raw data points than there are
pixels on the screen to display them, or space in main memory to hold them. For interactive
visualization, datasets often need to be reduced so they will fit on disk at the local site, further
reduced to fit in main memory at the local site, and further reduced to fit into memory on the local
graphics card(s). Visualization systems often store data at multiple resolutions to allow the user to
have fast interactive access to a lower resolution version of the visualization that can remain on
the graphics card. This allows a user to quickly manipulate the low-resolution version, for
example rotating a wireframe box showing the space the data takes up, or a visualization of only
every 100 th data item, while still seeing enough landmarks to do the manipulation accurately.
Once the user stops manipulating the low-resolution version of the visualization, the pipeline can
take the time to generate the full-resolution version, replacing the low-resolution version.
For real-time interaction with this kind of visualization there are limited filtering
opportunities. Interaction is very fast if the data being visualized fits in the graphics card, for
example changing the viewpoint on a polygonal surface representation or on volume data that fits
in the card’s 3D texture memory. If the user changes the visualization parameters on an existing
visualization (e.g. generating a new polygonal iso-surface or changing the viewable subset of
volume data) then various filters are going to need to be re-run. The CPU will need to use data
from main memory to compute the new visualization and send it to the graphics card. If the
necessary data are not currently stored in main memory then the CPU needs to move it in from
disk or from the network, and the user must wait for that data to be fetched before the pipeline
can execute and display the results. Prefetching data to local disk or main memory can be used to
reduce this wait time, but there is only so much storage space on the graphics card, in main
memory, and on local disk, so that space usage must be optimized for a given visualization. This
often means that an initial preprocessing step is needed to turn large datafiles into a hierarchy of
fixed sized blocks for faster transfer between the various caches.
With CPUs moving towards an increasing number of cores rather than each core becoming
increasingly faster, being able to parallelize the work of the visualization pipeline on a single
machine, as well as using clusters to parallelize the work across multiple machines is becoming
more important in decreasing computation time for very large datasets. Shading languages such
as GLSL and CG are being used to speed up the rendering on the GPUs. Languages like CUDA
are being used on GPUs to speed up computation by moving some of the computation off of the
CPU and onto the GPU. As with multiple CPUs, in these cases bus bandwidth can become
another limiting factor- if one cannot move the data fast enough into the CPU then one cannot
fully take advantage of the CPU’s processing power.
Non-Interactive, Offline Rendered Visualization
When computational simulations produce extremely large datasets, and/or when raw data
collected by scientific instruments, such as remote sensors, fill up storage systems, interactive
exploration of the results becomes extremely difficult. A common and effective approach has
been to produce animations and video files from the data that the scientist can explore at his or
her home institution and can share with colleagues.
Long running simulations such as weather or earthquake simulations, protein folding, high-
energy physics or astronomical computations fall in this category. They require large amounts of
data as initial conditions (the starting point of the simulation) and produce greater amounts of
data, often through a time series of results (a complete solution for each time step during the
length of the simulation).
When faced with such a scenario, scientists are often in a situation where the data are too large
to be gathered and shipped back to his or her office computer for analysis and visualization. The
data has to remain within the supercomputing facility hosting the computational simulation or the
data storage in the first place.
One solution then is to schedule batch (non-interactive) jobs that produce images based on the
simulation results. An image can be generated for each time step of the simulation, or an
animated camera can follow a predetermined path for a given dataset. More complex
visualizations can combine animations of numerous parameters (camera path, time series, change
in representation, selection of variables to be visualized, etc.). While the time to render one frame
can be rather large, the common approach is to apply an image-based parallelization where each
processor renders a distinct complete frame. This is relatively simple to implement and fits well
in the supercomputing infrastructure: a number of processor nodes can access a large commonly
shared file system hosting the complete simulation results. If rendering jobs access different time
step results, each processor will access unique files and maximizes I/O throughput. If rendering
jobs access the same data, the read-only nature of the process also maximizes file system usage.
The end result of the rendering job is a series of frames, usually kept in a non-destructive file
format, such as TIFF, to maintain data integrity. Since it’s a non-interactive process, high-
resolution, including stereoscopic 3D, as well as high quality images can be produced using
photo-realistic rendering techniques such as raytracing. Frames can be combined in various
movie formats of different resolution and quality for multiple purposes (result dissemination,
education and outreach, movie production, etc.).
The software used for such approaches varies greatly from very specific home-brewed
visualization software for a very narrow purpose (such as seismic wave propagation for
earthquake visualization), to scientific rendering toolkits (such as the ‘Visualization Toolkit’,
vtk.org, for medical or engineering applications), finally up to general-purpose rendering
packages (such as Autodesk Maya or Pixar’s Renderman). If available, graphics hardware can be
used to speed up such intensive rendering jobs.
Given the non-interactive characteristics of this process, the problem resides in specifying the
visualization and rendering parameters (variable selection, camera path, shading, lighting, etc.).
Scientists have devised different approaches: for a well-defined visualization technique such as
volume rendering, the main parameter to be selected is the transfer function that defines the
mapping from data values into color and transparency. This function can be selected off-line, on a
small dataset and then applied during full-scale rendering. A desktop user-interface or a web
portal can be used to define rendering job characteristics (data input, image size, quality, etc.) and
visualization parameters (viewpoint, transfer functions, etc.). More advanced systems let the user
preview the visualization in low resolution and quality but give full access to all the parameters of
the visualization (for instance Autodesk Maya). Even more advanced systems allow the user to be
immersed in the visualization by using a VR device such as the CAVE©, where the scientists can
interact with a proxy (reduced dataset) of the visualization and ‘choreograph’ the rendering to be
done. Donna Cox, leading the Advanced Visualization Laboratory at the National Center for
Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign, has
pioneered such an approach with the ‘Virtual Director’ system (Thiebaux, 2000).
With the advent of petascale computing, where the data produced is so large that it overflows
even the largest storage infrastructure, the approach of generating movies and animations directly
inside the supercomputing facility is gaining renewed interest taking the model in different
directions:

Compared to petascale sized datasets, even the highest resolution uncompressed
rendering is becoming attractive for its predictable and manageable size. A format
gaining traction is the so-called 4K format (4096x2160 pixel resolution) adopted by the
digital cinema initiative as the next generation movie distribution format. At four times
the resolution of high-definition television, it becomes possible to portray a simulation
dataset with exquisite detail.

While simulations can run for an extremely long time, it is often beneficial to couple the
simulation process with the visualization process, and generate early rendering results as
soon as possible while the simulation is still running. This technique, referred to as
computational steering if the user can interactively control simulation parameters,
enables early detections of problems such as programming bugs or incorrect initial
conditions in the simulation. The simulation-visualization coupling can be done through a
data checkpoint on storage, or through network communication between a simulation
cluster and a visualization rendering resource.

If a tight coupling between simulation and visualization is achieved by running both
programs on the same computational resource, in situ visualization can be performed,
removing the need for any storage or data transmission, hence speeding up the generation
of visual results.

Using high-speed networking, the rendering need not to be stored for later examination,
but can be streamed in full quality to the scientist at his or her desk.
The benefits of running the complete visualization pipeline all at once and generating
animations or video streams are:

Asynchronous work (“submit and forget”) for the scientists;

Non-interactive jobs where known batch optimizations can be applied, and predictable
usage of resources make the visualization process efficient and acceptable by the scientist
and the supercomputing community;

Early rendering as the simulation is running can be extremely beneficial;

The end product is much smaller than the simulation data (even for high-resolution
uncompressed rendering), and can still be cut and edited into a movie production pipeline
for education, dissemination and outreach.
The drawbacks of the approach are:

Non-interactive visualization- so the scientist needs to know what to look for, hence
limiting the visual exploration of simulation results.

May require work on proxy data beforehand to determine the visualization and rendering
parameters

Data has to be reloaded and re-rendered if modifications are needed.
Examples of Offline Rendered Visualizations

The San Diego Supercomputing Center and the Southern California Earthquake Center’s
TeraShake is the largest and most detailed simulation of what happens during a major
earthquake on the southern San Andreas fault, computed at 3000x1500x400 resolution.
Various volume renderings are used to produce movies. The parameters of visualization
are submitted by scientist through a web portal, which in turn schedules and launches
numerous jobs on large compute clusters such as the TeraGrid. The end results are time-
series animations for various earthquake scenarios.

The National Center for Supercomputing Applications’ classic tornado simulations
calculated the birth of a tornado, starting with data from weather conditions that produced
a powerful tornado in South Dakota. The results are choreographed and cinematic
renderings of this complex simulation data combining artistic and scientific expertise
translated the data into a dynamic, high-definition animated visualization. Data
representations (proxies) are loaded into software such as Maya, visualization parameters
are selected interactively in pre-visualization, and then jobs are launched for rendering on
a powerful cluster computer.

Accurately modeling the dynamics and structure of the Earth’s interior requires massively
distributed computational techniques - the use of high-performance PC clusters
generating terabytes of data which are then visualized through distributed volumetric
rendering methods to produce high-quality monoscopic and stereoscopic visualizations.
Global seismic wave propagation data are generated with software from the Theoretical
& Computational Seismology group at the California Institute of Technology. This
software produces highly accurate simulations of wave propagation. Global models for
each time-step are combined to create a global model of seismic wave propagation over
time. A one-hour seismic wave field divergence simulation of the 1994 earthquake in
Bolivia highlighting positive amplitude compression (P) waves at progressive stages
during propagation takes 21 hours to compute. The simulation data are redistributed
across the cluster according to the requirements of the specific visualization algorithm.
The 169 time steps of the simulation are sampled uniformly throughout model time,
making the total time for gathering and interpolating the data 33.8 hours, producing a 512
x 512 x 512 regular grid at 16-bit resolution (40GB-worth of data). An interactive tool is
used for staging and rendering monoscopic and stereoscopic 3D volumetric animations.
The approach allows the user to interactively select datasets, setup camera positions, pick
color, opacity and gradient transfer functions, and specify movie properties. The
rendering was performed on a TeraGrid cluster at the San Diego Supercomputing Center
on up to 64 nodes. The table below shows rendering times and speedups for a low-
resolution stereoscopic rendering and high-resolution monoscopic rendering of the 169
time steps of the simulation: it shows significant acceleration, where the stereoscopic
animation takes only 2 minutes and 30 seconds to generate on 64 processors, and the
high-resolution animation showing all the simulation details took only 28 minutes. The
end products, as shown in the following pictures, are time-series animations showing
three different views of the wave propagation through the earth.Fully Distributed Visualization
As the amount and variety of available data on the Internet grows, so will the means by which
the data can be filtered or mined, visualized and displayed. There is no single data mining or
visualization technique that will satisfy the needs of every application; the choice of technique is
highly dependent on the fundamental question being asked about the data. The power of the
Internet is maximally leveraged when storage, filtering / mining, visualization and display
capabilities are presented as independent services that can be combined in a myriad different
ways, distributing the visualization pipeline around the planet. Google maps is an example of how
the coupling of a map service with a data service can provide wholly new forms of information
and insight.
A key technological enabler for a distributed service-oriented paradigm (sometimes referred to
as Grid Computing or Cloud Computing) is an abundance of high-speed networking.
Traditionally, networking was considered expensive and it was therefore impractical to move
large quantities of data over them. As a result, data filtering and mining algorithms were
collocated with the source of the data. This made it difficult for new filtering and mining
algorithms to emerge as it was not possible for the original data providers to anticipate all the
different ways in which their users might want to use the data. However with the continual
decrease in networking costs, a fully distributed model is quickly becoming practical.
In 2002, the National Science Foundation’s OptIPuter project sought to envision how
unlimited networking bandwidth could change fundamental notions of computing and the way
scientists conduct basic research (Smarr, 2003). In the OptIPuter model, all computing resources
were fully distributed, and the networks that tied the resources together were thought of as wide-
area system busses rather than networks. Therefore a cluster of computers with massive amounts
of storage was collectively considered a single virtual disk drive, and a cluster of computers with
high performance graphics cards was thought of as a single graphics card. The project resulted in
the creation of numerous innovations which include: ultra-high-speed network transport protocols
that could routinely move data over 1-10 Gigabit/s networks between individual compute nodes
at close to the full line rate of the network (Vishwanath, 2007; He, 2002; Xiong, 2005); ultra-
high-speed national and international networks that could be provisioned directly by the
applications through the allocation of dynamic light paths (DeFanti, 2003); wide-area RAMDisk
caches that could improve I/O-bound scientific applications by five-fold (Vishwanath, 2008a);
and new middleware and hardware approaches to drive network-enabled displays with hundreds
of megapixels of resolution (Jeong, 2006). The OptIPuter project members were able to
successfully demonstrate that these individual systems could be daisy-chained to produce a
planetary-scale distributed computer (Zhang, 2003).
DATA MANAGEMENT FOR SUPPORTING DISTRIBUTED VISUALIZATION
Visual exploration of multi-terabyte and petabyte-sized datasets is critical to gain insight in
scientific domains, such as astrophysics, earth sciences and biosciences. When a hurricane is
about to make landfall, scientists involved in disaster management need to visualize, in real-time,
climate models running on geographically distributed clusters, overlay these models with high-
resolution map data, sensor data and traffic information, to make informed decisions regarding
the evacuation of people. The requirements of these visualizations include access to multiple 2D,
3D, 4D and higher dimensional datasets that are geographically distributed. However, the critical
challenges in achieving high performance are the access latencies of storage systems and wide-
area networking, rapid access to large multi-dimensional datasets, and transparent access to both
local and remote data repositories.
Scientists have traditionally copied remote datasets to local repositories in order to visualize
them. However, data replication is no longer a viable solution due to the enormous data-set sizes,
cost of the additional local storage, data consistency issues and real-time requirements. Prior
work enabling visualization for distributed data were limited to accessing a single multi-
dimensional dataset (Gao, 2005 and Ding, 2003), data present in a specific format (Prohaska,
2004) or data present on a specific storage system (Bethel, 2000 and Ding 2003). Additionally,
these solutions were demonstrated on 1Gbps wide-area networks using TCP as the transport
protocol to transfer data (Bethel, 2000, Ding, 2003 and Benyon, 2002). However, TCP fails to
scale to the 10x Gbps wide-area networks, and novel transport protocols are needed to fully
exploit the large available bandwidth. As visualization applications require access to data in
multiple scientific formats present in various filesystems on distributed data repositories, a
cohesive data management scheme that provides rapid and transparent access to datasets
independent of their formats, storage systems, and at the same time fully exploiting the available
bandwidth to provide low latency access is of paramount importance.
Case Study : LambdaRAM, a Distributed Data Cache
LambdaRAM (Vishwanath, 2009) is an approach to address the data management needs of
large distributed data visualization. LambdaRAM is a high-performance, multi-dimensional,
distributed cache that harnesses the memory of cluster nodes in one or more clusters that are
interconnected by ultra-high-speed networks, providing data-intensive scientific visualization
applications with rapid and transparent access to both local and remote data. LambdaRAM
employs novel latency mitigation heuristics based on the access patterns of an application,
including presending, a push-based mechanism, prefetching, a pull-based mechanism, and hybrid
combinations of them, proactively fetching data before an application needs it. It uses high-
performance data transfer protocols, including Celeritas (Vishwanath, 2008b), designed to fully
exploit the large available bandwidth of wide-area optical networks. LambdaRAM employs
multi-dimensional distributed data structures, including multi-grids and octrees, to manage multi-
dimensional datasets. This enables support for 2-D, volume, time-varying volume, and parallel
coordinate based visualization. LambdaRAM efficiently distributes and manages the data across
the nodes of the clusters. The data distribution in LambdaRAM is configurable and can easily be
tuned to the needs of an application. Scientific visualization requires access to datasets present in
formats including TIFF, HDF4, NetCDF, raw and binary data. Additionally, datasets could be
located on local storage, parallel filesystems or distributed filesystems. LambdaRAM has an
extensible design that aids in the design of plug-ins to interface visualization applications with
multiple scientific data formats and filesystems. Additionally, as a scientific dataset is typically
composed of multiple files, LambdaRAM presents an intuitive API to applications to access and
manipulate this dataset as without having to worry about data management. The API allows
visualization applications to easily subsample, subset and interact with the data. LambdaRAM
can manage multiple-datasets simultaneously. This is critical for visualization applications using
multi-resolution techniques wherein each resolution is managed as a dataset in LambdaRAM.
LambdaRAM provides memory quality of service guarantees necessary in interactive data
visualization to ensure that the low-resolution data are always cached in memory. Additionally,
As we scale towards petascale and future exascale systems, software testing is no longer
sufficient to guarantee reliable performance in mission critical environments. Formal verification
techniques have been applied to prove the safety and progress properties of LambdaRAM
(Vishwanath, 2008a). This helped identify a deadlock condition in the memory management of
LambdaRAM.
Figure 4 depicts a parallel visualization application using LambdaRAM to access the 600TB
multi-dimensional national aeronautics and space administration’s (NASA) modern era
retrospective analysis for research and application (MERRA) data distributed on a local parallel
filesystem. In this case, LambdaRAM harnesses the memory of the nodes of the cluster.
LambdaRAM can harness the memory of the nodes of multiple clusters interconnected by ultra-
high-speed networks to provide low-latency access to remote data. Figure 5 depicts a parallel data
visualization application running at NASA Ames using LambdaRAM to rapidly and transparently
access the MERRA data repository at NASA Goddard over 10 Gbps high-speed networks.
LambdaRAM helps data visualization scientists spend their efforts on the visualization algorithms
and facilitate scientific discovery instead of spending their efforts on data management issues
including subsampling and subsetting the dataset, memory management and copying data
between sites. LambdaRAM enables time-critical, high-performance data collaboration over both
the local and wide-area for data-intensive applications. Its extensible design enables high-
performance highly productive scientific visualization of large distributed data.
Distributed Data Access and Visualization of NASA’S MERRA Data for
Climate Analysis using LambdaRAM over Wide-Area Networks
The MERRA time period covers the modern era of remotely sensed data for the entire earth,
from 1979 through the present and is key in understanding climate variability The data are
published in the HDF4 format and consist of multiple 2D, 3D and 4D variables. The dataset size
is approximately 600TB and is stored at NASA Goddard. Given the size of the dataset,
replicating it at other sites incurs a heavy cost. Additionally, given the researchers’ need to
modify the dataset, replication could lead to data consistency issues. NASA would like to enable
their researchers and collaborators at various sites to seamlessly access the data located at
Goddard and use it in their weather analysis and visualizations. The ability to rapidly stride over
the multidimensional MERRA data using LambdaRAM would enable earth scientists to make
more informed weather predictions.
The performance of a climate analysis and visualization application accessing NASA’s
MERRA data using LambdaRAM over wide-are networks was evaluated. For this experiment,
MERRA data for 1979 was stored on an ultra-fast storage system at Chicago. The dataset is
approximately 1 Terabyte (TB) in size, and consists of 1440 data files in HDF4 format. The
storage system consists of a dual-processor dual-core AMD Opteron system with 4GB of RAM,
2TB of storage and a 10GE PCI-e based network interface card (NIC). The 2TB storage was
configured using eight 300 GB SATA – II disks on a PCI-e based 8-port 3ware RAID controller
using RAID 0. The data analysis and visualization was performed on a node at NASA Goddard.
This node consists of a dual-core dual-processor AMD Opteron with 4GB RAM and a 10GE PCI-
X based Intel NIC. A dedicated 10 Gbps network was provisioned for the experiment to
interconnect the two nodes. The network testbed is depicted in Figure 6 and consisted of the
CAVEWave network between Chicago and McLean, Virginia, and the DRAGON network
between McLean, VA and NASA Goddard. The total round trip latency for this network was
around 18ms.
Figure 7 depicts the times to access, compute and visualize the average ozone thickness of
MERRA data for 1979. From the figure, one can see as much as a 50% improvement in
performance with a single LambdaRAM server and a 100% improvement using two
LambdaRAM servers as compared to a typical ultra-fast storage system. On local storage,
accessing multidimensional data involves accessing multiple noncontiguous regions on disk
which incurs significant access overhead and leads to performance degradation. Improved
performance was observed using presending over prefetching as it does not incur the one-way
request latency in prefetching. Presending yielded 8.3% performance improvement over
prefetching using a single LambdaRAM server and 21.6% using two LambdaRAM servers. Also
observed is a 27% performance improvement by using Celeritas, an application-layer UDP-based
reliable transport protocol, in comparison to TCP over the wide-area network. Celeritas sustained
an average throughput of 3.5Gbps in the case of two LambdaRAM servers and 2.1Gbps using a
single LambdaRAM server and played a key role in achieving high-performance over the Wide-
area networks.
Particularly noteworthy was that it was faster to stride, compute and visualize the average
ozone thickness of the remote MERRA data using LambdaRAM than using a local ultra-fast
storage node. Additionally, parallel data analysis applications have demonstrated up-to a five-fold
performance improvement by using LambdaRAM to access multidimensional data on a PVFS2
parallel filesystem in comparison to using MPI-IO to directly access the data. This is due to the
multidimensional data management and latency mitigation heuristics in LambdaRAM. At
Supercomputing 2004 and 2005, an interactive visualization of large 2-D imagery using
LambdaRAM to access data over transatlantic networks, was demonstrated.
LambdaRAM’s model of aggressively using network bandwidth to overcome latency will be
crucial to ensuring the scalability of distributed computing and visualization applications whose
greatest bottleneck will be in getting the data to the processing units.
DATA RENDERING FOR SUPPORTING DISTRIBUTED VISUALIZATION
So far, we have seen a number of complete examples demonstrating different instantiations of
the visualization pipeline, both collocated and distributed, interactive and batch. The prior section
emphasized the early stages of the pipeline, managing and staging data. We now move further
down the visualization pipeline to the rendering step, again with a variety of architectures and
with an emphasis on large, distributed, multivariate data. Before presenting more examples,
though, we begin with a brief summary of basic methods and algorithms performed serially on a
single machine. Next, we examine how these methods must be modified when datasets grow in
size and complexity and architectures grow in parallelism and heterogeneity. We conclude with
two case studies in parallel volume rendering – one in hardware on a rendering cluster another in
software on a parallel supercomputer.
Rendering Techniques
Rendering methods for scientific visualization are organized according to the type of data
values and grid containing those values. Data values constitute a field over the grid, F(G), where
G ⊂ ℜ n . Most often, G is two-dimensional (n = 2) and F = F(x, y), or three-dimensional (n = 3)
and F = F(x, y, z). In the case of time-varying data, time can be treated as a fourth dimension,
where F = F(x, y, z, t). Alternately, a time varying dataset can be represented as a sequence of
steady state 3D datasets such that F = { F 0 (x, y, z), F 1 (x, y, z) … F t (x, y, z) }, where time is in
discrete steps from 0 to t. Each F t is a separate time step that may be stored in a separate file on
disk. G can be structured or unstructured, and F can consist of multiple scalar, vector, or tensor
values at each grid point. Different rendering techniques are used depending on various
combinations of F and G.
Field Types
Scalar Fields: A scalar field contains single-valued data at each grid cell, for example,
temperature. Multiple scalar values can exist at each location, for example, temperature, pressure,
and density. Scalar fields can be rendered by selecting a particular value (isovalue) and displaying
all of the locations where the field has that isovalue. The union of these locations forms a surface
(isosurface), and surface rendering techniques such as triangle meshes are used to render
isosurfaces (Lorensen, 1987; Gao, 2001). See Figure 8.
Rather than extracting an isosurface, the entire volume (assuming n = 3 or n = 4) can be
visualized directly using volume rendering techniques (Levoy, 1998; Ma, 2000). Volume
rendering creates an image of the entire 3D space using varying color and transparency. This
allows partial or complete visibility of the inside regions of the volume. Colors and opacities are
mapped the data values via a user-defined transfer function. See Figure 8.
Vector Fields: A vector field contains vector-valued data at each grid cell, for example,
velocity (v x , v y , v z ). The collection of velocity vectors represent the flow of particles in space and
in time. When the flow is steady-state, a set of seed points can be numerically integrated over the
velocity field to construct a set of streamlines. When the flow is time-varying, we call the
resulting locus pathlines. An alternative construction for time-varying flows is streaklines,
constructed by regenerating a constant set of seed points over time.
For rendering, each point in a streamline, pathline, or streakline can be connected to its
neighbor with a line or a higher order curve. To reduce aliasing and increase visibility, these lines
can be rendered as tubes or cylinders, or even semitransparent surfaces. To further improve
clarity, lighting and texturing can be applied (Johnson, 2004; vonFunck, 2008). See Figure 8.
There are direct methods of rendering a vector field without tracing particles through it,
including line integral convolution (LIC), glyphs, and topological methods. LIC renders the result
of applying an anisotropic convolution filter directly over the field (Cabral, 1993). Glyphs are
representations such as arrows, cones, etc., that are placed within the field and oriented tangent to
it (de Leeuw, 1993). Topological methods do not visualize the field itself, but rather the topology
of the flow by evaluating ridges, saddles, and critical points in the field (Gyulassy, 2008).
Tensor Fields: Tensors are matrices; for example 3x3 tensors are common in materials science
and medical imaging. Tensors that arise in science can be characterized by their eigenvalues and
eigenvectors. To render such a tensor field, a set of ellipsoidal glyphs is commonly used, where
the principle directions of the ellipsoid correspond to the eigenvectors, and the major and minor
ellipsoid radii reflect the maximum and minimum eigenvalues. Topological methods can also be
applied to tensor fields (Tricoche, 2008).
Grid Types
Structured Grids: A structured grid follows a regular pattern such that the coordinates of
individual grid cells are not stored; positional and topological information is derived from a few
parameters such as minimum, maximum, and spacing. Perhaps the most familiar example is a
Cartesian grid, but others such as polar, cylindrical, spherical are common. Rectilinear and
curvilinear grids are generalizations of these, where spacing varies across the extent of the grid
(Schroeder, 2006).
A special case of variable-spacing grid is the adaptive mesh refinement (AMR) grid (Ma,
1999; Weber, 2001a; Weber, 2001b). In an AMR grid, block shaped regions corresponding to
various data frequencies have different spatial resolution. These resolutions differ by a constant
factor, for example, half-spacing in each dimension. Many combinations and varieties of
structured grids exist, depending on the data requirements of the underlying simulation.
Structured grids are attractive because the regularity of grid locations saves storage space and
speeds access to data values. Figure 9 shows examples of structured grids.
Unstructured Grids: Some computational problems exist over a data domain that cannot be
effectively modeled by a structured grid without wasting a great deal of resources. Consider
simulating the airflow over an airplane. Most of the grid cells in a structured grid enveloping such
an irregular shape would represent air. This is appropriate if the objective is to understand the
airflow patterns surrounding the airplane. However, if the objective is to model the stress on the
fuselage, then most of the grid cells would be wasted.
In such cases, each grid cell of interest is listed in an indexed structure. The grid coordinates
of each point must be stored because they cannot be determined algorithmically. Also,
topological information must be stored. In particular, links from each cell to the neighboring cells
are necessary in order to traverse from one cell to an adjacent one. In 2D unstructured grids, cells
are usually triangular or rectangular in shape. The 3D analogy to these cell shapes are tetrahedral
and hexahedral grid cells (Nielson, 1997). See Figure 9.
Data Intensive Rendering
In computational science, simulations run for hours, days, weeks, and sometimes months at a
time across thousands or tens of thousands of compute nodes. These runs produce hundreds of
terabytes, even petabytes of results. Often these datasets are time varying and multivariate. The
scope of modern datasets poses new challenges for rendering; we will use the term data intensive
rendering when discussing approaches to meet these challenges, but we emphasize that this is not
a special case anymore. Eventually most if not all rendering of scientific computations will be
data intensive (Mount, 2004; Ross, 2008).
Data intensive rendering can affect the visualization pipeline; in particular, overall
performance depends on the successful use of parallelism and heterogeneity in the visualization
workflow. The computational performance of individual CPUs and GPUs is limited by electrical
consumption and heat dissipation, so parallelism is one alternative in order to continue to improve
performance. This is true regardless whether parallelism is inter-core with shared memory or
inter-node with distributed memory. Heterogeneity occurs when different architectures must
cooperate in parallel. Combining today’s multicore chips in parallel induces different types of
parallelism in the same rendering algorithm. The presence of multiple GPUs within the same
system is another example. Ultimately, we must use all of the architectural components in
parallel, not just the homogeneous ones, in order to raise the performance level of data intensive
rendering.
Most strategies used to parallelize the rendering process follow a similar form: data are first
decomposed and distributed to processing units for rendering after which the output of each task
is combined to form the final output image. Parallelization strategies are commonly categorized
based on how data are decomposed across processing units. Parallel direct volume rendering
techniques are usually categorized as either object-order or image-order techniques.
Parallel object-order techniques assign a processing unit to a section of data regardless of
where the data appears in the final output image. After each section of data are rendered, a
compositing step based on the theory described by Porter and Duff (Porter, 1984) constructs the
final image. Much effort has been devoted to developing efficient compositing methods including
Hsu’s direct send method (Hsu, 1993), Camahort and Chakravarty’s projection method
(Camahort, 1993), the binary-swap method (Ma, 1994) and the divided-screenwise hierarchical
compositing method (Ino, 2003). Object-order implementations have been developed for
distributed-memory clusters (Elvins, 1992; Gribble, 2004; Müller, 2006), and specialized
hardware has been developed that implements the binary-swap compositing method (Lombeyda,
2001, Muraki, 2003, Frank, 2005).
Parallel image-order rendering techniques take advantage of the fact that a contribution to a
pixel in the output image is independent of other pixels in the image. These techniques break the
output image into disjoint regions and assign a processing unit to render everything in that region.
Image-order implementations have been developed for shared-memory systems (Palmer, 1998) as
well as distributed-memory systems (Bajaj, 2000).
Object-order methods distribute input data across processing units before rendering begins.
Data remains stationary even when the view transformation changes. However, object-order
methods require an additional compositing phase to assemble the final output image. The cost of
this phase grows with the number of processing units. Image-order methods avoid the additional
compositing phase but require redistribution of some portion of the input data when the viewpoint
changes. The optimal choice of parallelization strategy is heavily dependent on the
implementation architecture. We will see examples of both strategies later in the section.
A common theme in data intensive rendering is heterogeneity. Heterogeneity may exist within
a compute node, as in the Roadrunner architecture (Barker, 2008), or it can take the form of a
rendering hierarchy, similar to a memory or storage hierarchy. We can consider three levels in
this hierarchy: supercomputer, graphics cluster, and workstation. Because computing power
decreases and interactivity increases as we move to different levels in the hierarchy, one may
expose different system characteristics at each level by mapping different visualization
algorithms and stages of the algorithm to each machine architecture and thereby optimize the end-
to-end performance.
For example, in a vector flow field visualization, the supercomputer can compute particle
traces and transfer finished particles to a visualization cluster for rendering. There, graphical
effects such as semitransparency and lighting can be added more effectively at the GPU hardware
level than in software at the supercomputer. A subset of the rendered surfaces can be transferred
to the scientist’s workstation for local interaction. This type of hierarchy has yet to be realized at
large scales in practice, but heterogeneity of this type will likely play a role in future data
intensive rendering.
Some rough guidelines for mapping algorithms to architectures are:
• Supercomputer: Algorithms such as ray casting that either have a software equivalent or
were originated in software; large scale data-bound problems where the cost of moving
data is prohibitive; high quality or large image size rendering where rendering time is
long; complex mesh problems where raw rendering time is not the bottleneck;
preprocessing, filtering, transforming, or numerical analysis operations performed prior
to rendering
• Graphics cluster: Algorithms such as marching cubes that originated as triangle
rendering methods; algorithms that rely heavily on triangles or textures for which the
GPU rendering pipeline is optimized; algorithms that benefit from multithread
parallelism that GPUs can offer, data domains that fit in GPU memory, and interactive
applications.
• Graphics workstation: Versions of graphics cluster algorithms operating on smaller data
subdomains; applications with even higher levels of interactivity; and algorithms that
make local approximations to the higher levels of the hierarchy, for example, image-
based rendering (Chen, 1995).
Finally, a common theme in data intensive rendering is that many algorithm design decisions
are governed by the high cost of reading, writing, and transporting data. Overall performance is
usually bound by the cost of data movement, rather than computation. This data movement can
take the form of transporting files across a grid or the internet, reading and writing to and from
storage systems, and communicating between the nodes of a cluster or supercomputer. We will
see the cost of data movement reappear several times in the following examples.
Case study: Parallel Volume Rendering on a Graphics Cluster
As the spatial size of volume data increases due to advances in scientific instrumentation, so
too does the need to visualize such large data. High-resolution displays allow scientists to see
their spatially large data at or nearer its native resolution. Most often, scalable high-resolution
displays are composed of an array of liquid-crystal displays (LCDs) or projectors and a
commodity distributed-memory cluster of computers with accelerated graphics hardware.
Schwarz and others have shown a distributed image-order volume rendering approach for
high-resolution displays that scales as the output resolution of the display and the number of
cluster nodes increases. This approach preprocesses data into a hierarchical structure which is
distributed across the local storage of a distributed-memory cluster. The cluster is equipped with
graphics cards capable of hardware accelerated 3D texture-mapping. Rendering is aborted and
restarted at the lowest level in the hierarchical data structure if user interaction occurs before the
highest-resolution level is completed.
The successful image-order parallelization of the rendering task relies on a data management
scheme that uses a multi-level cache and distributed shared-memory system. The primary responsibility of the multi-level cache is to keep the most recently used data bricks as close to the
graphics hardware as possible. The distributed shared-memory system keeps track of all data
bricks loaded in memory across cluster nodes. Data bricks are transferred between cluster nodes
when required via the cluster’s high-speed backplane. If a data brick is not in memory on one of
the nodes it is loaded from disk. Performance results show that the system scales as output
resolution and cluster size increases.
The Purkinje dataset in Figure 10 is shown on a six panel, twenty-four megapixel tabletop
display. The display is run by a three node cluster where each node is attached to two 2,560 x
1,600 LCDs. Each cluster node has one AMD Athlon 64 FX-60 Dual Core processor, 2 GB of
RAM, and a PCI-E nVidia GeForce 7900 GT graphics card with 256 MB of texture-memory. The
cluster is connected via a 10 Gbps Ethernet backplane with the MTU size set to 9,000 bytes.
The original raw data creates a 2,048 x 4,096 x 128 volume of 16-bit voxels. The real spatial
extent of the data is about 80 µm x 80 µm x 15 µm. Purkinje neurons exist in the cerebellar cortex
and are responsible for fine motor control. The study of Purkinje neurons may lead to treatments
for genetic mutations, such as autism, and neurodegenerative diseases, such as sporadic ataxias.
The rat kidney dataset in Figure 11 is shown on a fifty-five panel, 100 megapixel display. The
display is run by a twenty-eight node cluster where all but one node is attached to two 1,600 x
1,200 LCDs. Each cluster node has two AMD Opteron 246 processors, 4 GB of RAM, 500 GB of
local storage space, and an 8x AGP nVidia Quadro FX 3000 graphics card with 256 MB of
texture-memory. The cluster is connected via a 1 Gbps Ethernet backplane with the MTU size set
to 9,000 bytes.
The original raw data comprises a 32,768 x 32,768 x 128 volume of 24-bit samples. The real
spatial extent of the data is about 8mm x 5mm x 1.5mm. Scientists are particularly interested in
cell neucli which are between 5 µm and 10 µm in length.
Figure 12 shows the average time to render a single frame of the Purkinje neuron dataset.
Average rendering results for the rat kidney dataset are given in Figure 13. These results reflect
the average time taken to render and display all data in all levels of the tree for a given view. The
results show that as the output resolution increases along with the corresponding number of
rendering nodes, the time taken to render each dataset decreases.
Case Study: Parallel Volume Rendering on the IBM Blue Gene
In contrast to the previous case study, Peterka et al. (Peterka, 2008) wondered whether and
how parallel supercomputers such as the IBM Blue Gene and Cray XT can operate as parallel
visualization engines. Even though they do not have graphics capability, under certain conditions
of data size, algorithm, number of cores, and output display resolution, we have found that a
software rendering solution running in parallel on a supercomputing platform can actually be an
effective alternative to hardware-accelerated graphics cluster approaches that are the de facto
rendering method today.
Running visualization algorithms on a supercomputer also opens the door to in situ
visualization, where visualizations are rendered simultaneously with a running simulation (Ma,
2007; Tu, 2006; Yu, 2004). This permits scientists to “see into” their computations as they
proceed, even to steer them dynamically. There are many problems yet to be solved before in situ
visualization becomes a commonly used technique, but the ability to execute parallel
visualizations on the same architecture as simulations is one step in that direction.
Peterka and others are researching parallel visualization algorithms on leadership-class
systems, using the IBM Blue Gene/P system as a test environment. To date, a parallel volume
rendering algorithm has been successfully applied to over 32,000 BG/P compute cores. Datasets
of up to 335 billion elements were successfully volume rendered on BG/P to produce images up
to 16 million pixels.
Figure 14 shows an output image from volume rendering the entropy within a core-collapse
supernova using this approach. Through computational simulation, Blondin and Mezzacappa
(Blondin, 2003) are studying the physics of the shock wave that forms during the death of some
of the largest stars.
Figure 15 shows the end-to-end performance and scalability for a range of data, image, and
system sizes. Because the data are time-varying, each time step must be read from storage prior to
rendering. The size of a single time step can be gigabytes or even hundreds of gigabytes, and
Figure 15 shows that most of the total frame time (time to render one time step from start to
finish) is spent on I/O, reading the data from storage. Various optimizations, as already discussed
earlier, can help to hide the high cost of I/O (Peterka, 2008).
Excluding I/O, Figure 16 also shows that compositing is the next most expensive operation.
Compositing is a many-to-many communication step among all of the compute cores to exchange
partial results and merge these into a final image. Like I/O, this is a data movement operation.
This example demonstrates that large scale visualization performance is dominated by the cost of
data transfers, both from storage and across the interconnect between compute cores.
Returning to the original question, whether parallel volume rendering can be done within such
a supercomputing architecture, the answer is yes. The strengths of Argonne’s IBM Blue Gene are
a large capacity parallel storage system, low-latency high-bandwidth interconnection network,
and a large number of relatively low power compute nodes. Since the visualization algorithm (in
this case volume rendering) is bound by data movement and not rendering speed, the algorithm
properties and machine characteristics align. Data movement between supercomputer and
separate visualization cluster is eliminated and future in situ visualization is enabled by
performing visualization directly on the supercomputer.
ADVANCED DISPLAYS FOR SUPPORTING DISTRIBUTED VISUALIZATION
Traditionally data rendering and display occur together- i.e. the same computer that does the
rendering, displays the result. In distributed visualization, especially as necessitated by the sheer
massiveness of the data that needs to be rendered, it is ultimately more economical to separate
rendering from display. This Thin Client model is actually widely used in practice today – for
example every time we query Google, the “heavy lifting” of generating the result or visualization
is performed by massive compute clusters at Google, and the viewer simply sees the result on
their desktop computer, laptop or smartphone. This model is attractive because the large-scale
(and therefore expensive) computing infrastructure that supports such services can be shared
amongst large populations of users, and can be located near reliable and renewable power
sources.
As the types of problems we tackle grow in scale and complexity our notion of what
constitutes a Thin Client must change. The new Thin Client is a large ultra-high-resolution
display interfaced with an ultra-high-speed network. Increased screen real-estate and resolution
enables users to display and juxtapose more data simultaneously and thereby enhances the users’
ability to derive insight from data.
Until the mid-90s image creation for large displays (often called Power Walls) primarily used
Cathode Ray Tube-based projectors. The main advantage of CRT projectors is that they maintain
good brightness to 10,000 hours, are capable of generating very high resolution images (up to
1920x1200), and are able to refresh at high rates. In the mid to late 90s, LCD and DLP projectors
were rapidly declining in cost as well as physical size. This was driven largely by the business
graphics market which required low cost projectors for projecting presentation slides in board
rooms. Tilings of LCD or DLP projectors were used to create larger walls (Funkhouser, 2000).
While these projectors were relatively inexpensive it was difficult to align the geometry and color
of the screens. Techniques have been developed for automatic alignment through the use of
computer-vision. As a result commercial turnkey solutions are now available- but at a relatively
high cost. Today DLP projectors such as Sony’s SXRD are capable of 8-megapixel resolution
(4096x2160) and are largely used in movie theaters. In prototype form is NHK’s Super Hi-Vision
project (Kanazawa, 2003) which is capable of 7680x4320.
Currently the most economical way to build scalable ultra-high-resolution displays is by using
tilings of LCD panels rather than projectors due to the LCD’s long life, low power consumption,
high resolution and low cost. Other benefits include the fact that LCD panels are quite well color
calibrated and apart from the physical alignment of the panels during tiling, require no further
alignment, unlike projectors. Tilings of these have been used to create extremely high resolution
walls such as the 100 Megapixel LambdaVision display (Renambot, 2005). The main drawback
of LCD panels is that they have mullions that prevent them from producing truly seamless
displays. Mullions do not adversely impact the viewing of an image if they are taken into account
in the rendering of the image. However mullions do make the reading of text difficult when it
either occludes an entire sentence or words. Driving these tilings of displays requires a cluster of
computers, often each computer drives anywhere between 1 and 4 displays, and a master
computer coordinates the entire cluster. By building these tiled displays using small low cost
computers equipped with gigabit networking- such as Mac Minis we can leverage the Thin Client
paradigm and utilize remote shared computing infrastructure to generate and stream the
visualizations to the tiled display.
The major challenges in supporting tiled display environments then is how to scale graphics
rendering algorithms to be able to handle the exponentially growing data volumes that are
accumulated in scientific research, and how to best enable users to interact with information and
visualizations in these environments.
Middleware for Scalable Displays
The traditional model for scalable rendering has assumed that faster graphics cards will be
more than capable of supporting rendering to keep pace with the exponential growth of data size.
Middleware such as WireGL, and its enhanced version, Chromium manages a high resolution
scene by distributing polygons only to the computers responsible for rendering the particular
viewport into the overall scene (Humphreys, 2001). Termed “sort-first rendering” the advantage
of this scheme was that if an image filled a significant portion of the tiled display much of the
geometry would be evenly distributed across all the computers. However if the image fell on only
a small portion of the wall a load imbalance would result and overall rendering rate would
plummet.
An alternative approach replicates all the data across all the computers and simply uses raw
graphics power to process the data to generate the images for the individual viewports of the
display tiles. Used in middleware such as Conduit (by VRCO) and the Cross-Platform Cluster
Graphic Library (CGLX), the advantage of this approach is that many existing OpenGL
applications can be easily ported without modification of the code. The primary disadvantage is
that the amount of data that can be rendered is limited by the capabilities of the individual
graphics card.
Evolving out of the Thin Client paradigm, a third and most scalable approach delegates the
rendering to a remote cluster of computers and instead treats the tiled display simply as a large
frame buffer connected via a high speed network. This approach was first pioneered in SAGE
(the Scalable Adaptive Graphics Environment) (Jeong, 2006). The scheme has greater scalability
than previous approaches because it allows multiple cluster computers that are potentially
rendering different and very large data sets, to work concurrently and stream visualizations to be
displayed on the wall as individual windows as if on an enormous desktop. This allows users to
arbitrarily position and resize these windows on the wall and therefore enabling them to work
with multiple visualizations simultaneously- in effect creating a Cyber-Mashup (Leigh, 2007).
When the pixels of a visualization are streamed in parallel from a cluster of rendering nodes to
SAGE, they are intelligently routed to the correct display cluster nodes that are managing the
respective portions of the tiled display. As a window moves from one portion of the tiled display
to another, SAGE correctly re-routes the pixels in much the same way a network router routes
network packets. This approach therefore enables users to run multiple visualizations on
distributed rendering clusters and display them in a tiled display environment each in its own
separate window that can be independently controlled.
By contrast, the techniques used in WireGL/Chromium and CGLX require that the entire wall
be used to display only one visualization application at a time. This is acceptable for small tiled
displays but becomes impractical for very large display walls, especially those that can in the
future, potentially cover all the walls of a room. The SAGE model also has the advantage that far
less powerful graphics cards and computers can be used to drive the display walls thereby making
it more affordable and cost effective to maintain. These display walls can then connect into global
high performance networks to take advantage of large-scale computing resources that are
deployed at supercomputing centers around the world.
Supporting Distance Collaboration in Scalable Display Environments
Today, remote application sharing and video conferencing using tools such as WebEx, Skype,
and iChat, are making it routine for distributed users to work with each other and to work with
data from their desktop computers. These solutions however do not scale to ultra-high-resolution
display environments, which need to be able to share high-resolution visualizations and high-
definition video and audio in real-time to facilitate group-to-group collaboration.
Visualcasting is a novel image broadcasting scheme that extends SAGE to support distance
collaboration with multiple end-points without the need for traditional router-based support for
multicasting which have always been difficult to deploy over wide-area networks (Renambot,
2009). Visualcasting is a scalable real-time image replication service for multiple high-resolution
tiled displays. A visualization application streamed to multiple Visualcasting end-points (tiled
displays) can have different window layouts on each tiled display. The window operations
(moving or resizing) on each tiled display can occur independently. As windows on the tiled
display are resized or repositioned, SAGE performs the non-trivial task of reconfiguring the
multiple streams from the rendering source to the PC nodes that drive potentially differing tiled
displays.
Visualcasting is realized via the SAGE Bridge- a high-speed intelligent bridging system that
duplicates and splits pixel streams received from rendering clusters for each end-point. This
allows each rendering node to stream whole images without having to be cognizant of the
window layouts and tiled display configurations of multiple end-points. A single SAGE Bridge
node is built from a high-performance PC equipped with 10-gigabit network interfaces. Multiples
of these can be ganged together to facilitate the sharing of even higher resolution content or more
varieties of content. For example one bridge could be dedicated to high-definition audio/video
conferencing while another could be dedicated to sharing visualizations.
SAGE also supports audio streaming to facilitate distance collaboration. The design goal has
always been to treat sound data in SAGE in a similar fashion to pixel data potentially allowing for
sounds to be spatially collocated with the displays that show the visualizations. Sound buffers are
captured at various sources (microphones, high-definition video capture card, audio files, or
application-provided buffers), then streamed in an uncompressed form to minimize latency, and
finally sent to a sound card for playback.
Figure 18 and Figure 19 show a Visualcasting trial conducted over an international network
testbed called the Global Lambda Integrated Facility. In the trial ultra-high-resolution tiled
displays at the SC08 conference in Austin, Texas (Figure 18), Electronic Visualization
Laboratory at the University of Illinois at Chicago (Illinois), the School of Information at the
University of Michigan, and Masaryk University (Czech Republic) (Figure 19) were linked to
Visualcasting servers in Chicago so that they are all able to share a 4K (4096x2048) pixel
visualization stream as well as communicate over high-definition video conferencing.
CONCLUSION
The World Wide Web has been a transformative technology that has afforded its users the
ability to publish and retrieve an unprecedented amount of data. Making data massively available
and providing facilities for combining them stimulates the production of new insight. A crucial
component to the derivation of insight is visualization- the production of a meaningful visual
representation of data.
As data size continues to grow exponentially, the long-established technique of first
downloading data to a desktop computer and then creating a plot on a spreadsheet has given way
to distributed models of data retrieval, visualization and analysis. The majority of techniques used
on the Web today have their roots in the field of High-Performance Computing (HPC). For
almost two decades, the HPC community has developed innovative approaches for deriving
insight from large-scale distributed data.
Large-scale data visualization is conducted via a pipeline consisting of data retrieval, filtering,
rendering, and finally display. The motivation to collocate or distribute the steps of the pipeline is
driven largely by the economics of meeting user demand and managing scalability. Traditionally
when data was small, all the steps of the pipeline could reside on the same computer.
Visualization was interactive and users could adjust the visualizations in real-time. As data sizes
grew, visualization on single computers had to expand to utilize parallel processing on multiple
computers in order to keep pace. However, parallel computing was still largely collocated
computing because high speed wide-area networks were scarce. Furthermore, parallel computers
often worked in batch mode, and therefore the results of parallel visualization algorithms
consisted mainly of animated movies rather than interactive graphics. Viewing the visualization
from a different point of view was therefore not possible without time consuming re-computation.
As data volumes continued to grow, the economics of providing data management and data
processing services necessitated greater wide-area distribution of resources. Serendipitously, this
led to an explosion of new applications beyond scientific computing that includes e-commerce
and even video gaming.
The key challenge in data visualization today is how to provide users with the interactive
experience that they have grown accustomed to on their desktop computers, while offering the
scalability to handle the exponential explosion of data. Experiments show that large-scale
visualization is an Input/Output-bound problem. That is, the rate at which visualizations can be
generated is bound by the rate at which data can be sent through the visualization pipeline. If one
cannot move the data through the pipeline fast enough then one cannot process the data fast
enough. When interactivity is required, data access latency becomes another major problem-
especially when the data is distributed over wide-area networks. However as high-speed networks
such as the National LambdaRail and Internet-2 grow in capacity, the limitation of bandwidth
between distributed resources diminishes. Furthermore, through the aggressive use of bandwidth
to presend and prefetch data, it becomes possible to significantly reduce data access times to the
point where, surprisingly, it can be faster to access data from a remote computer than from one’s
local disk drive. Some researchers would argue that there will never be enough wide-area
bandwidth available to keep pace with both the growth of data and the multi-core and many-core
systems that are needed to process the data. In this case, researchers are examining the notion of
in situ rendering, where the parallelized visualization and computational algorithms are collocated
so that rendering the data can be done during the generation of the data. In in situ visualization,
the computation and visualization could be either collocated on the same supercomputer or run on
separate specialized resources that are collocated at the same site. In the first case, the data
transfer between computers is reduced though at a cost of sharing compute resources of the
supercomputer for visualization. In the latter approach, a visualization cluster, typically a GPU-
based cluster, needs to be interconnected to the supercomputer via high-speed local network
interconnects.
The notion of in situ visualization however is not a silver bullet. The rate at which a frame of
visualization is generated is bound by the rate of a simulation timestep- which can be on the order
of seconds to even hours. Therefore, in situ visualization does not lend itself to interactive
visualization. In situ visualization’s greatest value is in allowing the user to catch a glimpse of
how the computation is evolving in order to determine whether it is worth allowing it to continue
for days to weeks. This approach can enable computational steering where the user would directly
modify the parameters of the computation. Then once the computation is complete, techniques for
low-latency real-time distributed visualization can be brought to bear to provide users with a
means to interactively explore the results.
ACKNOWLEDGMENTS
This publication is based on work supported in part by Award CNS-0935919 and 0959053, made
by the National Science Foundation, and U.S. Dept. of Energy under Contract DE-AC02-
06CH11357.
REFERENCES
Bajaj C., Ihm I., Park S. & Song D. (2000). Compression-based ray casting of very large volume
data in distributed environments. Proceedings of the Fourth International Conference on High-
Performance Computing in the Asia-Pacific Region (HPC00),2, 720–725.

原创粉丝点击