Jim Gray Research Interests (Back to Jim's home page)

The World-Wide Telescope: Building the virtual astronomy observatory of the future; 4/5/2002
Astronomers are collecting huge quantities of data, and they are starting to federate this data. They held a Virtual Observatory conference in Pasadena to discuss the scientific and technical aspects of building a virtual observatory that would give anyone anywhere access to all the online astronomy data. My contribution ppt was a computer science technology forecast. doc, pdf, The Virtual Observatory will create a "virtual" telescope on the sky (with great response time). Information at your fingertips for astronomers; and for everyone else. A single-node prototype for this is at (http://skyserver.sdss.org/). More recently, Tanu Malik, Tamas Budavari, Ani Thakar, and Alex Szalay have built a 3-observatory SkyQuery (http://SkyQuery.net/) federation using .Net web services (I helped a little).

Alex and I have been writing papers about this. A “general audience” piece on the World-Wide Telescope for Science Magazine, V.293 pp. 2037-2038. 14 Sept 2001. (MS-TR-2001-77 word or pdf.) More recently we wrote two papers describing the SkyServer. The first describes how the SkyServer is built and how it is used.: “The SDSS SkyServer - Public Access to the Sloan Digital Sky Survey Data,”. A second paper (read only if you loved the first one) goes into gory detail about the SQL queries we used in data mining. It is MSR TR 2002-01: "Data Mining the SDSS SkyServer Database.” I have been giving lots of talks about this.
Tom Barclay, Alex Szalay, and I gave an overview talk at the Microsoft Faculty Summit that sketches this idea. I gave a talk on computer technology, arguing for online disks (rather than nearline tape), cheap processor and storage CyberBricks, and heavy use of automatic parallelism via database technology. The talk's slides are PowerPoint(330KB) and an extended abstract of the talk is at Word (330KB) and pdf (200KB). The genesis of my interest in this is documented in the paper: "Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey", Alexander S. Szalay, Peter Kunszt, Ani Thakar, Jim Gray MSword(220KB) or PDF (230 KB).

The Future of super-computing and computers (Gordon Bell was the principal author) (8/1/2001)
Gordon Bell assesses the SuperComputing every 5 years or so. This time I helped and argued with him a bit. This discussion focuses on technical computing, not AOL or Google or Yahoo! or MSN, each of which would be in the top 10 of the Top500 if they cared to enter. After 50 years of building high performance scientific computers, two major architectures exist: (1) clusters of “Cray-style” vector supercomputers; (2) clusters of scalar uni- and multi-processors. Clusters are in transition from (a) massively parallel computers and clusters running proprietary software to (b) proprietary clusters running standard software, and (c) do-it-yourself Beowulf clusters built from commodity hardware and software. In 2001, only five years after its introduction, Beowulf has mobilized a community around a standard architecture and tools. Beowulf’s economics and sociology are poised to kill off the other two architectural lines – and will likely affect traditional super-computer centers as well. Peer-to-peer and Grid communities provide significant advantages for embarrassingly parallel problems and sharing vast numbers of files. The Computational Grid can federate systems into supercomputers far beyond the power of any current computing center. The centers will become super-data and super-application centers. While these trends make high-performance computing much less expensive and much more accessible, there is a dark side. Clusters perform poorly on applications that require large shared memory. Although there is vibrant computer architecture activity on microprocessors and on high-end cellular architectures, we appear to be entering an era of super-computing mono-culture. Investing in next generation software and hardware supercomputer architecture is essential to improve the efficiency and efficacy of systems.

Digital Imortalitydoc or pdf (10/1/2000)
Gordon and I wrote a piece on the immortality spectrum between passing knowledge on to future generations: one way immortatlity at one end, and actually interacting with future generations via two-way immortality where part of you moves to cyberspace and continues to learn and evolve. It is a thought-piece for a "special" CACM issue.

A River System (Tobias Mayr of Cornell) (12/14/2000)
Data rivers are a good abstraction for processing large numbers (billions) of records in parallel. Tobias Mayr, a PhD student at Cornell visiting BARC in the fall of 2000 designed and started building a river system.  This small web site describes the current status of that work.

The 10,000$ Terabyte, and IO studies of Windows2000 (with Leonard Chung) (6/2/2000)
Leonard Chung (an intern from UC Berkeley) and I studied the performance of modern disks (SCSI and IDE) in comparison to the 1997 study of Erik Riedel. The conclusions are interesting: IDE disks (with their controllers) deliver good performance at less than 1/2 the price. One can package them in servers (8 to a box) and deliver very impressive performance. Using 40 GB IDE drives, we can deliver a served Terabyte for about 10,000$ (packaged and powered and networked). Raid costs about 2x more. This is approximately the cost of an un-raided SCSI terabyte. The details are at IO Studies..

The 1,000$ Terabyte is here with TeraScale Sneakernet . This work is now ongoing with our plans to re-build the TerraServer with SATA CyberBricks.

4 PetaBumps (2/15/1999)
In  February 1999, U. Washington (Steve Corbato and others), ISI-East (Terry Gibbons and others), QWest, Pacific Nortwest Gigapop, and DARPA's SuperNet, and Microsoft (Ahmed Talat, Maher Saba, Stephen Dahl, Alesandro Forin, and I) collaborated to set a "land speed records" for tcp/ip (they were the winners of the first Interent2 Land Speed Record. The experiment connected two workstations with SysKonnect Gigabit Ethernet via 10 SuperNet hops (Arlington, NYC, San Francisco, Seattle, Redmond). The systems delivered 750 mbps in a single stream tcp/ip (28 GB sent in 5 minutes) and about 900 Mbps when a second stream was used. This was over a distance of 5600 km, and so gives the metric 3 PetaBumps (peta bit meters per second). It was "standard" tcp/ip but had two settings: "jumbo" frames in the routers (4470 bytes rather than 1550 bytes) that give the endpoints fewer interrupts, and also the window size was set to 20 MB (since the round trip time was 97 ms you need that much of a window to hold the "20M in-flight" bits). The details are described in the submissions to the Internet2 committee.
The single-stream submission: Windows2000_I2_land_Speed_Contest_Entry_(Single_Stream_mail).htm
The multi-stream submission: Windows2000_I2_land_Speed_Contest_Entry_(Multi_Stream_mail).htm
The code: speedy.htm , speedy.h, speedy.c
And a powerpoint presentation about it. Windows2000_WAN_Speed_Record.ppt (500KB)

This was an extension of some work we did last fall (0.5 PetaBumps) with U. Washington, Research TV, Windows2000, Juniper, Alteon, SysKonnect, NTON, DARPA, Qwest, Nortel Networks, Pacific Northwest GigaPOP, and SC99, we demonstrated 1.3 Gbps (gigabit per second) desktop-to –desktop end-user performance over a LAN, MAN (30 km) , and WAN (300 km) using commodity hardware and software, and standard WinSock + tcp/ip and 5 tcp/ip streams.  Here is: The press release, the white paper word (210KB) or PDF (780KB), and PowerPoint Presentation (500KB)

(12/20/99)   Rules of Thumb in DataEngineering
A paper with Prashant Shaenoy, titled "Rules of Thub in Data Engineering," that revisits Amdahl's laws, Gilder's laws, and investigates the economics of caching disk and internet data.  .

(12/15/99) Scalability Terminology: Farms, Clones, Partitions, and Packs: RACS and RAPS
Wrote, with Bill Devlin, Bill Laing, and George Spix, a short piece trying to define a vocabulary for scaleable systems:  Geoplexes, Farms, Clones, RACS, RAPS, clones, partitions, and packs.  The paper defines each of these terms and discusses the design tradeoffs of using clones, partitions, and packs.

Large Spatial Databases:  I have been investigating large databases like the TerraServer which is documented in two Microsoft technical reports:

  • Microsoft TerraServer: A Spatial Data Warehouse, Tom Barclay, Jim Gray, Don Slutz, MSR-TR-99-29 Describes our 1999 design after operating the TerraServer for a year
  • The Microsoft TerraServer, Tom Barclay et. al. TR 98 17
    Was written just as the TerraServer was going online and describes the original design.
  • We have been operating the TerraServer (http://TerraService.Net/) since June 1998. At this point we have served over 4 billion web hits and 20 terabytes of geospatial images. We are working with Alex Szalay of Johns Hopkins on a similar system to make the Sloan Digital Sky Survey images available on the web as they arrive over the next six years.  Our research plan for handing this 40 Terabytes of data over the next five years is described in the report Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey , Alexander S. Szalay, Peter Kunszt, Ani Thakar, Jim Gray. MSR-TR-99-30 .

    In addition to the here are some interesting air photos of the Microsoft Redmond Campus.

    This is a 1995 proposed Alternate Architecture for EOS DIS (the 15 PB database NASA is building). Here is my PowerPoint summary of the report.(250KB)

    WindowsClusters: I believe you can build supercomputers as a cluster of commodity hardware and software modules. A cluster is a collection of independent computers that is as easy to use as a single computer. Managers see it as a single system, programmers see it as a single system, and users see it as a single system. The software spreads data and computation among the nodes of the cluster. When a node fails, other nodes provide the services and data formerly provided by the missing node. When a node is added or repaired, the cluster software migrates some data and computation to that node.

    My personal (1995) research plan is contained in the document: Clusters95.doc. It has evolved to a larger enterprise involving many groups within Microsoft, and many of our hardware and software partners. My research is a small (and independent) fragment of the larger NTclusters effort lead by Rod Gamache in the NT group Wolfpack_Compcon.doc (500KB) and a PowerPoint presentation of Wolfpack Clusters by Mark Wood WolfPack Clusters.ppt. (7/3/97) That effort is now called Microsoft Cluster Services and has the web site. Researchers at Cornell University, the MSCS team, and the BARC team wrote a joint paper summarizing MSCS for the Fault Tolerant Computing Symposium. Here is a copy of that paper MSCS_FTCS98.doc (144KB)

    We demonstrated SQL Server failover on NT Clusters SQL_Server_Availability.ppt (3MB). The WindowsNT failover time is about 15-seconds, SQL Server failover takes longer if the transaction log contains a lot of undo/redo work. Here is a white-paper describing our design SQL_Server_Clustering_Whitepaper.doc

    In 1997, Microsoft showed off many scalability solutions. A one-node terabyte geo-spatial database server (the TerraSever ), and a 45-node cluster doing a billion transactions per day. There were also SAP + SQL + NT-Cluster failover demos, a 50 GB mail store, a 50k user POP3 mail server, a 100 million-hits-per-day web server, and 64-bit addressing SQL Server were also shown. Here are some white papers related to that event: (5/24/97)

    A 1998 revision of the SQL Server Scalability white paper is SQL_Scales.doc (800 KB) or the zip version: SQL_Scales.zip (300 KB).

    There is much more about this at the Microsoft site http://www.microsoft.com/ntserver/ProductInfo/Enterprise/scalability.asp

    I wrote a short paper on storage metrics (joint with Goetz Graefe) discussing optimal page sizes, buffer pool sizes, an DRAM/disk tradeoffs to appear in SIGMOD RECORD 5_min_rule_SIGMOD.doc (.3MB Office97 MS Word file).

    Erik Riedel of CMU, Catharine van Ingen, and I have been investigating the best ways to move bulk data on an NT file system. Our experimental results and a paper describing them is at Sequential_IO. (7/28/98) You may also find the PennySort.doc (400 KB) paper interesting -- how to do IO cheaply!

    Database Systems: Database systems provide an ideal application to drive the scalability and availability techniques of clustered systems. The data is partitioned and replicated among the nodes. A high-level database language gives a location independent programming interface to the data. If there are many small requests, as in transaction processing systems, then there is natural parallelism within the computation. If there are a few large requests, then the database compiler can translate the high-level database program into a parallel execution plan. CacmParallelDB.doc.

    Performance: I helped define the early database and transaction processing benchmarks (TPC A, B, and C). I edited the Benchmark Handbook for Databases and Transaction Processing which is now online as a website at http://www.benchmarkresources.com/handbook/, managed by Brian Butler. (12/12/98) I and am an enthusiastic follower of the emerging database benchmarks Transaction Processing Performance Council. I am the web master for the Sort-Benchmark web site. For 1998, Chris Nyberg and I did the first PennySort benchmark PennySort.doc (400 KB).

    Transaction Processing: Andreas Reuter and I wrote the book Transaction Processing Concepts and Techniques. Here are the errata for the 5th printing: TP_Book_Errata_9.htm (17KB) or in word TP_Book_Errata_9.doc (50KB) (5/20/2001) I am working with Microsoft's Viper team that built distributed transactions into NT. Andreas and I taught two courses from the book at Stanford Summer Schools (with many other instructors). The course notes are at WICS99 and WICS96. I helped organize the High Performance Transaction Processing Workshop at Asilomar (9/5/99). The web site makes interesting reading.