Raw notes by Mike Lesk
Hans Schek - new infrastructure for information space - this is thousands or millions of databases and computational services that do multimedia and text classification, term extraction, an architecture with lots of services and I call the new architecture "hyperdatabase" - databases can sit on all our devices. Analogy - a database is a platform for developing applications on shared data - hyperdatabase is a platform for developing services; instead of indexing we need feature extraction. Research area not just for us.
Mike Franklin - working on suite of projects about query processing in strange and interesting environments - Telegraph with Joe - stream processing how do this with lots of sharing, adaptive processing, and sensor networks - how push this out into the network - lots more interesting work to be done. Errors, lost messages, nature of sensing the environment. XML broker - how process large numbers of Xpath and Xquery queries - 10s of thousands or 100s of thousands. Also applying query processing techniques to the Grid. Make it more interactive and more easily programmable.
Bruce Croft - talk from recent ICDE conference - developing new probabilistic model of retrieval - applied to cross-language retrieval, image retrieval, and tighter integration with speech recognition, and MT. Just started working on pushing that to do retrieval in semi-structured IR domain. Can we provide IR API for semistructured database?
Joe Hellerstein - bringing data independence to networking - network is not just moving packets like post office- trying to write intelligent programs on top of a very volatile system - programs must be robust against nodes coming and going - we're doing sensor networks and peer to peer processing - convergence between graph algorithms and query optimization - we need adaptive algorithms - I enjoy both algorithms and building systems - having more fun than in several years.
Jeff Ullman - group should address TIA problem - linked discovery or chains of discovery in multiple databases - the technical problem excites me. Query optimization is not the same for streams as for traditional databases - the new stuff with XML is not just the same as SQL optimization.
Rick Snodgrass - methodological basis for our field - now hampered by our methodology - in science knowledge is encoded in theories - scientific theories are testable and make predictions - our basis is twofold (a) how about it, (b) if you need better performance we'll put something on. We test on a few data points. We don't have scientific models. We need a list of needed scientific models. I have 4 suggestions which I sent out in email - can't do this in a few seconds. Each model is predictive and testable.
Avi Silberschatz - many years ago I had a dream - my laptop was a database machine but universal access to all data in the same way - some people in Stony Brook had the same goal - the database would sit below the operating system. Don’t want to have to remember things like "lpq". All the data in the world will sit in some form of database with universal access to it. Lots of research issues- would be great to accomplish this.
Mike Carey - stopping doing research a few years ago - I’m in industry waiting for problems to come to me - working on XML - adding workflow to Xquery so it can do data transformations - now using XML schemas to think about your data and Xquery to express integration want to integrate services and data.
Alon Halevy - my goal is to get people to stop complaints about semantic heterogeneity - want to automatically match between objects in different databases - experts use names and values - but over time they see lots of schemas and they get good at this. We’re using a big corpus to learn things-e.g. typical attributes for a field named "student" and using patterns of this sort to match between different schemas and reformulate queries for a database we don't know anything about. This is part of our idea of how to do Google of 10K databases; we reformulate your query. More generally cross the "structure chasm" between IR and DB world - make it easier for people to author and query data.
Rakesh Agrawal - how can we make our data systems more privacy aware - can we design info systems which will be sensitive to privacy and data ownership but not impede flow of information. Two primary drivers - technology is too invasive in wrong hands - build in antidotes; and new business models that require cooperation plus national security requiring need-to-know for sharing; and the underlying technologies for these are likely to be similar.
Mike Pazzani - I'm here to
listen - want to hear things that will increase the NSF budget in this area -
so one mission is to ensure progress of science and self-generated issues from
DB community might lead to 5% increases; this year it might be 15% - to the
extent you find interesting publishable problems which help people in the
geosciences or biosciences there may be double digit increases - from $9M this
year - take things like making it easier to get data into a database if it is
not text - if it is a time series or images or chemical models - people have no
idea how to do this and they're using undergraduates toolkits to make this easy
might lead to increased funding. Second semantic heterogeneity is important -
Atkins report says some scientists spend 75% of their time moving data from one
format to another and if you helped with that the scientists would love you.
I coordinate our relations with homeland security but TIA is not that
different from finding all data about SARS or
Jennifer Widom - an extremely
specific problem which will force us to think about some interesting things -
keyword search over XML and have it give the right answer. (she was willing to
stop there) IR-DB thing. a challenge to deal with semistructured
data - deal with metadata - it all comes out if you IR like search over XML
which has data/metadata mixed together and we don't know what the data really mean
- if you go to XML and give keywords this also has to do with ranking,
probabilistic reasoning, and it all comes together. (Schek some people in
Stefano Ceri- most of the web systems publish some information which comes from databases and publish some dynamic data I work on principles which let you build some of these systems - I envisage a future where we could teach how to build web applications. up to our community.
Stan Zdonik - stream processing systems - addition of quality of service makes them different and unique. by having quality of service it becomes more difficult Mike Franklin and I worked for a long time - used profiles - tried to understand user needs and application needs - if we want to move to autonomic systems we need to understand how workloads categorize different kinds of information and the QoS part of stream processing helps there
Dave Maier - stream stuff is 1/3 of my time, 1/3 is putting your own structure on information, e.g. re-using attention, applications to personal data management, 1/3 is looking at data product management - from scientific data - observed and simulated data, and the fourth 1/3 is distributed query processing, data theory hybrids, trade latency for need to maintain distributed state, have to have some way to talk about catalog and coverage information, not just what kinds of data sources have but what coverage.
Dieter Gawlick - stream processing, started with conventional data bases, 90% of our work is done. we have sophisticated use of information distribution so people can get a response when they are not around. we did some interesting stuff expressionless data? - Demand analysis - you have something, who wants to have it. Oracle streams. all in products. as we go forward we have some ideas I'm looking for business as I do updates in a DB I get a stream. we don't start from events we think of everything as a history the sweet spot for a query is when something comes into the history - just what is on the brink of history. doing this demand analysis leads to a different model. I subscribe to a publication and it appears.
Gerhard Weikum - working on French woman query and others form this morning working with DB technology plus machine learning and ontologies are also an asset we would like to exploit.
Martin Kersten - battle between multimedia DB and database kernel guys; the media guys have thousands of hours of media and they run their jobs for hours - they need array based functionality. the database kernel people are battling the hardware - we can't get to the data fast enough the CPU is idle 90% of the time so what can we throw out - we toss out hashing, random access, go to streamed processing - a new generation of DB kernels will be 10X faster.
Abiteboul - web is a large knowledge base - worked on that - but the web is not yet XML - We should contribute to turning the Web into a large knowledge base. Precise questions get precise answers and not list of documents -- second problem is mixing XML docs with web services - exchange information that is static or dynamic web services – super-excited bringing XML with embedded service calls, active data bases, query processing, tree structured data, and something very important is recursive query processing. to give you a flavor when you use Kazaa to find records you are doing recursive query processing on the web - not efficient - beautiful technology, now we have a use for it.
Timos Selis- formalize ETL by picking the right operators - if you have documents in thematic hierarchies - where does this doc sit where you find it ranking should take account of this - ontology information, semantics, technical part of the work is managing catalogs - building new thematic directories - extend search to use structural properties on top of keywords
Dave DeWitt - I used to do
optimization for Xquery but I think that's impossible - at the end of my query
I'm interested in a few hard core issues - what do we do with terabyte disk
drives which make parallelism hard and bandwidth per drive is poor. we
have to treat them as tape drives. optimization: queries are too complex,
statistics are not good enough - look at what is in the buffer pool, what pages
or tuples, push adaptive optimization, I’m hard core db the web stuff is
interesting but not for me. me - the Vannevar Bush dream and
Hector Garcia-Molina - you have lots of systems interacting but they are autonomous - why will they cooperate - what is the incentive to share or cooperate, even forward messages let alone queries or results. have to think about how systems get incentivized and why we should trust them. lots of interesting problems there.
Laura Haas - I don't really do research any more - very interested in large scale systems - integrate information from diverse systems - range from federated to big warehouses using ETL - space with caching and replication in the middle - interesting physical DB design issue - also crosses organizational boundaries - data placement problems - also need systems to be more dynamic - now they are statically configured and an expert sets them up - want to automatically map to different data sources - the whole grid thing - service interfaces - potential for us - how we can use the kinds of services that are provided by the grid-op-sys people - as they provide security services or accounting/billing our systems should open up and do things like that.
Mike Stonebraker - my grand vision in 1974 was System R. over the last 25 years we have added (a) code in databases - postgres added code (b) spatial data, arrays and new data types - another add-on (c) active data bases - glued on triggers, they are second class citizens, (d) text - object relational systems we put on text but there is no ranking or probabilistic stuff (e) queue management added but not enough for streams (f) parallel database (g) distributed databases especially web ones and heterogeneous and everyone is putting and O?DP wrapper around what's there. we took a 1974 idea and glued on all this crap - we should start with a clean sheet of paper then we would not get architectures that said DB2 has the data and weblogix BEA has the code. that's a dumb distribution in a lot of ways. we have a gazillion of these things - they don't work together - rethink from a blank sheet - new query languages, interface, architecture. sad Ted Codd just died. new mandate needed to do this better.
Naughton - meta-level comment - if you do what is helpful for scientists it is of enormous benefit for non-scientists look at web or computers. will happen again. every interesting technical challenge exists in the scientific arena - personal db, data streams, web data bases, all in more manageable ways. and we can experiment on scientists more easily than on corporations. and funding agencies will support this. if we keep looking for motivation in problems that have already perceived commercial value industry will be doing it already.
Yannis Ioannidis - 2 things (a) if you want to buy a used car you can go on the web and bid; I see a need to buy data products - the value of a product may be a dollar price or in conjunction with how fast you get it or how complete the data is or how reliable - I want e-commerce techniques for query optimization - it's not just buying one thing; pieces of a query may run in different places; Mariposa+++ - multidimensional optimization using e-commerce techniques, lots of theoretical issues also. (b) personalization in databases - I should not get the same answer to a query as Ulman gets - much personalization in the Web but not in DB.
------------------------------------------------------------------------
[I think I heard a lot about streaming, about scientific data, about shared out architectures, and scattered other topics. few people are doing what they did for their PhD]
Dewitt - nobody said they were working on Xquery - (Franklin, Carey, are doing some). do you guys believe lots of stuff is going to be done.
Stonebraker - nobody mentioned concurrency control or access methods. we are all out on the periphery. Bernstein: there are people doing this. work in multiversion access methods.
Schek - there is some work on transactions in a more general sense.
Bernstein - dozen PhDs working on XML query optimization just not in research
Widom - people are doing optimization under the label "Xpath queries under streaming XML".
Hellerstein - few people needed storage.
Stonebraker -
a) Vote on the organizer's view of what the important problems are - you each get 3 votes on the topics. From yesterday's.
b) Tonight's session will be different - 1 hr on personal DB led by Gray & Weikum; 1 hr on "vision" - why did the last 4 reports have little influence - so we need this as a vision thing - just a brainstorming session want to end early today. Everybody is encouraged to stop when the discussion seems to have petered out, not just use all the time. The chairs will also do that.
Avi - who is audience for the report? Stonebraker - the audience is researchers picking research directions and also funding agencies. If we don't write something $1B goes to supercomputer centers and not us.
Hans - what did we present 15 years ago? Is that still in the vision?
DeWitt will do the diff.
What is IR?
70s-80s research focused on document retrieval
90s TREC reinforced the IR==document retrieval view
first bib records, then full text as time went on.
now doc retrieval is important - turned into web search other topics
question answering - finding short segments with particular info
cross lingual retrieval
distributed retrieval - now big
topic detection and tracking
multimedia retrieval - images, video, annotating them - starting up now,
learning and labeling images and video with text.
summarization
IR & databases
differentiated by unstructured/structured data
what about marked up text and semi-structured data?
text has tended always to have at least a few fields
recent database papers on nearest neighbor and similarity search
distributed peer to peer search
Web search
info extraction
text data mining
boundaries getting fuzzier
IR integrated with databases
many such proposals - now in XML context -go back to 70s
e.g. combine ranked search and the specificity of user queries
supporting a probabilistic framework is the key
integration vs. cooperation: do we really want one giant system? or should
we still have separate systems & separate capabilities but they work together
semantic web - "if you made the web a database" - this is make the web into a knowledge base and that won't happen - we've had a debate for decades about manual vs. automatic representations of what documents "mean" and both work better than either one but creating the manual versions is very hard. That's the lesson from the IR work
go for knowledge or statistics?
Stonebraker - every
Gray - why haven't you mentioned KDD (Knowledge discovery)? The field is very fragmented. Every product has a text retrieval bolt-on to their database.
Croft - anyone who talks text data mining is similar to IR - that works well together. The data mining in structured data - numbers -
Hellerstein - it's all based on clustering, etc. Same as machine learning - there is a common set of technologies.
Croft - IR people like NL - want to understand how to describe and satisfy an information need in an unstructured world. That gets us excited. Yes, we built inverted file technology for large data but we focused on NL and the DB people have different needs.
Stonebraker - If I ask
Google "what is the temperature in
Croft - We are working on that in the question answering world. You do want some context - you want more than just "73" as an answer (did it come from Bob's home page or where?) DB retrieval is fact retrieval so there is overlap. Some people work on extracting tables from text.
Stonebraker - this is similar to the first time I heard the discussion 20 years ago. The communities should cooperate and they don't.
Hellerstein - Not true! There has been a lot of overlap, now forced by the Web - the database community feels weak on text - and then we found that the IR stuff isn't that hard. Cohera and Whizbang are companies that had combined products. This is a healthy area.
Mike Franklin - How many people have been to SIGIR conferences? (few)
Mike Carey? We are organized into stacks. We should have a conference on a problem - not by community.
DeWitt- We could organize a conference on a topic. I like that idea.
(Martin Kersten?) - We don't need anything new - just join to attack a problem.
Croft - We do need something different than "you come to our conferences and vice versa".
Timos Sellis - A few more applications?
Croft - Want to have an NL query and not think about it.
Ullman - Re semantic web -
you talk about semantics but when you have to do something you do syntax.
If you take the temperature in
Croft - This history so far is that focus on deep understanding and semantics has not produced benefits in effectiveness. Learning patterns has been useful - applied probabilistically - the little words don't help.
Gray - People use mostly nouns and verbs - they can throw away the rest – a telegraphic interface.
Ullman - Temperature is
special because you can't crawl the web and get temperature in
Gerhard Weikum - NL is something that is not that great for queries - you need to understand the text that is there.
Ulman - Google works because it is simple.
Pazzani - The Google
answer to "what is the temperature in
Serge Abiteboul - The problem is that if you have some info you can put it into plaintext and that's ridiculous. You have meta information and the question is when you have information if you start publishing meta-info it makes it much easier to avoid NL understanding.
Hellerstein - You can make schemas and just make things harder to use.
Abiteboul - Disagree
Bernstein - It's not just how you say things but how you learn - it must not be a manual activity to attach metadata.
Gerhard Weikum - takes over leading observations on DB, IR
business data is boring
action is e-science, e-culture, and entertainment
absolute facts is a myth created by accountants
uncertainty is fact, ambiguity is fact
hope for precise semantics based on universally agreed upon
ontologies and perfect metadata
IR:
similarity search with ranking is the best approximation to semantic search
DB:
can still leverage context - metadata, ontologies, multivariate distribution)
agree with Ullman - no such thing as pure semantics.
killer queries where Google, dbms fails
Find gene expression data and regulatory paths related to Barret tissue in the esophagus.
what are the most important results in percolation theory?
Are there any theorems isomorphic to my new conjecture?
Find information about public subsidies for plumbers
Where
can I download an open source implementation of the ARIES recovery algorithm
(needs to be decomposed into several pieces).
Which professors from D are teaching DBS and have research projects on XML
Who
was president of the
(can't do the decomposition and linking again)
"Who was the French woman that I met at the PC meeting where Peter Gray was PC chair?"
a) go through email archives and find which program committees I was on
b) then look to find the chairs of those committees
c) then having found that this was VLDB 95
get the list of the members and see that Sophie... came from Inria, Paris.
d) know that
Garcia-Molina - you are working in AI.
Weikum: Looks AI complete but you can do this with dumb things
Croft - Finding isomorphic theorems is the hardest one
Weikum - There is an "open math" project.
Croft - For question answering actually TREC does fairly well on that. There are a lot of factoid questions and the current systems are finding 70% of the right answers in the top one/two. But these are not factoid questions. ARDA now sponsoring ACQUAINT which looks at things like this in the intelligence domain. They want to find authoritative docs.
Weikum - People expect to type a few words at Google and get the answer the goal should be to minimize human time - you learn how to rephrase query
Agrawal - Some queries will get money: "what websites accept Visa/Mastercard but not Amex" - Amex will pay for that; but many queries people won't pay much for an answer we need to understand which queries have to be cheap and which can be expensive.
Weikum - Not sure money invested in the right things.
Timos - What is missing from DB ?
Weikum - Knowing which database to look in for the "gene expression data related to Barrett tissue" - there are many gene databases on the web – and each has its own schema.
Halevy How much is understanding the query and how much is mapping it to formal SQL.
Silberschatz – Some, I see how to map into a DB and the one about math I can't.
Weikum - There is the open math activity - suppose you have high school math text books and we have codified them into logic. Some inferencing capabilities in that - you can then mimic this. Pattern matching on XML.
Croft - What are the drivers for integrating IR & DBMS? You could build special purpose systems for each of your examples - or you could try to do this as IR. But where do you have to unify the systems or make them communicate?
Mike Franklin - That’s the key question
Stonebraker - If you want metadata - e.g. super-duper UDDI - that's what we bring to the table.
Weikum: Shouldn't we formulate this as a meta-query - not SQL.
Halevy-The fundamental problem mixing the two worlds is that we have a subquery in some formal world and we go to a repository and all we have is text. How do we come back with an answer to do joins?
Weikum - You could XML all the data you see on the web; but not sure which tags are important. Asked students to do researcher home pages and grossly underestimated the difficulty. And it still doesn't handle ambiguity
IR strengths
methodologically rich - statistics, probl, logic, NLP
appreciation and experience with machine learning
awareness of cognitive models for end-user intention and behavior
DB strengths
integrity, scalability, availability, manageability
system engineering
resource optimization - caching, memory mgt, query opt, physical design,
scheduling
Mike Franklin - Databases
allow manipulation - update, summarize, aggregate. This is more than IR does.
What about "find average salary of a
Croft: IR and Google are not synonymous.
Stonebraker – It’s easy to express your query once you have a table; the hard part is putting together the table.
Hellerstein - You don't even know what the breakdown should be. It is harder than you suggest.
Maier: Human attention is scarce resource. Where do you apply it? Writing metadata? Google harnesses this a little bit.
Weikum - DB & IR: issues & non-issues issues
exploit collective human input
use ML & ontologies
flexible ranking to XQuery
use ML to convert Web to XML
extend Google to deep web
break google monopoly
acquire broader skills
non-issues - we can do these
crawl structured data
simple IR on XML
polish XQuery and implement efficiently
homepage.xml schema
Again we need probabilities. As a special case we do traditional DB with result certainty 1.
Google is popular because of ranking and coverage
Ullman: No, they were popular when they had less coverage.
Weikum Afraid of Google having a monopoly. Want to have a peering system that spreads out queries.
Mike Franklin - Purest merger of DB & IR is in annotated scientific databases and this problem is important today. You need both DB & IR.
Mike Franklin. Info shadow is a problem. I look for Canyon Creek development near me and it is buried under a lot of stuff about the same name in Texas
Gray - We need spatial search and also time – this pushes to a schematized metadata search – not just flat text.
Lesk - also proper names
Bernstein - Yahoo does categories.
Pazzani - Google had a student contest for new feature and the winner was geographic search.
Lesk - we need to think about video
Hellerstein - Sensors and sensor fusion generate lots of info there that is somewhat structured.
Snodgrass - We need results on impossibility. Which IR tasks aren't worth trying.
Lesk: IR doesn't do much of that.
Croft - We try to categorize queries One TREC category is web search. We are learning about queries and what we can do, which will work
Gray - DB wants schematized search. IR doesn't. Syntax has lost to statistical search in IR. Is there a place where syntax works?
Lesk - OPACs - but Amazon seems to do better.
Gray - We should make the Web available as a study item for linguists, sociologists, etc.
Garcia-Molina – Our group at Stanford is doing that.
Stonebraker - I have a 6th grade daughter - her teachers ask her to look up dinosaurs. It is hard to find things appropriate - journal articles worthless.
Lesk – Search should do Flesch score and picture/text ratio.
Gray - Spam - learn what is not interesting and what is interesting in user context. Profile people
Croft - Contextual IR is an active area.
Move to what DB flavor in e-sciences
Hans - You said you wanted convergence - should ask about mergers
Bernstein - We focus on the query, The IR stuff is on preprocessing - organizing, he says MeSh/UMLS got them from 50% to 80% performance - likes thesauri
Lesk - Disagree schema helps very little. Described how the Internet Archive works. What could databases add?
Gray - It would run a lot faster – it is unusable now.
Dave Maier - Tobacco docs - comparing with open lit - query by a smoke chemist is "what in these docs contradicts the journal literature?" It is hard to do that.
Croft - That's info extraction
Lesk - see Futrelle doing that.
Weikum - What is the purpose of the query e.g. insider trading. That's text and numbers. They organize ahead of time. In your case you didn't know in advance how the documents would be used (about smoke chemistry.)
Hans - What services do the sides provide? IR can do text categorization, DB can do engineering
Final vote show of hands: 3-1 for "I couldn't find it" over "it wasn't online"
a boring research topic?
a new frontier?
a means to keep standards people busy?
XML
rapidly adopted by industry
format for exchange of small/medium pieces of data when archived grows to large volumes
a data model - for a wide range of kinds of data. not relational - permissive typing, full-text search
the database community should be involved and perhaps concerned
XML issues
storage of XML
native vs. XML-relational
lesson from OODB - this is a business issue but the vendors are
not trying to block it
efficient representation and compression
key issue is interface - not clear whether it should be like a DB -
DOM, SAX, - or a query language -- needs work
revisiting old topics
database design
integrity constraints
concurrency control
access control
reinventing the world
universal query language for XML
problems with Xquery - promoted by W3C
focus on complex queries need simple filters, IR style search
too complex, ambitious, too much politics
can you really go from documents to data
people want to do what they did in SQL and others want doc search - this is hard
can we undermine Xquery with something better?
thinks we need small core OQL plus plug ins
running late - we need standard now
This direction deactivated by XQuery
scientific: is Xquery good or bad from a scientific viewpoint
politics: should we push for it
Weikum: SQL can be segmented.
Stonebraker - You can't talk about Xquery without talking about schema. That is what has to be subset. Big tension between what the doc guys would like and XML. XQuery does everything. It makes IMS look simple.
Gray - We had identified Google with IR and now we are identifying Xquery and Xschema with everything. We should think from a blank sheet of paper. OODB did not fail; it made object-relational possible. An approach here is that the train has left the station. We can't do much - there is an alternate path which is a much simpler query language. You should pursue that if you have a better idea.
Query optimization
for subsets of the language
tree structure is a new ball game - new index structures, cost models, etc.
depends on storage
revisit distributed query processing and view maintenance
everything being studied
Foundations
lots of work on semi-structured data
first-order logic and relational languages: strong
OQL/functional languages: reasonable
full-text search: messy
typing
much more complex than in relational world
not settled
query type checking, type inferencing, update consistency
very active area - people from DB theory, functional programming, etc.
all this again is active, but problems not simple, need more work.
real frontier: world is changing
old vs. new data management
Old New
closed world openeness
client/server P2P
distributed db web-scale data
query/answer subscription queries, stream queries
active db active databases + web services, service discovery
QBE interface new interfaces
research must focus on new issues - not single site data
beyond XML: the semantic web e.g. putting music on the internet was a very nice problem and the solution
was elegant (Kazaa) even though the lawyers disagree - uses little traditional technology
Widom - When did you add semantic web? I'm not responsible for that.
Abiteboul - All this is syntax. Makes Ulman happy; the most fundamental difference from relational DB to web is that you don't know the semantics.
Ullman: A high order bit for the report is "is querying XML too important to be left to W3C".
Stonebraker: A simple thing to say is that Xquery is a pile of crap and XML schemas are a pile of crap and we can't influence that. If we had a clean sheet of paper and wanted to do something right, we would focus on merging doc world and structured data. No standards body can do this.
Widom: People are implementing this it's too late.
Lesk- So what is an XML success story?
Abiteboul - Newspaper articles - All were in separate formats. Now they all use XML, particularly NewsML. Now we can merge 5 newspapers. You have parsers and editors and you can publish with very little effort.
Maier - The tools are very important. I studied data interchange formats and found that people agree on what things mean and without tools there weren't used. Some things left behind like array data.
Gray: Another plug for code+data; HTML started and people wanted to send script and when you send me XML I don't know what it is, just a bunch of tags, you have to send me the methods as well.
Abiteboul - Before methods you need metadata; then you provide code. We should be more active - things like UDDI are dirty. We should be helping here.
Gray - Dave Clark has a nice model for standards. There is a period when it's too early for standards and a period when it's too late - research and production phases. You need to be in between. I do not think we are at the standards phase with our ideas yet. We still need more prototypes.
Abiteboul - We're working on Active XML - XML with embedded calls.
Stonebraker - You said we have to worry about views and updates – everything that came along with the relational model. It will be more complex – this is what collapsed IMS.
Widom - You can write lots of papers.
Stonebraker You're too optimistic
Abiteboul We have a lot of models. In a distributed session you probably will do some integration of things that are very relational -- integrating at the tuple level.
Stonebraker - Part of the IMS difficulties were restrictions on views.
Abiteboul - OODB also had trees.
Ceri - If there's a lot of XML data out there we don't have the luxury of not dealing with it. Because hierarchical is the wrong way from scratch we can't ignore it.
Gray - The IMS data model was designed by blue-collar programmers, no theory. Don't postulate that there is no good hierarchical data model because IMS failed 30 years ago. Nobody has ever tried properly.
Bernstein - We can count on incremental forward progress. All the relational products are making big investments in XML. The data capture is inherently semi-structured. e.g. there is always a "comment" field.
Widom It would be absurd not to bless the area
Bernstein - But people do think it is boring, the same areas as ten years ago.
Hellerstein - We should focus on more IR things with XML and here is a list of plausible real problems (the "new" in Abiteboul 's last slide).
Widom - We spent all morning moaning about structured data and IR. This is a chance to do something about it. The next language should be more IR-ish what went wrong with Xquery?
Alon - Too many politics.
Lesk - Look at
Maier - The manuals are too thick - but SQL is no better.
Kersten - Query sessions are missing from this discussion - not just one query in isolation.
Iannis - Database people know queries. Actual users explore in unstructured ways and this often finds the most interesting things. Queries are important; but other things are too. Context, personalized stuff, other modes of interaction.
Lesk - ranking, visualization
Hans - processes, flows, combinations of services .
Ceri - I want similarity based browsing.
Snodgrass - We don't know if algebra is better than calculus.
Iannis – It is not an issue of calculus vs. algebra. Declarative vs procedural is more important. I did a study: for simple stuff declarative is fine for more complex stuff procedural is needed. I don't know what kind of interface to give people. But, none of this has to do with XML.
Maier – Why is there is no XML on the web. Are we doing anything to help with XML that is streaming?
Abiteboul - Two questions a) not much public XML but lots in industry b) how do you handle changing data?
Hellerstein - If you take queries over streams and add distributed databases you get routing which is a big topic in the networking area.
Pazzani - In a startup XML is being used as an interchange language and then it gets dumped into relational DBs. Also used as an intermediary for different screens, etc. Not much going on in XML data bases.
Bernstein - Quite a lot going on. Talk to vendors our product people can list many big time customer with lots of XML data.
DeWitt Is it simple or complex?
Bernstein - They want to do queries. There is a wide range of tasks. We can't move fast enough.
Widom There is no relational on the web either. We don't ignore RDBMS.
Bernstein - Research on XML as a data model also has room for innovation. Don't be negative about lines of traditional database research that can be applied to XML
Widom - The conferences are 1/3 XML now. It is not problem that there is not enough work.
Stonebraker - If you do research that competes with the vendors. That's not research. A big problem we have is that a lot of what we do is too close-in. Vendors will do this. We should do something Oracle is not doing.
Widom – For example, query optimization for XML is not for researchers
Stonebraker - Yes. Don't do that. Leapfrog to the next data model. XML stinks it is too complex.
Widom - XML and XML schema are different the schemas are too complex
Hellerstein - Our CS colleagues won't fund us to work on XML query optimization, but many other things would sound better.
Agrawal – This is not a firm statement but anecdotal info is that XML being stored right now is very simple. A relational tuple or other simple structure. The complexity of schemas that are coming is justified.
Widom – For example, an airline record has a few structured fields and the comment field; that does not need all of XML.
DeWitt - We should take a stand. We're going to get blamed for Xquery and Xschema. People will say it came from the DB community. Ullman said we should repudiate any association with Xquery.
Widom - We can't do that we are already associated with it.
Croft - As an outsider reading DB papers I do blame you for Xquery.
Stonebraker This is easy we can say it is commercially important but we can do better.
Maier - What should we do as a data model if our goals are openness, peering, and so on?
Lesk: Whatever you do put <> around it and call it xml.
Widom - Nobody has a beef with just XML
Abiteboul - XML is just markup with simple markup, then the schemas come and made the problem.
Widom - Why did everything get so complex?
Mike Carey - what is our purpose as a community?
1 - produce great new ideas: ie write off Xschema and forget it
2 - structure the field (credits to Jim and Phil)
3 - educate the
workforce -
building industrial strength software claim Paradis better than DB2
Gray - Some of us - Dieter & me - work at companies with hundreds of PhDs who are doing the "how to make XML work" part. The community is working in this area, but where should the research work, not advanced development, go?
Carey - If we focused
entirely on research many of the
DeWitt - Should we focus on Xquery optimization so you're educating the work force for the current jobs?
Gray - the academic community completely ignored SQL. They said it was brain dead. That was fine, it happened anyway. I think we are in a similar state re XML- XSD-XQUERY today.
Stonebraker calls time. ten minutes to lunch.
results of the poll on the gong show
federated, heterogeneous 13
querying the internet 10
personal db 8
open source 5
privacy 5
visualization/new interfaces 5
probabilistic 5
autonomic 5
db tools/cybertools 4
experiment management 4
So how adjust schedule: Add querying the internet?
Hellerstein – No, we've done that.
Bernstein - We discussed visualization in 1989 - it never goes away.
Dave Maier –I would do experiment management.
Agreed to add that.
Stonebraker will do visualization, interfaces; frustrated that no one in this room is working on better UIs.
Aside: Abiteboul is working with BnF on archiving the web; they are changing the law to get legal deposit on French (country) websites.]
Carey
Brief history of federation
Multibase @1980.
many attempts since - every few years with new model
functional, relational, object-oriented, logic-based, XML
still not solved. last night we all brought it up again
will we ever solve it?
Haas
top ten reasons against federation - I get whines about all of them
10. Robustness: Systems fail, sources unavailable, more pieces mean more failures, so with robustness. (objections: DeWitt - google; Hellerstein - peer2peer; Stonebraker - your company is selling "sysplexes" which are single system of things that can fail; One piece of big iron will do better than 500 linux systems - sort of anti-federating.
9. Security: different systems have different security mechanisms, hard to have a coherent view of permissions; more points of failure, harder to make guarantee; and data is sometimes the "corporate jewels" and needs to be protected. Schek - look at e-health: would you trust that to a federation?
8. Updates recording change is not always an update. sources may not be databases; may have to go through an application API to do an update ACIDity - not all data sources support ACID properties – transaction semantics not always possible. e.g. our current system doesn't support 2-phrase commit.
7. Configurability:- hard to set up too many architectures possible; many choices, little guidance. Lots of code to install and lots of connections to support
6. Administration - hard to keep up monitoring is hard; not all sources have tracking facilities; tuning is difficult; repairing is painful, need distributed debugging and you have to deal with different vendors
5. Semantic Heterogeneity: hard to identify commonalities - same terms, different meanings (but this is also a problem in a single system with the same data)
4. Insufficient metadata: all sources have different metadata with no uniform standard
3. Performance (data movement): need to move data, geographic distribution is common and the WAN is slow; large data volumes common and you can't just cache because changes can be frequent and hard to track, plus storage is not unlimited.
2. Performance (complexity): decision-support applns do complex queries and choices give big differences in performance. Some sources may not have enough CPU power and you need expensive functions of data.
1. Performance (path length) simple queries - even OLTP like - have huge overheads simple queries are common - easier to write, automatically produced. Should use one big query for performance but not written.
Mike Carey – Q: we have had these problems for 20 years so why will federated succeed? A: It has to: integration is a top IT issue and not going away alternatives are expensive and/or painful write it by hand with 10 different APIs.
EAI/workflow solution consolidation - warehouse, data marts
Maier - How do you know about the data?
Bernstein - You do this in big meetings. Also simple scenarios exist - may not need high security or robustness for some applications. Customers know the data; need is great and compromise is possible.
Progress being made - 20 years of distributed query processing. Plumbing is in place; connectivity there. Reliable messaging. XML is now sort of basic agreement on how to exchange data. XML schema is a way of describing data. So we're getting closer.
What would we do if it worked?
retire? integrate the web - data google? p2p database?
Is research warranted? what are the most important topics?
Bernstein - The piece of this where we're making progress is semantics.
Maier - Look at blame allocation - be able to write down expectations of what the pieces should do and then be able to see what is happening.
Ulman - When you have enormous amounts of data you have to be uniform in your dealings. You can't write code for every 100 bytes. Once you have declarative languages you have to use query optimization.
Stonebraker - Cohera found out that you didn't mention is that semantic heterogeneity nearly always involves dirty data - and cleaning data is better done in bulk.
Maier - In health data they want to get something going. Federated is easier and if that doesn't work fast enough they might try to put it in a warehouse.
Haas - We are doing a service integration system based on db2.
Croft - Does federation include resource discovery? Does it include schema?
Haas - Federation includes metadata - I didn't consider resource discovery separately.
Halevy - To feel better about what we have done we need to focus on who are customers are. If people can put things in a warehouse they will do so. We need to go after the people who can't do this, who must put data in a warehouse.
Mike Franklin - Semantic heterogeneity not so bad. Security is more serious. They won't let people into their systems.
Carey - Sometimes all you have is a minimal interface
Stonebraker - You often have a non-relational interface which you have to wrap and then try to federate at a relational level - You might be better off at web level.
Garcia Molina - Why didn't anyone else vote for workflow; Distributed workflow is similar.
Hellerstein - On topic of reliability, the is lots of exciting work in networking. You can find key value in log number of links - p2p networks. db community don't talk to these people.
DeWitt - Distributed hash tables are not going to solve the world's problems.
Hans - You have underemphasized the problems of security and reliability. We can't live with low standards of accuracy - again see electronic patient record.
DeWitt - So what is the message? Laura says it’s impossible and Stonebraker says its done.
Lesk – The intelligence community tells me you only get a keyhole into db - they refuse to federate.
Agrawal - They want "need to know" information sharing - minimal information to be delivered. We have paper coming out.
Stonebraker - Two great success stories & one great failure. (1) Airlines have been federating for years - very successfully. When you have only half a dozen elephants and a huge incentive it works. (2) Both Dell & Wal-Mart have federated their supply chains. One big enough elephant. (3) RosettaNet - electronics community trying to federate their supply chain. No big enough elephant and so it is not working. There is the same problem in autos.
Laura - Will work in specialized cases. we should solve some of these problems.
Hellerstein – Tools are good. We won't solve all of these - we need to deliver tools to content managers.
Ullman – I am the only CS person who says in public favorable things about TIA. The DARPA John Poindexter & AI community project. On 9/11 you had four guys with visible Al-Qaeda connections who went to 4 different flight schools with no connection to an airline. If you could integrate all these records, you could have asked the right query. This happens at two levels. a) How many al-Qaeda guys have been to flight schools? b) Even more ambitious - What strange things are going on? But how rare was this?
This is an interesting problem ; locality-sensitive hashing to focus on connections. We need to find just a few events that are the most interesting. The technology is not there yet but it is an interesting problem.
Gray - The license plate of the guys who were the Washington sniper was looked up 18 times in a few weeks. Nobody noticed this large number of lookups (and all were in the vicinity of one of the shootings) - because of different systems.
Ullman - You need Bayesian theory to tell you how unlikely something is.
Agrawal - Data Mining - Potentials and Challenges
observations
some transfer of data mining research into products
most in vertical applications
horizontal tools - SAS Enterprise Miner, DB2 Intelligent Miner
data mining in non-conventional domains
new challenges because of security/privacy concerns
DARPA initiative to fund data mining research
identifying social links using association rules
crawled about 1M pages and found Arabic names and charted links to make
a social network. the most popular name was Al Gore- they blew the
Arabic name identifier.
Hellerstein - Why not use a graph clustering algorithm?
Agrawal – We are using association rules.
Ullman; - You need a strength measure.
Agrawal - website profiling using classification. training on labels like "Islamic leaders", etc.
Discovering trends using sequential patterns and shape queries - trends in patents, heat removal, emergency coolings, zirconium alloy, feed water. You look for a shape of the graph of % mentioned vs. year of those words. You sketch a "resurgence" in this case - V-shape - drop and then come back.
They are discovering microcommunities - tightly coupled bipartite graphs – e.g. Japanese elementary schools, Australian fire brigades, - you find tight graphs and then you manually label the areas.
new challenges
privacy preserving data mining
randomizing the data in a way that destroys individual data but not the summarizing stuff
cryptographic approach
privacy preserving discovery of association rules
data mining over compartmentalized databases
frequent traveler rating model - with demographics, credit ratings, criminal records, etc.
TIA
was going to build a giant warehouse and got flack
perhaps one could use randomized data shipping or local computation.
Croft - System to return a probability that it can return relevant data and then you go get permission.
Stonebraker - My discomfort is that in theory all warehouses are built for data mining but in fact nobody is doing any of it and the vendors are going broke. The people I talked to were doing fairly simple things. No statistical expertise on their staff.
Agrawal - Lots of leading companies are doing this.
Weikum? - The field is approaching saturation. Interesting research but it is not for 10years. It's incremental.
Silberschatz: If we solve TIA in 10 years I would be surprised.
Ullman - even if you give me everything in the world integrated I still can't ask the right question. even more mundane - what is a gene.
Agrawal
some hard problems
past poor predictor of future
abrupt changes; wrong training examples
actionable patterns
how do we find what is surprising?
over-fitting vs. not missing the rare nuggets
how insure not overfitting - still hard
richer patterns
in medical domain - you need dags
simultaneous mining over multiple data types
text voice and structure data
when to use which algorithms
avoid the everything looks like a nail to a man with a hammer
automatic selection of algorithm parameters
CMU is now offering a degree in data mining (Tom Mitchell running program).
Pazzani - Management schools have been doing some of this for decades
Hellerstein - Many of us don't understand statistics - we should be educating ourselves. The undergraduates should be taught a bit more.
Gray - There is a popular book by Jiawei Han that is a nice intro and course. The challenge is that SAS and other tools are chauffeur driven. We have to make it easier. The science community has a size problem. Business has 1000s or 10ks of records or can subset and use quadratic or cubic algorithms. Science users have very large datasets (billions). They need log-n or linear heuristics. GenBank is about 40 GB right now - fairly small.
Hellerstein - We have an area that overlaps with statistical AI. We need to talk about what we contribute. people tell us our math skills are not up to the job.
Discussion
is datamining "rich" querying? is it "deeply" integrated with database systems. most current work makes little use of database functionality
should analytics be an integral concern of database systems
issues in datamining over heterogeneous data repositories.
Weikum: Should data mining be linked to data quality? Biomed people very anxious about this.
Agrawal - yes.
Pazzani: DB community could teach machine learning about data that doesn't fit in main memory. You must avoid things that take 10 passes over the data.
Snodgrass - Perhaps we should focus on summarization, visualization, then let people make deductions.
Ullman - I agree, this is one aspect but if all you have is visualization you need help. Suppose you have 10-D data and you have to know which are the most interesting dimensions.
Ceri - What about semi-structured data?
Ullman - I've seen it but it's derivative.
Abiteboul - I've also seen it.
Stonebraker - this is boring, what to do?
Density of incrementalism to insight is high.
Gray & Lesk: Suggested tossing the agenda and asking if anyone was passionate about anything other than selling your own research.
Schek: - We just have too few breaks people want fresh air (1/2 the group had left after the break).
Maier - So what? Should we plan the wake for DB?
Gray - In previous meetings there has been conflict - relational vs OO; logic programming, XML.
Stonebraker: I'm happy to present a controversial vision statement. What's the purpose of this meeting? In previous cases there were research branches - right now I don't hear the controversy - we are all working away - not at a turning point.
Gray - Why are we here? It's a 5 year interval - no specific agenda. It was not the field is in crisis. Last time we said text was going to be important but we have not done squat.
Schek - Other people did the work.
Ullman - I proposed 1 hr ago that the DB community should take charge of TIA. Use a systems approach. the spirit of TIA today is an AI spirit. Describe a wonderful vision with no idea how to do it. I'd rather work on version 1.0.
Croft: - Enumerate research issues in TIA
Ulman - Make clear it is a database rather than an AI issue.
Snodgrass - If you look at last reports they state 30-40 year goals at high level and of course we haven't reached it.
Agrawal: We should have some nearer term goals.
Croft - So what have we done in the last five years? (Xquery?)
Stonebraker -It looks to us like we're dead on our feet.
Gray - I'm excited but it's applications, and I'm filling in gaps.
Stonebraker – OS people have quit doing that work - perhaps DB is a mature field and we should also drop things like query optimization. So I propose- we morph after dinner - 3 or 4 people to present visions of some sort that can't be achieved in ten years and listen to that.
Agrawal - One thing that would focus or excite us is some interesting application and TIA might be that thing It has database issues.
Gray - I have political problems with that. TIA has a big-brother overtone.
Stonebraker - This evening is anyone can get 15 minutes to say something that can't be accomplished in the next decade. No restrictions other than that.
Ulman - I understand the political issues about TIA - but it needs to be done. Just as city dwellers 5,000 years ago needed walls around their cities. It is a national need. The government gives guns to 1.5M people and relies on them not to invade your home. The political problem is to create analysts who get information and don't abuse things.
Stonebraker - This is a subset of heterogeneous federation and data mining.
Lesk: Three challenges e-science, TIA, personal memex [we've now killed 20 mins without getting anywhere]
Stonebraker: Integrating the deep web.
Gray: we have 24 hours left. Is the field really stagnating? Should we look for other careers?
Stonebraker - This discussion is very similar to the one 5 years ago.
Abiteboul - In 1981 people told me databases were dead.
Gray - What has been discussed so far is incremental. Oracle, IBM working in the mainstream. What should the researchers be doing?
Abiteboul But those guys don't publish so we need to do the same work.
Gray - They write a lot of papers.
Croft - Other areas are defining testbeds - so people could compare techniques. e.g. MT recently - was moribund and then defined a new measure 1.5 years ago and excitement is way up. (overlap of ngrams).
Ulman: When you define a measure of progress people make it increase.
Croft - You have to come up with good measures
Maier - Alon was saying for semantic integration what if we found something for people to try - a corpus of 1000 large databases.
Garcia Molina - Why is it bad to have the same list as 5 years ago. These are hard problems - should we only work on things we can solve in a year or two?
Bernstein: - It would be a problem if we had only the same solutions and were making no progress.
Gray - What progress have we made in the last seven years. Lots of things in data mining, cubes, auto-tuning, materialized views. In 1996?1976? Don Slutz was sending queries to DB2, SQL systems - 90% of the time he got the answer and the rest of the time he got a crash. Today you can use database systems and that is a result of research. Research in QA, fixing query optimizers.
Garcia-Molina: Is Google an accomplishment of last five years?
Silberschatz: Do we teach Google in DB community? -general yes I have a lot of data on my desktop and I don't use any database tools to manage it.
Bernstein - People use Outlook to manage their contacts (1/3 of the room?)
Hellerstein - failure with Gong show is that we talked about other people's work. (Laura had said this earlier).
Q: Should we just repeat the last report? Say it was the right program.
Croft: how do we move ahead? A number of people said this was a really exciting time - so much data around and people care about it.
Lesk: - Get people to do their own queries. just like IR. that's what made it exciting.
Maier - We have a lot of people who were at Laguna. Many of us are on their last research project. I can't do something which is ten years out. Maybe we have the wrong people.
Hellerstein - Disagree completely; wisdom has value. Phil can e.g. take risks at his stage in the career.
Garcia-Molina - The world is knocking on our door. There is a threat from terrorists and are we going to say there is nothing to do.
Maier: Who's bored with their current work? (only Ulman: puts his hand up) Carey and Halevy were the chairs of the two main conferences- What are the big issues?
Halevy - We had a lot of data mining papers and all but one were rejected.
Stonebraker - I can summarize as "in the past there has been a sea change" and in 1997 it was the web. Now we're just plodding along.
Gray - Webservices are a sea change. People can now publish info on the Internet, not just html.
Abiteboul - Deep web.
Franklin - Instead of a gong show we go around and you get 30 seconds for what excites you.
Stonebraker - we will spend 1 hr after dinner giving 2 mins to each person to say what you're excited about or to present a grand challenge.
applications
real-time enterprise
financial data feeds
supply chain management
sensors
environmental monitoring
RFID - radio frequency ids - e-zpass type - Gillette just ordered 500M at 10cents each.
Network monitoring
the sensors are the things that have triggered the big interest
what are the issues?
quality of service?
what's wrong with existing technology?
issues
push+latency: the data just comes but it ages fast
dbms - system controls data flow and optimizes throughput
sdms - sources control data delivery and you optimize latency
update followed by query - not fast enough
overload is possible - rate-based processing
DeWitt - I see no evidence that optimizing for latency & throughput are different. If you take a standard DBMS and forget about persistence it's the same.
Gray - Standard systems have response time thresholds and try to answer as much as possible. It is the same thing.
Croft - We also need different architectures to do 100K profiles against news wires.
Gray - In databases you treat queries as records and it works.
Maier -Is there always duality like that - queries and data invert?
adaptivity
loads change - so can not do a static plan
adaptive optimization issues
scheduling, load shedding, distributed bandwidth-aware optimization
correctness
semantics may not be deterministic
approximation, independent streams not synchronized
transactions do not seem central
update in place not the norm
overlap of answer arrival with query processing
mix queue-based processing with traditional storage
Silberschatz - At Lucent we worked on real-time billing - you append the call record in the database - and later you ask about the database.
Hellerstein - The only fun here is when you do distributed - push processing into the routers.
Ceri - You have all these queries coming and you want to combine them – need better ways to do that.
Carey - need a benchmark for streams (Widom says it's being done).
Stonebraker - we are writing a linear road benchmark and running on stream systems as well.
QoS - quality of service during overload conditions
overload -> degrade answer
who needs it? same for all?
admission control (turn away) or priority control (delay)
on requests; on data
degrade operators:
smaller window size; approximate operators
need to know what the use is -- a billing system isn't going to throw away the billing records but a sensing system might well toss inputs.
Gray - We can overprovision. You need to worry about fault handling not overloading.
Stonebraker If a missile is coming in you drop everything else on the floor.
Gray No you don't you have dedicated missile tracking hardware.
Stonebraker That is not what they do. I have talked to these people
Zdonik - When something is coming in you get too many pictures.
Maier:
What's wrong with existing technology?
MQseries + Websphere + DB2
performance, performance, performance
scaling: speed, volume, # of requests
too many boundary crossings (between these three systems)
linear road benchmark intends to prove this
second order effects
data model is wrong
triggers don't scale
No QoS
[losing track of this]
Q - Is the linear road benchmark the OO-1 of streams technology - that showed relational was 1000X slower than object-oriented.
Maier - Can we build such a benchmark? Benchmarks can be drivers of innovation like TPC - expanded database capabilities.
Franklin - The reason OO guys liked OO-1 was caching - Relational guys still have problems doing that. Now we are working on putting big caches in front of relational databases.
Stonebraker - We started storing data then we added procedures, then triggers, then queues.
Gray And then text.
Stonebraker - would you do better starting with a clean sheet of paper.
Gray - Postgres and MySql and the OODB guys started with a clean sheet of paper. What do you think of the results. It takes a long time just to get to where the state of the art is.
Hellerstein - There is a problem making the data fit - sensor fusion – e.g. calibrating temperature measurements.
Abiteboul - Applications of streaming - e.g. in security people watching streams of everything involved in intrusions - Process fast output of web crawlers have to do things on extremely fast streams. He wants research on what can be done with a given amount of memory, etc.
Ceri – There might be kinds of queries you can do on the fly - they are monotonic and don't aggregate
Agrawal - worried about the security impact.
------------------------------------------------------------------------
Stonebraker - dinner at 6pm - some room in first floor - after dinner we will
come back here for the gong show 2. 2 minute passionate discussion of what they think is a great vision or challenge problem.
mine:
personal memex
universal knowledge
individuals use data themselves
tomorrow
30 min vis
60 min pdb
60 no knobs
30 min on trust
that's the morning
------------------------------------------------------------------------
Dave DeWitt & Hans Scheck- Laguna Beach report from Feb. 1988
Bernstein, Dayal, DeWitt, Gawlick, Gray, Jarke, Lindsay, Lockermann,
Maier, Neuhold, Reuter, Rowe, Schek, Schmidt, Schreffl, Stonebraker,
Ullman (joke)
this was a controversial report - there was even a counter-report
Future DB applications it suggested
CASE (software-ugh), CIM (manufacturing), Images(yes), Spatial(yes), IR(yes)
future h/w environment
continue to consume hardware resources as fast as they occurred
special purpose DB machines were a dead end - (right)
future s/w environment
DB & OS types would continue to do battle
"we'll be stuck with current OS interfaces just as our clients are stuck with SQL" –
the context was MVS
4GLs would solve the PL/DBMS impedance mismatch - UGH
(protests - some people use things like Tomcat)
PROLOG+DBMS will yield nothing of lasting value
too many groups doing the same thing
(objection: bought three nice houses in
extendible DB managers
big debate on OR vs. Toolkit (no conclusion)
heated debate on OODBMS
"misguided" or "highly significant"
lacking a clear definition of the approach
active databases and rule triggers
strongly endorsed DBMS support for triggers, alerters, constraints with
high performance
no need for general recursive queries
(analogy drawn between its fate and the fate of Nth normal form) –
this was most controversial statement
end user interface
universal agreement that we needed better user interfaces
lamented lack of research
difficult to publish papers
reviewers hostile because they lacked graphs and equations
need to "demo or die"
lack of toolkits (e.g. X11)
single site db technology
hardware trends would require rethinking optimization, execution, run time
concurrency control dead as a research topic
support for parallel DBMS research
stop doing new access methods (except spatial)
distributed DBMS
enough research - commercialization about to come
only problem was administration of a system with 50,000 nodes (got this wrong)
miscellaneous topics
no-knobs physical db design - including index selection and load balancing
tables across disk arms
better logical design tools
support for continuous streams of data
no more data models please
data translation was a solved research problem
better support for information exchange
(most people liked the report –
Ullman objected that if you're going to crap on theory you should invite someone from the theory community)
Hans Schek.
has the original foils of the presentations
everyone had 4 topics to recommend and 2 people should not work on.
I picked those from the people who are here - Bernstein DeWitt Gawlick Gray Maier Schek Stonebraker
Bernstein
pro distributed sys admin
TP application schemas
auto data translation
active databases
con database machines
real extensible database systems (ORDB) - not research
Dave DeWitt
pro dbms for scientific apps
CASE support by DBMS
optimization of queries over complex object hierarchies
active DBMS - whatever that means
con general recursive queries
hardware sorters, filters
concurrency control
object oriented dbms that mention encapsulation
Dieter Gawlick
pro productivity & operations
technology for transaction processing
interdisciplinary communication
access patterns
Gray
pro procedures in db systems
automatic db physical design
disaster recovery - data and application replicated
10 years continuous operation
large or exotic db, 10^12 recs video fax sound
specialty databases - case, cim, geo
Maier
Pro single object constraints
physical representation language
update semantics for logic DBs
constructive type theory
con storing DML as strings in DB - not compositional (anti Postgres)
behavior-only object models (Maier says he hasn't a clue what this means)
Schek
pro systematics on semantic data models, knowledge rep, ...
optimization mapping to kernel operations
host language coupling with external operations
tight DBMS cooperation with applications
con more recursive query processing - missing applications
Stonebraker
pro integration of 4GLs prog lang and DBMSs
1000 node distributed DBMS
abolition of IR systems as one-offs - efficient text in
general purpose DBMS
end user usable application development environment
con recursive query processing
interface between prolog and DBMS
(at the time - 1988 - we were just seeing the end of the 5th generation hype)
Abiteboul - have seen good applications of constraint logic programming - don't throw out everything.
Ullman - logic has had some impact but not in query languages
Gray - Let's just modernize this and ship it.
Iannis - What has been accomplished in the 15 years that was a consequence of that report, on the positive side?
Gray - no-knobs stuff, disaster recovery.
Silberschatz: Did we envisage web in 1988? data mining?
Maier - most prescient is data streams
Tioga system has developed into something sold by Rocket Software and shown as an example of user interface - maps with colors reflecting certain properties. Also like PAD++ has zoom-in on data : dots turning into company icons turning into company financials and stock history. Name is Visionary. Pan and zoom over geography. canvases can be nested and you can put holes in canvas to see what's behind hit. Demo for dairy farmers showing meters as well as geography for milk quality. web-like; you can refresh but it doesn't dynamically follow the data coming in. (Stonebraker is running on the laptop DB2, Access, SQLServer, and I think one other piece of DB software). Uses ODBC connections to database tables - has wizards to help write SQL - can display results in layouts like chart, control, form, hierarchy, map or pattern. e.g. hierarchy - you pick "superid" and "empid" to link supervisors to their employees - various choices to pick colors and the like - says the dairy farmer app was created in 1/2 day (by an expert). Other than Ben Shneiderman, why don't we do things like this? (a few people objected they also did UI). Larry Roe asked why GUIs are 2nd class citizens in 1988 and they still are. Visualization is still done as scientific viz and not database viz. We're missing an opportunity.
Croft - Same situation in IR - sprinkling of visualization papers but it is difficult to do research in this area because hard to evaluate - people show nice examples and the paper got rejected. Also in IR very high dimensional data - hundreds of thousands of points - you can get galaxies of points but they are useless for finding things - hard to find a powerful visualization.
Gray - from Terraserver - we did prototype and then got in a graphic artist and the difference was stunning. There are three areas -- data storage & retrieval; graphics; and programming; and they're separate.
Stonebraker - the Tioga papers all got turned down.
DeWitt - Suppose no papers - submitted the interface - at the talk you do a demo. Would that help?
Serge - There is a man-machine interface community with their own community and they're better qualified than SIGMOD to evaluate stuff.
Stonebraker - We can keep opting out of this work but we're drawing a smaller circle around our work.
Maier - 1983 SIGMOD in San
Jose -
Stonebraker - but we do storage prototypes.
Ceri - my system is a graphical interface for the Web - to publish on the web you need something like this - we enabled people to use an existing tool. Not our responsibility to do the layouts but to use existing tools. We shouldn't do this but help others.
Maier - another opportunity for visualization - talked to people who do volume rendering - the main memory stuff doesn't work - their papers rediscover how to lay out data on disk to do rendering and zooming well in large data sets. You have to partner with people who do visualization algorithms.
Stonebraker - ditto for data mining. We could make a contribution if we chose and we choose not to.
Gray - There is a mindset "it's not in main memory" - 4G of RAM cost $1K and most people's data are smaller than that. I am working with a student who complained to me that he got 900K records/sec from the DB and he expected 1M. In main memory things go fast. If you don't use databases you can make vis
very hard - indices help.
Lesk: Why did you take up text but not interface?
Hellerstein - (1) we have colleagues in HCI we should talk to them about how to evaluate this. the IR folks are better at infecting the HCI conferences than we are. we have to learn their methodologies. (2) we lost a really important topic which is language spec - that can be done visually better than with languages - people want to point at things and show example calculations - we got papers into VLDB - we had an interesting spreadsheet - the problem is visual languages and for schema heterogeneity this is the place to start - people have to understand it.
Sellis - Suppose you had to pull in data from multiple applications on different machines? What are the problems?
Stonebraker - This opens
an ODBC connection to anything you want anyplace you want. Does no
distributed joins except on the screen. There are 50,000 line segments in
the
Ceri - there are web services experts - but we should keep some of this work
inside our community because the HCI people don't understand the paradigm of easily specifying joins and data navigation. This can be part of our field; but web rendering is not something we should tough.
Yannis - there is an advanced visual interfaces conference - not mainstream DB but it does bring the DB and HCI people together.
Stonebraker - I got pissed in 1993 and wanted to start in this area - I may be world's worst graphical artist but we got somewhere just from the idea of wormholes and moving closer to data.
Alon - there is also NLDB - why not also talk about natural language interfaces?
Stonebraker - we used to do front ends - 4GLs - that was respectable in SIGMOD once upon a time and now it isn't.
Gerhard Weikum
Simple case: only look at what I have on my PC
basis: email archives, memos, programs you wrote or ran, photos taken, Web pages ever visited - for last 30 years.
scenarios:
who was the French woman I met at a prog.comm. meeting where Peter Gray was chair?
when I did think of the idea for this paper and how can I prove that?
which book did I read while on the flight to my first VLDB 20 years ago?
difficulties/opportunities
substitute data sources for your own memory
query by associative memory - approx time, place, person, institution, anecdotal events
transitive closure over combination of similarity predicates
lazy and sloppy info organization (e.g. folder names)
automatic annotation; named entity recognition
automatic and evolving classification
very long time span with changing terminology & interpretation
classification, authority, etc. needs to be time-aware
understand that "stream" might be environmental at one time, sensors later
personalized interpretation and bias of terminology evolving over very long time span.
continuous learning & re-learning of preferences & biases from user interactions
[this is heavily about browsing more than detailed searching][
storing all this stuff is a solved issue.
Maier - You missed format conversion: “I went back looking for 1994 PowerPoint for my algorithms class - I have troff documents linked to particular files.” Its not just saving the software but reliance on libraries and scripts and so on. How do you capture the knowledge - it is application-owned. Having it around in the future is a problem.
Weikum: Two standard answers - migration and emulation.
Agrawal - Ray Lorie has a project to do this - keep things around 100 years.
Lesk - Why don't indexing file systems catch on?
Agrawal - is there a study for Microsoft Exchange? we looked at Lotus Notes - more than 30% just leave their email in their inbox.
Weikum - Users are lazy
Lesk - why shouldn't you leave the searching to the computer?
Gray - Personal Memex.
Gordon Bell wrote a paper "dear appy" my programs and data won't talk to each other any more. She writes back and says you have to put your stuff in a gold standard - which means something that has $1B behind it - Ascii, TIFF, PDF ok, HTML is possible, Postscript not. And you have the media problem - floppy disk or paper tape.
Vannevar Bush - memex paper - imagine trying to fill a terabyte in a year
300 KB JPEG - 3.6M items/TB - 9800 items/day
1 MB doc 1 M 2900 items/day
1 hr 256kb/sec MP3 9.3K 26 items/day
1 hr video 4
Gordon Bell MyLife Bits
20K pages tiff 1 GB
2k tacks music 7 GB
photos 13K images 2GB
video 10 hrs 3GB
docs 3k (ppt wd) 2GB
Mail 100k messages 3GB
18GB total
Now recording everything - all conversations, etc. He named everything on the way in - big folders. He couldn't find anything. Gray tried to build a database app but he refused. What has worked is to simply put everything in one database - You search on the text in the annotations. Google does not organize the internet as a collection of folders.
There is a timeline view, showing thumbnails - also parallel screens of personal and work timelines.
They don't do the face recognition - just do the text inversion and searching. They ocr the papers. They can do the search and render the result fast – only thousands of things retrieved, usually.
AskMSR - automatic question answering. mines the web
'when was Abraham Lincoln shot' ok
'what color is grass' ok
'why is sky blue' ok
'what is the meaning of life' ok
'why did aliens first land on earth; (they crash landed)
'where is Osama bin Laden'
(in the mountains of
text retrieval is doing very good job.
images, scientific data - not doing well. SQL is our strategy for wandering through scientific data.
Weikum - Wouldn't XML be even better - e.g. know Gordon Bell from a ringing bell.
Gray - The text people are doing this - and the image people are starting to learn how to annotate.
Weikum - So we should team up with people from the other community.
Gray - We can build the plumbing to help them.
Hellerstein - Most people don't have scientific data in their files. This is well in hand - aside from questions like why don't Microsoft apps talk to Apple apps.
Bernstein - System engineering of very large components - consider annotation - We have people whose career is reading large volumes of text and turning it into something formal, one sentence at a time - this is a career - Here we are engineering something which brings together many of those components - The visualization is also part of this - Most of the folks I talk to are enthusiastic about seeing technology applied - The machine learning guys won't stop to work on schema matching or helping with annotation. We are the systems engineers for anything that involves information management.
Gray - We built a database photostore and gave it to the graphics guys and they did nothing - Once we built this they saw way to use face recognition.
Stonebraker - if you enlarge the sandbox a little bit - my wife dents her pick on financial records across Schwab, bank, limited partnerships, Quicken, etc., - We are not overwhelmed by Gordon's problem but by personal financial management.
Hellerstein - the elderly need this more, including in particular their health records.
Kersten - organic databases to support an ambient world
disappearing DBMS - picture of IBM old tape drives
now we see that there are small gadgets not big iron
He got a call from Philips - planning more gadgets - dream of ambient world - The gadgets are all hidden - the TV is in the wall and the controls are in the basement. but all the remote controls now have knowledge - How do we do the data management? We need appliances with knowledge. We now have light fixtures that knows date and ambient light. We have a bathroom mirror that will display stock prices for me and a cartoon for a kid.
The database system is gone
data management can be left to the individual applications
no need for scaled-down SQL DBMS
data management doesn't need a DBMS.
Phase 1
DBMS is hidden in the wall. big server in basement. no good – Philips does not sell PCs, it sells toothbrushes, TVs, etc.
Phase 2
every product has its own data. what sensors does it need? how communicate?
how have multi-year backward capability?
characteristics needed
self-descriptiveness: outsider can access, interpret, re-use the schema
resource requirements re explicitly stated
software version trail is available to let you time travel
code-base of the manager part of the store itself
Lesk - does Philips really want to let other people access their schemas?
Martin - yes -
self-organizing
can split into subsystems with minimal synchronization requirements
systems can fuse easily - user resolves conflicts
roll-forward over schema updates, storage optimization
database can migrate or move to another location.
self-repair
runnable system can be obtained on a new platform with minimal bootstrap
new toothbrush picks up data from old one or somewhere
software setup so that a bug can be "fixed" by locking-out part of the code
without losing all the rest of the functionality
replicated storage/indexing to recover from failure
manage a trail of database version
self-aware
security aware - authenticate environment (toothbrush recognizes your
fingerprint on the button)
location aware
time-aware. should be able to manually back up in time.
grand challenge for the 21st century
organize database management system which can be embedded in a wide
collection of hardware and is autonomous, self-descriptive, self-organizing,
self-repairable, self-aware, and stable data store recall functionality.
Garcia Molina: Sony, Toshiba, etc can't agree on DVD-R standard. Will they agree to this?
Martin: Philips is the only company which has to agree to this. There may have to be some bargains with Sony but they are optional. Now thinking about sensors in watches which measure blood pressure and temperature which go to a device that figures out whether you are stressed and changes the music on your radio.
Stonebraker - it is not fair to underestimate heterogeneous human integration problem - my thermostat and home security already are these things - just doing it for Philips will not work.
Martin: but it points where to go.
Hellerstein: This was great - we are missing at Ubicomp and that's a shame.
Bernstein - There are areas that have come and gone - heterogeneous databases looks like a career that will still be here in 20-30 years.
Ullman - some years ago I was on an ISAT study that sounded like this - was a bunch of AI people who wanted distributed active objects – The study went nowhere but they did not see it as a database problem.
Martin - The Philips problem is not all DB. But we have a place – supporting the data needed for people.
Franklin - This is happening even more in cars.
Avi - The networking guys want an IP address for every device.
Schek – This is a very important area - looking into applications of hyper-database. in medical environment. The medical people are already here. We have to support them.
Weikum - Not such a big problem of data integration - paradigm is e-services sending messages to devices - they have to share some data but not that much.
auto tuning
dynamic allocation of existing resources
integrate with failure reaction
low variance in response time
advisors for need/impact of new resources
single/multiple databases
two comp centers, few blade computers in each
recent years - now computers cheaper by factor of 10
so people now minimize the number of comp centers and
two is the right number.
most engineering -
all the vendors working on this - keeping some knobs, use optional
Bernstein: - Look at how tree balancing went away when B-trees were invented - the knobs just vanished.
auto tuning
what to do with a bad query/program - reports are not the solution
80% case -- how do we know that we are at 80% of the benefit for 20%
of the effort? maybe I am only at 0.1% of optimum.
tradeoffs
which application/task/user is most important
business value of application - can't do this automatically
rules and regulation
do we have high level abstractions for these?
selecting functionality
which features? which preferences? how much security?
need to link to the tradeoff policies.
discovery
low level discovery - broken disks, new disks, blades, etc.
high level discovery - semantic transformations
Stonebraker - Why is this so slow to arrive? Vendors seem slow.
Gray - Simplicity is a feature, and it fights other features.
Dieter - Of the illities - reliability, security etc. - simplicity is not the top priority. But now it is getting attention.
Weikum - Workloads are more dynamic now.
Garcia Molina: We need a benchmark - "no knobs 10" means a ten year old that can install the database system.
Gray: Patterson has a group measuring how long it takes to repair a failure injected into a RAID system. they measure % success and time.
Hellerstein - They did the same thing for databases - those were pretty good at surviving failures.
Bernstein: - one problem making systems simple enough to be auto-tunable is that the systems are tunable - Even the experts don't understand the implications of new features. It’s not just a matter of good tuning algorithms but a way to model the system well. Engineers must understand the tuning implications of new features they add. We don't model systems well.
Gawlick - I agree. we don't put in any feature that breaks reliability or security. What are the impacts of things like streaming? We didn't know well enough what this would mean.
Gray - Cost of managing computers vastly exceeds the cost of computers. In the MSN area the staff is reluctant to manage a database; they understand files, they don't understand databases. We're struggling with making databases as simple as files - it requires a new model.
Lesk: there are two choices. My first color screen had 27 adjustment knobs on it - the vendors learned how to do self-adjust. My unix systems used to give me a choice of file block size. That went away because the performance hit is now acceptable. You're only looking at the first answer, not the second (ignore the choice).
Haas - We are now doing that. Watch and our knobs will go away.
Bernstein - $4/GB/month to do storage device management. and this is 5 times cheaper than the outsiders charge. Much more than the disk costs.
Gawlick - Recovery, security, and drive down cost.
Stonebraker - Jim Gray had Tandem numbers 10-12 years ago and 80-90% of crashes were operator error - goal is get rid of operator - you shouldn't leave the knobs in. Customers would rather have reliability than to allow computer jockeys to turn knobs.
Gawlick - That was a different time. All of this stuff is gone. You can no longer delete a table and be unable to get it back.
Gray - There is an operations phase and an administrative phase. We can automate operations, but not administration - we need wizards and profiles.
Ceri - Goal-oriented tuning?
Gawlick - Yes. Set an objective. Response time not more than X seconds. set a policy, something else tunes.
whose privacy? whose security?
individuals, organization, government, or society?
traditionally
access control
views (need to know)
roles, not people
but now add:
serious adversaries: MIT students bought used disks and used shareware
disk recovery - one was from an ATM machine, another from a pharmacy
long timescales - in 25 years will your data be private? will you still
remember your password
scale - lots of people, with rights and access, many info gatherers
cross-source data integration : 1+1 can be >>2. even if Census has
rules and CDC has rules together they may not be strong enough
people care a lot, more in recent years
issues
managing data use
trust relationships
transparency
incentives
mechanisms
goals, metrics
primary & secondary use
Prozac fiasco - you got reminders to tell you to take your drug - then
the marketing folks sent out a mailing about a new offer - users felt
their privacy had been breached
traffic light cameras for red-light runners detecting speeders - raised
a problem.
specification of purpose for data, and how enforce?
trust & relationships
two sorts of trust
policy adherence trust - enforced or audited
relationship trust - may only be loosely related to policies
changes in relationships are problem - merger/acquisition
transparency
of use
is the policy crisp and comprehensible
of disclosure - do you know what you gave up? is the info on the magstripe
on your drivers license the same as what is printed on the front
of extraction - how do I know what is taken? e.g. swiping a card may prove you
are >21 but it can time and location stamp you. how do I know that the
voting booth did the right thing
of data destruction - can you promise this? some people said it's just too
hard to ensure bits are gone
incentives
Economic - may make sense, graduated rather than boolean
shopping cards - claim people don't care.
privacy is not fungible - my privacy is worth more to me than you
costs of privacy
dollar costs - claim black market value of identity is $60/person.
frictional costs to business
cost vs. usability : people in human rights work in foreign countries,
whose life may be in danger, often don't have encryption tools
mechanisms
authorization vs. accountability
enforcement in computing science sense vs. law enforcement (if somebody
does this you catch them)
accountability - catching the bad guys - claimed to scale better
graceful degradation
should you avoid a single point of failure that leaks all your data forever
would you prefer loss to leakage
human factor - biggest hole is human
human leaks
key management
long timescales
goals and metrics
store my data forever
not necessarily - as long as I want it and no longer
enforce my policy forever -
well if I’ve been in a car accident perhaps my medical records should
be available
ease of use - but how measure
problem statements here tricky
chart: target user (individual -org-gov-soc) vs. approach (legislation,
incentives, enforcement mechanisms).
we haven't done a lot of work about economic incentives.
Stonebraker - Identity theft is an unbelievable problem - I was a victim - your life is changed forever - Huge hassle - If there is less privacy identity theft will be harder - There are social consequences of privacy which are not a good idea. Right now if every agency that granted credit had my fingerprints identity theft would be hard, but they don't. This kind of thing is resisted on privacy grounds but commerce implications are serious and fall back to individuals.
Hellerstein - Identifiers are one piece of the problem.
Lesk: contrast between laws
in
Hellerstein - It is long and boring.
Weikum -
Bernstein: - large institutions face problem of knowing what they have and what they are allowed to do with it. Big metadata problem.
Hellerstein -we have this thing called p3p but it's complicated.
Avi - some people say there is no privacy so why worry.
Hellerstein - book "transparent society" - if everyone knew everything we'd be better off, like a small town.
Stonebraker - your phone records are owned by phone records - no law prevents them from doing whatever they want with them.
Avi - there must be I can't
get access even when I worked for AT&T. What about
Stonebraker - Right. Utility bills belong to gas company. None of this is covered by restrictions.
Pazzani - phone companies can not use records of who you called to market things to you.
Hellerstein - Rakesh filled three days with this. Targeted my panel with what are the research topics?
Ceri - how can I change the policy? or get an exception.
Croft- CSTB has produced a couple of reports about these - mechanisms are around. What should the DB chapter say. Most of these issues are what CS people talk about in general.
Hellerstein - We have declarative interfaces - queries - I don't know how the problems change when you secure objects, not collections. For example identifying individuals in aggregation queries. we own some of the most important systems.
Garcia Molina - do you have a report from Rakesh's workshop?
Agrawal - do “google Almaden institute 2003” - that will be the website – most talks are online. First day was policy makers - then people from industry including Kaiser, State Farm, MSN, and then we had somebody from Newsweek. Larry Lessig gave an evening talk. Thus getting requirements. 2nd day was technology day - Diffie, Schneier - current snapshot of technology - what is happening in data systems, lessons to learn, Christos talked. 3rd day had 3 workgroups tried to summarize. (1) requirements, (2) technology, (3) what are the Hilbert problems in this space? It also all on video. Perhaps we can give out DVDs.
Gray - Digital rights management issue drives policy of how you can use info; Huge economic incentive to control this - dovetails with this - They have clear ideas about what kind of policy - They want very fine control. If they get their way there will be a mechanism to control your information too.
Hellerstein - This is about copying - one dimension - there are other problems.
Rakesh - at WWW conference same discussion - entertainment industry would prefer not to let people make computers.
Gray - Disappearing ink company - you can only read your mail at their website. Genre here is kill off copies. Revoke keys. You can not do this transitively.
Hellerstein - record companies don't get this - they focus on initial digital version.
Gray - watermarking technology.
Lesk - Make things easy to use - social engineering
Garcia-Molina - I wish people knew more about me so they didn't send me irrelevant spam.
Hellerstein - Are there mechanisms to put in the report - points in problem space.
Agrawal - This is the IBM centric viewpoint - absolute privacy has too high a cost - makes our lives harder - Some people only use cash - that fraction is coming down. About 20%-25% of people don't care. Large fraction are interested - care a bit but are pragmatic - e.g. personalizing email is fine but that means I have to give out some information. IBM wants to design for this pragmatic majority. The purpose is also not static -- this is a declarative specification problem we have to be good at. Query modification problem. What it is doing is there is a policy stored in the database as metadata. Each query consults this metadata as well as user preferences. Authorization done through query rewriting rather than access control. What to do if the disk can be hijacked? We can encrypt it but then range queries run very slow. Some negative results - you can't do arithmetic and comparison with safe encryption. People focus on trying for encryption schemes that allow arithmetic but they missed allowing aggregation which can allow indexing over encrypted data. Notion of retention - do we keep the data forever? Or get rid of things when we don't need them. There are a whole lot of interesting problems - how to do compliance. How do you give user faith that you're following the policy? What if somebody files a lawsuit, how do you prove you followed the policy? Shat if the data are exported? If we can push on the data side we might help people like Larry Lessig coming from the legal side. (His vision paper was in VLDB last year).
Bernstein - see Lessig's books - they show how the definition of software defines the possibilities in terms of the legalities that can be enforced or expressed - programmers are making law when they set up a service or a database.
Garcia Molina -yes.
Ullman - Some queries we can't do in encrypted form. I think there is more hope than you imagine. Suppose you appear in different databases as J. Hellerstein or Joe Hellerstein or Joey Hellerstein - in plain text we can write a little routine that folds these - perhaps locality sensitive hashing will enable some of this - but you can get some leverage in the encrypted space. Not as secure but there may be a tradeoff.
Hellerstein - Location sensitive hashing is dangerous - you can insert things and see where they fall and zero in.
Agrawal - if you have order preserving encryption you're leaking info. We need to describe what is the security model?
Gray - Use Palladium - it is in clear in the cpu but not in the memory or on the disk. The cpu sees the world as if it were clear but the cache lines are encrypted.
Agrawal - IBM mainframes support encryption and the DB people have not been able to figure out how to use it.
Stonebraker - we could sketch the report - use lunch to refine this. Regroup after lunch.
Context:
we have to write one
we can't say "we got it right in 1997" (or 98)
proposal
make up a 50 year challenge with a 5-10 year milestone
"highest pole in the tent"
tone
information management focus
networks and cpus are unimportant tools
reasonable tents
deep web
personal dbms
query the internet
example
"integrate the deep web"
structured data on the web
your home security system and your toothbrush are on the web
new data model and query language for text, streams, structured data,
and triggers.
heterogeneous federated DB
a million sites
semi-automatic wrappers
"finding" problem - who has this?
dynamic federation - Martin talked about sites coming and going
schema choosing - Alon wants to add things picking my friend's schema
self repair, self-adapt, new GUIs, privacy
superset of TIA problem
"tell me something interesting"
short term goals
get a benchmark/test harness
have a serious pow-wow with IR
play out the separate stream system debate
design next version of Xstuff
get somewhere on "tell me something interesting" - challenge data mining
get somewhere on wrapper technology - how build all these automatically
Hans Schek - you left out multimedia.
Stonebraker - have everyone spend lunch improving this.
Jennifer - can you switch the top level label - it doesn't have much to do with deep web. I like the bullets.
Bernstein - Get rid of top grand challenge bullet - there are grand challenges they have a common info mgt problem. Our problem is the common part (75%) of everyone else's grand challenge. Everyone likes this.
Gray - We should reduce this to 25% - people won't swallow 75%.
Bernstein - We're going to have a lot of grand challenges over the next 5 years in CS - we shouldn't be one in the pile but common to all.
Stonebraker - Phil & I will write report.
Pazzani - federated data bases sounds like a 15-year old term - if you mean semantic heterogeneity it will sound better. Distributed databases sounds like bigger iron - this is something different.
Stonebraker – Lets spend 40 minutes on brainstorming. I will write a complete draft and then Phil will fix it. Then we'll circulate and you can whine about it.
Bernstein - a) We need to explain the DB field - we have some core competencies in access methods and query optimization, transactions, schema management. We are driven by bigger disks and faster networks. Related technologies can make major contributors to machine learning, graphics, etc we are the integrators and learn how to apply them to info management. Iannis was talking about raising the level of abstract by putting those three components together (?what 3). b) This is the fifth such report we need to look back at earlier reports - not just what changed but show. What problems are long term and which have been solved or are under control.
DeWitt - do you want to review the past recommendations and say how this fits or what?
Bernstein - mostly thinking about the way some problems recur and some have dropped off the road map. We are uniquely positioned - many people in grand challenge mode think we're in good shape and can look back and talk about what we've learned from 20 years of this.
Marten - We should start with 98 report. Repeat at least one recommendation.
Bernstein - Previous report was information utility - that's almost a satisfactory label.
Maier - Federation then might have been n=3 and is now n=10k.
Schek - Huge change in technology - ubiquitous computing - where is the data now - we need a new grand challenge.
Marten - The hardware is disappearing.
Hellerstein - Perhaps do the history separately - might distract too much and take too much space.
Gray - Phil only wanted two paragraphs. The really new thing is concern about privacy.
Marten It was a hot topic in early 80s.
Gray - did not make it into our reports.
Bernstein - in 1980s we knocked off good problems one after another - replication, optimization; heterogeneous data was on the list then too. Basic transaction management has gone.
Croft - Talk about how the process can be facilitated.
Stonebraker - we need some short term milestones. The annual text exercise has been good for IR
Marten - TREC-like thing e.g. 2004 "year of privacy in databases" and have a conference.
Bernstein - Rick proposed a journal of data sets. produce challenge examples.
Maier: the thousand database corpus.
Gray don't do a 1000 - do 10 - it is hard to curate them - 1000 will diffuse focus. 10 will be better.
Ullman - we wanted to take all the CS dept databases - but there was a big privacy problem. I wonder if we could get that.
Stonebraker - every CS dept has public databases - e.g. spring schedule of classes.
Ullman - what is easy to get is all web data. if that's the entire benchmark it might not be enough. You really want some spreadsheets and files.
Gray - we want a dozen or so. don't design it here.
Croft - TREC comes from a group getting together - the top web page says "how to suggest a new track". It helps that you have funders. NIST administers it. There are 8 or 9 tracks now. (Some terminate - there may be a dozen total).
Pazzani -
lesk - Something related to 9/11
Jeff - Query optimization for distributed heterogeneous queries.
Gray - What is the milestone?
Jeff - Hard to phrase as percentages of improvement.
Stonebraker - Test harness - want a single performance metric. Jim's benchmarks all come with a single number like Sequoia benchmark has 20 queries and progress on them is measurable. Astronomy is like that.
Jeff - another 5 yr milestone is a new language which integrates structured, semi-structured, and text data.
Marten - Another is an open source database kernel.
Widom - What's wrong with MySQL?
Franklin - A lot of people see open source databases - If we are not involved it hurts our community.
Stonebraker - IBM owns illustra/postgres code line - and has no use for it - can you guys get PR by making this open source.
Laura - I asked that and got an answer which makes me think we don't own it.
Gray - when you publish your data you will publish in some database - we need an open source way to do that.
Maier - Why do I need the database?
Gray - What just comma separated values? I publish Sloan stuff on a disk with procedures. I propose that if we publish databases and the methods that encapsulate the objects you need an open source database to do that.
Hellerstein - What about just XML dumps of the data?
Gray - How do I get the procedures? There is no standard source procedure language.
Stonebraker - the wrong way to do this testbed is to publish some data sets - each of us have to donate a machine or a piece of a machine that is accessible and running something we can get at. so we need 15 volunteers.
Hellerstein - Intel will volunteer to help. PlanetLab community.
Gray - Grid guys will too.
Serge - presenting as a challenge a data model and a query language for semi- structured data doesn't seem like from this century. last century. we need sensors, updates, etc.
Pazzani - try to understand what's changed. Semantic heterogeneity has been around for a while but some people say they are making more progress. There is a lot more scientific data around.
Zdonik? 5 year goal involving outreach to other communities.
Croft - We ship IT testbeds or toolkits - we never use a database system.
Bernstein - At least 3 versions of outreach - db technology for others, more collaboration databases plus technologies, or actual applications. The scientific database area is an expanding trend. If this report is to be read by outsiders statements that we are moving this way would help.
Franklin - in five years we'll say we missed bioinformatics.
Stonebraker - politically and technically researchers should find app areas and help them solve it. We pay lip service to that; Jim has been doing it.
Bernstein - Lots of people are doing it. Dave Maier and I were at NLM and a lot of people at Michigan and Penn are doing this - it is not just biologists learning databases
Alon - How to make us upward compatible with all the grand challenges – How make this concrete - we provide some services to all of these challenges - streaming data, federating data, etc. Define as services you can expect from our community.
Lesk - The “find Tony Blair” example - get his face & voice off the web. Show we noticed 9/11.
Ceri - pose a query - find the sources - they may change depending on context and then all these issues of caching come to play. Exporting particular expertise with optimization to the global infrastructure or Hans' hyperdatabase is measurable and will happen in 5 years.
Gray - Archiving - come up with representations that are likely to last for a century. Privacy - follow up on the template Hellerstein laid out – policies and enforcement - tools for enforcers and policymakers.
Maier - Asilomar 2103 should be able to read the slides from this meeting.
Rakesh - One major problem is given a database find what is personally identifiable. What fields matter.
DeWitt - Either as a milestone or challenge we should make progress on getting scientists to stop using filesystems. (too negative) so using databases.
Marten - We need younger people at this meeting - sliding window.
Yannis - Antiprivacy direction - In 5 years should be able to have personalized behavior from databases - you access a public database and it behaviors differently.
Croft - Database courses are morphing into things from other disciplines - should we endorse this.
Jeff - I don't believe it - it's not happening except at Berkeley – Database systems were a long trek getting to where a database course was a serious piece of a CS program and I don't want to give that out.
Gawick - secure database - 5 year challenge.
Avi - one issue we should worry about is that when we combine data bases how do we know the result is meaningful. We have no idea about the places we are getting data from - we need to do some math about the reliability of results.
Bernstein - if we believe that bringing in other information management fields is important we should endorse additional support for a broader range of topics in DB research conferences. Do people like what's happening at VLDB?
Hellerstein - VLDB has been saying really weird things but a lot of papers at SIGMOD and VLDB have little bits of statistics or fractals. People are sensitized to this. The VDLB extra tracks are e-commerce apps and that would not be in a curriculum.
DeWitt - This is the year of XML. (VLDB this year has more than 10 XML papers)
Agrawal - In five years it will be a good test to say "tell me something interesting from your data".
Stonebraker The meeting is winding down. Send your slides to Jim. I will send notes around.
Jim will have a website for the workshop.