The development of Cyc was a very long-term, high-risk gamble that has begun to pay off. Begun as a research project in 1984, Cyc is now a working technology with applications to many real-world business problems. Cyc's vast knowledge base enables it to perform well at tasks that are beyond the capabilities of other software technologies.
This page is intended to give an indication of Cyc's potential by describing some tasks to which Cyc is currently being applied and by suggesting others to which Cyc might be applied in the future.
- Applications Currently Available or in Development
- Potential Applications
- Online brokering of goods and services
- "Smart" interfaces
- Intelligent character simulation for games
- Enhanced virtual reality
- Improved machine translation
- Improved speech recognition
- Sophisticated user modelling
- Semantic data mining
You may also to be interested to look at "Ideas for Applying Cyc", a paper co-authored by Cyc project founder Doug Lenat. The paper is now several years old but, nevertheless, does a good job of conveying the range of uses to which Cyc could be put. Some of the ideas have already been realized.
Current database systems are information-rich but knowledge-poor. They have a very flat structure that incorporates little or no semantic-level knowledge. A personnel database may know that "Fred Utz" is a "professor" who works for "Harvard University", but it does not know what these strings mean, what a person is, what a professor is, or what a university is. Cyc, however, does know.
To leverage Cyc's repository of everyday knowledge about the world, we give Cyc a description of the content of each database we want to use, using CycL as universal description language for database schemas. We tell Cyc, for example, that in table XYZ, the field CITY contains text strings which can be mapped directly to instances of the Cyc term #$City, like #$NewYorkCity. This process need be performed only once for each database. Once you have the description of the database in the Cyc knowledge base, you can use that knowledge in all sorts of ways.
Let's say, for example, that we have two tables. The first is a table of people with three fields: the person's name, the person's job title, and the name of the person's employer. The second is a table of employers with two fields: the employer's name and the name of the state where the employer is located.
Now imagine that we want to use these two tables to answer the following query: "Show me people who hold an advanced degree and live in New England." But wait! The tables don't say anything about degrees people hold or where they live, and they don't mention New England. But Cyc knows that doctors, lawyers, and professors hold advanced degrees, that people generally live near their place of work, and that New England is comprised of six specific states. So Cyc converts this query into a query for doctors, lawyers, and professors whose employer is located in one of those six states. Cyc automatically generates the necessary SQL, talks to the relevant databases, and converts the results back into CycL where appropriate. If Fred Utz is listed a professor working at Harvard, then Utz will be found to satisfy the original query.
The only way to get the same result using existing database tools would be to formulate the SQL query by hand, leveraging your own knowledge of both the world and, more importantly, of the structure and content of the available tables. With Cyc, on the other hand, you can issue a query in plain english (using Cyc's NL capabilities), and you don't need to know anything about the tables, or even which tables are available. The process is fully automatic.
Cyc's database tools handle a query in three phases: the interface phase, the planner phase, and the executor phase.
In the interface phase, the user constructs a query using an interface tool. In some versions of the database application, the interface is a simple template form which the user fills out and submits; in others, the user enters a plain english query, which is converted by Cyc's NL module into a CycL expression with free variables. So, for example, "Professors living in New England" might get converted to
(#$and
(#$isa ?x #$Professor)
(#$residesInGeographicRegion ?x #$NewEngland-USRegion))
In the planner phase, the Cyc inference engine uses assertions in the knowledge base to backchain from that expression to expressions which refer to descriptions of database columns. One such expression might be:
(#$and
(#$isa ?x #$Professor)
(#$employees ?y ?x)
(#$residenceOfOrganization ?y #$Massachusetts-State))
Next, these expressions are converted into a intermediate representation format called CSQL, which represents logical database queries at a high level (that is, without reference to the specific location or format of the underlying data).
In the executor phase, Cyc uses knowledge about the physical aspects of the underlying databases to convert the (logical) CSQL queries into (physical) SQL queries. It is at this stage, for example, that logical terms (such as #$Massachusetts-State) are translated into terms appropriate for the specific database (such as "MA" or "22" or whatever). The executor also handles various data type coercion issues between tables during joins.
When the results of the SQL query come back from the database, Cyc translates objects it knows about back into CycL (e.g "MA" back into #$Massachusetts-State) and leaves rest as raw data.
During 1995, Cycorp developed an application along the lines described above for a major pharmaceuticals company. The application is used to integrate tens of gigabytes of data from over 60 tables. The databases used are Oracle databases, but the Cyc database technology is compatible with any other SQL database. In fact, it would be quite easy to modify the executor to communicate using ODBC or OpenDoc.
Related Applications
Cycorp has also developed two other applications which build directly on the database technology described above.
The Working Tables Application
The "Working Tables" application allows relatively unsophisticated users quickly and easily to construct a data warehouse from a set of underlying databases by capitalizing on Cyc's understanding of the extant data sources. The user can construct a set of profile queries, that is, a set of interesting attributes of various types of objects represented in the underlying data.
For example, if the set of available databases includes information on doctors, the user might construct a doctor profile format:
- Where does the doctor work?
- Where did the doctor get his/her degree?
- What drugs has the doctor been prescribing?
- Etc.
Cyc stores the user's desired profile template as a high-level CycL query, with no mention of how or where the data is stored. Thus, Cyc permits the user to quickly and automatically create a warehouse of business objects which can later be processed by the planner and the executor to generate SQL queries.
Because the "Working Tables" application leverages Cyc's knowledge of the underlying databases, the details of those databases are transparent to the user. They may contain overlapping, redundant information, and their structure and content can even be changed right under the user's feet, without forcing the user to reconstruct his or her profile queries.
The Meta-Query Browser
The "Meta-Query Browser" also leverages Cyc's understanding of the extant data sources. It allows users to browse the structure of available databases by using the Cyc ontological hierarchy.
For example, let's say the user wants to know which available databases talk about doctors and their credentials. Cyc shows the user which databases refer to those concepts, even if the reference is merely implicit. Perhaps a table is available which talks about doctors and the medical degrees they hold. Because Cyc knows that a medical degree is one type of credential, it is able to correctly identify the table as matching the user's query.
Big organizations frequently possess large libraries of "opaque" information sources; that is, data which do not lend themselves to traditional text-based searching methods. A news agency may possess a library of thousands of news photos; a movie studio thousands of film clips; a software help desk thousands of text articles too unwieldy to index directly. When such libraries must be searched, a common solution is to attach to each item a short text caption describing its contents. Thus a news photo of a soldier holding a gun to a woman's head might be captioned "a soldier holding a gun to a woman's head", plus a few tags for time and place, and then could be retrieved by querying for "soldier" or "gun".
This solution, while certainly adequate, is far from ideal. It would be nice if the photo could also be retrieved by queries for "someone in danger", or "a frightened person", or "a man threatening a woman". Such an achievement, however, lies far beyond the abilities of even the most sophisticated of traditional text-searching tools, all of which are fundamentally based on simple string matching and synonyms. Most search tools lack the ability to handle natural-language queries, and even those that do have some NL capability lack the background of commonsense knowledge required to make a connection between having a gun to one's head and feeling frightened.
Cyc is not crippled by such a liability. Cyc knows that guns shoot bullets and are designed to kill people; that having a gun to one's head therefore threatens one's life; that those whose lives are threatened feel fear; and that the vast majority of soldiers are men. Cyc can therefore conclude that the image in question is, in all likelihood, a good match for each of the queries above.
A major focus of the Cyc team's efforts in 1994 was the creation of an image-retrieval application for a major corporate partner which possessed a library of hundreds of thousands of captioned images, but no fully satisfactory way to search them. Building on the Cyc inference engine and Cyc's NL capabilities, we developed a system along the lines described above.
First, the content of each image in the library is described to Cyc by converting the english captions to CycL and adding the resulting formulas to the knowledge base. Cyc's NL tools permit the english-to-CycL translation to be mostly automatic; human intervention is required only in unusual cases.
Once the target images have been described to Cyc, the system is ready to accept queries, which can be issued in plain english. Cyc begins by converting the english queries to CycL, again using Cyc's NL tools. For example, the english query "a frightened person" might be parsed as:
(#$and
(#$isa ?x #$Person)
(#$feelsEmotion ?x #$Fear #$High))
After asking the user to confirm its parse (or, occasionally, to choose one of two or more equally valid parses), Cyc begins to backchain from the query expression, using the image descriptions and other knowledge in the KB. When it is able to unify all the free variables with the elements of a picture (in our example, ?x would unify with the woman in the picture), Cyc knows it has found a match.
Cycorp has generalized this approach to image retrieval to extend it to other opaque-information-source retrieval applications. For example, document retrieval from large libraries of text documents described by short abstracts (analogous to image captions), is a task nearly identical in structure to that of retrieving captioned images. In fact, any database of captions, summaries, or abstracts could be handled similarly, whether the corresponding library contained images, sounds, video, text, or anything else.
GIST allows users to import and simultaneously manage and integrate multiple
industry-specific and other thesauri. This tool is the most advanced
multi-thesaurus manager available; it allows any combination of
thesauri to be "active", it has machine-guided "conceptual merging" of
thesaurus terms based on the Cyc ontology, and it is designed to work
with various existing thesaurus tools.
Within the next few years, Cyc may be installed at sites where even thousands of users need to harness its power simultaneously. An important question is: what should the architecture of such an installation look like?
While one Cyc server can easily support a dozen clients at the same time (we do it in Austin all the time), handling operations from thousands of clients would bring a Cyc server to its knees. How do we address this scalability issue?
The obvious alternative is an architecture in which each user (or workgroup) runs its own local Cyc server, with its own copy of the Cyc knowledge base. (This is the arrangement we currently have in our Austin office, which is home to about a dozen Cyc servers.) But this scenario does little to alleviate the scalability problem. Now, each user is saddled with the responsibility of hosting a knowledge base which already contains nearly half a million assertions, and could easily grow into the millions as more and more specialized knowledge is added. Moreover, each user must tolerate the computational and communicational overhead associated with keeping each of the thousands of copies of Cyc in sync with each other.
A great deal of the redundancy and inefficiency of such an architecture can be eliminated by adapting Cyc to work in a distributed fashion. In fact, this is how humans work: we specialize and collaborate. Although most humans share a core of common knowledge which allows them to communicate, no one human possesses all the knowledge of the human race. Instead, we have experts in law, medicine, construction, automotive repair, and software design. When a human lacks the knowledge to solve a problem, he can often find a solution by collaborating with an expert. When I go to the doctor with a persistent headache, the doctor's knowledge of medicine combines with my knowledge of my symptoms to produce a solution that neither of us could have reached alone.
In a distributed Cyc architecture, the network is populated with Cyc agents, each of which shares a common core of knowledge, but possesses in addition one or more additions to the core knowledge base that extend its expertise into new domains. Most importantly, the various Cyc agents are endowed with the ability to communicate with each other and to perform inferencing in a collaborative fashion. The inter-agent communication can be handled flexibly by using KQML or some other knowledge-sharing protocol, or it can be implemented using a more efficient Cyc-specific protocol.
Cyc agents in a distributed architecture share knowledge describing how they can be reached (via which network address, port, and protocol) and what their areas of expertise are. During inferencing, when an agent tries to expand a formula whose content lies outside its area of expertise, it determines (by consulting its own knowledge base or consulting with a central knowledge-broker agent) whether another agent is available that might be able to help. If so, the agent sends a message to its remote counterpart asking it to help expand the formula in question. The answer may be a complete set of bindings, but more commonly it is simply a partial result, still containing unbound variables. The local agent incorporates the result into its own proof tree just as if it had done the work itself, and continues its task. In many cases, the local agent may need to consult the remote agent, or several remote agents, multiple times in the course of its inferencing process.
Cycorp has cooperated with the Computer Science Department of the University of Maryland in Baltimore County (UMBC) to develop a demo of such a distributed architecture. In the demo, three Cyc agents communicate with each other using KQML. While all three agents possess the same core knowledge base, each possesses additional knowledge about an additional domain in which it is considered to be an expert: the GeoAgent in geography, the PolAgent in politics, and the EcoAgent in economics. Working together, they can answer queries that no one of them could have answered alone.
For example, suppose a user asks the GeoAgent for "elected heads of government of countries north of the equator". This might be represented as:
(#$and
(#$headOfGovernmentOf ?x ?y)
(#$hasAttributes ?x #$Elected)
(#$northOf ?y #$Equator))
The GeoAgent is able to find bindings for the third clause by using its own knowledge of the geography domain:
- Britain is in Europe.
- Europe is in the northern hemisphere.
- The northern hemisphere is north of the equator.
- If region A is part of region B, and region B is north of region C, then region A is north of region C.
- Therefore, Britain is north of the equator.
But to find bindings for the first two clauses, the GeoAgent must enlist outside help. It sends these as queries to the PolAgent, which is able to find bindings for them using its knowledge of politics domain:
- Heads of government of democratic countries are elected.
- Great Britain is a democratic country.
- John Major is the head of government of Britain.
- Therefore, John Major is the elected head of government of Great Britain.
When the GeoAgent agent receives this answer back from the PolAgent, it combines it with its own partial answer to produce the final result: John Major.
An important research focus of the Cycorp/UMBC project has been optimizing the implementation of collaborative inferencing so that the costs of managing the necessary communications overhead are significantly outweighed by the increase in scope and efficiency which result from sharing the inferencing workload. We have been sufficiently pleased with the results that we plan to convert the network in our Austin office to a distributed architecture during the coming year.
(For more details on the Cycorp/UMBC demo of the distributed Cyc architecture, see "The Cycic Friends Network: getting Cyc agents to reason together", a paper describing the project which was presented at the 1995 CIKM conference. Also see The Cyc KQML Project, a page at UMBC describing our work.)
It should also be pointed out that, while the description above assumes that all the agents participating in the distributed architecture are Cyc agents, this is not a requirement. Cycorp has defined a very simple protocol for implementing cooperative inferencing, and any agent which has been taught to adhere to this protocol, and which possesses knowledge of interest to other agents, can meaningfully participate in such an architecture. For instance, the role normally played by a Cyc agent in a distributed architecture could equally well be filled by a gateway to
- an expert system implemented in Prolog
- an SQL database
- a special-purpose inferencing tool (e.g. for spatial inferencing)
- a human expert
- etc.
In fact, the WWW Information Retrieval application, described next, builds on the distributed Cyc architecture by filling that role with a gateway to a WWW information source.
The explosion of the World Wide Web during the last two years has created a tremendous opportunity. The WWW is home to a vast quantity of information, much of which could, in principle, be used to make Cyc more intelligent, while shortcutting the laborious process of manual knowledge entry.
This could happen in one of two ways: either online information could be extracted, converted to CycL, and incorporated directly into the knowledge base, or Cyc could be taught to treat external information sources as extensions of the KB, without directly incorporating their contents.
Cycorp explored the first approach during 1996. Currently Cyc is
nearing the critical mass required for the reading and assimilation of
online texts (new stories, encyclopedia articles, etc.) In this scenario, Cyc's natural language tools are used to process online texts, converting them to CycL for inclusion in the knowledge base.
Cycorp is also pursuing the second approach. In this scenario, we write gateways which disguise WWW information sources as Cyc agents (with a very limited domain of expertise), available to operate in a distributed Cyc architecture. There is a large and ever-growing number of information sources on the WWW which might fill this role. All, however, share the following characteristics:
- They embody a large corpus of knowledge,
- they can respond to HTTP queries for specific knowledge, and
- they present their knowledge in an HTML format which, while generally not as regularly structured as a database, is sufficiently structured that it can be parsed by a relatively simple algorithm.
An example is the Internet Movie Database, a truly stupendous compendium of movie knowledge. A quick browse of the IMD demonstrates that it embodies virtually everything there is to know about movies; that it can respond to queries for particular actors, movies, etc.; and that it displays the results in a format, which, while it varies somewhat from one actor, movie, etc. to another, depending on what information is available, is nevertheless fairly regular.
We can effectively annex the contents of the IMD to the Cyc KB by creating a gateway which, on the one hand, interacts with Cyc agents exactly as a Cyc agent operating in a distributed Cyc architecture would, and, on the other hand, simulates the interaction of a WWW browser with the IMD HTTP server.
This gateway is advertised to Cyc as an expert in the movie domain, so that whenever Cyc receives a query in that domain, it turns to the gateway for assistance. For example, let's say a user asks Cyc, "What movies did Ronald Reagan act in?". This might be represented in CycL as:
(#$actedInMovie #$RonaldReagan ?x)
Cyc hands this CycL query to the gateway, which understands enough about movie-related CycL vocabulary to translate this query into an HTTP request to the IMD server for a page on Ronald Reagan. When the IMD server returns the page, the gateway parses the HTML, extracts a list of the movies in which Reagan appeared, converts the result to CycL, and then constructs a suitable reply to the Cyc agent making the request. (For reasons not worth explaining, the reply contains more than just the answer.)
To the user interacting with Cyc, this transaction is entirely transparent. It appears to the user as if Cyc now knows everything there is to know about movies, and yet the KB still fits on the hard disk!
Enhancing Cyc's cinematic erudition may be a fairly frivolous application of the techniques described above, but the WWW contains plenty of large sources of semi-structured information on weightier topics (stock quotes, company profiles, WWW indexes, resumes, the CIA World Factbook, etc.). The ability to effectively incorporate vast portions of the WWW into a "virtual" knowledge base is a compelling possibility. Not only would it greatly expand the effective scope of Cyc's knowledge, but it would do so at little cost to the Cyc development team.