At True Knowledge we're building a Universal Answer Engine: a computer system designed to answer users' questions on any subject: directly and automatically.
However, in the process, we've learned a great deal about what it takes to build such a system and in this blog post I’m going to try to distil this knowledge into Ten Principles: all of which, we suggest, are vital for success.
Principle 1. Knowledge not Code
All the world's information systems are developed in much the same way: the designers of the system decide what knowledge the system needs to cover, what kinds of queries it will respond to and what the responses to those queries will be.
Database schemas are then designed and database tables populated with data. Software engineers then write large quantities of computer code to read the data from these tables and present them to users in the desired format. After lots of elapsed time has gone by and lots of money has been spent, a new computer system exists which supports the types of queries it was designed for.
This method works well for individual vertical areas. Multiple verticals can be supported on the same system by doing a similar process for each vertical: multiplying up the amount of code and the number of database tables. However, this approach cannot be scaled to a truly open-domain system which has to support an unlimited number of verticals. The reason is that time and money are finite resources. The maintenance requirements of code and ever more complex database schemas also scale non-linearly.
Because of this, True Knowledge has two basic design requirements:
- All knowledge has a single structured representation that is unrelated to its meaning. The database schema is of fixed size and supporting new types of knowledge is done by adding more knowledge into this universal representation: facts about facts.
- The Knowledge Engine (the query-processing heart of our platform) is knowledge neutral: it contains no program code specific to any knowledge area.
It isn't possible to escape domain specific code entirely, however. For example, code is needed to know that the string “the 23rd of October 2007” denotes a date and to calculate square roots. However, in our system, such code is isolated from the core query processing system. It is attached in a soft way to specific inference rules (which are also soft). Our platform even supports this happening with externally supplied scripts which can be read from a database.
Principle 2. You can’t do it without understanding
In the True Knowledge system no answer to a question is attempted unless we’ve first managed to translate it into a language-independent machine-processable query: a full semantic interpretation of the user’s intent.
Although it is possible to build question answering systems by directly matching text questions to pre-written answers we don’t believe this approach can be made to scale. Every natural language question has huge numbers of variants (thousands in some cases) and to work out that they are the same question requires semantic processing.
This step also underpins our ability to disambiguate queries and throw away interpretations which are unlikely to be what the user was intending.
Principle 3. You can’t do it without inference
Our experience with True Knowledge is that only a small percentage of questions can be directly answered by looking up the answer from a static source. Most questions require one or more logical steps or calculations to generate the answer the user wants. This is the case no matter how big the information source – it even applies to vast sources like the billions of web pages indexed by search engines such as Google.
For example, True Knowledge will happily answer questions like “
is lisa rinna older than cindy Crawford?” and infer the answer from both their dates of birth. We currently know about more than half a million people, so for just questions of exactly this form we would need 250 billion facts to do this without inference (one for each pair of people). Now consider simple distance questions like “
How far is it from chicago to madingley in miles?” When such a question can be asked about any pair of fixed points on the globe (of which there are millions), in any unit of distance, you can appreciate the scale of the problem.
In True Knowledge, inference is also used to extend knowledge only slightly. e.g. by including the CEO’s name in a list of people that ‘run’ a named business derived from the knowledge that chief executives are part of the management team.
Another way of thinking about inference is that it allows relatively small knowledge bases to punch well above their weight. The 160 million (and growing) facts that True Knowledge currently knows allows it to answer trillions of questions. Without a general inference system, each fact would only answer one possible question.
Principle 4. The only truly scalable way to learn everything is by allowing users to contribute
One of the biggest success stories on the internet is Wikipedia. It is vastly bigger than any other encyclopaedia and one of the most trafficked websites on the internet. It was built almost entirely from the unpaid efforts of internet users and is kept up-to-date by thousands of volunteers.
True Knowledge automatically sources facts from Wikipedia harnessing this user generated source. It also has vast amounts of knowledge that have been imported from databases and added by our own staff.
We have also developed tools that allows users to directly add knowledge to our knowledge base and vote to correct knowledge that is believed to be incorrect. Sometimes this knowledge is directly prompted for by the system. Our internal metrics show that this knowledge, although a small percentage of what we know, is disproportionately valuable in answering other users’ questions.
One of the difficulties of this is that external users (unlike staff) are untrusted. However, a truly effective system needs to have ways of dealing with untrusted knowledge sources. Understanding the knowledge is a big advantage here as we can automatically suppress knowledge the system believes is incorrect (see Principle 2). The multiple source approach (see principle 8) is also a big help.
Principle 5. Silence is way better than getting it wrong (when you have a decent backfill)
True Knowledge is designed to reliably know when it doesn’t understand or doesn’t know. When it can produce a good direct answer, it does. When it can’t, it stays silent and some other kinds of results can be presented to the user instead – perhaps standard internet search. Producing a wrong answer to a question or finding an interpretation of the user’s request that isn’t what they intended are equally bad as the user hasn’t got what they wanted and valuable screen space has been taken up with bad data.
A great example of this principle in action is
the browser plugin we launched yesterday which seamlessly passes your standard search engine queries through our platform and inserts our answers into the results page when appropriate. Here the backfill could hardly be better: it’s Google. When the plugin fires inappropriately or when it can’t add value to the results it takes up valuable real-estate at the top of the page. Staying silent is these cases is exactly what is required. However, when it can produce a perfect direct answer, it does so, saving the user the effort of searching through the links and improving on the results page that would otherwise be there.
Principle 6. Model the universe the same way as your users do
(or communication is only possible between equals)
The True Knowledge platform contains a
comprehensive ontology mapping all the things in the world into hundreds of thousands of classes (people, places, animals, substances etc.) This knowledge underpins the system that translates users’ questions into queries corresponding to what they mean. It is also used to disambiguate. Without this ontology and commonsense knowledge, it would be far harder to respond to users' questions in a sensible way.
Principle 7. Lexical knowledge is just another kind of knowledge
(or language independence is achievable)
As discussed in Principle 1, True Knowledge represents all knowledge in the same basic way. This knowledge includes what English words correspond to what entities. The various grammatical forms of various English words are also facts like any other.
This means that both the core technology and the knowledge representation is free of anything tied to the English language and expanding to other languages is achievable essentially just be adding soft knowledge.
In future implementations of True Knowledge, users will be asking questions in multiple languages yet having their questions answered from a shared knowledge source.
Principle 8. All facts need sources and these need to be available to the user
In True Knowledge the facts used to answer a question are shown to the user after the answer and the sources for those individual facts are available with a single mouse click. Multiple sources can point at a single fact and the history of users endorsing or disagreeing with a fact is also visible. This history is also used for automatic assessment of a fact's truth.
We believe this approach is significantly better than a system which is just a black box and prints out an answer without the user being able to explore where the answer came from.
Principle 9. Get it working scalably on cheap hardware
If engineered correctly, modern web platforms should be able to run on servers which are cheap and available through cloud computing vendors enabling capacity to be turned on and off at a moment’s notice. At a launch or in a situation of high demand, capacity can be increased without having to buy more servers or make long term financial commitments.
Similarly, by making use of modern open source software, systems can be built in a way that avoids licence fees and avoids tying the business to any particular commercial supplier.
True Knowledge follows these principles to the letter being cloud based and not being tied to any commercial software.
Principle 10. It has to work fast
Modern search engines work lightening fast and this has become the expectation of users. Highly complex computing tasks can be done on modern hardware in a tiny fraction of a second and even highly complex AI systems should be no exception. At True Knowledge we believe strongly in this principle and significant resources are being spent to live up to it fully.