It feels great to have a big fat knowledge base full of facts but sometimes you can have too much of a good thing.
For example, we know about 19 different places containing Madingley UK. This is great but it does make it hard for the system to give a concise text-based answer to a question like "Where is Madingley?"
What to do - should we give all 19 results, including "the 01223 telephone code area", "US postal group 5", and "the world"? Obviously not. How about only the smallest place containing it? Again, no. This is quite likely to be some obscure electoral district that no-one's ever heard of.
The ideal answer is probably something like "Cambridgeshire, England" or "Cambridgeshire, England, UK". We need some way of deciding which places are interesting enough to use as answers. Obviously countries are all interesting, but where else is? If I ask where Micronesia is then I'd be interested to get "the Southern hemisphere" in the answer, but we probably don't want the hemisphere as part of the answer for Madingley. Does that make it OK to say that the Southern hemisphere is interesting enough to use in answers and not the Northern one, or would that be just too borealocentric?
So far we've labelled all the countries, US states and English counties as 'place is interesting'. We'll be putting in more over the next few days, so come along and join in to help get questions about your town answered well.
I would have thought that this is a question whose answer depends on the context of the person asking the question. Labelling English counties as interesting is good for someone in England, or maybe an ex-pat Brit (who still has a knowledge of England). But for someone in China, English counties are probably irrelevant.
If you had some way of knowing about the user, then you could tailor the answer to them. For instance, provide an appropriate amount of context to describe answers, and also rank answers so that e.g. nearest place is first. Could this be an optional feature of TK accounts?
Posted by: Nomlas | 09 March 2009 at 09:18 AM
You need to up the number of facts by two or three orders of magnitude. Hand‐entering them is simply not a feasible way of proceeding. I recommend you write two sets of automated web crawlers. One that targets specific sites such as wikipedia and uses knowledge about things like info boxes to import lots of data automatically. Secondly a more general crawler that spots common patterns of fact assertion on any website with a high trust value (pagerank?).
You can exclude some assertion patterns too, such as those that begin with "your mother is". They could be held in a clearing house until they can be verified if the usage is suspect, or trivial—"[Nicholas Shanks] would like [a pepperoni pizza for lunch] was true from 9 March 2009 onwards" is a pretty useless fact that you might find online, and can be automatically filtered out.
Other importers would also be useful, such as GEDCOM and FoaF for people/relationships/genealogies, and a general RDF ontology parser/importer for accumulating random data. RDF triples are well suited to how your system works.
Posted by: Nicholas | 09 March 2009 at 10:59 AM
We already do automatically import facts. If you look throught the recent activity log you'll see lines like
15:58 yesterday true knowledge Added 34603 facts sourced from Freebase about human beings
I sure didn't type all those humans in myself!
Posted by: Beth | 09 March 2009 at 11:13 AM
I'd guess that you need a referent to the computer asking the question & its origin or IP address.
Thus: to a person in Cambridge (CB, UK), Madingley is just a few miles north.
To someone in Cambridge (Glos) it's in Cambridgeshire, and
to someone in Cambridge (Mass), Madingley is in England.
Posted by: Martin Griffies | 16 April 2009 at 02:57 PM