Information Retrieval and Machine Learning

OVERVIEW

  • The Information Retrieval and Machine Learning group develops core Information Retrieval, statistical Natural Language Processing and Machine Learning technologies in order to help solve complex and challenging business problems.

    Information Retrieval

    We are interested in core relevance in Information Retrieval (IR) systems: determining the set of documents that are most relevant to a given query, using just the query and the content of the documents. This core relevance work can be incorporated into learning to rank systems that use Machine Learning to find the best function to use to rank future search results. It can also be incorportated into results diversity models that try to show a variety of kinds of relevant documents in response to a query. We are looking at how to use different kinds of signals in learning to rank systems in enterprise and e-commerce search systems.

    We are concerned with problems of scale in IR systems: how can systems be built and distributed so that we can search billions of documents in real time. We investigate how search can be exploited in application specific contexts like email search.

    Statistical Natural Language Processing

    We use techniques from the field of Statistical Natural Language Processing (NLP) to do text mining: extracting structured information from unstructured data. We are investigating applications of Statistical NLP like named entity recognition, where we extract the names of entities like people, places, organizations, and products from text. Once we have extracted a set of entities we can perform coreference resolution, where we try to determine whether possible mentions of an entity are really referring to the same entity, and entity linking where we try to link a mention of an entity to a particular entry or set of entries in a structured knowledge base. We can then consider tasks like relationship extraction, where we try to find out how the entities in a document are connected, for example, we can try to learn that the person Karl Haberl is the manager of the organization The Information Retrieval and Machine Learning Group

    We are also interested in summarizing information in large knowledge bases. To that end, we are investigating topic modelling where we break a set of documents into a number of topics and then display the topics to the user to help them understand what the documents are talking about.

    Machine Learning

    We also have interests in more fundamental aspects of Machine Learning. We're interested in scalable learning and inference techniques for the graphical models that drive our statistical NLP work. We are also concerned with semi- and unsupervised techniques for Machine Learning, since in many situations it is difficult and expensive to obtain the training labels needed for supervised learning techniques. We're also investigating active and cost-sensitive learning techniques, so that we can make the best use of limited resources when acquiring information during the learning process.

    Our goal is to use techniques from Machine Learning to find scalable, generalizable solutions to the problems faced by Oracle and its customers.

PUBLICATIONS