Oracle Labs | Single Publication Page

Attribute Extraction from Noisy Text Using Character-based Sequence Tagging Models

Pallika Kanani, Michael Wick, Adam Pocock

06 December 2015

Attribute extraction is the problem of extracting structured key-value pairs from unstructured data. Many similar entity recognition problems are usually solved as a sequence labeling task in which elements of the sequence are word tokens. While word tokens are suitable for newswire, for many types of data—from social media text to product descriptions–word tokens are problematic because simple regular-expression based word tokenizers can not accurately tokenize text that is inconsistently spaced. Instead, we propose a character-based sequence tagging approach that jointly tokenizes and tags tokens. We find that the character-based approach is surprisingly accurate both at tokenizing words, and at inferring labels. We also propose an end-to-end system that uses pair- wise entity linking models for normalizing the extracted values.

Venue : NIPS Workshop on Machine Learning for e-Commerce 2016.

External Link: https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxuaXBzMTVlY29tbWVyY2V8Z3g6MTQ0ZWRkZDQxZTFmYjcxYQ

Attribute Extraction from Noisy Text Using Character-based Sequence Tagging Models

Attribute Extraction from Noisy Text Using Character-based Sequence Tagging Models

Resources For

Partners

Emerging Technology

What’s New

Contact Us