SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Kavli Affiliate: John Richardson

| First 5 Authors: Taku Kudo, John Richardson, , ,

| Summary:

This paper describes SentencePiece, a language-independent subword tokenizer
and detokenizer designed for Neural-based text processing, including Neural
Machine Translation. It provides open-source C++ and Python implementations for
subword units. While existing subword segmentation tools assume that the input
is pre-tokenized into word sequences, SentencePiece can train subword models
directly from raw sentences, which allows us to make a purely end-to-end and
language independent system. We perform a validation experiment of NMT on
English-Japanese machine translation, and find that it is possible to achieve
comparable accuracy to direct subword training from raw sentences. We also
compare the performance of subword training and segmentation with various
configurations. SentencePiece is available under the Apache 2 license at
https://github.com/google/sentencepiece.

| Search Query: ArXiv Query: search_query=au:”John Richardson”&id_list=&start=0&max_results=10

Read More

Leave a Reply

Your email address will not be published.