
Session 11: Tokenization
Presenters
Tokenization
In the 11th session of "AI Talks," we explored Tokenization in NLP, starting with an overview of its importance as the process of breaking text into smaller units called tokens. We discussed various types of tokenization, including word, subword, character, and sentence tokenization. Word tokenization, the most intuitive method, divides text based on spaces and punctuation but struggles with out-of-vocabulary (OOV) words and certain languages. Subword tokenization, which segments words into meaningful smaller units, was highlighted as a solution to OOV issues, using techniques like Byte Pair Encoding (BPE) and SentencePiece to create a more compact and efficient vocabulary.
Character and sentence tokenization were also discussed in depth. Character tokenization treats each character as a token, completely eliminating OOV problems and enabling handling of morphologically rich languages, but it leads to long token sequences and requires deeper models for meaningful understanding. Sentence tokenization, on the other hand, breaks text into sentences based on punctuation and language-specific rules, making it useful for tasks like summarization and translation. However, it can be language-dependent and prone to errors if punctuation is misinterpreted.
Finally, we delved into the mechanics of subword tokenization algorithms, focusing on Byte Pair Encoding (BPE) and SentencePiece. BPE iteratively merges frequent character pairs to build subword units, making it efficient for handling OOV words and compact vocabulary needs, though it has fixed vocabulary constraints. SentencePiece, developed by Google, was introduced as a more flexible approach that does not rely on spaces, making it suitable for diverse languages and domains. However, it comes with computational overhead and requires careful handling of special tokens. Overall, the session provided a comprehensive understanding of tokenization methods, their strengths and weaknesses, and their impact on modern NLP applications.
Slide links: https://www.canva.com/design/DAGcMaPHMIw/UvUmGWoLjV542gC7vEY1zw/view?utm_content=DAGcMaPHMIw&utm_campaign=designshare&utm_medium=link2&utm_source=uniquelinks&utlId=h2e4f058483