SDS 626: Subword Tokenization with Byte-Pair Encoding — with @JonKrohnLearns

Topics covered
Popular Clips
Episode Highlights
Word Tokenization
introduces word tokenization as a foundational step in natural language processing (NLP). This method involves breaking down text into individual words using spaces as delimiters, which is straightforward but has limitations. If a word is not frequently present in the training data, it becomes an unknown token during model deployment, potentially ignoring important words 1.
A big drawback with such word level tokenization is that if a word didn't show up enough times in our training data, then when the NLP model encounters that word in production, there's no way to handle it.
---
To address this, character-level tokenization is introduced, allowing models to represent words by their constituent characters, though it requires more tokens and can lead to suboptimal performance 1.
Character Tokenization
Character tokenization breaks text into individual characters, offering a solution to the unknown token problem in word tokenization. This method allows models to represent any word by its characters, ensuring no word is ignored due to vocabulary limitations 1. However, it requires a large number of tokens and lacks the semantic richness of words, which can hinder model performance.
Unfortunately, character level tokenization also has its own drawbacks. For one, it requires a large number of tokens to represent a sequence of text.
---
Despite these drawbacks, character tokenization is used in techniques like ELMo, which leverages character embeddings to improve NLP tasks 1.
Subword Tokenization
Subword tokenization, particularly through byte-pair encoding, offers a balanced approach by combining the strengths of word and character tokenization. This method involves splitting words into meaningful subwords, which can be recombined to represent out-of-vocabulary words efficiently 2.
The upshot is that byte pair encoding is indeed so powerful that it is a crucial component behind many of the leading NLP models of today, such as Bertin, GPT-3 and Excelnet.
---
By using byte-pair encoding, NLP models can handle new words by leveraging known subwords, enhancing their ability to understand and process language 1.
Related Episodes

SDS 506: Supervised vs Unsupervised Learning — with Jon Krohn
Answers 383 questions
SDS 446: Getting Started in Machine Learning — with Jon Krohn
Answers 383 questions
SDS 554: @JonKrohnLearns's Deep Learning Courses
Answers 383 questions
SDS 556: @JonKrohnLearns's Machine Learning Courses
Answers 383 questions
SDS 558: @JonKrohnLearns's Answers to Questions on Machine Learning
Answers 383 questions
SDS 620: OpenAI Whisper: General-Purpose Speech Recognition — with @JonKrohnLearns
Answers 383 questions
SDS 476: Peer-Driven Learning — with Jon Krohn
Answers 383 questions
SDS 456: The Pomodoro Technique — with Jon Krohn
Answers 383 questions
SDS 484: Algorithm Aversion — with Jon Krohn
Answers 383 questions
SDS 624: Imagen Video: Incredible Text-to-Video Generation — with @JonKrohnLearns
Answers 383 questions
SDS 510: Deep Reinforcement Learning — with Jon Krohn
Answers 383 questions
SDS 474: The Machine Learning House — with Jon Krohn
Answers 383 questions
SDS 568: PaLM: Google's Breakthrough Natural Language Model — with Jon Krohn
Answers 383 questions
SDS 576: Tech Startup Dramas — with Jon Krohn
Answers 383 questions
SDS 468: The History of Data — with Jon Krohn
Answers 383 questions
