Token

1. Definition of Token

  • Token is a fundamental unit in Natural Language Processing (NLP), used to break down text into segments that models can understand and process.
  • Different models and tokenizers may split text into tokens in different ways.

2. Types of Tokens

  • Word-level Token: Each word is treated as a token. For example, the sentence "Hello, world!" would be split into two tokens: ["Hello", "world"].
  • Character-level Token: Each character is treated as a token. For example, "Hello" would be split into ["H", "e", "l", "l", "o"].
  • Subword-level Token: Words are broken down into smaller subword units. For example, "unhappiness" might be decomposed into ["un", "happiness"].

3. Role of Tokens

  • Model Input: AI models take tokens as input, and each token is typically mapped to a unique vector representation.
  • Computational Resources: The number of tokens directly affects the model's computational load and memory usage. More tokens mean more computational resources are consumed.
  • Cost Calculation: In some AI services, costs are calculated based on the number of tokens used. For example, OpenAI's API charges are based on the total number of input and output tokens.

4. Limitations of Tokens

  • Context Length: Most AI models have a maximum token limit (e.g., 4096 tokens), and text exceeding this limit will be truncated.
  • Efficiency: Longer token sequences increase the model's computation time, reducing response speed.