What is Tokenization?

✓ 🍋

Tous les outils Guides 117 Glossaire 34 Formats de fichiers 131 Cas d'utilisation 302 API

Tokenization

Text Tokenization

Splitting text into meaningful units (tokens) such as words, sentences, or subword pieces for processing.

Détail technique

Tokenization operates on sequences of Unicode code points, where each character's properties (category, script, case, directionality) are defined by the Unicode standard. Text processing in the browser uses the TextEncoder/TextDecoder APIs for encoding conversion and Intl.Segmenter for locale-aware word and sentence boundary detection. Understanding the distinction between bytes, code units, code points, and grapheme clusters is essential for correct text manipulation.

Exemple

```javascript
// Tokenization: text processing example
const input = 'Sample text for processing';
const result = input
  .trim()
  .split(/\s+/)
  .filter(Boolean);
console.log(result); // ['Sample', 'text', 'for', 'processing']
```

Outils associés

W Word Counter C Case Converter S Sort Lines L Lorem Ipsum Generator S Slug Generator F Find & Replace R Remove Duplicate Lines B Base64 Encoder/Decoder U URL Encoder/Decoder J JSON Formatter H HTML Entity Encoder/Decoder R Reverse Text A Add/Remove Line Numbers T Text Diff T Text Extractor

Termes associés

BOM Case Conversion Escape Character ASCII Diacritics Kerning CJK Grep