Is White Space Tokenization enough?

Example Sentences Used

  1. I can’t believe we finished the long-distance race in under an hour!
  2. Let’s brainstorm some birthday party ideas for our friend’s upcoming celebration.
  3. Sarah accidentally left her homework at home, so she’ll need to ask the teacher for an extension.
  4. Don’t forget to water the houseplants; they need moisture to survive.
  5. Despite the malfunctioning GPS sending us miles off course, we’ll never forget the breathtaking mountain scenery we stumbled upon!
  6. She couldn’t resist the temptation of grabbing a double-scoop ice cream cone before heading to the beach.

TreeBankWord Tokenizer

Had trouble deciphering the word can’t: (ca)(n’t). Treated all punctuation like a separate entity.

WordPunct Tokenizer

Separated words that are connected by hyphens. Handled contraction by splitting them: (can)(‘)(t). Also handled punctuation as a separate entitiy.

PunkWord Tokenizer

Separated contraction by seperating the two parts: (can)(‘t). Handled punctuation as a separate entity.

White Space Tokenizer

Separated all the words at the given spaces inĀ between the words. Each word contained any punctuation attached to it.

Pattern Tokenizer

Split select contractions as: (ca)(n’t) or (do)(n’t) while others are split as: (we)(‘ll). Punctuation was treated as a separate entity.

For the small sample size used, white space tokenization may be enough, but when words are only separated at the white space, issues can arise depending on whether compound words are separate words or hyphenated. In the case of “long-distance,” the white space tokenizer kept it as one token, which makes sense, but for “ice cream,” the white space tokenizer split the compound word into two tokens due to the space, and splitting it may lose the meaning of the phrase because “ice” and “cream” mean different things separately.

Are White Spaces Sufficient?

Scroll to Top