Current Slide

Small screen detected. You are viewing the mobile version of SlideWiki. If you wish to edit slides you will need to use a larger device.

Tokenization: language issues

  • Arabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to right

  • Words are separated, but letter forms within a word form complex ligatures

  •                                        ← →    ← →                                  ← start

  • ‘Algeria achieved its independence in 1962 after 132 years of French occupation.’

  • With Unicode, the surface presentation is complex, but the stored form is straightforward


Speaker notes:

Content Tools

Sources

There are currently no sources for this slide.