Current Slide

Small screen detected. You are viewing the mobile version of SlideWiki. If you wish to edit slides you will need to use a larger device.

Reuters RCV1 statistics

  • symbol                        statistic                       value

  •     N                           documents                      800,000

  •      L                        avg. # tokens per doc            200

  •      M                       terms (= word types)         400,000

  •                                 avg. # bytes per token            6

                                    (incl. spaces/punct.)

  •                                 avg. # bytes per token           4.5

                                    (without spaces/punct.)

  •                                 avg. # bytes per term             7.5

  •                                non-positional postings    100,000,000

    4.5 bytes per word token vs. 7.5 bytes per word type: why?


Speaker notes:

Content Tools

Sources

There are currently no sources for this slide.