Tech Info 155: HELIOS UB64 PDF full-text indexing performance considerations

HELIOS Tech Info #155

Tue, 28 Jan 2014

HELIOS UB64 PDF full-text indexing performance considerations

Starting with the HELIOS UB64 release, PDF full-text indexing for Spotlight compatible searches is included in every installation via HELIOS Base UB64. The PDF text extraction solution included in HELIOS Base is sufficient for many customers. However, the PDF HandShake option, utilizing Callas PDF Toolbox, includes a more advanced PDF text extractor, which offers faster performance and more precise text extraction.

Please note that the HELIOS Index Server full-text extraction requires many times more computing and disk performance because every file needs to be analyzed and processed for metadata and text extraction. The simple file name only indexing via the desktop database is many times faster because it just scans the file system and reads the resource information to index file ID, file name, and parent ID. The desktop database is required for every volume, unlike the index database which is optional and must be turned on per volume in HELIOS Admin.

Here are the performance results of a sample volume with 3,700 files which are mainly PDF documents using a quad-core powered Intel i7 server. On an average, each PDF document includes about 10 pages:

Indexing System HELIOS Base UB64 HELIOS PDF HandShake UB64
.Desktop rebuild
(volume file index)
10 seconds
(about 370 files per second)

.DesktopIndex rebuild
(volume Spotlight index)

32 minutes
(about 2 PDFs per second) 
10 minutes
(about 6 PDFs per second)

The table shows that the desktop database rebuild is very fast processing about 370 files per second, which is 200 times faster compared to PDF full-text indexing. The optional PDF HandShake PDF full-text indexing is about 3-4 times faster compared to the Base UB64 included PDF full-text indexing.

Please note:
The desktop database rebuild performance highly depends on both the seeking and the number of transactions performance of the disk storage system. This is very slow on RAID systems and much faster on single disks, and lightning fast on SSD disks.
The Spotlight database rebuild performance highly depends on the number of CPUs available for processing because processing images, Office, and PDF files for text and metadata extraction is highly CPU intensive. Indexing will spawn as many processes in parallel as the server has CPUs, so twice the CPU cores means twice as fast. The number of CPUs can be limited via the MaxProc Index Server preference.


References

Index Server manual:
www.helios.de/support/manuals/indexsrvUB2-e

New Features in HELIOS Base UB64:
www.helios.de/web/EN/products/New/UB64/base.html