Table of Contents

.ds Aq ’

Name

tesseract - command-line OCR engine

Synopsis

tesseract FILE OUTPUTBASE [OPTIONS]... [CONFIGFILE]...

Description

tesseract(1) is a commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed at Google since then.

In/Out Arguments

FILE

The name of the input file. This can either be an image file or a text file.

Most image file formats (anything readable by Leptonica) are supported.

A text file lists the names of all input images (one image name per line). The results will be combined in a single file for each output file format (txt, pdf, hocr, xml).

If FILE is stdin or - then the standard input is used.

OUTPUTBASE

The basename of the output file (to which the appropriate extension will be appended). By default the output will be a text file with .txt added to the basename unless there are one or more parameters set which explicitly specify the desired output.

If OUTPUTBASE is stdout or - then the standard output is used.

Options

-c CONFIGVAR=VALUE

Set value for parameter CONFIGVAR to VALUE. Multiple -c arguments are allowed.

--dpi N

Specify the resolution N in DPI for the input image(s). A typical value for N is 300. Without this option, the resolution is read from the metadata included in the image. If an image does not include that information, Tesseract tries to guess it.

-l LANG, -l SCRIPT

The language or script to use. If none is specified, eng (English) is assumed. Multiple languages may be specified, separated by plus characters. Tesseract uses 3-character ISO 639-2 language codes (see LANGUAGES AND SCRIPTS).

--psm N

Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:

The options for


0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
11 = Sparse text. Find as much text as possible in no particular order.
12 = Sparse text with OSD.
13 = Raw line. Treat the image as a single text line,
     bypassing hacks that are Tesseract-specific.

--oem N

Specify OCR Engine mode. The options for N are:

The options for


0 = Original Tesseract only.
1 = Neural nets LSTM only.
2 = Tesseract + LSTM.
3 = Default, based on what is available.
based on what is available.

--tessdata-dir PATH

Specify the location of tessdata path.

--user-patterns FILE

Specify the location of user patterns file.

--user-words FILE

Specify the location of user words file.

CONFIGFILE

The name of a config to use. The name can be a file in tessdata/configs or tessdata/tessconfigs, or an absolute or relative file path. A config is a plain text file which contains a list of parameters and their values, one per line, with a space separating parameter from value.

Interesting config files include:


Table of Contents