tesseract  5.0.0
pdfrenderer.cpp
Go to the documentation of this file.
1 // File: pdfrenderer.cpp
3 // Description: PDF rendering interface to inject into TessBaseAPI
4 //
5 // (C) Copyright 2011, Google Inc.
6 // Licensed under the Apache License, Version 2.0 (the "License");
7 // you may not use this file except in compliance with the License.
8 // You may obtain a copy of the License at
9 // http://www.apache.org/licenses/LICENSE-2.0
10 // Unless required by applicable law or agreed to in writing, software
11 // distributed under the License is distributed on an "AS IS" BASIS,
12 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 // See the License for the specific language governing permissions and
14 // limitations under the License.
15 //
17 
18 // Include automatically generated configuration file if running autoconf.
19 #ifdef HAVE_CONFIG_H
20 # include "config_auto.h"
21 #endif
22 
23 #include "pdf_ttf.h"
24 #include "tprintf.h"
25 
26 #include <allheaders.h>
27 #include <tesseract/baseapi.h>
28 #include <tesseract/renderer.h>
29 #include <cmath>
30 #include <cstring>
31 #include <fstream> // for std::ifstream
32 #include <locale> // for std::locale::classic
33 #include <memory> // std::unique_ptr
34 #include <sstream> // for std::stringstream
35 #include "helpers.h" // for Swap
36 
37 /*
38 
39 Design notes from Ken Sharp, with light editing.
40 
41 We think one solution is a font with a single glyph (.notdef) and a
42 CIDToGIDMap which maps all the CIDs to 0. That map would then be
43 stored as a stream in the PDF file, and when flat compressed should
44 be pretty small. The font, of course, will be approximately the same
45 size as the one you currently use.
46 
47 I'm working on such a font now, the CIDToGIDMap is trivial, you just
48 create a stream object which contains 128k bytes (2 bytes per possible
49 CID and your CIDs range from 0 to 65535) and where you currently have
50 "/CIDToGIDMap /Identity" you would have "/CIDToGIDMap <object> 0 R".
51 
52 Note that if, in future, you were to use a different (ie not 2 byte)
53 CMap for character codes you could trivially extend the CIDToGIDMap.
54 
55 The following is an explanation of how some of the font stuff works,
56 this may be too simple for you in which case please accept my
57 apologies, its hard to know how much knowledge someone has. You can
58 skip all this anyway, its just for information.
59 
60 The font embedded in a PDF file is usually intended just to be
61 rendered, but extensions allow for at least some ability to locate (or
62 copy) text from a document. This isn't something which was an original
63 goal of the PDF format, but its been retro-fitted, presumably due to
64 popular demand.
65 
66 To do this reliably the PDF file must contain a ToUnicode CMap, a
67 device for mapping character codes to Unicode code points. If one of
68 these is present, then this will be used to convert the character
69 codes into Unicode values. If its not present then the reader will
70 fall back through a series of heuristics to try and guess the
71 result. This is, as you would expect, prone to failure.
72 
73 This doesn't concern you of course, since you always write a ToUnicode
74 CMap, so because you are writing the text in text rendering mode 3 it
75 would seem that you don't really need to worry about this, but in the
76 PDF spec you cannot have an isolated ToUnicode CMap, it has to be
77 attached to a font, so in order to get even copy/paste to work you
78 need to define a font.
79 
80 This is what leads to problems, tools like pdfwrite assume that they
81 are going to be able to (or even have to) modify the font entries, so
82 they require that the font being embedded be valid, and to be honest
83 the font Tesseract embeds isn't valid (for this purpose).
84 
85 
86 To see why lets look at how text is specified in a PDF file:
87 
88 (Test) Tj
89 
90 Now that looks like text but actually it isn't. Each of those bytes is
91 a 'character code'. When it comes to rendering the text a complex
92 sequence of events takes place, which converts the character code into
93 'something' which the font understands. Its entirely possible via
94 character mappings to have that text render as 'Sftu'
95 
96 For simple fonts (PostScript type 1), we use the character code as the
97 index into an Encoding array (256 elements), each element of which is
98 a glyph name, so this gives us a glyph name. We then consult the
99 CharStrings dictionary in the font, that's a complex object which
100 contains pairs of keys and values, you can use the key to retrieve a
101 given value. So we have a glyph name, we then use that as the key to
102 the dictionary and retrieve the associated value. For a type 1 font,
103 the value is a glyph program that describes how to draw the glyph.
104 
105 For CIDFonts, its a little more complicated. Because CIDFonts can be
106 large, using a glyph name as the key is unreasonable (it would also
107 lead to unfeasibly large Encoding arrays), so instead we use a 'CID'
108 as the key. CIDs are just numbers.
109 
110 But.... We don't use the character code as the CID. What we do is use
111 a CMap to convert the character code into a CID. We then use the CID
112 to key the CharStrings dictionary and proceed as before. So the 'CMap'
113 is the equivalent of the Encoding array, but its a more compact and
114 flexible representation.
115 
116 Note that you have to use the CMap just to find out how many bytes
117 constitute a character code, and it can be variable. For example you
118 can say if the first byte is 0x00->0x7f then its just one byte, if its
119 0x80->0xf0 then its 2 bytes and if its 0xf0->0xff then its 3 bytes. I
120 have seen CMaps defining character codes up to 5 bytes wide.
121 
122 Now that's fine for 'PostScript' CIDFonts, but its not sufficient for
123 TrueType CIDFonts. The thing is that TrueType fonts are accessed using
124 a Glyph ID (GID) (and the LOCA table) which may well not be anything
125 like the CID. So for this case PDF includes a CIDToGIDMap. That maps
126 the CIDs to GIDs, and we can then use the GID to get the glyph
127 description from the GLYF table of the font.
128 
129 So for a TrueType CIDFont, character-code->CID->GID->glyf-program.
130 
131 Looking at the PDF file I was supplied with we see that it contains
132 text like :
133 
134 <0x0075> Tj
135 
136 So we start by taking the character code (117) and look it up in the
137 CMap. Well you don't supply a CMap, you just use the Identity-H one
138 which is predefined. So character code 117 maps to CID 117. Then we
139 use the CIDToGIDMap, again you don't supply one, you just use the
140 predefined 'Identity' map. So CID 117 maps to GID 117. But the font we
141 were supplied with only contains 116 glyphs.
142 
143 Now for Latin that's not a huge problem, you can just supply a bigger
144 font. But for more complex languages that *is* going to be more of a
145 problem. Either you need to supply a font which contains glyphs for
146 all the possible CID->GID mappings, or we need to think laterally.
147 
148 Our solution using a TrueType CIDFont is to intervene at the
149 CIDToGIDMap stage and convert all the CIDs to GID 0. Then we have a
150 font with just one glyph, the .notdef glyph at GID 0. This is what I'm
151 looking into now.
152 
153 It would also be possible to have a 'PostScript' (ie type 1 outlines)
154 CIDFont which contained 1 glyph, and a CMap which mapped all character
155 codes to CID 0. The effect would be the same.
156 
157 Its possible (I haven't checked) that the PostScript CIDFont and
158 associated CMap would be smaller than the TrueType font and associated
159 CIDToGIDMap.
160 
161 --- in a followup ---
162 
163 OK there is a small problem there, if I use GID 0 then Acrobat gets
164 upset about it and complains it cannot extract the font. If I set the
165 CIDToGIDMap so that all the entries are 1 instead, it's happy. Totally
166 mad......
167 
168 */
169 
170 namespace tesseract {
171 
172 // If the font is 10 pts, nominal character width is 5 pts
173 static const int kCharWidth = 2;
174 
175 // Used for memory allocation. A codepoint must take no more than this
176 // many bytes, when written in the PDF way. e.g. "<0063>" for the
177 // letter 'c'
178 static const int kMaxBytesPerCodepoint = 20;
179 
180 /**********************************************************************
181  * PDF Renderer interface implementation
182  **********************************************************************/
183 TessPDFRenderer::TessPDFRenderer(const char *outputbase, const char *datadir, bool textonly)
184  : TessResultRenderer(outputbase, "pdf"), datadir_(datadir) {
185  obj_ = 0;
186  textonly_ = textonly;
187  offsets_.push_back(0);
188 }
189 
190 void TessPDFRenderer::AppendPDFObjectDIY(size_t objectsize) {
191  offsets_.push_back(objectsize + offsets_.back());
192  obj_++;
193 }
194 
195 void TessPDFRenderer::AppendPDFObject(const char *data) {
196  AppendPDFObjectDIY(strlen(data));
197  AppendString(data);
198 }
199 
200 // Helper function to prevent us from accidentally writing
201 // scientific notation to an HOCR or PDF file. Besides, three
202 // decimal points are all you really need.
203 static double prec(double x) {
204  double kPrecision = 1000.0;
205  double a = round(x * kPrecision) / kPrecision;
206  if (a == -0) {
207  return 0;
208  }
209  return a;
210 }
211 
212 static long dist2(int x1, int y1, int x2, int y2) {
213  return (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1);
214 }
215 
216 // Viewers like evince can get really confused during copy-paste when
217 // the baseline wanders around. So I've decided to project every word
218 // onto the (straight) line baseline. All numbers are in the native
219 // PDF coordinate system, which has the origin in the bottom left and
220 // the unit is points, which is 1/72 inch. Tesseract reports baselines
221 // left-to-right no matter what the reading order is. We need the
222 // word baseline in reading order, so we do that conversion here. Returns
223 // the word's baseline origin and length.
224 static void GetWordBaseline(int writing_direction, int ppi, int height, int word_x1, int word_y1,
225  int word_x2, int word_y2, int line_x1, int line_y1, int line_x2,
226  int line_y2, double *x0, double *y0, double *length) {
227  if (writing_direction == WRITING_DIRECTION_RIGHT_TO_LEFT) {
228  std::swap(word_x1, word_x2);
229  std::swap(word_y1, word_y2);
230  }
231  double word_length;
232  double x, y;
233  {
234  int px = word_x1;
235  int py = word_y1;
236  double l2 = dist2(line_x1, line_y1, line_x2, line_y2);
237  if (l2 == 0) {
238  x = line_x1;
239  y = line_y1;
240  } else {
241  double t = ((px - line_x2) * (line_x2 - line_x1) + (py - line_y2) * (line_y2 - line_y1)) / l2;
242  x = line_x2 + t * (line_x2 - line_x1);
243  y = line_y2 + t * (line_y2 - line_y1);
244  }
245  word_length = sqrt(static_cast<double>(dist2(word_x1, word_y1, word_x2, word_y2)));
246  word_length = word_length * 72.0 / ppi;
247  x = x * 72 / ppi;
248  y = height - (y * 72.0 / ppi);
249  }
250  *x0 = x;
251  *y0 = y;
252  *length = word_length;
253 }
254 
255 // Compute coefficients for an affine matrix describing the rotation
256 // of the text. If the text is right-to-left such as Arabic or Hebrew,
257 // we reflect over the Y-axis. This matrix will set the coordinate
258 // system for placing text in the PDF file.
259 //
260 // RTL
261 // [ x' ] = [ a b ][ x ] = [-1 0 ] [ cos sin ][ x ]
262 // [ y' ] [ c d ][ y ] [ 0 1 ] [-sin cos ][ y ]
263 static void AffineMatrix(int writing_direction, int line_x1, int line_y1, int line_x2, int line_y2,
264  double *a, double *b, double *c, double *d) {
265  double theta =
266  atan2(static_cast<double>(line_y1 - line_y2), static_cast<double>(line_x2 - line_x1));
267  *a = cos(theta);
268  *b = sin(theta);
269  *c = -sin(theta);
270  *d = cos(theta);
271  switch (writing_direction) {
273  *a = -*a;
274  *b = -*b;
275  break;
277  // TODO(jbreiden) Consider using the vertical PDF writing mode.
278  break;
279  default:
280  break;
281  }
282 }
283 
284 // There are some really awkward PDF viewers in the wild, such as
285 // 'Preview' which ships with the Mac. They do a better job with text
286 // selection and highlighting when given perfectly flat baseline
287 // instead of very slightly tilted. We clip small tilts to appease
288 // these viewers. I chose this threshold large enough to absorb noise,
289 // but small enough that lines probably won't cross each other if the
290 // whole page is tilted at almost exactly the clipping threshold.
291 static void ClipBaseline(int ppi, int x1, int y1, int x2, int y2, int *line_x1, int *line_y1,
292  int *line_x2, int *line_y2) {
293  *line_x1 = x1;
294  *line_y1 = y1;
295  *line_x2 = x2;
296  *line_y2 = y2;
297  int rise = abs(y2 - y1) * 72;
298  int run = abs(x2 - x1) * 72;
299  if (rise < 2 * ppi && 2 * ppi < run) {
300  *line_y1 = *line_y2 = (y1 + y2) / 2;
301  }
302 }
303 
304 static bool CodepointToUtf16be(int code, char utf16[kMaxBytesPerCodepoint]) {
305  if ((code > 0xD7FF && code < 0xE000) || code > 0x10FFFF) {
306  tprintf("Dropping invalid codepoint %d\n", code);
307  return false;
308  }
309  if (code < 0x10000) {
310  snprintf(utf16, kMaxBytesPerCodepoint, "%04X", code);
311  } else {
312  int a = code - 0x010000;
313  int high_surrogate = (0x03FF & (a >> 10)) + 0xD800;
314  int low_surrogate = (0x03FF & a) + 0xDC00;
315  snprintf(utf16, kMaxBytesPerCodepoint, "%04X%04X", high_surrogate, low_surrogate);
316  }
317  return true;
318 }
319 
320 char *TessPDFRenderer::GetPDFTextObjects(TessBaseAPI *api, double width, double height) {
321  double ppi = api->GetSourceYResolution();
322 
323  // These initial conditions are all arbitrary and will be overwritten
324  double old_x = 0.0, old_y = 0.0;
325  int old_fontsize = 0;
327  bool new_block = true;
328  int fontsize = 0;
329  double a = 1;
330  double b = 0;
331  double c = 0;
332  double d = 1;
333 
334  std::stringstream pdf_str;
335  // Use "C" locale (needed for double values prec()).
336  pdf_str.imbue(std::locale::classic());
337  // Use 8 digits for double values.
338  pdf_str.precision(8);
339 
340  // TODO(jbreiden) This marries the text and image together.
341  // Slightly cleaner from an abstraction standpoint if this were to
342  // live inside a separate text object.
343  pdf_str << "q " << prec(width) << " 0 0 " << prec(height) << " 0 0 cm";
344  if (!textonly_) {
345  pdf_str << " /Im1 Do";
346  }
347  pdf_str << " Q\n";
348 
349  int line_x1 = 0;
350  int line_y1 = 0;
351  int line_x2 = 0;
352  int line_y2 = 0;
353 
354  const std::unique_ptr</*non-const*/ ResultIterator> res_it(api->GetIterator());
355  while (!res_it->Empty(RIL_BLOCK)) {
356  if (res_it->IsAtBeginningOf(RIL_BLOCK)) {
357  pdf_str << "BT\n3 Tr"; // Begin text object, use invisible ink
358  old_fontsize = 0; // Every block will declare its fontsize
359  new_block = true; // Every block will declare its affine matrix
360  }
361 
362  if (res_it->IsAtBeginningOf(RIL_TEXTLINE)) {
363  int x1, y1, x2, y2;
364  res_it->Baseline(RIL_TEXTLINE, &x1, &y1, &x2, &y2);
365  ClipBaseline(ppi, x1, y1, x2, y2, &line_x1, &line_y1, &line_x2, &line_y2);
366  }
367 
368  if (res_it->Empty(RIL_WORD)) {
369  res_it->Next(RIL_WORD);
370  continue;
371  }
372 
373  // Writing direction changes at a per-word granularity
374  tesseract::WritingDirection writing_direction;
375  {
376  tesseract::Orientation orientation;
377  tesseract::TextlineOrder textline_order;
378  float deskew_angle;
379  res_it->Orientation(&orientation, &writing_direction, &textline_order, &deskew_angle);
380  if (writing_direction != WRITING_DIRECTION_TOP_TO_BOTTOM) {
381  switch (res_it->WordDirection()) {
382  case DIR_LEFT_TO_RIGHT:
383  writing_direction = WRITING_DIRECTION_LEFT_TO_RIGHT;
384  break;
385  case DIR_RIGHT_TO_LEFT:
386  writing_direction = WRITING_DIRECTION_RIGHT_TO_LEFT;
387  break;
388  default:
389  writing_direction = old_writing_direction;
390  }
391  }
392  }
393 
394  // Where is word origin and how long is it?
395  double x, y, word_length;
396  {
397  int word_x1, word_y1, word_x2, word_y2;
398  res_it->Baseline(RIL_WORD, &word_x1, &word_y1, &word_x2, &word_y2);
399  GetWordBaseline(writing_direction, ppi, height, word_x1, word_y1, word_x2, word_y2, line_x1,
400  line_y1, line_x2, line_y2, &x, &y, &word_length);
401  }
402 
403  if (writing_direction != old_writing_direction || new_block) {
404  AffineMatrix(writing_direction, line_x1, line_y1, line_x2, line_y2, &a, &b, &c, &d);
405  pdf_str << " " << prec(a) // . This affine matrix
406  << " " << prec(b) // . sets the coordinate
407  << " " << prec(c) // . system for all
408  << " " << prec(d) // . text that follows.
409  << " " << prec(x) // .
410  << " " << prec(y) // .
411  << (" Tm "); // Place cursor absolutely
412  new_block = false;
413  } else {
414  double dx = x - old_x;
415  double dy = y - old_y;
416  pdf_str << " " << prec(dx * a + dy * b) << " " << prec(dx * c + dy * d)
417  << (" Td "); // Relative moveto
418  }
419  old_x = x;
420  old_y = y;
421  old_writing_direction = writing_direction;
422 
423  // Adjust font size on a per word granularity. Pay attention to
424  // fontsize, old_fontsize, and pdf_str. We've found that for
425  // in Arabic, Tesseract will happily return a fontsize of zero,
426  // so we make up a default number to protect ourselves.
427  {
428  bool bold, italic, underlined, monospace, serif, smallcaps;
429  int font_id;
430  res_it->WordFontAttributes(&bold, &italic, &underlined, &monospace, &serif, &smallcaps,
431  &fontsize, &font_id);
432  const int kDefaultFontsize = 8;
433  if (fontsize <= 0) {
434  fontsize = kDefaultFontsize;
435  }
436  if (fontsize != old_fontsize) {
437  pdf_str << "/f-0-0 " << fontsize << " Tf ";
438  old_fontsize = fontsize;
439  }
440  }
441 
442  bool last_word_in_line = res_it->IsAtFinalElement(RIL_TEXTLINE, RIL_WORD);
443  bool last_word_in_block = res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD);
444  std::string pdf_word;
445  int pdf_word_len = 0;
446  do {
447  const std::unique_ptr<const char[]> grapheme(res_it->GetUTF8Text(RIL_SYMBOL));
448  if (grapheme && grapheme[0] != '\0') {
449  std::vector<char32> unicodes = UNICHAR::UTF8ToUTF32(grapheme.get());
450  char utf16[kMaxBytesPerCodepoint];
451  for (char32 code : unicodes) {
452  if (CodepointToUtf16be(code, utf16)) {
453  pdf_word += utf16;
454  pdf_word_len++;
455  }
456  }
457  }
458  res_it->Next(RIL_SYMBOL);
459  } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
460  if (res_it->IsAtBeginningOf(RIL_WORD)) {
461  pdf_word += "0020";
462  pdf_word_len++;
463  }
464  if (word_length > 0 && pdf_word_len > 0) {
465  double h_stretch = kCharWidth * prec(100.0 * word_length / (fontsize * pdf_word_len));
466  pdf_str << h_stretch << " Tz" // horizontal stretch
467  << " [ <" << pdf_word // UTF-16BE representation
468  << "> ] TJ"; // show the text
469  }
470  if (last_word_in_line) {
471  pdf_str << " \n";
472  }
473  if (last_word_in_block) {
474  pdf_str << "ET\n"; // end the text object
475  }
476  }
477  const std::string &text = pdf_str.str();
478  char *result = new char[text.length() + 1];
479  strcpy(result, text.c_str());
480  return result;
481 }
482 
484  AppendPDFObject("%PDF-1.5\n%\xDE\xAD\xBE\xEB\n");
485 
486  // CATALOG
487  AppendPDFObject(
488  "1 0 obj\n"
489  "<<\n"
490  " /Type /Catalog\n"
491  " /Pages 2 0 R\n"
492  ">>\nendobj\n");
493 
494  // We are reserving object #2 for the /Pages
495  // object, which I am going to create and write
496  // at the end of the PDF file.
497  AppendPDFObject("");
498 
499  // TYPE0 FONT
500  AppendPDFObject(
501  "3 0 obj\n"
502  "<<\n"
503  " /BaseFont /GlyphLessFont\n"
504  " /DescendantFonts [ 4 0 R ]\n" // CIDFontType2 font
505  " /Encoding /Identity-H\n"
506  " /Subtype /Type0\n"
507  " /ToUnicode 6 0 R\n" // ToUnicode
508  " /Type /Font\n"
509  ">>\n"
510  "endobj\n");
511 
512  // CIDFONTTYPE2
513  std::stringstream stream;
514  // Use "C" locale (needed for int values larger than 999).
515  stream.imbue(std::locale::classic());
516  stream << "4 0 obj\n"
517  "<<\n"
518  " /BaseFont /GlyphLessFont\n"
519  " /CIDToGIDMap 5 0 R\n" // CIDToGIDMap
520  " /CIDSystemInfo\n"
521  " <<\n"
522  " /Ordering (Identity)\n"
523  " /Registry (Adobe)\n"
524  " /Supplement 0\n"
525  " >>\n"
526  " /FontDescriptor 7 0 R\n" // Font descriptor
527  " /Subtype /CIDFontType2\n"
528  " /Type /Font\n"
529  " /DW "
530  << (1000 / kCharWidth)
531  << "\n"
532  ">>\n"
533  "endobj\n";
534  AppendPDFObject(stream.str().c_str());
535 
536  // CIDTOGIDMAP
537  const int kCIDToGIDMapSize = 2 * (1 << 16);
538  const std::unique_ptr<unsigned char[]> cidtogidmap(new unsigned char[kCIDToGIDMapSize]);
539  for (int i = 0; i < kCIDToGIDMapSize; i++) {
540  cidtogidmap[i] = (i % 2) ? 1 : 0;
541  }
542  size_t len;
543  unsigned char *comp = zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize, &len);
544  stream.str("");
545  stream << "5 0 obj\n"
546  "<<\n"
547  " /Length "
548  << len
549  << " /Filter /FlateDecode\n"
550  ">>\n"
551  "stream\n";
552  AppendString(stream.str().c_str());
553  long objsize = stream.str().size();
554  AppendData(reinterpret_cast<char *>(comp), len);
555  objsize += len;
556  lept_free(comp);
557  const char *endstream_endobj =
558  "endstream\n"
559  "endobj\n";
560  AppendString(endstream_endobj);
561  objsize += strlen(endstream_endobj);
562  AppendPDFObjectDIY(objsize);
563 
564  const char stream2[] =
565  "/CIDInit /ProcSet findresource begin\n"
566  "12 dict begin\n"
567  "begincmap\n"
568  "/CIDSystemInfo\n"
569  "<<\n"
570  " /Registry (Adobe)\n"
571  " /Ordering (UCS)\n"
572  " /Supplement 0\n"
573  ">> def\n"
574  "/CMapName /Adobe-Identify-UCS def\n"
575  "/CMapType 2 def\n"
576  "1 begincodespacerange\n"
577  "<0000> <FFFF>\n"
578  "endcodespacerange\n"
579  "1 beginbfrange\n"
580  "<0000> <FFFF> <0000>\n"
581  "endbfrange\n"
582  "endcmap\n"
583  "CMapName currentdict /CMap defineresource pop\n"
584  "end\n"
585  "end\n";
586 
587  // TOUNICODE
588  stream.str("");
589  stream << "6 0 obj\n"
590  "<< /Length "
591  << (sizeof(stream2) - 1)
592  << " >>\n"
593  "stream\n"
594  << stream2
595  << "endstream\n"
596  "endobj\n";
597  AppendPDFObject(stream.str().c_str());
598 
599  // FONT DESCRIPTOR
600  stream.str("");
601  stream << "7 0 obj\n"
602  "<<\n"
603  " /Ascent 1000\n"
604  " /CapHeight 1000\n"
605  " /Descent -1\n" // Spec says must be negative
606  " /Flags 5\n" // FixedPitch + Symbolic
607  " /FontBBox [ 0 0 "
608  << (1000 / kCharWidth)
609  << " 1000 ]\n"
610  " /FontFile2 8 0 R\n"
611  " /FontName /GlyphLessFont\n"
612  " /ItalicAngle 0\n"
613  " /StemV 80\n"
614  " /Type /FontDescriptor\n"
615  ">>\n"
616  "endobj\n";
617  AppendPDFObject(stream.str().c_str());
618 
619  stream.str("");
620  stream << datadir_.c_str() << "/pdf.ttf";
621  const uint8_t *font;
622  std::ifstream input(stream.str().c_str(), std::ios::in | std::ios::binary);
623  std::vector<unsigned char> buffer(std::istreambuf_iterator<char>(input), {});
624  auto size = buffer.size();
625  if (size) {
626  font = buffer.data();
627  } else {
628 #if !defined(NDEBUG)
629  tprintf("Cannot open file \"%s\"!\nUsing internal glyphless font.\n", stream.str().c_str());
630 #endif
631  font = pdf_ttf;
632  size = sizeof(pdf_ttf);
633  }
634 
635  // FONTFILE2
636  stream.str("");
637  stream << "8 0 obj\n"
638  "<<\n"
639  " /Length "
640  << size
641  << "\n"
642  " /Length1 "
643  << size
644  << "\n"
645  ">>\n"
646  "stream\n";
647  AppendString(stream.str().c_str());
648  objsize = stream.str().size();
649  AppendData(reinterpret_cast<const char *>(font), size);
650  objsize += size;
651  AppendString(endstream_endobj);
652  objsize += strlen(endstream_endobj);
653  AppendPDFObjectDIY(objsize);
654  return true;
655 }
656 
657 bool TessPDFRenderer::imageToPDFObj(Pix *pix, const char *filename, long int objnum,
658  char **pdf_object, long int *pdf_object_size,
659  const int jpg_quality) {
660  if (!pdf_object_size || !pdf_object) {
661  return false;
662  }
663  *pdf_object = nullptr;
664  *pdf_object_size = 0;
665  if (!filename && !pix) {
666  return false;
667  }
668 
669  L_Compressed_Data *cid = nullptr;
670 
671  int sad = 0;
672  if (pixGetInputFormat(pix) == IFF_PNG) {
673  sad = pixGenerateCIData(pix, L_FLATE_ENCODE, 0, 0, &cid);
674  }
675  if (!cid) {
676  sad = l_generateCIDataForPdf(filename, pix, jpg_quality, &cid);
677  }
678 
679  if (sad || !cid) {
680  l_CIDataDestroy(&cid);
681  return false;
682  }
683 
684  const char *group4 = "";
685  const char *filter;
686  switch (cid->type) {
687  case L_FLATE_ENCODE:
688  filter = "/FlateDecode";
689  break;
690  case L_JPEG_ENCODE:
691  filter = "/DCTDecode";
692  break;
693  case L_G4_ENCODE:
694  filter = "/CCITTFaxDecode";
695  group4 = " /K -1\n";
696  break;
697  case L_JP2K_ENCODE:
698  filter = "/JPXDecode";
699  break;
700  default:
701  l_CIDataDestroy(&cid);
702  return false;
703  }
704 
705  // Maybe someday we will accept RGBA but today is not that day.
706  // It requires creating an /SMask for the alpha channel.
707  // http://stackoverflow.com/questions/14220221
708  std::stringstream colorspace;
709  // Use "C" locale (needed for int values larger than 999).
710  colorspace.imbue(std::locale::classic());
711  if (cid->ncolors > 0) {
712  colorspace << " /ColorSpace [ /Indexed /DeviceRGB " << (cid->ncolors - 1) << " "
713  << cid->cmapdatahex << " ]\n";
714  } else {
715  switch (cid->spp) {
716  case 1:
717  if (cid->bps == 1 && pixGetInputFormat(pix) == IFF_PNG) {
718  colorspace.str(
719  " /ColorSpace /DeviceGray\n"
720  " /Decode [1 0]\n");
721  } else {
722  colorspace.str(" /ColorSpace /DeviceGray\n");
723  }
724  break;
725  case 3:
726  colorspace.str(" /ColorSpace /DeviceRGB\n");
727  break;
728  default:
729  l_CIDataDestroy(&cid);
730  return false;
731  }
732  }
733 
734  int predictor = (cid->predictor) ? 14 : 1;
735 
736  // IMAGE
737  std::stringstream b1;
738  // Use "C" locale (needed for int values larger than 999).
739  b1.imbue(std::locale::classic());
740  b1 << objnum
741  << " 0 obj\n"
742  "<<\n"
743  " /Length "
744  << cid->nbytescomp
745  << "\n"
746  " /Subtype /Image\n";
747 
748  std::stringstream b2;
749  // Use "C" locale (needed for int values larger than 999).
750  b2.imbue(std::locale::classic());
751  b2 << " /Width " << cid->w
752  << "\n"
753  " /Height "
754  << cid->h
755  << "\n"
756  " /BitsPerComponent "
757  << cid->bps
758  << "\n"
759  " /Filter "
760  << filter
761  << "\n"
762  " /DecodeParms\n"
763  " <<\n"
764  " /Predictor "
765  << predictor
766  << "\n"
767  " /Colors "
768  << cid->spp << "\n"
769  << group4 << " /Columns " << cid->w
770  << "\n"
771  " /BitsPerComponent "
772  << cid->bps
773  << "\n"
774  " >>\n"
775  ">>\n"
776  "stream\n";
777 
778  const char *b3 =
779  "endstream\n"
780  "endobj\n";
781 
782  size_t b1_len = b1.str().size();
783  size_t b2_len = b2.str().size();
784  size_t b3_len = strlen(b3);
785  size_t colorspace_len = colorspace.str().size();
786 
787  *pdf_object_size = b1_len + colorspace_len + b2_len + cid->nbytescomp + b3_len;
788  *pdf_object = new char[*pdf_object_size];
789 
790  char *p = *pdf_object;
791  memcpy(p, b1.str().c_str(), b1_len);
792  p += b1_len;
793  memcpy(p, colorspace.str().c_str(), colorspace_len);
794  p += colorspace_len;
795  memcpy(p, b2.str().c_str(), b2_len);
796  p += b2_len;
797  memcpy(p, cid->datacomp, cid->nbytescomp);
798  p += cid->nbytescomp;
799  memcpy(p, b3, b3_len);
800  l_CIDataDestroy(&cid);
801  return true;
802 }
803 
805  Pix *pix = api->GetInputImage();
806  const char *filename = api->GetInputName();
807  int ppi = api->GetSourceYResolution();
808  if (!pix || ppi <= 0) {
809  return false;
810  }
811  double width = pixGetWidth(pix) * 72.0 / ppi;
812  double height = pixGetHeight(pix) * 72.0 / ppi;
813 
814  std::stringstream xobject;
815  // Use "C" locale (needed for int values larger than 999).
816  xobject.imbue(std::locale::classic());
817  if (!textonly_) {
818  xobject << "/XObject << /Im1 " << (obj_ + 2) << " 0 R >>\n";
819  }
820 
821  // PAGE
822  std::stringstream stream;
823  // Use "C" locale (needed for double values width and height).
824  stream.imbue(std::locale::classic());
825  stream.precision(2);
826  stream << std::fixed << obj_
827  << " 0 obj\n"
828  "<<\n"
829  " /Type /Page\n"
830  " /Parent 2 0 R\n" // Pages object
831  " /MediaBox [0 0 "
832  << width << " " << height
833  << "]\n"
834  " /Contents "
835  << (obj_ + 1)
836  << " 0 R\n" // Contents object
837  " /Resources\n"
838  " <<\n"
839  " "
840  << xobject.str() << // Image object
841  " /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]\n"
842  " /Font << /f-0-0 3 0 R >>\n" // Type0 Font
843  " >>\n"
844  ">>\n"
845  "endobj\n";
846  pages_.push_back(obj_);
847  AppendPDFObject(stream.str().c_str());
848 
849  // CONTENTS
850  const std::unique_ptr<char[]> pdftext(GetPDFTextObjects(api, width, height));
851  const size_t pdftext_len = strlen(pdftext.get());
852  size_t len;
853  unsigned char *comp_pdftext =
854  zlibCompress(reinterpret_cast<unsigned char *>(pdftext.get()), pdftext_len, &len);
855  long comp_pdftext_len = len;
856  stream.str("");
857  stream << obj_
858  << " 0 obj\n"
859  "<<\n"
860  " /Length "
861  << comp_pdftext_len
862  << " /Filter /FlateDecode\n"
863  ">>\n"
864  "stream\n";
865  AppendString(stream.str().c_str());
866  long objsize = stream.str().size();
867  AppendData(reinterpret_cast<char *>(comp_pdftext), comp_pdftext_len);
868  objsize += comp_pdftext_len;
869  lept_free(comp_pdftext);
870  const char *b2 =
871  "endstream\n"
872  "endobj\n";
873  AppendString(b2);
874  objsize += strlen(b2);
875  AppendPDFObjectDIY(objsize);
876 
877  if (!textonly_) {
878  char *pdf_object = nullptr;
879  int jpg_quality;
880  api->GetIntVariable("jpg_quality", &jpg_quality);
881  if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize, jpg_quality)) {
882  return false;
883  }
884  AppendData(pdf_object, objsize);
885  AppendPDFObjectDIY(objsize);
886  delete[] pdf_object;
887  }
888  return true;
889 }
890 
892  // We reserved the /Pages object number early, so that the /Page
893  // objects could refer to their parent. We finally have enough
894  // information to go fill it in. Using lower level calls to manipulate
895  // the offset record in two spots, because we are placing objects
896  // out of order in the file.
897 
898  // PAGES
899  const long int kPagesObjectNumber = 2;
900  offsets_[kPagesObjectNumber] = offsets_.back(); // manipulation #1
901  std::stringstream stream;
902  // Use "C" locale (needed for int values larger than 999).
903  stream.imbue(std::locale::classic());
904  stream << kPagesObjectNumber << " 0 obj\n<<\n /Type /Pages\n /Kids [ ";
905  AppendString(stream.str().c_str());
906  size_t pages_objsize = stream.str().size();
907  for (const auto &page : pages_) {
908  stream.str("");
909  stream << page << " 0 R ";
910  AppendString(stream.str().c_str());
911  pages_objsize += stream.str().size();
912  }
913  stream.str("");
914  stream << "]\n /Count " << pages_.size() << "\n>>\nendobj\n";
915  AppendString(stream.str().c_str());
916  pages_objsize += stream.str().size();
917  offsets_.back() += pages_objsize; // manipulation #2
918 
919  // INFO
920  std::string utf16_title = "FEFF"; // byte_order_marker
921  std::vector<char32> unicodes = UNICHAR::UTF8ToUTF32(title());
922  char utf16[kMaxBytesPerCodepoint];
923  for (char32 code : unicodes) {
924  if (CodepointToUtf16be(code, utf16)) {
925  utf16_title += utf16;
926  }
927  }
928 
929  char *datestr = l_getFormattedDate();
930  stream.str("");
931  stream << obj_
932  << " 0 obj\n"
933  "<<\n"
934  " /Producer (Tesseract "
936  << ")\n"
937  " /CreationDate (D:"
938  << datestr
939  << ")\n"
940  " /Title <"
941  << utf16_title.c_str()
942  << ">\n"
943  ">>\n"
944  "endobj\n";
945  lept_free(datestr);
946  AppendPDFObject(stream.str().c_str());
947  stream.str("");
948  stream << "xref\n0 " << obj_ << "\n0000000000 65535 f \n";
949  AppendString(stream.str().c_str());
950  for (int i = 1; i < obj_; i++) {
951  stream.str("");
952  stream.width(10);
953  stream.fill('0');
954  stream << offsets_[i] << " 00000 n \n";
955  AppendString(stream.str().c_str());
956  }
957  stream.str("");
958  stream << "trailer\n<<\n /Size " << obj_
959  << "\n"
960  " /Root 1 0 R\n" // catalog
961  " /Info "
962  << (obj_ - 1)
963  << " 0 R\n" // info
964  ">>\nstartxref\n"
965  << offsets_.back() << "\n%%EOF\n";
966  AppendString(stream.str().c_str());
967  return true;
968 }
969 } // namespace tesseract
struct TessBaseAPI TessBaseAPI
Definition: capi.h:62
signed int char32
void tprintf(const char *format,...)
Definition: tprintf.cpp:41
signed int char32
Definition: unichar.h:51
@ DIR_LEFT_TO_RIGHT
Definition: unichar.h:45
@ DIR_RIGHT_TO_LEFT
Definition: unichar.h:46
@ WRITING_DIRECTION_TOP_TO_BOTTOM
Definition: publictypes.h:134
@ WRITING_DIRECTION_LEFT_TO_RIGHT
Definition: publictypes.h:132
@ WRITING_DIRECTION_RIGHT_TO_LEFT
Definition: publictypes.h:133
const char * GetInputName()
Definition: baseapi.cpp:925
bool GetIntVariable(const char *name, int *value) const
Definition: baseapi.cpp:291
static const char * Version()
Definition: baseapi.cpp:238
const char * title() const
Definition: renderer.h:88
void AppendString(const char *s)
Definition: renderer.cpp:111
void AppendData(const char *s, int len)
Definition: renderer.cpp:115
bool EndDocumentHandler() override
bool BeginDocumentHandler() override
TessPDFRenderer(const char *outputbase, const char *datadir, bool textonly=false)
bool AddImageHandler(TessBaseAPI *api) override
static std::vector< char32 > UTF8ToUTF32(const char *utf8_str)
Definition: unichar.cpp:220