PDFlib TET官方最新版免費下載,可從PDF文檔中提取文本、圖像和元數(shù)據(jù)，PDFlib TET正版購買、在線文檔支持-慧都網(wǎng)

<menu id="w2i4a"></menu>

產(chǎn)品

產(chǎn)品
資訊
資源
視頻
學(xué)院
示例

產(chǎn)品中心
解決方案
行業(yè)方案
視頻課程
關(guān)于慧都

熱門產(chǎn)品

UI界面: DevExpress telerik BCGSoft Developer Machines

文檔管理: Aspose E-iceblue GrapeCity PDFlib

圖表控件: LightningChart Steema Iocomp

數(shù)據(jù)采集: TAKEBISHI Matrikon

思維導(dǎo)圖: TheBrain XMind

開發(fā)工具: IntelliJ IDEA MyEclipse Zend PyCharm WebStorm CLion

報表控件: Fast Report Stimulsoft GrapeCity

加密解密: VMPsoft Eziriz Oreans

項目管理: DHTMLX NETRONIC

數(shù)據(jù)庫管理: Devart PremiumSoft

條碼工具: Bartender Softek Dynamsoft TEC-IT Byte Aspose.BarCode

解決方案

軟件定制解決方案: 軟件系統(tǒng)定制高端UI定制業(yè)務(wù)系統(tǒng)定制

智能制造解決方案: OMES制造執(zhí)行系統(tǒng) APS生產(chǎn)排程系統(tǒng) OQMS質(zhì)檢管理系統(tǒng) OPTS生產(chǎn)溯源系統(tǒng) OTPM設(shè)備管理系統(tǒng) OKanban看板管理 DA工業(yè)數(shù)據(jù)采集系統(tǒng) SRM供應(yīng)商管理 PDM產(chǎn)品數(shù)據(jù)管理 WMS倉儲管理 OMES ProLine產(chǎn)線MES系統(tǒng)

行業(yè)方案

制造行業(yè): 磁性材料行業(yè)hot 汽車零配件行業(yè) 電子行業(yè) 精密裝配行業(yè) 鈑金行業(yè) 機(jī)械加工行業(yè) 汽車改裝行業(yè) 金屬薄膜材料行業(yè) 燈具照明行業(yè) 電線電纜行業(yè) 鋼結(jié)構(gòu)行業(yè)

其他行業(yè): 石油行業(yè)hot 醫(yī)療行業(yè) 金融行業(yè) 建筑行業(yè)

視頻課程

產(chǎn)品視頻: UI界面類圖標(biāo) 報表網(wǎng)絡(luò)通訊文檔管理矢量圖像處理位圖圖像處理音頻視頻文件格式轉(zhuǎn)碼條形碼加密解密測試分析地圖/CAD/GIS BI/大數(shù)據(jù) 算法工作流 UML 數(shù)據(jù)庫/服務(wù)器 IDE 項目管理思維導(dǎo)圖其他移動開發(fā) 掃描識別條形碼

學(xué)院課程: VIP視頻免費視頻用戶界面圖表報表文檔管理大數(shù)據(jù) 工作流項目管理測試分析往期公開課項目管理其他

企業(yè)培訓(xùn): 定制培訓(xùn)班

關(guān)于慧都

慧都簡介慧都文化聯(lián)系我們合作伙伴典型客戶

首頁 > 產(chǎn)品 > PDFlib TET

PDFlib TET授權(quán)購買

下載：568 收藏：89

查看價格免費下載

PDFlib TET (產(chǎn)品編號：10596)

PDFlib TET是一款可以從任意PDF文檔格式中可靠地提取文本信息的軟件。

標(biāo)簽：PDF

開發(fā)商： PDFlib

當(dāng)前版本： v5.4

產(chǎn)品類型：控件

產(chǎn)品功能：文檔管理

平臺語言：Activex & COM|.NET|JAVA|C++/ MFC|其他

開源水平：不提供源碼

本產(chǎn)品的分類與介紹僅供參考，具體以商家網(wǎng)站介紹為準(zhǔn)，如有疑問請來電 023-68661681 咨詢。

文本和圖像提取工具包

接受所有的PDF輸入

世界所有書寫系統(tǒng)均可使用

允許多種許可證程序運行

全球信賴的PDF產(chǎn)品

PDFlib TET（文本和圖像提取工具包）可靠地從 PDF 文檔中提取文本、圖像和元數(shù)據(jù)。TET 將 PDF 的文本內(nèi)容作為 Unicode 字符串提供，以及詳細(xì)的顏色、字形和字體信息以及頁面上的位置。以通用圖像格式提取柵格圖像。TET 可以選擇將 PDF 文檔轉(zhuǎn)換為基于 XML 的格式，稱為 TETML，該格式包含文本和元數(shù)據(jù)以及資源信息。TET 包含用于確定字邊界、將文本分組到列、標(biāo)識表結(jié)構(gòu)和刪除冗余項（如陰影文本）的高級內(nèi)容分析算法。

* 關(guān)于本產(chǎn)品的分類與介紹僅供參考，精準(zhǔn)產(chǎn)品資料以官網(wǎng)介紹為準(zhǔn)，如需購買請先行測試。

PDFlib TET支持功能

為搜索引擎實現(xiàn)PDF索引器
重新利用PDF中的文本和圖像
將PDF的內(nèi)容轉(zhuǎn)換為其他格式
根據(jù)PDF的內(nèi)容進(jìn)行處理，例如，根據(jù)標(biāo)題進(jìn)行拆分（除TET之外還需要PDFlib + PDI）
檢查頁面上的特定位置是否為空，例如用于放置條形碼或圖章
TET還包括pCOS界面，用于查詢有關(guān)PDF文檔的詳細(xì)信息，例如文檔信息字段和XMP元數(shù)據(jù)，字體列表，頁面大小等（請參閱pCOS產(chǎn)品描述和pCOS Cookbook）

為什么選擇TET提取文本？

用連字符號連接

TET可檢測跨越多行的連字詞，刪除連字符，并將各個部分組合成一個完整的詞。這對確保完整的單詞搜索成功是很重要的，盡管文檔中僅包含帶連字符的部分。破折號（與連字符不同）要分開處理，因為不能將其刪除。

陰影和粗體文本檢測

TET的專利陰影檢測算法可識別并刪除多余的文本實例，以避免過多的文本提取。就算其他軟件會提取陰影或粗體文本乘積，但TET會正確刪除多余的副本。盡管一個單詞的額外實例仍將導(dǎo)致搜索引擎的點擊，但是，如示例中所示，如果逐個字符地重復(fù)復(fù)制文本，則將找不到更多的點擊。

重音字符

在許多語言中，都會將重音符號和其他變音標(biāo)記放置在其他字符附近，以形成組合字符。一些排版程序（最著名的是TeX）分別發(fā)出兩個字符（基本字符和重音符）以創(chuàng)建組合字符。例如，要創(chuàng)建字符?，首先將字母a放置在頁面上，然后將降壓字符¨放置在頁面頂部。 TET會檢測到這種情況，并重新組合兩個字符以形成適當(dāng)?shù)慕M合字符。

連字

連字在單個字形中組合了兩個或更多字符。最常見的連字用于fi，fl和ffi的組合；Th，sp，ct，st和許多其他組合使用了較少見的連字。從數(shù)字文檔中提取文本時，必須分析連字并將其分離為組成字符以進(jìn)行正確的文本處理。TET可以檢測連字并酌情提供兩個或更多字符。

首字下沉

首字下沉是段落開頭的較大的初始字符，其中初始字符的頂部與行的頂部對齊，而其余字符則下降幾行，首字下沉用于強(qiáng)調(diào)段落的開頭。如果對它們的處理不當(dāng)，則會從兩個部分提取初始單詞：單個初始字符和單詞其余部分，TET會正確提取完整單詞。

Unicode映射

TET獲得專利的Unicode映射算法實現(xiàn)了一種級聯(lián)算法，該算法采用所有可用信息來確定Unicode值。對于許多有問題的文檔，TET會提取適當(dāng)?shù)腢nicode文本，而其他產(chǎn)品只會傳遞不可用的垃圾。

帶有阿拉伯語和希伯來語的雙向文本

PDF不對邏輯文本進(jìn)行編碼，而只是頁面上字形的容器。阿拉伯語和希伯來語腳本中的文本從右到左排列。由于它通常包含從左到右的插入物（例如西方語言中的數(shù)字或名稱），因此文本必須在兩個方向上都進(jìn)行解釋，因此使用術(shù)語“雙向”。 TET對從右到左和從左到右的文本的視覺混合重新排序，以創(chuàng)建適當(dāng)?shù)倪壿嬑谋据敵觥?

修復(fù)損壞的PDF文檔

PDF文檔可能由于傳輸錯誤或其他問題而損壞。TET的修復(fù)模式可恢復(fù)多種損壞的PDF。有時，PDF文檔損壞嚴(yán)重，以致頁面甚至無法在Acrobat中顯示。即使在這種極端情況下，TET仍經(jīng)常交付文檔的頁面內(nèi)容。

為什么選擇TET提取圖像？

色彩空間和壓縮

PDF中的柵格圖像數(shù)據(jù)可以以11種顏色空間和9種壓縮濾鏡的組合進(jìn)行編碼，但是常見的圖像文件格式（例如JPEG和TIFF）僅支持這些組合的子集。TET的圖像引擎在PDF圖像的特性與圖像輸出格式的功能之間取得了平衡。無論P(yáng)DF圖像的內(nèi)部結(jié)構(gòu)如何，像素圖像都是以一種常見的圖像文件格式提取的。

專色

TET創(chuàng)建帶有其他專色通道的TIFF輸出。這適用于需要出色的色彩保真度并且不能接受任何顏色轉(zhuǎn)換的應(yīng)用。如果具有DeviceN顏色的圖像僅包含常見CMYK印刷色的子集，則會添加缺少的印刷通道，以便可以創(chuàng)建純CMYK輸出。但是，某些應(yīng)用程序可能無法處理專色通道，但僅限于普通TIFF輸出。在這種情況下，可以指示TET發(fā)出單個專色通道作為灰度TIFF，以便于處理。

合并碎片圖像

許多PDF文檔中的圖像被生成PDF的軟件分解為小片段。在頁面上看似單一的圖像實際上可能由許多小塊組成。例如，Microsoft Office應(yīng)用程序和TeX通常會產(chǎn)生大量碎片圖像，其中包含成百上千個小碎片。Adobe InDesign通常將圖像分成大小不一的片段。TET檢測碎片圖像并將其合并以形成可用的較大圖像。只有合并圖像后，才能合理地重新使用碎片圖像。

TET contains advanced content analysis algorithms for determining word boundaries, grouping text into columns and removing redundant text. Using the integrated pCOS interface you can retrieve arbitrary objects from the PDF, such as metadata, interactive elements, etc.

With PDFlib TET you can:
Implement the PDF indexer for a search engine
Repurpose the text and images in PDFs
Convert the contents of PDFs to other formats
Process PDFs based on their contents, e.g. splitting based on headings (requires PDFlib+PDI in addition to TET)

Accepted PDF input

TET supports all relevant flavors of PDF input:

All PDF versions up to Acrobat 9, including ISO 32000-1
Protected PDFs which do not require a password for opening the document
Damaged PDF documents will be repaired

Unicode

Since text in PDF is usually not encoded in Unicode, PDFlib TET normalizes the text in a PDF document to Unicode:

TET converts all text contents to Unicode. In C and other non-Unicode aware languages the text is returned in the UTF-8 or UTF-16 formats, and as native strings in Unicode-capable programming languages.
Ligatures and other multi-character glyphs are decomposed into a sequence of the corresponding Unicode characters.
Glyphs without appropriate Unicode mappings are identified as such, and are mapped to a configurable replacement character in order to avoid misinterpretation.
TET implements various workarounds for problems with specific document creation packages, such as InDesign and TeX documents or PDFs generated on mainframe systems.

Content analysis and word detection

TET includes advanced content analysis algorithms:

Patented algorithm for determining word boundaries which is required to retrieve proper words
Recombine the parts of hyphenated words (dehyphenation)
Remove duplicate instances of text, e.g. shadow and artificially bolded text
Recombine paragraphs in reading order
Correctly order text which is scattered over the page

Page Layout and Table Detection

The page content is analyzed to determine text columns. Tables are detected, including cells which span multiple columns. This improves the ordering of the extracted text. Table rows and the contents of each table cell can be identified.

Geometry

TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.

Image Extract

Images on PDF pages can be extracted as TIFF, JPEG, or JPEG 2000 files. Precise geometric information (position, size, and angles) are reported for each image. Fragmented images will be combined to larger images to facilitate repurposing. Image fidelity is guaranteed since no downsampling or color space conversion occurs. This ensures the highest possible image quality.

PDF Analysis

The TET library includes the pCOS interface for querying details about a PDF document, such as document info and XMP metadata, font lists, page size, and many more.

Configuration Options for problematic PDF

TET contains special handling and workarounds for various kinds of PDF where the text cannot be extracted correctly with other products. In addition, it includes various configuration features to improve processing of problem documents:

Unicode mapping can be customized via user-supplied tables for mapping character codes or glyph names to Unicode.
PDFlib FontReporter is an auxiliary tool for analyzing fonts, encodings, and glyphs in PDF. It works as a plugin for Adobe Acrobat. This plugin is freely available for Mac and Windows.
Embedded fonts are analyzed to find additional hints which are useful for Unicode mapping. External font files or system fonts are used to improve text extraction results if a font is not embedded.

Unicode Postprocessing

TET supports various Unicode postprocessing steps which can be used to improve the extracted text:

Foldings preserve, remove or replace characters, e.g. remove punctuation or characters from irrelevant scripts.
Decompositions replace a character with an equivalent sequence of one or more other characters, e.g. replace narrow, wide or vertical Japanese characters or Latin superscript variants with their respective standard counterparts.
Text can be converted to all four Unicode normalization forms, e.g. emit NFC form to meet the requirements for Web text or a database.

Document Domains

PDF documents may contain text in other places than the page contents. While most applications will deal with the page contents only, in many situations other document domains may be relevant as well. TET extracts the text from all of the following document domains:

page contents
predefined and custom document info entries
XMP metadata on document and image level
bookmarks
file attachments and PDF portfolios can be processed recursively
form fields
comments (annotations)
general PDF properties can be queried, such as page count, conformance to standards like PDF/A or PDF/X, etc.

XMP Metadata

TET supports XMP metadata in several ways:

Using the integrated pCOS interface, XMP metadata for the document, inpidual pages, images, or other parts of the document can be extracted programmatically.
TETML output contains XMP document and image metadata if present in the PDF.
Images extracted in the TIFF or JPEG formats contain image metadata if present in the PDF.

TETML represents PDF Contents as XML

TET optionally represents the PDF contents in an XML flavor called TETML. It contains a variety of PDF information in a form which can easily be processed with common XML tools. TETML contains the actual text plus optionally font and position information, resource details (fonts, images, colorspaces), and metadata.

TETML is governed by a corresponding XML schema to make sure that TET always creates consistent and reliable XML output. TETML can be processed with XSLT stylesheets, e.g. to apply certain filters or to convert TETML to other formats. Sample XSLT stylesheets for processing TETML are included in the TET distribution.

The following fragment shows TETML output with glyph details:

<Word>
<Text>PDFlib</Text>
<Box llx="111.48" lly="636.33" urx="161.14" ury="654.33">
<Glyph font="F1" size="18" x="111.48" y="636.33" width="9.65">P</Glyph>
<Glyph font="F1" size="18" x="121.12" y="636.33" width="11.88">D</Glyph>
<Glyph font="F1" size="18" x="133.00" y="636.33" width="8.33">F</Glyph>
<Glyph font="F1" size="18" x="141.33" y="636.33" width="4.88">l</Glyph>
<Glyph font="F1" size="18" x="146.21" y="636.33" width="4.88">i</Glyph>
<Glyph font="F1" size="18" x="151.08" y="636.33" width="10.06">b</Glyph>
</Box>
</Word>

TET Connectors

TET connectors provide the necessary glue code to interface TET with other software. The following TET connectors make PDF text extraction functionality available for various software environments:

TET connector for the Lucene Search Engine
TET connector for the Solr Search Server
TET connector for Oracle Text
TET connector for MediaWiki
TET PDF IFilter for Microsoft products is available as a separate product. It extracts text and metadata from PDF documents and makes it available to search and retrieval software on Windows.

TET Cookbook

The TET Cookbook is a collection of programming examples which demonstrate the use of TET for various text and image extraction tasks. Several Cookbook samples show how to combine the TET and PDFlib+PDI products in order to enhance PDF documents, e.g. add bookmarks or links based on the text on the page.

更新時間:2023-07-13 15:00:44.000 | 錄入時間:2006-01-18 11:46:00.000 | 責(zé)任編輯:胡濤

慧都公開課 更多

2023 HOOPS Exchange專場峰會 ? 中國場

2023 HOOPS Exchange專場峰會 ? 中國場

HOOPS 2023峰會(中國場)

HOOPS 2023峰會(中國場)

CAE仿真峰會

CAE仿真峰會

1分鐘解鎖SOLIDWORKS 2023新功能

1分鐘解鎖SOLIDWORKS 2023新功能

實時了解產(chǎn)品最新動態(tài)與應(yīng)用

技術(shù)交流群: 767755948（QQ群）

掃碼聯(lián)系獲取幫助

相關(guān)產(chǎn)品

產(chǎn)品功能：文檔管理

源碼：非開源

產(chǎn)品編號：14310

當(dāng)前版本：v22.4 [銷售以商家最新版為準(zhǔn)，如需其他版本，請來電咨詢]

開發(fā) 商： ASPOSE

正式授權(quán)

">Aspose.Word for Python

允許開發(fā)人員在不需要Office Automation的情況下處理Word文檔的API

產(chǎn)品功能：文檔管理

源碼：非開源

產(chǎn)品編號：11102

當(dāng)前版本：V10.0.4700 [銷售以商家最新版為準(zhǔn)，如需其他版本，請來電咨詢]

開發(fā) 商： Add-in Express

正式授權(quán)

">Add-in Express for Office and .NET

開發(fā)商業(yè)類微軟Office擴(kuò)展的一體化框架，如Office COM Add-in、Outlook插件

產(chǎn)品功能：文檔管理

源碼：非開源

產(chǎn)品編號：11188

當(dāng)前版本：v6.19.0.2 [銷售以商家最新版為準(zhǔn)，如需其他版本，請來電咨詢]

開發(fā) 商： PDF Tools AG

正式授權(quán)

">3-Heights PDF Optimization

PDF優(yōu)化類庫，用于壓縮PDF文件的尺寸大小、提高網(wǎng)絡(luò)瀏覽速度、提供高質(zhì)量的打印等

產(chǎn)品功能：文檔管理

源碼：非開源

產(chǎn)品編號：12807

當(dāng)前版本：2021 [銷售以商家最新版為準(zhǔn)，如需其他版本，請來電咨詢]

開發(fā) 商： Qoppa Software

正式授權(quán)

">PDF Studio

PDF Studio是一款功能強(qiáng)大的，易于使用的PDF編輯器，它以Adobe? Acrobat?和其他PDF工具的小部分代價在PDF文檔上提供了大量的功能。

產(chǎn)品功能：文檔管理

源碼：非開源

產(chǎn)品編號：14219

當(dāng)前版本：v7.4.1 [銷售以商家最新版為準(zhǔn)，如需其他版本，請來電咨詢]

開發(fā) 商： E-iceblue

正式授權(quán)

">Spire.Cloud

Spire.Cloud是一款幫助WEB網(wǎng)站或WEB應(yīng)用系統(tǒng)輕松處理Office文件全面的解決方案。

授權(quán)相關(guān)問題

服務(wù)電話

重慶/ 023-68661681

華東/ 13452821722

華南/ 18100878085

華北/ 17382392642

客戶支持

技術(shù)支持咨詢服務(wù)

服務(wù)熱線：400-700-1020

郵箱：sales@evget.com

關(guān)注我們

地址 : 重慶市九龍坡區(qū)火炬大道69號6幢

慧都科技版權(quán)所有 Copyright 2003- 2024 渝ICP備12000582號-13 渝公網(wǎng)安備 50010702500608號

掃碼咨詢

添加微信立即咨詢

電話咨詢

客服熱線
023-68661681

TOP

三级成人熟女影院,欧美午夜成人精品视频,亚洲国产成人乱色在线观看,色中色成人论坛 (function(){ var bp = document.createElement('script'); var curProtocol = window.location.protocol.split(':')[0]; if (curProtocol === 'https') { bp.src = 'https://zz.bdstatic.com/linksubmit/push.js'; } else { bp.src = 'http://push.zhanzhang.baidu.com/push.js'; } var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(bp, s); })();