With TestComplete, you can read data of PDF files and compare these files.
To extract text contents of PDF files, TestComplete uses optical character recognition (OCR).
Requirements
-
Your TestComplete version must be 14.20 or later.
-
Your computer must have access to the ocr.api.dev.smartbear.com web service.
If you have firewalls or proxies running in your network, they should allow your computer to access the web service. This web service is used to recognize the text content of PDF files.
-
Your firewall must allow traffic through port 443.
-
You need an active license for the TestComplete Intelligent Quality add-on.
-
The Intelligent Quality add-on must be enabled in TestComplete.
You can enable the add-on during TestComplete installation. If you did not enable the add-on during the installation, you can do this at any moment later via the File > Install Extensions dialog. Select File > Install Extensions from the TestComplete main menu and enable the Intelligent Quality > Intelligent Quality Core plugin in the resulting dialog.
-
PDF to Text support must be enabled in TestComplete.
By default, it is enabled automatically if you enable the Intelligent Quality add-on during TestComplete installation.
If you experience issues with PDF support in your tests, select File > Install Extensions from the TestComplete main menu and make sure the PDF to Text plugin is enabled (you can find it in the Intelligent Quality group). If the plugin is disabled, enable it. In the confirmation message, click the link to read a third-party license agreement. If you agree to the license terms, click Enable OCR.
How TestComplete recognizes PDF contents
TestComplete sends PDF files to the ocr.api.dev.smartbear.com web service by SmartBear. This web service forwards the files to Google Vision API and transfers the recognition results back to TestComplete.
In your test, you get the text of an entire PDF file as a string and work with it as your needs dictate. For example, you can compare it against some expected values. See below for more information.
Extract data
In keyword tests
To get text contents of a PDF file, use the PDF to Text operation. Then, to get the recognized text, you can, for example, store it to a variable by using the Set Variable Value operation:
To get a fragment of the recognized text:
-
write a script that will obtain the needed fragment, and
-
call that script routine from your keyword test by using the Run Script Routine, Run Code Snippet, or Run Test operation.
In script tests
To get text contents of a PDF file, use the PDF.ConvertToText(…)
method. It receives the PDF file name as a parameter, processes the file and returns a string that contains the recognized text:
JavaScript, JScript
var path = "C:\\work\\sample.pdf";
var contents = PDF.ConvertToText(path);
Python
path = "C:\\work\\sample.pdf"
contents = PDF.ConvertToText(path)
VBScript
path = "C:\work\sample.pdf"
contents = PDF.ConvertToText(path)
DelphiScript
var path, contents;
…
path := 'C:\work\sample.pdf';
contents := PDF.ConvertToText(path);
C++Script, C#Script
var path = "C:\\work\\sample.pdf";
var contents = PDF["ConvertToText"](path);
To get a fragment of the recognized text. , use methods and properties of the aqString
object: Contains(…)
, StrMatches(…)
, SubString(…)
and others. With aqString.StrMatches(…)
, you can search and validate the string with regular expressions.
Examples
Extract a Value From a PDF File
The sample code below shows how to get all date values from a PDF file:
JavaScript, JScript
{
// Get the path to the tested PDF file
var path = "C:\\work\\sample.pdf";
if ((path != "") && (aqFile.Exists(path)) && (aqFileSystem.GetFileExtension(path) == "pdf"))
{
// Get the entire file contents
contents = PDF.ConvertToText(path);
if (contents != "")
{
// This expression specifies a date pattern: mm/dd/yy or mm/dd/yyyy
regEx = /\d{1,2}\/\d{1,2}\/\d{2,4}/gim
// Post all the date values that match the specified pattern
// to the test log
var r = contents.match(regEx);
if (r != null && r.length > 0)
{
for (var i = 0; i < r.length; i++ )
Log.Message(r[0]);
}
}
}
}
Python
import re
def GetDateValuesFromPDF():
# Get the path to the tested PDF file
path = "C:\\work\\sample.pdf"
if ((path != "") and (aqFile.Exists(path)) and (aqFileSystem.GetFileExtension(path) == "pdf")):
# Get the entire file contents
contents = PDF.ConvertToText(path)
if (contents != ""):
# This expression specifies a date pattern: mm/dd/yy or mm/dd/yyyy
regEx = re.compile("\d{1,2}\/\d{1,2}\/\d{2,4}")
# Post all date values that match the specified pattern
# To the test log
r = regEx.findall(contents)
if r != None:
for i in range (0, len(r)):
Log.Message(r[i])
VBScript
' Get the path to the tested PDF file
path = "C:\work\sample.pdf"
If path <> "" And aqFile.Exists(path) And aqFileSystem.GetFileExtension(path) = "pdf" Then
' Get the entire file contents
contents = PDF.ConvertToText(path)
If contents <> "" Then
' This expression specifies a date pattern: mm/dd/yy or mm/dd/yyyy
Set regEx = New RegExp
regEx.pattern = "\d{1,2}\/\d{1,2}\/\d{2,4}"
regEx.Global = True
' Post all the date values that match the specified pattern
' to the test log
Set r = regEx.Execute(contents)
For Each m In r
Log.Message(m.Value)
Next
End If
End If
End Sub
DelphiScript
var path, contents, regEx;
begin
// Get the path to the tested PDF file
path := 'C:\work\sample.pdf';
if ((path <> '') and (aqFile.Exists(path)) and (aqFileSystem.GetFileExtension(path) = 'pdf')) then
begin
// Get the entire file contents
contents : = PDF.ConvertToText(path);
if contents <> '' then
begin
// Extract a value
regEx := HISUtils.RegExpr();
// This expression specifies a date pattern: mm/dd/yy or mm/dd/yyyy
regEx.Expression := '\d{1,2}/\d{1,2}/\d{2,4}';
// Post all the date values that match the specified pattern
// to the test log
if regEx.Exec(contents) then
repeat
Log.Message(regEx.Match[0]);
until not regEx.ExecNext;
end;
end;
end;
C++Script, C#Script
{
// Get the path to the tested PDF file
var path = "C:\\work\\sample.pdf";
if ((path != "") && (aqFile["Exists"](path)) && (aqFileSystem["GetFileExtension"](path) == "pdf"))
{
// Get the entire file contents
contents = PDF["ConvertToText"](path);
if (contents != "")
{
// This expression specifies a date pattern: mm/dd/yy or mm/dd/yyyy
regEx = /\d{1,2}\/\d{1,2}\/\d{2,4}/gim
// Post all the date values that match the specified pattern
// to the test log
var r = contents["match"](regEx);
if (r != null && r["length"] > 0)
{
for (var i = 0; i < r["length"]; i++ )
Log["Message"](r[0]);
}
}
}
}
Extract Section Contents From a PDF File
This sample shows how to get the contents of a section from a PDF file (the text between two subheaders):
JavaScript, JScript
{
// Specifies the tested PDF file
var path = "C:\\work\\sample.pdf";
// Set the name of the section
// whose contents you want to get
section1Name = "Section 1 Name";
// Set the name of the section
// that follows the target section
// (to get the contents of the last section, leave this value empty)
section2Name = "Section 2 Name";
// Get the section contents
contents = GetSectionContents(path, section1Name, section2Name);
// Post the contents to the test log
if (contents != null && contents != "")
Log.Message("View the contents of the section \"" + section1Name + "\" in the Details panel", contents);
}
// Use a regular expression
// to get the section contents
function GetSectionContents(aPath, aSection1, aSection2)
{
if (aqFile.Exists(aPath) && (aqFileSystem.GetFileExtension(aPath) == 'pdf'))
{
// Get the entire text content of the PDF file
str = PDF.ConvertToText(aPath);
if (str != "")
{
// Create a regular expression that will get the text between the section headers
var w = aSection1 + "[\\r\\n]*([^]*)" + aSection2;
var regEx = new RegExp(w, "gim");
var r = regEx.exec(str);
if (r != null && r.length > 0)
{
return r[1];
}
else
{
Log.Warning("Failed to get the section contents");
return "";
}
}
}
}
Python
import re
def Main():
# Specifies the tested PDF file
path = "C:\\work\\sample.pdf"
# Set the name of the section
# Whose contents you want to get
section1Name = "Section 1 Name"
# Set the name of the section
# That follows the target section
# (To get the contents of the last section, leave this value empty)
section2Name = "Section 2 Name"
# Get the section contents
contents = GetSectionContents(path, section1Name, section2Name);
# Posts the contents to the test log
if (contents != None and contents != ""):
Log.Message("View the contents of the section \"" + section1Name + "\" in the Details panel", contents);
# Use a regular expression
# To get the section contents
def GetSectionContents(aPath, aSection1, aSection2):
if (aqFile.Exists(aPath) and (aqFileSystem.GetFileExtension(aPath) == 'pdf')):
# Get the entire text contents of the PDF file
str = PDF.ConvertToText(aPath)
if (str != ""):
# Create a regular expression that will get text between section headers
w = aSection1 + "[\\r\\n]+([\\w\\W\\s\\S\\r\\n]*)" + aSection2
regEx = re.compile(w)
r = regEx.search(str)
if (r and len(r.groups()) > 0):
return r.groups()[0]
else:
Log.Warning("Failed to get the section contents")
return ""
VBScript
' Specifies the tested PDF file
path = "C:\work\sample.pdf"
' Set the name of the section
' whose contents you want to get
section1Name = "Section 1 Name"
' Set the name of the section
' that follows the target section
' (to get the contents of the last section, leave this value empty)
section2Name = "Section 2 Name"
' Get the section contents
contents = GetSectionContents(path, section1Name, section2Name)
' Posts the contents to the test log
If contents <> "" Then
Call Log.Message("View the contents of the section '" & section1Name & "' in the Details panel", contents)
End If
End Sub
' Use a regular expression
' to get the section contents
Function GetSectionContents(aPath, aSection1, aSection2)
If aqFile.Exists(aPath) And aqFileSystem.GetFileExtension(aPath) = "pdf" Then
' Get the entire text content of the PDF file
str = PDF.ConvertToText(aPath)
If str <> "" Then
' Create a regular expression that will get the text between the section headers
w = aSection1 & "[\r\n]*([\w\W\s\S\r\n]*)" & aSection2
Set regEx = New RegExp
regEx.Pattern = w
regEx.Global = True
If Not r Is Nothing Then
For Each m In r
GetSectionContents = m.SubMatches(0)
Next
Else
Log.Warning("Failed to get the section contents")
GetSectionContents = ""
End If
End If
End If
End Function
DelphiScript
// to get the section contents
function GetSectionContents(aPath, aSection1, aSection2);
var str, regEx, w;
begin
if (aqFile.Exists(aPath) and (aqFileSystem.GetFileExtension(aPath) = 'pdf')) then
begin
// Get the entire text content of the PDF file
str := PDF.ConvertToText(aPath);
if str <> '' then
begin
// Create a regular expression that will get the text between the section headers
aSection1 := aqString.Replace(aSection1, ' ', '[\s\t\r\n]+');
aSection2 := aqString.Replace(aSection2, ' ', '[\s\t\r\n]+');
w := aSection1 + '(.*)' + aSection2;
regEx := HISUtils.RegExpr();
regEx.Expression := w;
if regEx.Exec(str) then
// Return the text that is between the specified section headers
result := regEx.Match[1]
else
begin
Log.Warning('Failed to get the section contents');
result := '';
end;
end;
end;
end;
procedure Main();
var path, section1Name, section2Name, contents;
begin
// Specifies the tested PDF file
path := 'C:\work\sample.pdf';
// Set the name of the section
// whose contents you want to get
section1Name := 'Section 1 Name';
// Set the name of the section
// that follows the target section
// (to get the contents of the last section, leave this value empty)
section2Name := 'Section 2 Name';
// Get the section contents
contents := GetSectionContents(path, section1Name, section2Name);
// Post the contents to the test log
if (contents <> '') then
Log.Message('View the contents of the section "' + section1Name + '" in the Details panel', contents);
end;
C++Script, C#Script
{
// Specifies the tested PDF file
var path = "C:\\work\\sample.pdf";
// Set the name of the section
// whose contents you want to get
section1Name = "Section 1 Name";
// Set the name of the section
// that follows the target section
// (To get the contents of the last section, leave this value empty)
section2Name = "Section 2 Name";
// Get the section contents
contents = GetSectionContents(path, section1Name, section2Name);
// Post the contents to the test log
if (contents != null && contents != "")
Log["Message"]("View the contents of the section \"" + section1Name + "\" in the Details panel", contents);
}
// Use a regular expression
// to get the section contents
function GetSectionContents(aPath, aSection1, aSection2)
{
if (aqFile["Exists"](aPath) && (aqFileSystem["GetFileExtension"](aPath) == 'pdf'))
{
// Get the entire text content of the PDF file
str = PDF["ConvertToText"](aPath);
if (str != "")
{
// Create a regular expression that will get the text between the section headers
var w = aSection1 + "[\\r\\n]*([^]*)" + aSection2;
var regEx = new RegExp(w, "gim");
var r = regEx["exec"](str);
if (r != null && r.length > 0)
{
return r[1];
}
else
{
Log["Warning"]("Failed to get the section contents");
return "";
}
}
}
}
Validate PDF files
PDF Checkpoints
To verify contents of an entire PDF file, use PDF checkpoints. They compare contents of the PDF file with the expected contents you store in the Stores > Files collection of your project. You can create checkpoints both during test recording and at design time. In the checkpoint properties, you can set the allowed difference between files. For more information, see PDF Checkpoints.
Custom Verification
To validate a fragment of a PDF file, or to create custom verifications, you need to create a keyword test or write a script.
In keyword tests
-
Use the PDF to Text operation to get file contents. Follow this operation with the Set Variable Value operation to save the extracted text (the last operation result) to a variable:
-
Use the If…Then operation to check the variable value, or write a script routine that will perform the needed verification, and then call it from your keyword test by using the Call Script Routine, Run Code Snippet, or Run Test operation.
In script tests
-
Call
PDF.ConvertToText(…)
to get the text of your PDF file:JavaScript, JScript
var path = "C:\\work\\sample.pdf";
var contents = PDF.ConvertToText(path);Python
path = "C:\\work\\sample.pdf"
contents = PDF.ConvertToText(path)VBScript
path = "C:\work\sample.pdf"
contents = PDF.ConvertToText(path)DelphiScript
var path, contents;
…
path := 'C:\work\sample.pdf';
contents := PDF.ConvertToText(path);C++Script, C#Script
var path = "C:\\work\\sample.pdf";
var contents = PDF["ConvertToText"](path); -
Use script objects and methods that TestComplete provides for working with strings, for example,
aqString
, to perform the operation you need. See the example below.
Example
The example below shows how to validate the contents of a PDF file ignoring some part of it.
JavaScript
{
let path1 = "C:\\work\\baseline.pdf";
let path2 = "C:\\work\\report.pdf";
if (ComparePDF(path1, path2))
Log.Message("The text contents of specified PDF files are the same");
}
function ComparePDF(path1, path2)
{
if (((path1 != "") && (aqFile.Exists(path1)) && (aqFileSystem.GetFileExtension(path1) == "pdf"))
&& ((path2 != "") && (aqFile.Exists(path2)) && (aqFileSystem.GetFileExtension(path2) == "pdf")))
{
// Get the text contents of PDF files
let str1 = PDF.ConvertToText(path1);
let str2 = PDF.ConvertToText(path2);
// Use the regular expression
// to replace the date/time stamp
regEx = /\d{1,2}.\d{1,2}.\d{2,4}\s\d{1,2}:\d{2}:\d{2}\s\w{2}/gim;
str1 = str1.replace(regEx, "<ignore>");
str2 = str2.replace(regEx, "<ignore>");
// Compare the resulting contents
return equal(str1, str2);
}
else
return false;
}
JScript
{
var path1 = "C:\\work\\baseline.pdf";
var path2 = "C:\\work\\report.pdf";
if (ComparePDF(path1, path2))
Log.Message("The text contents of specified PDF files are the same");
}
function ComparePDF(path1, path2)
{
if (((path1 != "") && (aqFile.Exists(path1)) && (aqFileSystem.GetFileExtension(path1) == "pdf"))
&& ((path2 != "") && (aqFile.Exists(path2)) && (aqFileSystem.GetFileExtension(path2) == "pdf")))
{
// Get the text contents of PDF files
var str1 = PDF.ConvertToText(path1);
var str2 = PDF.ConvertToText(path2);
// Use the regular expression
// to replace the date/time stamp
regEx = /\d{1,2}.\d{1,2}.\d{2,4}\s\d{1,2}:\d{2}:\d{2}\s\w{2}/gim;
str1 = str1.replace(regEx, "<ignore>");
str2 = str2.replace(regEx, "<ignore>");
// Compare the resulting contents
return (str1 == str2);
}
else
return false;
}
Python
def Main():
path1 = "C:\\work\\baseline.pdf"
path2 = "C:\\work\\report.pdf"
if (ComparePDF(path1, path2)):
Log.Message("The text contents of specified PDF files are the same")
def ComparePDF(path1, path2):
if (path1 != "" and aqFile.Exists(path1) and aqFileSystem.GetFileExtension(path1) == "pdf" and \
path2 != "" and aqFile.Exists(path2) and aqFileSystem.GetFileExtension(path2) == "pdf"):
# Get the text contents of PDF files
str1 = PDF.ConvertToText(path1)
str2 = PDF.ConvertToText(path2)
# Use the regular expression
# to replace the date/time stamp
regEx = "/\d{1,2}.\d{1,2}.\d{2,4}\s\d{1,2}:\d{2}:\d{2}\s\w{2}/gim"
str1 = str1.replace(regEx, "<ignore>")
str2 = str2.replace(regEx, "<ignore>")
# Compare the resulting contents
return (str1 == str2)
else:
return False
VBScript
path1 = "C:\work\baseline.pdf"
path2 = "C:\work\report.pdf"
If ComparePDF(path1, path2) Then
Log.Message("The text contents of specified PDF files are the same")
End If
End Sub
Function ComparePDF(path1, path2)
If path1 <> "" And aqFile.Exists(path1) And aqFileSystem.GetFileExtension(path1) = "pdf" _
And path2 <> "" And aqFile.Exists(path2) And aqFileSystem.GetFileExtension(path2) = "pdf" Then
' Get the text contents of PDF files
str1 = PDF.ConvertToText(path1)
str2 = PDF.ConvertToText(path2)
' Use the regular expression
' to replace the date/time stamp
Set regEx = New RegExp
regEx.Pattern = "\d{1,2}.\d{1,2}.\d{2,4}\s\d{1,2}:\d{2}:\d{2}\s\w{2}"
regEx.IgnoreCase = True
regEx.Global = True
str1 = regEx.replace(str1, "<ignore>")
str2 = regEx.replace(str2, "<ignore>")
' Compare the resulting contents
ComparePDF = (str1 = str2)
Else
ComparePDF = False
End If
End Function
DelphiScript
var str1, str2;
var regEx;
begin
if ((path1 <> '') and (aqFile.Exists(path1)) and (aqFileSystem.GetFileExtension(path1) = 'pdf'))
and ((path2 <> '') and (aqFile.Exists(path2)) and (aqFileSystem.GetFileExtension(path2) = 'pdf')) then
begin
// Get the text contents of PDF files
str1 := PDF.ConvertToText(path1);
str2 := PDF.ConvertToText(path2);
// Use the regular expression
// to replace the date/time stamp
regEx : = HISUtils.RegExpr;
regEx.Expression := '\d{1,2}.\d{1,2}.\d{2,4}\s\d{1,2}:\d{2}:\d{2}\s\w{2}';
str1 := regEx.Replace(str1, '<ignore>');
str2 := regEx.Replace(str2, '<ignore>');
// Compare the resulting contents
result : = (str1 = str2);
end
else
result := false;
end;
procedure Main();
var path1, path2;
begin
path1 := 'C:\work\baseline.pdf';
path2 := 'C:\work\report.pdf';
if ComparePDF(path1, path2) then
Log.Message('The text contents of specified PDF files are the same');
end;
C++Script, C#Script
{
var path1 = "C:\\work\\baseline.pdf";
var path2 = "C:\\work\\report.pdf";
if (ComparePDF(path1, path2))
Log["Message"]("The text contents of specified PDF files are the same");
}
function ComparePDF(path1, path2)
{
if (((path1 != "") && (aqFile["Exists"](path1)) && (aqFileSystem["GetFileExtension"](path1) == "pdf"))
&& ((path2 != "") && (aqFile["Exists"](path2)) && (aqFileSystem["GetFileExtension"](path2) == "pdf")))
{
// Get the text contents of PDF files
var str1 = PDF["ConvertToText"](path1);
var str2 = PDF["ConvertToText"](path2);
// Use the regular expression
// to replace the date/time stamp
regEx = /\d{1,2}.\d{1,2}.\d{2,4}\s\d{1,2}:\d{2}:\d{2}\s\w{2}/gim;
str1 = str1["replace"](regEx, "<ignore>");
str2 = str2["replace"](regEx, "<ignore>");
// Compare the resulting contents
return (str1 == str2);
}
else
return false;
}