Platfora Installation Guide

Platfora Installation Guide
Version 4.5
For On-Premise Hadoop Deployments
Copyright Platfora 2015
Last Updated: 10:14 p.m. June 28, 2015
Contents
Document Conventions............................................................................................. 5
Contact Platfora Support...........................................................................................6
Copyright Notices...................................................................................................... 6
Chapter 1: Installation Overview (On-Premise)......................................................... 8
On-Premise Hadoop Deployments........................................................................... 8
Master vs Worker Node Installations........................................................................9
Preinstall Check List............................................................................................... 10
High-Level Install Steps.......................................................................................... 11
Chapter 2: System Requirements (On-Premise)..................................................... 13
Platfora Server Requirements.................................................................................13
Port Configuration Requirements............................................................................14
Ports to Open on Platfora Nodes...................................................................... 15
Ports to Open on Hadoop Nodes......................................................................15
Supported Hadoop and Hive Versions................................................................... 17
Hadoop Resource Requirements............................................................................17
Browser Requirements............................................................................................18
Chapter 3: Configure Hadoop for Platfora Access................................................. 19
Create Platfora User on Hadoop Nodes.................................................................19
Create Platfora Directories and Permissions in Hadoop........................................ 19
HDFS Tuning for Platfora....................................................................................... 21
Increase Open File Limits..................................................................................21
Increase Platfora User Limits............................................................................ 22
Increase DataNode File Limits.......................................................................... 22
Allow Platfora Local Access.............................................................................. 22
MapReduce Tuning for Platfora..............................................................................23
YARN Tuning for Platfora....................................................................................... 25
Chapter 4: Install Platfora Software and Dependencies.........................................27
About the Platfora Installer Packages.................................................................... 27
Install Using RPM Packages.................................................................................. 28
Install Dependencies RPM Package................................................................. 28
Install Optional Security RPM Package.............................................................29
Install Platfora RPM Package (Master Only).....................................................30
Install Using the TAR Package...............................................................................31
Create the Platfora System User...................................................................... 31
Set OS Kernel Parameters................................................................................33
Install Dependent Software................................................................................35
Platfora Installation Guide - Contents
Install Platfora TAR Package (Master Only)..................................................... 39
Install PDF Dependencies (Master Only).......................................................... 40
Chapter 5: Configure Environment on Platfora Nodes...........................................43
Install the MapR Client Software (MapR Only).......................................................43
Configure Network Environment............................................................................. 45
Configure /etc/hosts File.................................................................................... 45
Verify Connectivity Between Platfora Nodes..................................................... 46
Verify Connectivity to Hadoop Nodes................................................................47
Open Firewall Ports........................................................................................... 48
Configure Passwordless SSH................................................................................. 49
Verify Local SSH Access...................................................................................49
Exchange SSH Keys (Multi-Node Only)............................................................49
Synchronize the System Clocks............................................................................. 50
Create Local Storage Directories............................................................................51
Verify Environment Variables..................................................................................52
Chapter 6: Configure Platfora for Secure Hadoop Access.................................... 53
About Secure Hadoop.............................................................................................53
Configure Kerberos Authentication to Hadoop....................................................... 54
Obtain Kerberos Tickets for a Platfora Server.................................................. 54
Auto-Renew Kerberos Tickets for a Platfora Server......................................... 54
Configure Secure Impersonation in Hadoop...........................................................55
Chapter 7: Initialize Platfora Master Node............................................................... 57
Connect Platfora to Your Hadoop Services............................................................57
Understand How Platfora Connects to Hadoop................................................ 57
Obtain Hadoop Configuration Files................................................................... 59
Create Local Hadoop Configuration Directory...................................................59
Initialize the Platfora Master................................................................................... 69
Configure SSL for Client Connections...............................................................71
Configure SSL for Catalog Connections........................................................... 73
About System Diagnostic Data..........................................................................74
Troubleshoot Setup Issues..................................................................................... 75
View the Platfora Log Files............................................................................... 75
Setup Fails Setting up Catalog Metadata Service.............................................75
TEST FAILED: Checking integrity of binaries................................................... 76
Chapter 8: Start Platfora............................................................................................78
Start the Platfora Server......................................................................................... 78
Log in to the Platfora Web Application................................................................... 79
Add a License Key..................................................................................................81
Change the Default Admin Password.....................................................................81
Page 3
Platfora Installation Guide - Contents
Load the Tutorial Data............................................................................................ 82
Chapter 9: Initialize a Worker Node......................................................................... 84
Appendix A: Command Line Utility Reference........................................................85
setup.py................................................................................................................... 85
hadoop-check.......................................................................................................... 89
hadoopcp................................................................................................................. 92
hadoopfs.................................................................................................................. 93
install-node.............................................................................................................. 94
platfora-catalog........................................................................................................ 95
platfora-catalog ssl.............................................................................................97
platfora-config.......................................................................................................... 98
platfora-export........................................................................................................100
platfora-import........................................................................................................104
platfora-license...................................................................................................... 106
platfora-license install...................................................................................... 107
platfora-license uninstall.................................................................................. 108
platfora-license view........................................................................................ 108
platfora-node..........................................................................................................109
platfora-node add.............................................................................................110
platfora-node config......................................................................................... 111
platfora-services.................................................................................................... 112
platfora-services start.......................................................................................113
platfora-services stop.......................................................................................115
platfora-services restart................................................................................... 117
platfora-services status.................................................................................... 118
platfora-services sync...................................................................................... 120
platfora-syscapture................................................................................................ 120
platfora-syscheck...................................................................................................122
Appendix B: Glossary..............................................................................................125
Page 4
Preface
This guide provides information and instructions for installing and initializing a Platfora® cluster. This
guide is intended for system administrators with knowledge of Linux/Unix system administration and
basic Hadoop administration.
This on-premise installation guide is for data center environments (either physical or virtual data centers)
that have a permanent, managed Hadoop cluster. Platfora is installed in the same network as your
Hadoop cluster.
Document Conventions
This documentation uses certain text conventions for language syntax and code examples.
Convention
Usage
Example
$
Command-line prompt proceeds a command to be
entered in a command-line
terminal session.
$ ls
$ sudo
Command-line prompt
$ sudo yum install open-jdk-1.7
for a command that
requires root permissions
(commands will be prefixed
with sudo).
UPPERCASE
Function names and
keywords are shown in all
uppercase for readability,
but keywords are caseinsensitive (can be written
in upper or lower case).
SUM(page_views)
italics
Italics indicate a usersupplied argument or
variable.
SUM(field_name)
[ ] (square
Square brackets denote
optional syntax items.
CONCAT(string_expression[,...])
...
(elipsis)
An elipsis denotes a syntax
item that can be repeated
any number of times.
CONCAT(string_expression[,...])
brackets)
Page 5
Platfora Installation Guide - Introduction
Contact Platfora Support
For technical support, you can send an email to:
support@platfora.com
Or visit the Platfora support site for the most up-to-date product news, knowledge base articles, and
product tips.
http://support.platfora.com
To access the support portal, you must have a valid support agreement with Platfora. Please contact
your Platfora sales representative for details about obtaining a valid support agreement or with questions
about your account.
Copyright Notices
Copyright © 2012-15 Platfora Corporation. All rights reserved.
Platfora believes the information in this publication is accurate as of its publication date. The
information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” PLATFORA
CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH
RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS
IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR
PURPOSE.
Use, copying, and distribution of any Platfora software described in this publication requires an
applicable software license. Platfora®, You Should Know™, Interest Driven Pipeline™, Fractal Cache™,
and Adaptive Job Synthesis™ are trademarks of the Platfora Corporation. Apache Hadoop™ and Apache
Hive™ are trademarks of the Apache Software Foundation. All other trademarks used herein are the
property of their respective owners.
Embedded Software Copyrights and License Agreements
Platfora contains the following open source and third-party proprietary software subject to their
respective copyrights and license agreements:
• Apache Hive PDK
• dom4j
• freemarker
• GeoNames
• Google Maps API
• javassist
Page 6
Platfora Installation Guide - Introduction
• javax.servlet
• Mortbay Jetty 6.1.26
• OWASP CSRFGuard 3
• PostgreSQL JDBC 9.1-901
• Scala
• sjsxp : 1.0.1
• Unboundid
Page 7
Chapter
1
Installation Overview (On-Premise)
This section provides an overview of the Platfora installation process for environments that will use an onpremise deployment of Hadoop with Platfora.
Topics:
•
On-Premise Hadoop Deployments
•
Master vs Worker Node Installations
•
Preinstall Check List
•
High-Level Install Steps
On-Premise Hadoop Deployments
An on-premise Hadoop deployment means that you already have an existing Hadoop installation in your
data center (either a physical data center or a virtual private cloud).
Page 8
Platfora Installation Guide - Installation Overview (On-Premise)
Platfora connects to the Hadoop cluster managed by your organization, and the majority of your
organization's data is stored in the distributed file system of this primary Hadoop cluster.
For on-premise Hadoop deployments, the Platfora servers should be on their own dedicated hardware
co-located in the same data center as your Hadoop cluster. A data center can be a physical location with
actual hardware resources, or a virtual private cloud environment with virtual server instances (such as
Rackspace or Amazon EC2). Platfora recommends putting the Platfora servers on a network with at least
1 Gbps connectivity to the Hadoop nodes.
Platfora users access the Platfora master node using an HTML5-compliant web browser. The Platfora
master node accesses the HDFS NameNode and the MapReduce JobTracker or YARN Resource
Manager using native Hadoop protocols. The Platfora worker nodes access the HDFS DataNodes
directly. If using a firewall, Platfora recommends placing the Platfora servers on the same side of the
firewall as your Hadoop cluster.
Platfora software can run on a wide variety of server configurations – on as little as one server or scale
across multiple servers. Since Platfora runs best with all of the active lenses readily available in RAM,
Platfora recommends obtaining servers optimized for higher RAM capacity and a minimum of 8 CPUs.
Master vs Worker Node Installations
If you are installing Platfora for the very first time, you begin by installing, configuring and initializing
the Platfora master node. Once you have the master node up and running, you can then add in additional
worker nodes as needed.
Page 9
Platfora Installation Guide - Installation Overview (On-Premise)
All nodes in a Platfora cluster (master and workers) must meet the minimum system requirements and
have the required prerequisite software installed. If you are using the RPM installer packages, you can
use the base installer package to install the required software on each Platfora node. If you are using the
TAR installer packages, you must manually install the required software on each Platfora node.
You only need to install the Platfora server software, however, on the master node. Platfora copies the
server software from the master to the worker nodes during the worker node initialization process.
All nodes in a Platfora cluster also require you to configure the network environment so that all the
nodes can talk to each other, as well as to the Hadoop cluster nodes. If you are adding additional worker
nodes to an existing Platfora cluster, make sure to follow the instructions for installing dependencies
and configuring the environment. You can skip any tasks denoted as 'Master Only' - these tasks are only
required for first-time installations of the Platfora master node.
Preinstall Check List
Here is a list of items and information you will need in order to install a new Platfora cluster with an onpremise Hadoop deployment. Platfora must be able to connect to Hadoop services during setup, so you
will also need information from your Hadoop installation.
Platfora Checklist
This is a list of things you will need in order to install Platfora nodes.
What You Need
Description
Platfora License
Platfora Customer Support must issue
you a license file. Trial period licenses are
available upon request for pilot installations.
Platfora Software
A Platfora customer support representative
can give you the download link to the
Platfora installation package for your chosen
operating system and Hadoop distribrution
and version. Platfora provides both rpm and
tar installer packages.
(MapR Only) MapR Client Software
If you are using a MapR Hadoop cluster
with Platfora, you will need the MapR client
software for the version of MapR you are
using. The MapR client software must be
installed on all Platfora nodes.
Page 10
Platfora Installation Guide - Installation Overview (On-Premise)
Hadoop Checklist
This is a list of things you will need from your Hadoop environment in order to install Platfora.
What You Need
Description
Hadoop Distribution and Version Number
When you install Platfora, you need to
specify what Hadoop distribution you have
(Cloudera, Hortonworks, MapR, etc.) and
what version you are running.
Hadoop Hostnames and Ports
You will need to know the hostnames and
ports of your Hadoop services (NameNode,
Resource Manager or JobTracker, Hive
Server, DataNodes, etc.)
Hadoop Configuration Files
Platfora requires local versions of Hadoop's
configuration files. It uses these files to
connect to Hadoop services:
• core-site.xml and hdfs-site.xml for HDFS
• mapred-site.xml and yarn-site.xml for
data processing
• hive-site.xml for the Hive metastore
The locations of these files varies based on
your Hadoop distribution.
Platfora Data Directory Location in HDFS
Platfora requires a directory location in
HDFS to store its library files and output
(lenses).
High-Level Install Steps
This section lists the high-level steps involved in installing Platfora to work with an on-premise Hadoop
cluster. Note that there are different procedures if you are installing a new Platfora cluster verses adding
a worker node to an existing Platfora cluster.
New Platfora Installation
When installing Platfora for the first time, you begin with installing and configuring the Platfora master
node first. After the master node is installed, initialized and connected to the Hadoop services it needs,
then you can use the master node to add additional worker nodes into the cluster.
These are the high-level steps for installing Platfora for the first time:
1. Make sure your systems meet the minimum System Requirements.
Page 11
Platfora Installation Guide - Installation Overview (On-Premise)
2. .Configure Hadoop for Platfora Access .
3. Install Platfora Software and Dependencies.
4. Configure Environment on Platfora Nodes.
5. (Secure Hadoop Only) Configure Platfora for Secure Hadoop Access.
6. Obtain a Copy of Your Hadoop Configuration Files.
7. Configure Access to Your Hadoop Services.
8. Initialize the Platfora Master.
9. Start Platfora.
10.Login to the Platfora Application.
11.Install the License File.
12.(Optional) Load the Tutorial Data (as a quick way to test that everything works).
13.Add Worker Nodes.
Additional Worker Node Installation
Once you have a Platfora master node up and running, you can use it to initialize additional worker
nodes. Before you can initialize a worker node, however, you must make sure that it has the required
dependencies installed.
These are the high-level steps for adding a worker node to an existing Platfora cluster:
1. Install the prerequisite software only directly on the worker node.
• If using the RPM installer packages, Install Dependencies RPM Package.
• If using the TAR installer packages, you must manually Create the Platfora System User, Set OS
Kernel Parameters, and Install Dependent Software.
2. Configure Environment on Platfora Nodes.
3. (Secure Hadoop Only) Configure Kerberos Authentication to Hadoop.
4. Add Worker Node to Platfora Cluster.
Page 12
Chapter
2
System Requirements (On-Premise)
The Platfora software runs on a scale-out cluster of servers. You can install Platfora on a single node to start,
and then scale up storage and processing capacity by adding additional nodes. Platfora requires access to an
existing, compatible Hadoop implementation in order to start. Users then access the Platfora application using a
compatible web browser client. This section describes the system requirements for on-premise deployments of
the Platfora servers, Hadoop source systems, network connectivity, and web browser clients.
Topics:
•
Platfora Server Requirements
•
Port Configuration Requirements
•
Supported Hadoop and Hive Versions
•
Hadoop Resource Requirements
•
Browser Requirements
Platfora Server Requirements
Platfora recommends the following minimum system requirements for Platfora servers. For multi-node
installations, the master server and all worker servers must be the same operating system (OS) and
system configuration (same amount of memory, CPU, etc.).
64-bit Operating
System or Amazon
Machine Image
(AMIs)
1
CentOS 6.2-6.5 (7.0 is not supported)
RHEL 6.2-6.5 (7.0 is not supported)
Scientific Linux 6.2
Amazon Linux AMI 2014.03+
Oracle Enterprise Linux 6.x
Ubuntu 12.04.1 LTS or higher
1
Security-Enhanced Linux 6.2
If you wish to install Security-Enhanced Linux, refer to Platfora's Support site for
installation instructions.
Page 13
Platfora Installation Guide - System Requirements (On-Premise)
Software
Java 1.7
Python 2.6.8, 2.7.1, 2.7.3 through 2.7.6 (3.0 not supported)
PostgreSQL 9.2.1-1, 9.2.5, 9.2.7 or 9.3 (master only)
2
OpenSSL 1.0.1 or higher
Unix Utilities
rsync, ssh, scp, cp, tar, tail, sysctl, ntp, wget
Memory
64 GB minimum, 256 recommended
The server needs enough memory to accommodate
actively used lens data. Additionally, it needs 1-2 GB
reserved for normal operations and the lens query engine
workspace.
CPU
8 cores minimum, 16 recommended
Disk
All Platfora nodes (master or worker) require 300MB for the
Platfora installation. Every node requires high-speed local storage
and a local disk cache configured as a single logical volume.
Hardware RAID is recommended for the best performance.
All nodes combined require appropriate free space for aggregated
data structures (Platfora lenses). At a minimum, you will need
twice the amount of disk space as the amount of system memory.
The Platfora master node requires an additional, approximately
700 MB for metadata catalog (dataset definitions, vizboard and
visualization definitions, lens definitions, etc.)
Network
1 Gbps reliable network connectivity between Platfora master
server and query processing servers
1 Gbps reliable network connectivity between Platfora master
server and Hadoop NameNode and JobTracker/ResourceManager
node
Network bandwidth should be comparable to the amount of
memory on the Platfora master server
Port Configuration Requirements
You must open ports in the firewall of your Platfora nodes to allow client access and intra-cluster
communications. You also must open ports within your Hadoop cluster to allow access from Platfora.
This section lists the default ports required.
2
Only required if you want to enable SSL for secure communications between Platfora
servers
Page 14
Platfora Installation Guide - System Requirements (On-Premise)
Ports to Open on Platfora Nodes
Your Platfora master node must allow HTTP connections from your user network. All nodes must allow
connections from the other Platfora nodes in a multi-node cluster.
On Amazon EC2 instances, you must configure the port firewall rules on the
Platfora server instances in addition to the EC2 Security Group Settings.
Platfora Service
Default
Port
Allow connections from…
Master Web Services Port
(HTTP)
8001
External user network
Platfora worker servers
localhost
Secure Master Web Services
Port (HTTPS)
8443
External user network
Platfora worker servers
localhost
Master Server Management
Port
8002
Platfora worker servers
localhost
Worker Server Management
Port
8002
Platfora master server
other Platfora worker servers
localhost
Master Data Port
8003
Platfora worker servers
localhost
Worker Data Port
8003
Platfora master server
other Platfora worker servers
localhost
Master PostgreSQL Database
Port
5432
Platfora worker servers
localhost
Ports to Open on Hadoop Nodes
Platfora must be able to access certain services of your Hadoop cluster. This section lists the Hadoop
services Platfora needs to access and the default ports for those services.
Page 15
Platfora Installation Guide - System Requirements (On-Premise)
Note that this only applies to on-premise Hadoop deployments or to self-managed Hadoop deployments
in a virtual private cloud, not to Amazon Elastic MapReduce (EMR).
Hadoop Service
3
Default Ports by Hadoop Allow connections from…
Distro
CDH,
HDP,
Pivotal
Apache MapR
Hadoop
HDFS NameNode
8020
9000
N/A
Platfora master and worker servers
HDFS DataNodes
50010
50010
N/A
Platfora master and worker servers
MapRFS CLDB
N/A
N/A
7222
Platfora master and worker servers
MapRFS DataNodes
N/A
N/A
5660
Platfora master and worker servers
MRv1 JobTracker
8021
9001
9001
Platfora master server
MRv1 JobTracker
Web UI
50030
50030
50030
External user network (optional)
YARN
ResourceManager
8032
8032
8032
Platfora master server
YARN
ResourceManager
Web UI
8088
8088
8088
External user network (optional)
YARN Job History
Server
10020
10020
10020
Platfora master server
YARN Job History
Server Web UI
19888
19888
19888
External user network (optional)
HiveServer Thrift
Port
10000
10000
10000
Platfora master server
Hive Metastore DB
3
Port
9083
9933
(HDP2)
N/A
9083
Platfora master server
If connecting to Hive directly using JDBC
Page 16
Platfora Installation Guide - System Requirements (On-Premise)
Supported Hadoop and Hive Versions
This section lists the Hadoop distributions and versions that are compatible with the Platfora installation
packages. If using Hive as a data source for Platfora, the version of Hive must be compatible with the
version of Hadoop you are using.
Hadoop
Distro
Version
Hive
Version
M/R
Version
Platfora Package
Cloudera 5
CDH5.0
0.12
YARN
cdh5
CDH5.1
0.12
YARN
cdh5
CDH5.2
0.13
YARN
cdh52
CDH5.3
0.13.1
YARN
cdh52
CDH5.4
1.1
YARN
cdh54
HDP 2.1.x
0.13.0
YARN
hadoop_2_4_0_hive_0_13_0
HDP 2.2.x
0.14.0
YARN
hadoop_2_6_0_hive_0_14_0
MapR 4.0.1
0.12.0
YARN
mapr4
MapR 4.0.2
0.13.0
YARN
mapr402
MapR 4.1.0
0.13.0
YARN
mapr402
Pivotal Labs
PivotalHD 3.0 0.14.0
YARN
hadoop_2_6_0_hive_0_14_0
Amazon EMR
(AMI 3.7.x)
Hadoop 2.4.0
YARN
hadoop_2_4_0_hive_0_13_0
Hortonworks
MapR
0.13.1
Hadoop Resource Requirements
Platfora must be able to connect to an existing Hadoop installation. Platfora also requires permissions
and resources in the Hadoop source system. This section describes the Hadoop resource requirements for
Platfora.
Platfora uses the remote Distributed File System (DFS) of the Hadoop cluster for persistent storage and
as the primary data source. Optionally, you can also configure Platfora to use a Hive metastore server as
a data source.
Page 17
Platfora Installation Guide - System Requirements (On-Premise)
Platfora uses the Hadoop MapReduce services to process data and build lenses. For larger lens builds to
succeed, Platfora requires minimum resources on the Hadoop cluster for MapReduce tasks.
DFS Disk Space
Platfora requires a designated persistent storage directory in the
remote distributed file system (DFS) with appropriate free space for
Platfora system files and data structures (lenses). The location is
configurable.
DFS Permissions
The platfora system user needs read permissions to source data
directories and files.
The platfora system user needs write permissions to Platfora's
persistent storage directory on DFS.
MapReduce
Permissions
The platfora system user needs to be added to the submit-jobs
and administer-jobs access control list (or added to a group that has
these permissions).
DFS Resources
Minimum Open File Limit = 5000
MapReduce
Resources
Minimum Memory for Task Processes = 1 GB
Browser Requirements
Users can connect to the Platfora web application using the latest HTML5-compliant web browsers.
Platfora supports the latest releases of the following web browsers:
• Chrome (preferred browser)
• Firefox
• Safari
• Internet Explorer with the Compatibility View feature disabled (versions prior to IE 10 are not
supported)
Platfora supports these web browsers on desktop machines only.
Page 18
Chapter
3
Configure Hadoop for Platfora Access
Before initializing and starting Platfora for the first time, you must make sure that Platfora can connect to
Hadoop and access the directories and services it needs. The tasks in this section are performed in your Hadoop
environment, and apply to on-premise Hadoop installations only (not to Amazon EMR).
Topics:
•
Create Platfora User on Hadoop Nodes
•
Create Platfora Directories and Permissions in Hadoop
•
HDFS Tuning for Platfora
•
MapReduce Tuning for Platfora
•
YARN Tuning for Platfora
Create Platfora User on Hadoop Nodes
Platfora requires a platfora system user account on each node in your Hadoop cluster. The Platfora
server uses this system user account to submit jobs to the Hadoop cluster and to access the necessary
files and directories in the Hadoop distributed file system (HDFS).
Creating a system user requires root or sudo permissions.
1. Create the platfora user:
$ sudo useradd -s /bin/bash -m -d /home/platfora platfora
2. Set a password for the platfora user:
$ sudo passwd platfora
Create Platfora Directories and Permissions in Hadoop
Platfora requires read and write permissions to a designated directory in the Hadoop file system where
it can store its metadata and MapReduce output. Platfora connects to HDFS as the platfora user and
also runs its MapReduce jobs as this same user.
Page 19
Platfora Installation Guide - Configure Hadoop for Platfora Access
Create a data directory for Platfora and set the platfora system user as its owner. In the example
below, the Hadoop file system has a user called hdfs, the directory is called /platfora and the
Platfora server is running as the platfora system user:
$ sudo -u hdfs hadoop fs -mkdir /platfora
$ sudo -u hdfs hadoop fs -chown platfora /platfora
$ sudo -u hdfs hadoop fs -chmod 711 /platfora
Note that for MapR, run the command as the mapr user:
$ sudo -u mapr hadoop fs -mkdir /platfora
$ sudo -u mapr hadoop fs -chown platfora /platfora
$ sudo -u mapr hadoop fs -chmod 711 /platfora
The platfora system user needs access to the location where MapReduce writes its staging files.
Depending on your Hadoop distribution, the location of the staging area is different. In Cloudera, MapR,
Pivotal, and Hortonworks, the default location is /user/username. In Apache, the location is /
tmp/xxx/xxx/username.
Make sure this location exists and is writable by the platfora system user. For example, on Cloudera:
$ sudo -u hdfs hadoop fs -mkdir /user/platfora
$ sudo -u hdfs hadoop fs -chown platfora /user/platfora
For example, on MapR:
$ sudo -u mapr hadoop fs -mkdir /user/platfora
$ sudo -u mapr hadoop fs -chown platfora /user/platfora
During lens build processing, the platfora system user needs to be able to write to the intermediate
and log directories on the Hadoop nodes. Check the following Hadoop configuration properties and
make sure the specified locations exist in HDFS and are writable by the platfora system user.
Property
Hadoop Configuration
File
Description
mapreduce.cluster.local.dir
mapred-site.xml
Tells the MapReduce servers
where to store intermediate files
for a job.
mapreduce.jobtracker.system.dir mapred-site.xml
The directory where MapReduce
stores control files.
mapreduce.cluster.temp.dir
A shared directory for temporary
files.
mapred-site.xml
Page 20
Platfora Installation Guide - Configure Hadoop for Platfora Access
Property
Hadoop Configuration
File
Description
mapr.centrallog.dir (MapR Only)
mapred-site.xml
The central job log directory for
MapR Hadoop.
The platfora system user also needs to be added to the submit-jobs and
administer-jobs access control lists (or added to a group that has these
permissions).
The platfora system user also needs read permissions to the source data
directories and files that you want to analyze in Platfora.
HDFS Tuning for Platfora
Platfora opens files on the Hadoop NameNode and DataNodes as it does its work to build the lens. This
section describes how to ensure your Hadoop cluster has file limits that support lens build operations.
Increase Open File Limits
Platfora opens files on the Hadoop NameNode and DataNodes as it builds the lens. For multiple lens
builds or for lenses that have a lot of fields selected, a lens build can cause your Hadoop nodes to exceed
the maximum open file limit. When this limit is exceeded, Platfora lens builds will fail with a "Too
many open files..." exception.
Linux operating systems limit the number of open files and connections a process can have. This
prevents one application from slowing down the entire system by requesting too many file handlers.
When an application exceeds the limit, the operating system prevents the application from requesting
more file handlers, causing the process to fail with a "Too many open files..." error.
Verify your file limits are adequate on each Hadoop node. Increase the limits on your Hadoop nodes
where the limts are too low. There are two places file limits are set in the Linux operating system:
• A global limit for the entire system (set in /etc/sysctl.conf)
• A per-user process limit (set in /etc/security/limits.conf)
You can check the global limit by running the command:
$ cat /proc/sys/fs/file-nr
This should return a set of three numbers like this:
704 0 294180
The first number is the number of currently opened file descriptors. The second number is the number
of allocated file descriptors. The third number is the maximum number of file descriptors for the whole
system. The maximum should be at least 250000.
Page 21
Platfora Installation Guide - Configure Hadoop for Platfora Access
To increase the global limit, edit /etc/sysctl.conf (as root) and set the property:
fs.file-max = 294180
Increase Platfora User Limits
You can check the per-user process limit by running the command:
$ ulimit -n
This should return the file limit for the currently logged in user, for example:
1024
This limit should be at least 5000 for the platfora system user (or whatever user runs Platfora lens
build jobs).
To increase the limit, edit /etc/security/limits.conf (as root) and add the following lines (the
* increases the limit for all system users):
*
*
root
root
hard
soft
hard
soft
nofile
nofile
nofile
nofile
65536
65536
65536
65536
You must reboot the server whenever you change OS kernel settings.
Increase DataNode File Limits
A Hadoop HDFS DataNode has an upper bound on the number of files that it can serve at any one time.
In your Hadoop configuration, make sure the DataNodes are tuned to have an upper bound of at least
5000 by setting the following properties in the hdfs-site.xml file (located in the conf directory on
your Hadoop NameNode):
Framework
hdfs-site.xml Property
Minimum Value
MapReduce v1
dfs.datanode.max.xcievers
5000
YARN
dfs.datanode.max.transfer.threads
5000
Allow Platfora Local Access
If the platfora system user is not able to make HDFS calls during lens build processing, lens build
jobs in Platfora will stall at 0% progress. To prevent this, make sure your hdfs-site.xml files
contain the dfs.block.local-path-acess.user parameter and that its value includes the
platfora system user. For example:
<property>
<name>dfs.block.local-path-access.user</name>
<value>gpadmin,hdfs,mapred,yarn,hbase,hive,platfora</value>
</property>
Page 22
Platfora Installation Guide - Configure Hadoop for Platfora Access
MapReduce Tuning for Platfora
It is pretty common in Hadoop to customize configuration file properties to suit a specific MapReduce
workload. This section lists the mapred-site.xml properties that Platfora needs for its lens builds.
Platfora can pass in certain properties at runtime for its lens build jobs. Other properties must be set on
the Hadoop nodes themselves.
Runtime properties can be set in the Platfora local copy of the mapred-site.xml file, and are then
passed to Hadoop with the lens build job configuration. Non-runtime properties must be configured in
your Hadoop environment directly.
Consult your Hadoop vendor's documentation for recommended memory configuration settings for
Hadoop task/container nodes. These settings depend on the node hardware specifications, and can vary
for each environment.
Required Properties for MapReduce v1 Hadoop Clusters
These properties must be set in order for lens build jobs to succeed. You can set these in the local
mapred-site.xml file on the Platfora master, and they will be passed to Hadoop at runtime.
Property
Recommended Value
Default Value
Runtime?
mapred.child.java.opts
At least -Xmx1024m
Can be set higher
based on the amount of
memory on your Hadoop
nodes and the number of
simultaneous task slots
available per node.
-Xmx200m
YES
0.70
YES
mapred.job.shuffle.input.buffer.percent
0.30
Page 23
Platfora Installation Guide - Configure Hadoop for Platfora Access
Required Properties for YARN Hadoop Clusters
These properties must be set in order for lens build jobs to succeed. You can set the runtime properties in
the local mapred-site.xml file on the Platfora master, and they will be passed to Hadoop at runtime.
Non-runtime properties must be configured in your Hadoop environment directly.
Property
Recommended Value
Default Value
Runtime?
mapreduce.map.java.opts
At least
-Xmx200m
YES
-Xmx1024m
mapreduce.reduce.java.opts Can be set higher
based on the amount of
memory on your Hadoop
nodes and the number of
simultaneous task slots
available per node.
YES
mapreduce.map.shuffle.input.buffer.percent
0.30
The percentage of total
JVM heap size to allocate
to storing map outputs
during the shuffle phase.
0.70
YES
mapreduce.reduce.shuffle.input.buffer.percent
0.30
The percentage of total
JVM heap size to allocate
to storing reduce outputs
during the shuffle phase.
0.70
YES
mapreduce.map.memory.mb The calculated RAM per
container size for your
hardware specifications.
Platfora requires at least
1024.
512
NO
mapreduce.reduce.memory.mbThe calculated RAM per
container size for your
hardware specifications.
Platfora requires at least
1024.
512
NO
mapreduce.framework.name yarn
local
Make sure this is set to
yarn to prevent jobs from
running in local mode.
Page 24
NO
Platfora Installation Guide - Configure Hadoop for Platfora Access
Optional Sort Tuning Properties
These properties increase the number of streams to merge at once when sorting files and set a higher
memory limit for sort operations. If the sort phase can fit the data in memory, performance will be better
than if it spills to disk. You may decide to increase these if you notice that records are spilling when you
look at the lens build job details. However, setting this too high can result in job failures. If too much of
the JVM is reserved for sorting, then not enough will be left for other task operations.
The following optional mapred-site.xml properties apply to both MapReduce v1 and YARN
Hadoop clusters.
Property
Recommended Value
Default Value
Runtime?
io.sort.factor
100
10
YES
io.sort.mb
25-30% of the
*.java.opts values. For
example, if the java.opts
properties are set to
1024MB, this should be
about 256MB.
100
YES
io.sort.record.percent
0.15
0.05
YES
YARN Tuning for Platfora
This configuration is only required for Hadoop MapReduce v2 clusters with YARN. This section
lists the yarn-site.xml properties that Platfora needs for its lens builds. Platfora can pass in
certain properties at runtime for its lens build jobs. Other properties must be set on the Hadoop nodes
themselves.
Runtime properties can be set in the Platfora local copy of the yarn-site.xml file, and are then
passed to Hadoop with the lens build job configuration. Non-runtime properties must be configured in
your Hadoop environment directly.
Consult your Hadoop vendor's documentation for recommended memory configuration settings for
Hadoop task/container nodes. These settings depend on the Hadoop node's hardware specifications, and
can vary for each environment.
Page 25
Platfora Installation Guide - Configure Hadoop for Platfora Access
Required Properties for YARN Hadoop Clusters
Tuning these properties properly on your Hadoop nodes will optimize Platfora lens build jobs.
Property
Recommended Value
The total memory size for
yarn.nodemanager.resource.memoryall containers on a node (in
mb
Default Value
Runtime?
8192
NO
1024
YES
MB).
Should be the total
amount of RAM on the
node, minus 15-20% for
reserved system memory
space.
yarn.scheduler.minimumallocation-mb
The minimum memory
size per container.
Depends on the amount
of total memory on a
node:
• 512 MB (on nodes with
4-8 GB total RAM)
• 1024 MB ( on nodes
with 8-24 GB total
RAM)
• 2048 MB (on nodes
with more than 24 GB
total RAM)
yarn.scheduler.maximumallocation-mb
The maximum memory
8192
size per container.
Same as
yarn.nodemanager.resource.memorymb.
YES
Determine Maximum Reduce Tasks for Platfora
In addition to these YARN settings in Hadoop, you will need to determine the maximum number
of MapReduce reduce tasks allowed for a Platfora lens build job. This number is then configured in
Platfora after you initialize the Platfora master by setting the Platfora server configuration property:
platfora.reduce.tasks.
The number of reducer tasks can be determined using the following formula:
(yarn.nodemanager.resource.memory-mb / mapreduce.reduce.memory.mb) *
number_of_hadoop_nodes
Page 26
Chapter
4
Install Platfora Software and Dependencies
This section describes how to provision a Platfora node with the required prerequisites and Platfora
software. If you are installing a new Platfora cluster, the master node needs everything (prerequisites and
Platfora software). Worker nodes only need the prerequisite software installed prior to initialization.
Most of the tasks in this section require root permissions. The example commands in the
documentation use sudo to denote the commands that require root permissions.
Topics:
•
About the Platfora Installer Packages
•
Install Using RPM Packages
•
Install Using the TAR Package
About the Platfora Installer Packages
Platfora provides RPM or TAR installer packages that are specific to the Hadoop distribution you are
using. Platfora Customer Support can provide you with the link to download the installer packages for
your environment.
Make sure to download the correct Platfora installer packages for your Hadoop distribution and version.
See Supported Hadoop and Hive Versions if you are not sure which Platfora package to use for your
chosen Hadoop distribution.
RPM Packages
If you plan to install Platfora on a Linux operating system that supports the RPM packager manager,
such as RedHat or CentOS, Platfora recommends using the RPM packages to install Platfora and its
required dependencies.
The platfora-base RPM package includes all the prerequisite software that Platfora needs, plus
automates the OS configurations needed by Platfora. This package should be installed on all Platfora
nodes (master and workers).
Page 27
Platfora Installation Guide - Install Platfora Software and Dependencies
The platfora-server package includes the Platfora software only, which only needs to be installed
on the master node. The Platfora software is copied to the worker nodes during initialization or upgrade,
so you don't need to install it on the worker nodes ahead of time.
TAR Package
If you plan to install Platfora on a Linux operating system that does not support the RPM package
manager, such as Ubuntu, you have to use the TAR package. You may also use the TAR package if you
just want to install and manage the dependent software that is installed in your environment yourself.
The TAR package contains the Platfora server software only, which only needs to be installed on the
master node.
The TAR package does not contain the prerequisite software that Platfora needs. You must manually
install the required prerequisite software and do the required OS configurations on all Platfora nodes
prior to installing and initializing Platfora.
Install Using RPM Packages
Follow the instructions in this section to install the Platfora dependencies and server software using the
RPM packages. Install the platfora-base RPM package on all Platfora nodes, and the platforaserver RPM package on the master node only.
Install Dependencies RPM Package
The platfora-base RPM package contains all of the dependent software required by Platfora, and
also automates several OS configuration tasks. This package is installed on all Platfora nodes.
This task requires root permissions. Commands that begin with sudo denote root
commands.
The platfora-base RPM package does the following:
• Creates a /usr/local/platfora/base directory containing Platfora's third-party dependencies.
• Creates the platfora system user. The platfora user has no password set.
• Generates an SSH key for the platfora system user and adds the key to the user's
authorized_keys file.
• Adds the platfora system user to the sudoers file. This allows you to execute commands as root
while logged in as the platfora user.
• Ensures the OS kernel parameters are appropriate for Platfora and sets them if they are not.
• Creates a .bashrc file for the platfora system user.
Page 28
Platfora Installation Guide - Install Platfora Software and Dependencies
The platfora-base package uses the following file naming convention, where version-build
is the version and build number of the base package only, and x86_64 is the supported system
architecture. The base and Platfora server packages use different versioning schemes.
platfora-base-version-build-x86_64.rpm
The base package is not updated every Platfora release. It is only updated when
the Platfora dependencies change, which is not as often. When upgrading Platfora,
check the release notes to see if upgrade of the base package is required.
1. Log on to the machine on which you are installing Platfora.
2. Using the download link provided by Platfora Customer Support, download the base package. For
example:
$ wget http://downloads.platfora.com/release/platforabase-version-build-x86_64.rpm
3. Install the package using the yum package manager (requires root permission). For example:
$ sudo yum --nogpgcheck localinstall platfora-base-version-buildx86_64.rpm
Confirm that the /usr/local/platfora/base directory was created.
$ sudo ls -a /usr/local/platfora/base
Install Optional Security RPM Package
The platfora-security RPM package contains SSL-enabled PostgreSQL and the OpenSSL
package it depends on. This package is only needed if you plan to enable SSL communications between
the Platfora worker nodes and the Platfora metadata catalog database.
This task requires root permissions. Commands that begin with sudo denote root
commands.
The platfora-security package is installed after the platfora-base package. The
platfora-security RPM package does the following:
• Creates a /usr/local/platfora/security directory containing the SSL-enabled version of
PostgreSQL.
• Checks if OpenSSL version 1.0.1 or later is installed, and if not downloads and installs the openssl
package dependency from the OpenSSL public repo.
• Edits the .bashrc file for the platfora system user and changes the PATH environment variable
so that secure PostgreSQL is listed before the default PostgreSQL installed by the platfora-base
package.
The platfora-security package uses the following file naming convention, where
version-build is the version and build number of the base package only, and x86_64 is the
Page 29
Platfora Installation Guide - Install Platfora Software and Dependencies
supported system architecture. The base, security and Platfora server packages use different versioning
schemes.
platfora-security-version-build-x86_64.rpm
The security package only needs to be upgraded when the base package is
upgraded, which is not every release. When upgrading Platfora, check the release
notes to see if upgrade of the base and security packages is required.
1. Log on to the machine on which you are installing Platfora.
2. Using the download link provided by Platfora Customer Support, download the security package. For
example:
$ wget http://downloads.platfora.com/release/platforasecurity-version-build-x86_64.rpm
3. Install the package using the yum package manager (requires root permission). For example:
$ sudo yum --nogpgcheck localinstall platfora-security-version-buildx86_64.rpm
Confirm that the /usr/local/platfora/security directory was created.
$ sudo ls -a /usr/local/platfora/security
Install Platfora RPM Package (Master Only)
The platfora-server RPM package contains the Platfora server software. This package is installed
on the Platfora master node only.
The platfora-server RPM package creates a /user/local/platfora/platfora-server
directory containing the Platfora software.
The platfora-server package uses the following file naming convention, where hadoop_distro
corresponds to the Hadoop distribution you are using, version-build is the version and build number
of the Platfora software, and x86_64 is the supported system architecture.
platfora-server-hadoop_distro-version-build-x86_64.rpm
Make sure to download the correct Platfora installer packages for your Hadoop
distribution and version. See Supported Hadoop and Hive Versions if you are not
sure which Platfora package to use for your chosen Hadoop distribution.
This task requires root permissions. Commands that begin with sudo denote root commands.
1. Log on to the machine on which you are installing the Platfora master.
2. Using the download link provided by Platfora Customer Support, download the Platfora server
package. For example:
$ wget http://downloads.platfora.com/release/platforaserver-hadoop_distro-version-build-x86_64.rpm
3. Install the package using the yum package manager (requires root permission). For example:
$ sudo yum --nogpgcheck localinstall platforaserver-hadoop_distro-version-build-x86_64.rpm
Page 30
Platfora Installation Guide - Install Platfora Software and Dependencies
Confirm that the /usr/local/platfora/platfora-server directory was created.
$ sudo ls -a /usr/local/platfora/platfora-server
Install Using the TAR Package
Follow the instructions in this section to install the Platfora dependencies and server software using
the TAR packages. The TAR package contains the Platfora server software only. You must install all
dependencies yourself.
For the Platfora master node, do all the tasks described in this section.
For a Platfora worker node, do all the tasks described in this section except for:
• Install PostgreSQL
• Install Platfora TAR Package
• Install PDF Dependencies
Create the Platfora System User
Platfora requires a platfora system user account to own the Platfora installation and run the Platfora
server processes. This same system user must be created on all Platfora nodes.
This task requires root permissions. Commands that begin with sudo denote root commands.
(MapR Only) If you are using MapR as your Hadoop distribution with Platfora, make sure to follow the
additional steps for MapR. The platfora system user must exist on all Platfora nodes and all MapR
nodes. The UID/GID must also be the same on the MapR nodes as on Platfora nodes.
1. Create the platfora system user:
$ sudo useradd -s /bin/bash -m -d /home/platfora platfora
2. Set a password for the platfora user:
$ sudo passwd platfora
3. (MapR Only) Check the /etc/passwd file on your MapR CLDB node, and find the entry for the
platfora user. Note the user and group id numbers that are used.
For example:
platfora:x:1002:1002::/home/platfora:/bin/bash
4. (MapR Only) Check the /etc/passwd file on your Platfora master node. If the user and group id
numbers for the platfora user are different, update them so that they are the same as on the MapR
nodes.
For example:
$ sudo usermod -u 1002 platfora
$ sudo groupmod -g 1002 platfora
Page 31
Platfora Installation Guide - Install Platfora Software and Dependencies
Configure sudo for the platfora User
This is an optional task. Configuring sudo access for the platfora system user is a convenient way to
run commands as root while logged in as the platfora user.
If you do not configure sudo access for the platfora user, then you must change to the root user to
execute the system commands that require root permissions.
This documentation assumes that you have sudo access configured. If you do not, every time you see
sudo at the beginning of a command, it means you need to be root to run the command.
1. Edit the /etc/sudoers file using the visudo command.
$ sudo visudo
2. Add a line such as the following in this file:
# User privilege specification
platfora ALL=(ALL:ALL) ALL
3. Save your changes and exit the visudo editor.
Generate and Authorize an SSH Key
Generating and authorizing an SSH key for the platfora system user on the localhost is required
by the Platfora management utilities. This task should be performed on all Platfora nodes.
The Platfora management utilities require a trusted-host environment (the ability to SSH to a remote
system in the Platfora cluster without a password prompt). Even in single-node installations, you must
exchange SSH keys for the localhost.
1. Make sure that Selinux is disabled using either the sestatus or getenforce command.
$ sestatus
If Selinux is enabled, disable it using the recommended procedure for the node's operating system.
2. Make sure you are logged in to the Platfora server as the platfora system user.
$ su - platfora
3. Go to the ~/.ssh directory (create it if it does not exist):
$ mkdir .ssh
$ cd .ssh
4. Generate a public/private key pair that is NOT passphrase-protected.
Press the ENTER or RETURN key for each prompt:
$ ssh-keygen -C 'platfora key for node 0' -t rsa
Enter file in which to save the key (/home/platfora/.ssh/
id_rsa): ENTER
Enter passphrase (empty for no passphrase): ENTER
Enter same passphrase again: ENTER
Page 32
Platfora Installation Guide - Install Platfora Software and Dependencies
5. Append the public key to the ~/.ssh/authorized_keys file (this allows SSH access from the
current host to itself):
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
6. Make sure the home directory, .ssh directory, and the files it contains have the correct permissions:
$ chmod 700 $HOME && chmod 700 ~/.ssh && chmod 600 ~/.ssh/*
7. Test that you can SSH to localhost without a password prompt.
If prompted to add localhost to the list of known hosts, enter yes :
$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be
established...
Are you sure you want to continue connecting (yes/no)? yes
Set OS Kernel Parameters
This section has the Linux OS kernel settings required for Platfora. You must have root or sudo
permissions to change kernel parameter settings. Changing kernel settings requires a system reboot in
order for the changes to take effect.
Kernel ulimit Setting
Linux operating systems set limits on the number of open files and connections a process can have. For
some applications, such as Platfora and Hadoop, having a lot of open file handlers during processing is
normal. Having the limit set too low can cause Platfora lens builds to fail.
There are two places file limits are set in the Linux operating system:
• A global limit for the entire system (set in /etc/sysctl.conf)
• A per-user process limit (set in /etc/security/limits.conf)
You must have root or sudo permissions to change OS ulimit settings.
You can check the global limit by running the command:
$ cat /proc/sys/fs/file-nr
This should return a set of three numbers like this:
704 0 294180
The first number is the number of currently opened file descriptors. The second number is the number
of allocated file descriptors. The third number is the maximum number of file descriptors for the whole
system. This limit should be at least 250000.
To increase the global limit, edit /etc/sysctl.conf (as root) and set the property:
fs.file-max = 294180
You can check the per-user process limit by running the command:
$ ulimit -n
Page 33
Platfora Installation Guide - Install Platfora Software and Dependencies
This should return the file limit for the currently logged in user, for example:
1024
This limit should be at least 20000 for the platfora user (or whatever user runs the Platfora server).
To increase the limit, edit /etc/security/limits.conf (as root) and the following lines (the *
increases the limit for all system users):
*
*
root
root
hard
soft
hard
soft
nofile
nofile
nofile
nofile
65536
65536
65536
65536
Reboot the server for the changes to take effect.
$ sudo reboot
Kernel Memory Overcommit Setting
Linux operating systems allow memory to be overcommitted, meaning the OS will allow an application
to reserve more memory than actually exists within the system. Allowing overcommit prevents the OS
from killing processes when a process requests more memory than is available.
If you are using a version 1.6 Java Runtime Environment (JRE), you must configure your OS to allow
memory overcommit. If you are using a version 1.7 JRE, overcommit is not necessary.
You must have root or sudo permissions to change kernel memory overcommit settings.
1. Check your version of Java.
$ java -version
If you are running a 1.6 version, proceed to the next steps. If you are running a 1.7 version, you do
not need to make any further changes.
2. Edit the /etc/systcl.conf file.
$ sudo vi /etc/systcl.conf
3. Set the following value:
vm.overcommit_memory=1
4. Save and close the file.
5. Reboot your system for the change to take effect:
$ sudo reboot
Kernel Shared Memory Settings
Some default OS installations have the system shared memory values set too low for Platfora. You may
need to increase the shared memory settings if they are set too low.
You must have root or sudo permissions to set the system shared memory parameters.
1. In /etc/sysctl.conf, make sure the shared memory parameters have the minimum values or
higher.
Page 34
Platfora Installation Guide - Install Platfora Software and Dependencies
If your settings are lower than these minimum values, you will need to change them. If they are
higher than the minimum, leave them as is.
kernel.shmmax=17179869184
kernel.shmall=4194304
2. If you made changes to /etc/sysctl.conf, reboot the server for the changes to take effect.
$ sudo reboot
Install Dependent Software
If using the TAR installation package to install Platfora, you must install all of the dependencies
yourself. This section provides instructions for manually installing the prerequisite software on a
Platfora node.
If you are provisioning a Platfora master node, you must install all dependencies.
If you are provisioning a Platfora worker node, you can skip the task for installing PostgreSQL.
PostgreSQL is only needed on the Platfora master node.
Confirm Linux OS Utilities
Platfora requires several standard Linux utilities to be installed on your system and in your environment
PATH. Check your system for the required utilites before installing Platfora.
Most Linux operating systems already have these utilities installed by default.
• rsync
• ssh
• scp
• tail
• tar
• cp
• wget
• ntp
• sysctl (/usr/sbin must be in your PATH)
To verify that a utility is installed and can be found in the PATH, you can check its location using the
which command. For example:
$ which rsync
$ which tar
$ which sysctl
If a utility is not installed, you will need to install it before installing Platfora. Check your OS
documentation for instructions on installing these utilities.
Page 35
Platfora Installation Guide - Install Platfora Software and Dependencies
Install Java
The Platfora server requires a Java Runtime Environment (JRE) version 1.7 or higher. Platfora
recommends installing the full Java Development Kit (JDK) for access to the latest Java features and
diagnostic tools.
The instructions in this section are for installing version 1.7 of the Open Java Development Kit
(OpenJDK).
You must have root or sudo permissions to install Java.
1. Check if Java 1.7 or higher is already installed.
$ java -version
If java is not found, you will need to install it.
2. Install OpenJDK using your OS package manager.
On Ubuntu Systems:
$ sudo apt-get install openjdk-7-jdk
On RedHat/CentOS Systems:
$ su -c "yum install java-1.7.0-openjdk"
3. Set the JAVA_HOME environment variable in the platfora user’s profile file. For example, where
java_directory is the versioned directory where Java is installed:
$ echo "export JAVA_HOME=/usr/lib/jvm/java_directory/jre" >> /home/
platfora/.bashrc
$ echo "export PATH=$JAVA_HOME/bin:$PATH" >> /home/platfora/.bashrc
$ source /home/platfora/.bashrc
4. Make sure JAVA_HOME is set correctly for the platfora user:
$ su - platfora
$ echo $JAVA_HOME
Confirm Python Installation
The Platfora management utilities require Python version 2.6.8, 2.7.1, or 2.7.3 through 2.7.6. Python
version 3.0 is not supported. Most Linux operating systems already have Python installed by default, but
you need to make sure the version is compatible with Platfora.
To check if the correct version of Python is installed:
$ python -V
If Python is not installed (or you have an incompatible version of Python) you will need to install or
upgrade/downgrade it before installing Platfora. Check your OS documentation for instructions on
installing or upgrading/downgrading Python to version 2.6.8 or higher 2.x version.
Page 36
Platfora Installation Guide - Install Platfora Software and Dependencies
Install PostgreSQL (Master Only)
Platfora stores its metadata catalog in a PostgreSQL relational database. PostgreSQL version 9.2 or 9.3
must be installed (but not running) on the Platfora master server before you start Platfora for the first
time. Platfora worker nodes do not require a PostgreSQL installation.
You must have root or sudo permissions to install PostgreSQL.
Install PostgreSQL 9.2 on Ubuntu Systems
These instructions are for installing PostgreSQL 9.2 on Linux Ubuntu operating systems.
1. Install the dependent libraries:
$ sudo apt-get install libpq-dev
2. Add the PostgreSQL repository to your system configuration:
$ sudo add-apt-repository ppa:pitti/postgresql
$ sudo apt-get update
3. Install PostgreSQL 9.2:
$ sudo apt-get install postgresql-9.2
4. Stop the PostgreSQL service.
$ sudo service postgresql stop
5. Remove the PostgreSQL automatic start-up scripts:
$ sudo rm /etc/rc*/*postgresql
6. Create and change the ownership on the directory where PostgreSQL writes its lock files:
$ sudo mkdir /var/run/postgresql
$ sudo chown platfora /var/run/postgresql
7. Update the platfora user’s PATH environment variable to include the PostgreSQL executable
directory and /usr/sbin:
$ echo "export PATH=/usr/lib/postgresql/9.2/bin:/usr/sbin:$PATH" >> /
home/platfora/.bashrc
$ source /home/platfora/.bashrc
Install PostgreSQL 9.2 on RedHat/CentOS Systems
These instructions are for installing PostgreSQL 9.2 on RedHat Enterprise Linux (RHEL) or CentOS
operating systems.
1. Download the appropriate PostgreSQL 9.2 YUM repository for your operating system.
Go to the PostgreSQL yum repository website, copy the URL link for the appropriate YUM
repository configuration, and download it using wget.
For example, to download the YUM repository configuration for PostgreSQL 9.2 on a 64-bit RHEL 6
operating system.
$ wget http://yum.pgrpms.org/9.2/redhat/rhel-6-x86_64/pgdgredhat92-9.2-7.noarch.rpm
2. Add the PostgreSQL YUM repository to your system configuration:
$ sudo rpm -i pgdg-redhat92-9.2-7.noarch.rpm
Page 37
Platfora Installation Guide - Install Platfora Software and Dependencies
3. Install PostgreSQL:
$ sudo yum install postgresql92 postgresql92-server
4. If it is enabled, disable the PostgreSQL automatic start-up.
Each operating system has its own technique for auto starting PostgreSQL. If your system uses
chkconfig to manage init scripts, you can remove PostgreSQL from the chkconfig control using
the following command:
chkconfig --del postgresql
For some operating systems, the PostgreSQL start.conf file configures the auto-start of a
specific PostgreSQL cluster.
5. Create and change the ownership on the directory where PostgreSQL writes its lock files:
$ sudo mkdir /var/run/postgresql
$ sudo chown platfora /var/run/postgresql
6. Update platfora user’s PATH environment variable to include the PostgreSQL executable
directory and /usr/sbin:
$ echo "export PATH=/usr/pgsql-9.2/bin:/usr/sbin:$PATH" >> /home/
platfora/.bashrc
$ source /home/platfora/.bashrc
Confirm OpenSSL Installation
Platfora uses OpenSSL for secure communications between the Platfora worker servers and its metadata
catalog database. If you decide to enable SSL for the Platfora catalog, which is optional, you will need
OpenSSL version 1.0.1 or higher on your Platfora nodes.
As an optional security feature, you can choose to enable SSL communications between the Platfora
metadata catalog and the Platfora worker nodes. If you decide to enable this, you will need to have:
• SSL-enabled PostgreSQL. If using the RPM installation packages, Platfora provides an optional
platfora-security package that contains SSL-enabled PostgreSQL. If using the TAR
installation packages, the packages provided in the PostgreSQL public repo come with SSL enabled.
• OpenSSL. If using the RPM installation packages, Platfora provides an optional platforasecurity RPM package that pulls this dependency from the public repo. If using the TAR
installation packages, you will have to install the openssl package yourself.
Many Linux operating systems already have OpenSSL installed by default, but you need to make sure
the version is compatible with the version that PostgreSQL uses.
1. Check that OpenSSL version 1.0.1 or higher is installed.
$ openssl version
2. If OpenSSL is not installed (or you have an incompatible version) you will need to install or upgrade
it before enabling SSL for the Platfora catalog. Check your OS documentation for instructions on
installing or upgrading the openssl package.
Page 38
Platfora Installation Guide - Install Platfora Software and Dependencies
Install Platfora TAR Package (Master Only)
The TAR installation package contains the Platfora server software only. You only need to install this
package on the Platfora master node. You can skip this task if you are provisioning a Platfora worker
node.
The platfora tar package uses the following file naming convention, where version-build.no is
the version and build number of the Platfora software and hadoop_distro corresponds to the Hadoop
distribution you are using.
platfora-version-build.num-hadoop_distro.tgz
Make sure to download the correct Platfora installer package for your Hadoop
distribution and version. See Supported Hadoop and Hive Versions if you are not
sure which Platfora package to use for your chosen Hadoop distribution.
This task requires root permissions. Commands that begin with sudo denote root commands.
1. Log on to the machine on which you are installing the Platfora master.
2. Create a Platfora installation directory and ensure that it is owned by the platfora system user.
For example:
$ sudo mkdir /usr/local/platfora
$ sudo chown platfora /usr/local/platfora -R
3. Log in as the platfora user and go to the installation directory that you just created:
$ su - platfora
$ cd /usr/local/platfora
4. Download the 4.5.0 release package and checksum file using the URLs provided by Platfora
Customer Support.
Make sure to download the correct packages for your Hadoop distribution version. For example:
$ wget http://downloads.platfora.com/release/platfora-versionbuild.num-hadoop_distro.tgz
$ wget http://downloads.platfora.com/release/platfora-versionbuild.num-hadoop_distro.tgz.sha
5. After downloading the package and checksum file, make sure the package is valid using the shasum
command.
For example:
$ shasum -c platfora-version-build.num-hadoop_distro.tgz.sha
If the package is valid, you should see a message such as:
platfora-version-build.num-hadoop_distro.tgz: OK
6. Unpack the package within the installation directory.
For example:
$ tar -zxvf platfora-version-build.num-hadoop_distro.tgz
7. Create a symbolic link named platfora-server that points to the actual installation directory.
Page 39
Platfora Installation Guide - Install Platfora Software and Dependencies
For example:
$ ln -s platfora-version-build.num-hadoop_distro platfora-server
8. Set the PLATFORA_HOME environment variable for the platfora system user.
$ echo "export PLATFORA_HOME=/usr/local/platfora/platfora-server" >>
$HOME/.bashrc
9. Set the PATH environment variable for the platfora system user.
The PATH should include /usr/sbin, $PLATFORA_HOME/bin, and the PostgreSQL executable
directories. If your system has more than one version of PostgreSQL installed, make sure that 9.2 is
listed first in the PATH of the platfora user.
For example (Ubuntu):
$ echo "export PATH=/usr/lib/postgresql/9.2/bin:/usr/sbin:
$PLATFORA_HOME/bin:$PATH" >> $HOME/.bashrc
$ source $HOME/.bashrc
For example (RedHat/CentOS):
$ echo "export PATH=/usr/pgsql-9.2/bin:/usr/sbin:$PLATFORA_HOME/bin:
$PATH" >> $HOME/.bashrc
$ source $HOME/.bashrc
10.Make sure the JAVA_HOME environment variable is set (if it's not, see Install Java).
$ echo $JAVA_HOME
Install PDF Dependencies (Master Only)
One feature of Platfora is the ability to save a vizboard as a PDF document. In order for the Platfora
server to render PDFs, it needs PhantomJS and the OpenSans font to be installed on the Platfora master
node. You can skip this task if you are provisioning a Platfora worker node.
The PhantomJS installation relies on several fonts that ship with the Platfora software. For this reason,
the PhantomJS installation must be done after installing the Platfora software.
To install PhantomJS, do the following:
1. Log into the Platfora master node as the platfora user.
2. Install the PhantomJS dependencies.
On Ubuntu
$
$
$
$
sudo
sudo
sudo
sudo
apt-get
apt-get
apt-get
apt-get
On Redhat/CentOS
install
install
install
install
fontconfig
libfreetype6
libfontconfig1
libstdc++6
Page 40
$
$
$
$
$
sudo
sudo
sudo
sudo
sudo
yum
yum
yum
yum
yum
install
install
install
install
install
fontconfig
freetype
libfreetype.so.6
libfontconfig.so.1
libstdc++.so.6
Platfora Installation Guide - Install Platfora Software and Dependencies
3. Download the compiled PhantomJS executable.
$ sudo wget https://bitbucket.org/ariya/phantomjs/downloads/
phantomjs-1.9.7-linux-x86_64.tar.bz2
4. Extract the files.
$ sudo tar xjf phantomjs-1.9.7-linux-x86_64.tar.bz2
5. Copy the PhantomJS binary to an accessible bin directory.
You should choose a bin directory that is common to most user environments.
$ sudo cp phantomjs-1.9.7-linux-x86_64/bin/phantomjs /usr/local/bin
6. Verify the phantomjs command is accessible to the platfora user.
$ which phantomjs
/usr/local/bin/phantomjs
If the command is not found, add the bin directory to the platfora user's environment:
$ echo "export PATH=/usr/local/bin:/usr/sbin:$PATH" >> /home/
platfora/.bashrc
$ source /home/platfora/.bashrc
7. Install the OpenSans font for use by the PDF feature.
a) Make a directory to contain the typeface.
$ sudo mkdir -p /usr/share/fonts/truetype
b) Copy the font to the truetype directory.
$ sudo cp -r $PLATFORA_HOME/server/webapps/proton/dist/fonts/
OpenSans /usr/share/fonts/truetype
c) Refresh the font cache.
$ sudo fc-cache -f
After installing, you'll want to verify the installation is running correctly. One easy way to do this is
using examples that came with the PhantomJS tarball:
$ phantomjs phantomjs-1.9.7-linux-x86_64/examples/hello.js
Hello, world!
You can also output a PDF to verify the fonts were installed correctly. to output to PDF choose Share
> Prepare PDF for Download from an open vizboard. In the example PDF output below, the left
Page 41
Platfora Installation Guide - Install Platfora Software and Dependencies
side shows the output when the fonts are installed. The right side was rendered without the proper fonts
installed:
Page 42
Chapter
5
Configure Environment on Platfora Nodes
This section describes how to configure a Platfora node's operating system and network environment. You
should perform these tasks on every node in the Platfora cluster (master and workers) after you have installed the
Platfora dependencies and software, but before you initialize Platfora (or initialize a new worker node).
Topics:
•
Install the MapR Client Software (MapR Only)
•
Configure Network Environment
•
Configure Passwordless SSH
•
Synchronize the System Clocks
•
Create Local Storage Directories
•
Verify Environment Variables
Install the MapR Client Software (MapR Only)
If you are using MapR as your Hadoop distribution, you must install the MapR client software on all
Platfora nodes (master and workers). If you are not using MapR with Platfora, you can skip this task.
Platfora uses the MapR client to submit MapReduce jobs and file system commands directly to the
MapR cluster. For more information about the MapR client, see the MapR documentation.
If you use MapR 4.1, Platfora requires that you install the MapR 4.0.2 client
software.
You must have root or sudo permissions to install the MapR client.
Installing the MapR Client on Ubuntu
1. Add the following line to the /etc/apt/sources.list file:
deb http://package.mapr.com/releases/version/ubuntu/ mapr optional
Platfora supports MapR client versions: v3.0.x, v3.1.1, v4.0.x.
Page 43
Platfora Installation Guide - Configure Environment on Platfora Nodes
2. Update the repository and install the MapR client:
$ sudo apt-get update
$ sudo apt-get install mapr-client
3. Configure the MapR client where clusterName is the name of your MapR cluster and cldbhost
is the hostname and port of the MapR CLDB node:
$ sudo /opt/mapr/server/configure.sh –N clusterName -c C cldbhost:7222
4. Check if the /opt/mapr/hostname file exists on the node.
$ sudo ls /opt/mapr
If the file doesn't exist, create it:
$ sudo hostname -f > /opt/mapr/hostname
5. Set the PLATFORA_HADOOP_LIB environment variable. For example (check the path for your
version of the MapR client):
$ echo "export PLATFORA_HADOOP_LIB=/opt/mapr/hadoop/lib" >>
$HOME/.bashrc
Installing the MapR Client on RedHat/CentOS
1. Create the file /etc/yum.repos.d/maprtech.repo with the following contents:
[maprtech]
name=MapR Technologies
baseurl=http://package.mapr.com/releases/version/redhat/
enabled=1
gpgcheck=0
protect=1
Platfora supports MapR client versions: v3.0.x, v3.1.1, v4.0.x.
2. Install the MapR client. For example, on 64-bit operating systems:
$ sudo yum install mapr-client.x86_64
3. Configure the MapR client where clusterName is the name of your MapR cluster and
cldbhost:port is the hostname and port of the MapR CLDB node:
$ sudo /opt/mapr/server/configure.sh –N clusterName -c C cldbhost:port
4. Check if the /opt/mapr/hostname file exists on the node.
$ sudo ls /opt/mapr
If the file doesn't exist, create it:
$ sudo hostname -f | sudo tee /opt/mapr/hostname
5. Set the PLATFORA_HADOOP_LIB environment variable. For example (check the path for your
version of the MapR client):
$ echo "export PLATFORA_HADOOP_LIB=/opt/mapr/hadoop/lib" >>
$HOME/.bashrc
Page 44
Platfora Installation Guide - Configure Environment on Platfora Nodes
Configure Network Environment
A Platfora node needs to be able to connect to other Platfora nodes over the network, and to the Hadoop
services it needs. This section describes how to check the network connections between nodes, and make
sure the required ports are open to connections from a Platfora node.
Configure /etc/hosts File
The /etc/hosts file is a system file that identifies the hostnames and IP addresses of other machines
in the network so that they can find each other.
Platfora uses the /etc/hosts system file to find other nodes over the network. This means that each
node in a Platfora cluster must have the same entries. When you add, change, or remove a node, you
should update the /etc/hosts file on all Platfora nodes. For on-premise Hadoop installations, you
will also need to specify the address of your Hadoop NameNode.
A typical /etc/hosts file on a Platfora node might look something like this:
# Platfora IP
127.0.0.1
10.202.123.45
10.202.123.46
10.202.123.47
10.202.123.48
Hostname
localhost
ip-10-202-123-45
ip-10-202-123-46
ip-10-202-123-47
ip-10-202-123-48
Alias
platfora-master
platfora-worker-1
platfora-worker-2
platfora-worker-3
# Hadoop IP
Hostname
Alias
10.202.123.55 ip-10-202-123-55 hadoop-namenode
Platfora relies on the IP address associated with a node's network interface.
Host File Configuration on Amazon EC2 Instances
If you are installing your Platfora nodes on Amazon EC2 instances, the entries in the /etc/hosts file
should use the Amazon internal IP addresses and hostnames.
If you are using standard EC2 instances, the internal IP address is associated with the network interface
of the instance. When you stop or restart a standard EC2 instance, its internal IP address and hostname
changes. This means that whenever you stop and restart an instance, you'll need to update the /etc/
hosts files to reflect the new internal IP addresses and hostnames that are assigned to the instance.
Platfora recommends using virtual private cloud (VPC) EC2 instances to run your Platfora nodes. EC2VPC instances maintain their assigned internal IP address and hostname through restarts.
Amazon Elastic MapReduce (EMR) Hadoop instances are launched on EC2-VPC instances by default.
You do not need to put Hadoop node entries in your Platfora node /etc/hosts files if you are using
EMR as your Hadoop distribution. You only need Hadoop entries if you are running your own managed
Hadoop cluster on EC2.
Page 45
Platfora Installation Guide - Configure Environment on Platfora Nodes
Verify Connectivity Between Platfora Nodes
The Platfora master and worker nodes must be able to accept incoming network connections from each
other on the ports designated for Platfora intra-node communications. This sections explains how you
can test network connectivity between Platfora nodes and verify that the required ports are open.
In a multi-node Platfora cluster, all of the nodes must be able to connect to each other over the network.
Platfora services use certain ports for intra-node communications. Before you initialize Platfora, you
should decide what ports to use for these services, and make sure that they are open to connections from
other Platfora nodes.
The following table shows the default Platfora ports:
Platfora Service
Default
Port
Allow connections from…
Master Web Services Port
(HTTP)
8001
External user network
Platfora worker servers
localhost
Secure Master Web Services
Port (HTTPS)
8443
External user network
Platfora worker servers
localhost
Master Server Management
Port
8002
Platfora worker servers
localhost
Worker Server Management
Port
8002
Platfora master server
other Platfora worker servers
localhost
Master Data Port
8003
Platfora worker servers
localhost
Worker Data Port
8003
Platfora master server
other Platfora worker servers
localhost
Master PostgreSQL Database
Port
5432
Platfora worker servers
localhost
One way to verify that these ports are open to connections from another Platfora node is to use the
telnet command.
For example, to test if port 8002 was open on a remote node with the IP address 10.10.10.9, you
could run the following command to test the connection:
$ telnet 10.10.10.9 8002
Page 46
Platfora Installation Guide - Configure Environment on Platfora Nodes
If a connection is not allowed, you will need to configure the firewall on your Platfora nodes to open the
appropriate ports and allow incoming connections from the other Platfora nodes.
On Amazon EC2 instances, you may need to configure the port firewall rules on the Platfora server
instances in addition to the EC2 Security Group Settings.
Verify Connectivity to Hadoop Nodes
The Platfora master and worker nodes must be able to connect to certain Hadoop services. This topic
explains how you can test network connectivity between Platfora nodes and an on-premise Hadoop
installation to verify that the required ports are open.
The following table shows the default Hadoop service ports that Platfora needs open:
Hadoop Service
Default Ports by Hadoop Allow connections from…
Distro
CDH,
HDP,
Pivotal
Apache MapR
Hadoop
HDFS NameNode
8020
9000
N/A
Platfora master and worker servers
HDFS DataNodes
50010
50010
N/A
Platfora master and worker servers
MapRFS CLDB
N/A
N/A
7222
Platfora master and worker servers
MapRFS DataNodes
N/A
N/A
5660
Platfora master and worker servers
MRv1 JobTracker
8021
9001
9001
Platfora master server
MRv1 JobTracker
Web UI
50030
50030
50030
External user network (optional)
YARN
ResourceManager
8032
8032
8032
Platfora master server
YARN
ResourceManager
Web UI
8088
8088
8088
External user network (optional)
YARN Job History
Server
10020
10020
10020
Platfora master server
YARN Job History
Server Web UI
19888
19888
19888
External user network (optional)
Page 47
Platfora Installation Guide - Configure Environment on Platfora Nodes
Hadoop Service
Default Ports by Hadoop Allow connections from…
Distro
CDH,
HDP,
Pivotal
Apache MapR
Hadoop
HiveServer Thrift
Port
10000
10000
10000
Platfora master server
Hive Metastore DB
4
Port
9083
9933
(HDP2)
N/A
9083
Platfora master server
To determine the interfaces and ports your particular Hadoop cluster is using for its file system and data
processing services, look at the core-site.xml and mapred-site.xml or yarn-site.xml
configuration files on your Hadoop NameNode (typically located in Hadoop's conf directory).
One way to verify that these ports are open to connections from a Platfora node is to use the telnet
command.
For example, to test if port 8020 was open on the Hadoop NameNode with the IP address
10.10.10.9, you could run the following command to test the connection:
$ telnet 10.10.10.9 8020
If a connection is not allowed, you will need to configure the firewall on your Hadoop nodes to open the
appropriate ports and allow incoming connections from the Platfora nodes.
Also, make sure your Hadoop services are actually up and running.
Note for Amazon Users: If you are using Amazon Elastic Map Reduce as your Hadoop cluster, the
EC2 Security Group Settings are sufficient to allow connectivity between Platfora instances on EC2 and
the EMR instances. If you are running your own Hadoop cluster on designated Amazon EC2 instances,
you may need to configure the port firewall rules on the Hadoop server instances in addition to the EC2
Security Group Settings.
Open Firewall Ports
If using firewall software in your network, you must open the required ports in the firewall software to
allow incoming connections from the other servers in your Platfora and Hadoop clusters. On Amazon
EC2 clusters, this is in addition to configuring your EC2 security group settings.
For a list of the default Platfora and Hadoop ports, see Port Configuration Requirements.
The process to open a firewall port depends on your server's operating system.
For RedHat / CentOS Servers:
4
If connecting to Hive directly using JDBC
Page 48
Platfora Installation Guide - Configure Environment on Platfora Nodes
Add a line to the /etc/sysconfig/iptables file to open the required port. For example (for port
8001):
-A INPUT -m state --state NEW -m tcp -p tcp --dport 8001 -j ACCEPT
Restart the firewall for your changes to take effect. For example:
$ sudo /etc/init.d/iptables restart
For Ubuntu Servers:
Open the required port in the firewall. For example (for port 8001):
$ sudo ufw allow 8001
Configure Passwordless SSH
The Platfora management utilities require a trusted-host environment (the ability to SSH to a remote
system in the Platfora cluster without a password prompt). Even in single-node installations, you must
exchange SSH keys for the localhost.
Verify Local SSH Access
This task confirms that local SSH access was set up correctly during installation. If it wasn't, then you'll
have to configure it before initializing Platfora.
If you installed Platfora using the RPM packages, the package installer creates the platfora user,
generates an SSH key, and authorizes it for the localhost. If you installed using the TAR package, you
should have done these steps manually as part of installing the dependencies.
Test that you can SSH to localhost without a password prompt.
$ su - platfora
$ ssh localhost
If prompted to add localhost to the list of known hosts, enter yes:
The authenticity of host 'localhost (127.0.0.1)' can't be established...
Are you sure you want to continue connecting (yes/no)? yes
If you get a password prompt, see Generate and Authorize an SSH Key.
Exchange SSH Keys (Multi-Node Only)
In multi-node installations of Platfora, each Platfora node must have the public SSH key for itself and
all other nodes in the Platfora cluster in its list of authorized keys. This task applies only when adding a
worker node to a Platfora cluster.
You must exchange SSH keys between all Platfora nodes as the platfora user (master and all worker
nodes). This procedure should be executed from each new worker node that you add to the Platfora
cluster.
Page 49
Platfora Installation Guide - Configure Environment on Platfora Nodes
Before you can exchange an SSH key, you have to generate and authorize it. If you installed Platfora
using the RPM packages, the installer should have done this for you automatically. See Verify Local SSH
Access to confirm this was set up correctly.
If you installed using the TAR package, you should have done this prior to installing the Platfora
software. See Generate and Authorize an SSH Key.
1. Make sure you are logged in to the Platfora worker node as the platfora system user.
$ su - platfora
2. Copy the public key of the current worker node to the other Platfora nodes in the cluster (master and
other worker nodes).
If you have password authentication enabled between the Platfora hosts, you can add the public key
to each of the remote hosts as follows:
$ ssh-copy-id platfora@master_hostname
$ ssh-copy-id platfora@worker1_hostname
$ ssh-copy-id platfora@worker2_hostname
If password authentication is not enabled between hosts (such as on Amazon EC2 instances), login to
each remote server in a separate terminal session and copy/paste the public key of the current worker
host into each remote server's authorized_keys file.
3. Copy the public keys from all other Platfora nodes to the current worker node (master and other
worker nodes).
One way to do this is to copy the entire contents of the master’s authorized_keys file (which
should have the keys of all nodes in the cluster) into the current node’s authorized_keys file.
If you have password authentication enabled between the Platfora hosts, you can copy the master's
authorized_keys file to the current node's authorized_keys file as follows:
$ scp platfora@master_hostname:/home/platfora/.ssh/authorized_keys
~/.ssh/authorized_keys
If password authentication is not enabled between hosts (such as on Amazon EC2 instances), login
to the master server in a separate terminal session, copy the contents of its authorized_keys file,
and paste into the ~/.ssh/authorized_keys file of the current node.
4. Test that you can ssh to the other Platfora nodes without a password prompt.
For example (if prompted to add the other host to the list of known hosts, enter yes):
$ ssh worker_hostname
The authenticity of host 'worker_hostname (110.123.4.5)' can't be
established...
Are you sure you want to continue connecting (yes/no)? yes
Synchronize the System Clocks
Platfora uses NTP (Network Time Protocol) to synchronize the system clocks on the Platfora servers.
Page 50
Platfora Installation Guide - Configure Environment on Platfora Nodes
Network Time Protocol (NTP) ensures that the system clocks on your Platfora servers stay accurate.
Accurate system clocks are important for consistent timestamps in your Platfora log files and for
accurate scheduling of lens builds. See www.ntp.org for more information about using NTP.
Synchronizing the system clock involves installing the NTP software, making sure all Platfora servers
are using the same list of NTP time servers (as configured in the /etc/ntp.conf), and starting the
NTP daemon (ntpd).
1. Install the NTP software.
On RedHat/CentOS
$ sudo yum install ntp
On Ubuntu
$ sudo apt-get install ntp
2. Verify that NTP is configured to use the correct time server for your network in /etc/ntp.conf.
3. Start the NTP daemon service.
$ sudo service ntpd start
Create Local Storage Directories
The Platfora server needs local file system locations for its data files and configuration files. These must
be the same locations on all Platfora servers. When you add a worker node, the locations used on the
master are created on the worker node for you (provided the platfora system user has write access to
these locations). If not, you'll have to create these locations on the worker nodes ahead of time.
Create the Platfora Data Directory
Each Platfora server needs a location where it can store its data and work files. This location should
have enough disk space to accommodate the Platfora server log files, the metadata catalog database, and
materialized lens data. This directory must be writable by the platfora system user.
For example:
$mkdir /data/platfora_data
Set the PLATFORA_DATA_DIR environment variable for the platfora system user, for example:
$ echo "export PLATFORA_DATA_DIR=/data/platfora_data" >> $HOME/.bashrc
Create the Platfora Configuration Directory
Each Platfora server needs a location where it can store its configuration files. This directory must be
writable by the platfora system user. For example:
$mkdir /home/platfora/platfora_conf
Page 51
Platfora Installation Guide - Configure Environment on Platfora Nodes
Set the PLATFORA_CONF_DIR environment variable for the platfora system user, for example:
$ echo "export PLATFORA_CONF_DIR=/home/platfora/platfora_conf" >>
$HOME/.bashrc
Source the ~/.bashrc file.
$ source $HOME/.bashrc
Verify Environment Variables
The Platfora installation uses several system environment variables which are typically set during the
installation process. These environment variables are used by the Platfora software to determine the
location of various directories and files.
Verify the platfora user environment by looking at the .bashrc file in the platfora user's home
directory.
Variable Name
Description
PLATFORA_HOME
Location of the Platfora installation files.
PLATFORA_DATA_DIR
Location of the Platfora data directory containing the metadata
catalog, lens data, and work files.
PLATFORA_CONF_DIR
Local directory where Platfora stores its configuration files.
HADOOP_CONF_DIR
Location of the local Hadoop configuration files that Platfora uses
to connect to the various Hadoop services.
JAVA_HOME
Location of the Java installation on your system.
PATH
Locations of system executables.
LD_LIBRARY_PATH
Locations of system library files.
If you use data compression, make sure that LD_LIBRARY_PATH
also contains the paths to the compression libraries you are
using.
PLATFORA_HADOOP_LIBLocation of the MapR client library files for Hadoop. Only needed
(MapR Only)
if you are using MapR.
Page 52
Chapter
6
Configure Platfora for Secure Hadoop Access
This section describes how to configure a Platfora node to authenticate to a Hadoop cluster that has been
configured to run in secure mode. If you are not using Kerberos-protected secure Hadoop services, or if you are
using Amazon EMR, you can skip the tasks in this section.
Topics:
•
About Secure Hadoop
•
Configure Kerberos Authentication to Hadoop
•
Configure Secure Impersonation in Hadoop
About Secure Hadoop
By default Hadoop runs in non-secure mode, meaning users and clients can connect to Hadoop services
without providing authentication credentials. If you have configured your Hadoop cluster to run in
secure mode, each client connection needs to be authenticated by Kerberos in order to use Hadoop
services.
The Hadoop services leverage Kerberos to perform user authentication on all remote procedure calls
(RPCs). Group resolution is performed on the Hadoop NameNode, JobTracker and ResourceManager
respectively. Tasks are run using the user account who submitted the job.
The Platfora master node accesses Hadoop services when:
• Connecting to Hadoop data sources.
• Defining datasets in the Platfora data catalog.
• Submitting and monitoring lens build jobs.
The Platfora worker nodes access the Hadoop file system when:
• Copying lens data output to Platfora.
Platfora acts as a client of the Hadoop file system and data processing services. It connects to these
services using the platfora system user account. In order for Platfora to access a secure Hadoop
cluster, this platfora user must be authenticated by Kerberos.
Page 53
Platfora Installation Guide - Configure Platfora for Secure Hadoop Access
Consult your Hadoop vendor documentation for enabling secure Hadoop. After secure Hadoop is
enabled, Platfora is just another Kerberos client that you add to your secure Hadoop environment.
Configure Kerberos Authentication to Hadoop
Platfora supports Kerberos authentication to secure Hadoop services. To enable access to secure
Hadoop, you must configure each Platfora server as a Kerberos client in the same realm as your secure
Hadoop services.
Obtain Kerberos Tickets for a Platfora Server
To configure Kerberos authentication between Platfora and Hadoop, you will need to request a Kerberos
ticket as the system user that runs the Platfora server (i.e. the platfora user). You can configure the
Kerberos client software to request a ticket for this user at login.
This will allow Platfora to access Kerberos-protected Hadoop services once the platfora system user
has successfully logged in to the operating system.
This guide does not provide instructions for installing and configuring the Kerberos client software. See
your Linux operating system documentation for detailed instructions. Guides for CentOS and Ubuntu
can be found online.
See your Hadoop vendor's documentation for creating Kerberos principals and keytabs for Hadoop client
services. You should follow the same procedure to create a Kerberos service principal name and keytab
file for each Platfora node.
Auto-Renew Kerberos Tickets for a Platfora Server
In addition to the standard Kerberos client software, Platfora recommends also installing the kstart
package on the Platfora server machines, and using the k5start utility to start a daemon process to
maintain the Kerberos ticket cache for the Platfora server principal. Otherwise, the Platfora server will
be denied access to Kerberos-enabled Hadoop services whenever its issued Kerberos ticket expires.
To enable automatic renewal of the Kerberos ticket for the Platfora server:
1. Install the kstart package.
2. Before starting the Platfora server, run the k5start utility.
Page 54
Platfora Installation Guide - Configure Platfora for Secure Hadoop Access
For example, use the keytab file to obtain a ticket granting ticket (TGT) for the principal
platfora/myrealm.com (the principal name as specified in the keytab file). The lifetime is 10
hours and the program wakes up every 10 minutes to check if the ticket is about to expire:
$ sudo k5start -f keytab -K 10 -l 10h platfora/myrealm.com
If a ticket expires and is re-issued in the middle of a lens build job, the Platfora
System page may show a Kerberos authentication failure. Failed authentication
attempts are always retried however, and Hadoop usually completes the job as
expected despite the initial authentication failure.
Configure Secure Impersonation in Hadoop
If your Hadoop cluster runs in secure mode, you can do additional configuration in Hadoop to enable
secure impersonation. Secure impersonation allows a given Hadoop superuser to submit jobs or access
files on behalf of another user. Secure impersonation is used in conjuction with Platfora's HDFS
Delegated Authorization feature.
Secure impersonation is not required to access a secure Hadoop cluster. You can configure Platfora
to authenticate to a secure Hadoop cluster without using secure impersonation. All tasks initiated by
Platfora are performed as the platfora system user in that case.
Secure impersonation is required to use Platfora's HDFS Delegated Authorization feature. This allows
the platfora system user to submit tasks on behalf of another user. The Platfora server uses its
Kerberos credentials to authenticate to Hadoop. However, file system accesses and tasks are authorized
as the user who is logged in to the Platfora application.
To use Platfora's HDFS Delegated Authorization feature, you must do the following to enable secure
impersonation in your Hadoop environment:
• Add the platfora system user to the HDFS supergroup on all Hadoop nodes.
• Create a /user/username directory in HDFS for each proxied user that is owned by that user.
• Grant read access on the appropriate source data files and directories in HDFS to the proxied user
groups.
• You must enable the secure impersonation properties for the platfora superuser in the coresite.xml file on your Hadoop nodes. For example:
<property>
<name>hadoop.proxyuser.platfora.groups</name>
<value>marketing,sales</value>
<description>Allow the superuser 'platfora' to impersonate any
users in the groups named 'marketing' or 'sales'
These groups should map to the LDAP groups registered
in Platfora.
</description>
</property>
Page 55
Platfora Installation Guide - Configure Platfora for Secure Hadoop Access
<property>
<name>hadoop.proxyuser.platfora.hosts</name>
<value>*</value>
<description>The superuser 'platfora' can connect from any host
to impersonate a user</description>
</property>
Page 56
Chapter
7
Initialize Platfora Master Node
This section describes how to set up a new Platfora cluster by initializing the master node. Once the Platfora
master node is up and running, you will have a fully functioning single-node Platfora cluster. You can then use
the master node to add the worker nodes into the cluster.
Topics:
•
Connect Platfora to Your Hadoop Services
•
Initialize the Platfora Master
•
Troubleshoot Setup Issues
Before you initialize the Platfora master, make sure you have done all the tasks described in Install Platfora
Software and Dependencies and Configure Environment on Platfora Nodes.
Connect Platfora to Your Hadoop Services
In order to initialize a new Platfora cluster, the master node must be able to connect to the Hadoop
services it needs. This section explains how to configure Platfora to connect to your Hadoop file system
and data processing services. This process is different depending on the type of Hadoop deployment you
have.
Understand How Platfora Connects to Hadoop
The Platfora servers use native Hadoop protocols to connect to Hadoop services using remote procedure
calls (RPC). Platfora is a client of Hadoop, and uses the standard Hadoop configuration files to connect
to its services.
Platfora uses the Hadoop configuration files to connect to Hadoop. These files must be in a local
directory on the Platfora master node. You can either obtain a copy of these files from your Hadoop
environment or recreate these files with the minimum required properties.
Page 57
Platfora Installation Guide - Initialize Platfora Master Node
If you are using Amazon Elastic MapReduce (EMR) as your primary Hadoop distribution, you only
need the core-site.xml file to connect to Amazon S3. You then set Platfora configuration properties
to connect to EMR for data processing services.
Hadoop File
Description
Connects to...
core-site.xml
Platfora uses the coresite.xml configuration file to
connect to the distributed
file system service for your
Hadoop deployment. For
example: HDFS for Cloudera
and Hortonworks, MapRFS
for MapR, or S3 for Amazon
EMR.
On-Premise Hadoop: HDFS
Platfora uses the hdfssite.xml configuration file
to configure how Platfora
data is stored in the remote
Hadoop distributed file
system (HDFS).
On-Premise Hadoop: HDFS
Platfora uses the mapredsite.xml configuration file to
connect to the MapReduce
JobTracker service and to
pass in runtime properties
for lens build MapReduce
jobs. This file is required for
Hadoop deployments using
MapReduce v1 or YARN.
On-Premise Hadoop:
Platfora uses the yarnsite.xml configuration file
to connect to the YARN
ResourceManager service
and to pass in runtime
properties for map and
reduce task containers. This
file is required for Hadoop
deployments using YARN.
On-Premise Hadoop: YARN
hdfs-site.xml
mapred-site.xml
yarn-site.xml
Page 58
NameNode
Amazon EMR: S3 Bucket
NameNode
Amazon EMR: not used
MapReduce JobTracker
Amazon EMR: not used
ResourceManager
Amazon EMR: not used
Platfora Installation Guide - Initialize Platfora Master Node
Hadoop File
Description
Connects to...
hive-site.xml
You only need to configure a
hive-site.xml file if you plan
to use Hive as a data source
for Platfora.
Hive Metastore
Obtain Hadoop Configuration Files
The easiest way to supply the configurations that Platfora needs is to obtain a copy of your configuration
files from your Hadoop installation and place them in the local Platfora Hadoop configuration directory.
You can then change any configurations as needed for Platfora. This task only applies to on-premise
Hadoop deployments, not Amazon EMR deployments.
Platfora requires local versions of the core-site.xml, hdfs-site.xml, and mapred-site.xml
files. If your Hadoop distribution supports YARN, you must also include a local yarn-site.xml
file. Finally, if you choose the option to use the Hive metastore as a Platfora data source, you must also
provide a hive-site.xml file.
You can copy the files directly from your Hadoop servers. The location of the Hadoop configuration
files varies depending on your Hadoop installation, but they can typically be found in one of the
following locations on your Hadoop NameNode:
• /etc/hadoop/conf
• $HADOOP_INSTALL/hadoop/conf
• /opt/mapr/hadoop/hadoop-version/conf
Downloading Configuration Files in Cloudera and Hortonworks
If you are using Cloudera Manager or Hortonworks Ambari Server, you can download a zip file
containing your Hadoop configuration files. For example, in the Cloudera Manager Admin
Console:
1. Go to Cluster/Services > Actions > Download Client Configuration.
2. Select Service > All Services.
3. Under cluster-level Actions, click Client Configuration URLs.
4. Choose the configuration files for the services needed by Platfora (HDFS, MapReduce, YARN,
Hive) and download to your local system.
Create Local Hadoop Configuration Directory
This section describes the minimum Hadoop configuration properties that Platfora needs as a client of
Hadoop's services.
The Platfora master node machine requires a local directory where it can find copies of the standard
Hadoop configuration files. When you initialize the Platfora master, you must provide the location of a
local Hadoop configuration directory.
Page 59
Platfora Installation Guide - Initialize Platfora Master Node
1. Create a configuration directory location owned by the platfora system user.
$ su - platfora
$ mkdir /home/platfora/hadoop_conf
2. Set the HADOOP_CONF_DIR environment variable for the platfora system user, for example:
$ echo "export HADOOP_CONF_DIR=/home/platfora/hadoop_conf" >>
$HOME/.bashrc
3. In this directory, copy or recreate the Hadoop configuration files needed for your Hadoop
distribution.
core-site.xml (HDFS / MapRFS)
Platfora uses the core-site.xml configuration file to connect to the distributed file system service
for your Hadoop deployment. For example: HDFS for Cloudera and Hortonworks, MapRFS for MapR.
Apache/Cloudera/Hortonworks with MapReduce v1
Platfora requires the following minimum property where namenode_hostname is the DNS
hostname of your Hadoop NameNode, and hdfs_port is the HDFS server port.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://namenode_hostname:hdfs_port</value>
</property>
</configuration>
Apache/Cloudera/Hortonworks/Pivotal with YARN
Platfora requires the following minimum property where namenode_hostname is the DNS
hostname of your Hadoop NameNode, and hdfs_port is the HDFS server port.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode_hostname:hdfs_port</value>
</property>
</configuration>
Page 60
Platfora Installation Guide - Initialize Platfora Master Node
MapR with MapReduce v1
Platfora requires the following minimum properties where where cldbhost is the DNS hostname
of the MapR CLDB node, and 7222 is the CLDB server port. If you are using file compression, you
must also specify the compression libraries you are using.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>maprfs://cldbhost:7222</value>
</property>
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.DeflateCodec,
org.apache.hadoop.io.compress.SnappyCodec
</value>
</property>
</configuration>
MapR with YARN
Platfora requires the following minimum properties where where cldbhost is the DNS hostname
of the MapR CLDB node, and 7222 is the CLDB server port. If you are using file compression, you
must also specify the compression libraries you are using.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>maprfs://cldbhost:7222</value>
</property>
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.DeflateCodec,
org.apache.hadoop.io.compress.SnappyCodec
Page 61
Platfora Installation Guide - Initialize Platfora Master Node
</value>
</property>
</configuration>
hdfs-site.xml
Platfora uses the hdfs-site.xml configuration file to configure how Platfora data is stored in the
remote Hadoop distributed file system (HDFS).
HDFS
This file should have at least the following content. If you want Hadoop replication enabled for
Platfora lens data, increase the 1 to a higher value.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- required -->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<!-- required for Cloudera 5.3 and later with HDFS Encryption
enabled -->
<property>
<name>dfs.encryption.key.provider.uri</name>
<value>kms://http@hadoop_name_node:16000/kms</value>
</property>
</configuration>
mapred-site.xml
Platfora uses the properties in its local mapred-site.xml file to connect to the Hadoop JobTracker
service, and pass in client-side configuration options for Platfora-initiated MapReduce jobs.
Any Hadoop MapReduce runtime properties can be passed along by Platfora with a lens build job
configuration. See MapReduce Tuning for Platfora for a description of the required and recommended
Page 62
Platfora Installation Guide - Initialize Platfora Master Node
properties that Platfora needs for lens building. Any properties marked as runtime can be set in the local
Platfora mapred-site.xml file instead of on the Hadoop cluster.
Apache/Cloudera/Hortonworks/MapR with MapReduce v1
Platfora requires the following minimum properties in its local mapred-site.xml file for
MapReduce v1 distributions.
If you are using the high-availability (HA) JobTracker feature in your Hadoop
cluster, you would use the HA JobTracker properties in Platfora's mapredsite.xml file instead of just the mapred.job.tracker property.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- required -->
<property>
<name>mapred.job.tracker</name>
<value>jobtracker_hostname:jt_port</value>
</property>
<!-- should be at least 1024m, but may be more based on memory on
your Hadoop nodes -->
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx1024m</value>
</property>
<!-- required -->
<property>
<name>mapred.job.shuffle.input.buffer.percent</name>
<value>0.30</value>
</property>
<!-- optional -->
<property>
<name>io.sort.record.percent</name>
<value>0.15</value>
</property>
<!-- optional -->
<property>
<name>io.sort.factor</name>
<value>100</value>
</property>
<!-- optional -->
<property>
Page 63
Platfora Installation Guide - Initialize Platfora Master Node
<name>io.sort.mb</name>
<value>256</value>
</property>
</configuration>
Apache/Cloudera/Hortonworks/Pivotal with YARN
Platfora requires the following minimum properties in its local mapred-site.xml file for YARN
distributions.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.jobhistory.address</name>
<value>yarn_rm_hostname:port</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>yarn_rm_hostname:web_port</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- should be at least 1024m, but may be more based on memory on
your Hadoop nodes -->
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx1024k</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1024k</value>
</property>
<property>
<name>mapreduce.task.io.sort.factor</name>
<value>100</value>
</property>
<property>
<name>mapreduce.job.user.classpath.first</name>
<value>true</value>
Page 64
Platfora Installation Guide - Initialize Platfora Master Node
</property>
<!-- Needed For Hortonworks 2.2 Only -->
<property>
<name>hdp.version</name>
<value>2.2.0.0-2041</value>
</property>
<!-- Needed For Pivotal 3.0 Only -->
<property>
<name>stack.version</name>
<value>3.0.0.0-249</value>
</property>
<property>
<name>stack.name</name>
<value>phd</value>
</property>
</configuration>
MapR with YARN
Platfora requires the following minimum properties in its local mapred-site.xml file for MapR
distributions using YARN.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapr.host</name>
<value>yarn_rm_hostname</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>yarn_rm_hostname:port</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>yarn_rm_hostname:web_port</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Page 65
Platfora Installation Guide - Initialize Platfora Master Node
<property>
<name>mapr.centrallog.dir</name>
<value>${hadoop.tmp.dir}/logs</value>
</property>
</configuration>
yarn-site.xml
Platfora uses the properties in its local yarn-site.xml file to connect to the Hadoop
ResourceManager service, and pass in client-side configuration options for Platfora-initiated YARN
jobs.
Any Hadoop YARN runtime properties can be passed along by Platfora with a lens build job
configuration. See YARN Tuning for Platfora for a description of the required and recommended
properties that Platfora needs for lens building. Any properties marked as runtime can be set in the local
Platfora yarn-site.xml file instead of on the Hadoop cluster.
All Hadoop Distributions with YARN
Platfora requires the following minimum properties in its local yarn-site.xml file for Hadoop
distributions using YARN.
<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>yarn_rm_hostname:8032</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>yarn_rm_hostname:8088</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>yarn_rm_hostname:8033</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>yarn_rm_hostname:8031</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
Page 66
Platfora Installation Guide - Initialize Platfora Master Node
<value>yarn_rm_hostname:8030</value>
</property>
<property>
<name>mapreduce.job.hdfs-servers</name>
<value>hdfs://yarn_rm_hostname:8020</value>
</property>
# Adjust these properties based on available Hadoop memory resources
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8192</value>
</property>
</configuration>
hive-site.xml
Platfora uses a local hive-site.xml configuration file to connect to the Hive metastore service. You
only need a local hive-site.xml file if you plan to use Hive as a data source for Platfora.
There are two ways to configure how clients connect to the Hive metastore service in your Hadoop
environment. You can set up the HiveServer or HiveServer2 Thrift service, which allows various remote
clients to connect to the Hive metastore indirectly. This is called a remote metastore client configuration,
and is the recommended configuration by the Hadoop vendors. If you add a Hive datasource through the
Platfora web application, you can connect to the Hive Thrift service without the need for a Platfora copy
of the hive-site.xml file.
Optionally, you can connect directly to the Hive metastore database using a JDBC connection. This
requires that you have the login credentials for the Hive metastore database. This is called a local
metastore configuration because you are connecting directly to the metastore database rather than
through a service. If you want to connect to the Hive metastore database directly using JDBC, then you
must specify the connection information in a hive-site.xml.
Platfora can only connect to a single Hive instance via a remote or a local metastore configuration.
Remote Metastore (Thrift) Server Configuration
If you are using the Hive Thrift remote metastore, in addition to the URI, you may want to include the
following performance properties:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
Page 67
Platfora Installation Guide - Initialize Platfora Master Node
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://hostname:hiveserver_thrift_port</value>
</property>
<property>
<name>hive.metastore.client.socket.timeout</name>
<value>120</value>
<description>
Number of seconds to wait for the client to retieve all of
the objects (tables and partitions) from Hive. For tables
with thousands of partitions, you may need to increase.
</description>
</property>
<property>
<name>hive.metastore.batch.retrieve.max</name>
<value>100</value>
<description>
Maximum number of objects to get from metastore in one batch.
A higher number means less round trips to the Hive metastore
server,
but may also require more memory on the client side.
</description>
</property>
</configuration>
Local JDBC Configuration
To have Platfora connect directly to a local JDBC metastore requires additional configuration on the
Platfora servers. Each Platfora server requires a hive-site.xml file with the correct connection
information, as well as the appropriate JDBC driver installed. Here is an example hive-site.xml
to connect to a MySQL local metastore:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hive_hostname:metastore_db_port/metastore</
value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
Page 68
Platfora Installation Guide - Initialize Platfora Master Node
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive_username</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
</property>
<property>
<name>hive.metastore.client.socket.timeout</name>
<value>120</value>
</property>
<property>
<name>hive.metastore.batch.retrieve.max</name>
<value>100</value>
</property>
</configuration>
The Platfora server would also need the MySQL JDBC driver installed in order to use this
configuration. You can place the JDBC driver .jar files in $PLATFORA_DATA_DIR/extlib to
install them (requires a restart of the Platfora server).
Initialize the Platfora Master
The Platfora setup utility (setup.py) verifies your operating system environment, configures the
Platfora software, and initializes the platfora metadata catalog database. You must run this setup
utility successfully before starting Platfora for the first time.
To run the setup utility:
$ $PLATFORA_HOME/setup.py
The setup.py utility prompts you for the following information about your environment.
Information
Requested
Description
Platfora Configuration
Directory
This is the local $PLATFORA_CONF_DIR directory location that
you created earlier where Platfora will store its configuration
files.
For example: /home/platfora/platfora_conf.
Page 69
Platfora Installation Guide - Initialize Platfora Master Node
Information
Requested
Description
Hadoop Distribution
and Version
This tells Platfora what distribution of Hadoop you are using.
Choose the number that corresponds to your Hadoop distribution
and version.
Platfora Web Services
Port
This sets the port number for the Platfora web application server.
This is the port used for HTTP client connections to the Platfora
application.
Defaults to 8001.
Platfora Server
Management Port
This sets the port number for TCP management connections
between Platfora servers. This is the port used for server-toserver heartbeat and management utility connections.
Defaults to 8002.
Platfora Data Transfer
Port
This sets the port number for TCP data connections between
Platfora servers. This is the port used for server-to-server data
transfers during query processing.
Defaults to 8003.
Hadoop Configuration
File Directory
This is the local directory containing your Hadoop configuration
files that you created earlier.
For example: /home/platfora/hadoop_conf.
Platfora Data Directory
This is the local $PLATFORA_DATA_DIR directory location that
you created earlier where Platfora will store its metadata catalog
database, lens data, and log files.
For example: /data/platfora_data.
Platfora Catalog
Database Port
This is the port of the PostgreSQL database server instance
where the Platfora metadata catalog database will be initialized.
Defaults to 5432.
Company Name
Used as an identifier for system diagnostic bundles. If you
encounter issues or problems, Platfora Support may request
that you generate a system diagnostic bundle. Enter the correct
company name to aid possible troubleshooting in the future.
Setup Platfora for
Secure Connection
If yes, configures Platfora to use HTTPS for secure
communications between the Platfora master server and web
browser clients. If no, uses regular HTTP connections. See
Configure SSL for Client Connections .
Send Metrics to
Platfora
If yes, configures the Platfora server to send anonymous system
diagnostic data to Platfora over an HTTPS connection. See About
System Diagnostic Data for details.
Remote DFS Data
Directory
This is the remote data directory location in the configured
Hadoop file system. Setup will make sure that Platfora has write
permissions to this location before proceeding.
Page 70
Platfora Installation Guide - Initialize Platfora Master Node
Information
Requested
Description
Maximum Java Virtual The maximum JVM size allocated to the Platfora server process.
Machine (JVM) Memory On a dedicated machine, this should be about 80 percent of
total system memory. Setup will use this guideline to suggest a
default (M=megabytes, G=gigabytes).
Relative Platfora Disk
Cache Size
When a lens is built in Hadoop, lens data files are copied over
to Platfora local disk in order to improve the performance of
lens queries. This sets the maximum amount of local disk space
on the Platfora server to use for storing lens data. The limit
is determined by taking a percentage of the total disk space
capacity on the Platfora server. The default is 0.8 or 80% of total
disk space.
After running the setup.py command, run the hadoop-check command to check your Hadoop
settings. The hadoop-check utility verifies that Hadoop is correctly configured for use with Platfora.
It also collects system information from the Hadoop cluster environment. If you are using MapR, you
should run this utility as it collects important information, but be aware that it can report misleading
configuration information.
Configure SSL for Client Connections
When running setup, you have the option to configure SSL connections between the Platfora master
server and browser clients. If you do not have your own certificate, you can have the setup utility
generate a self-signed certificate for you.
Data sent over an HTTPS connection will be encrypted regardless of whether the server certificate
is CA-signed or self-signed. However, most web browsers will only trust certificates signed by a
trusted certificate authority (CA) and will display security warnings when presented with self-signed
certificates.
For production installations, you may want to replace the self-signed certificate with one signed by a
trusted CA.
Page 71
Platfora Installation Guide - Initialize Platfora Master Node
If you choose to configure SSL during setup, the setup utility will ask the following additional questions
to configure SSL:
Information
Requested at
Setup
Description
Platfora Secure
Connection
If yes, configures Platfora to use HTTPS for secure communications
between the Platfora master server and web browser clients. If no,
uses regular HTTP connections.
If you choose to enable secure communications, you can use your
own server certificate (if you have one), or you can have Platfora
generate one for you.
TCP Port for HTTPS
Connections
Enter the TCP port the Platfora master server should use when web
browsers connect to the Platfora web application using HTTPS. Note
that when telemetry is enabled, the Platfora server uses this port
to securely send telemetry data to Platfora using HTTPS. Default is
8443.
KeyStore Location,
Password and Type
The keystore contains the master server’s private key, and its
certificate with the corresponding public key. The keystore is used
to provide credentials.
Location: Accept the default location if you want Platfora to generate
a keystore for you, otherwise enter the path to your own keystore.
Password: If using your own keystore, enter the password to access
that keystore, otherwise set a password for the keystore that
Platfora will create.
Type: If using your own keystore, enter the type of keystore format
you are using. Allowed types are JKS (Java Keystore) or PKCS12
(Public Key Cryptography Standards #12 Keystore). If you plan to
have Platfora generate the keystore for you, use JKS (the default).
Generate SelfSigned SSL
Certificate
Enter Y if you do not have a server certificate and want Platfora to
generate one for you. Enter N if you already have a certificate that
you want to use.
Page 72
Platfora Installation Guide - Initialize Platfora Master Node
Information
Requested at
Setup
Description
TrustStore Location, A truststore contains certificates to trust. The truststore is used to
Password and Type verify certificate authority (CA) credentials.
Location: By default, Platfora uses the default truststore that
comes with your Java installation. This default truststore is already
configured to trust all of the recognized certificate authorities
(Verisign, Symantec, Thawte, etc.). If you have your own
truststore, you can enter the path to that truststore instead.
Password: If using the default truststore that ships with Java, the
default password is changeit. If you have changed this password or
are using your own truststore, enter the correct password for the
truststore.
Type: If using your own truststore, enter the type of truststore
format you are using. Allowed types are JKS or PKCS12. If you use
the default truststore that comes with your Java installation, use
JKS (the default).
When SSL is enabled, ensure that the keystore and truststore passwords entered
in Platfora always match the passwords configured at the keystore and truststore
locations. Changing the passwords in Platfora does not change the passwords at
the keystore or truststore locations. If the passwords entered in Platfora do not
match the passwords at the keystore and truststore locations, the Platfora server
fails to start.
Configure SSL for Catalog Connections
For added security, you can encrypt the communications between the Platfora worker nodes and the
metadata catalog on the Platfora master node.
If you decide to enable SSL for the Platfora catalog, you must have SSL-enabled PostgreSQL installed
on your Platfora master node, and OpenSSL version 1.0.1 or higher installed on all Platfora nodes
(master and worker nodes).
If you are enabling this optional security feature at installation time, you would do so after running
setup.py but before starting the Platfora servers.
1. On the Platfora master node, log in as the platfora system user.
$ su - platfora
2. Make sure the Platfora servers are not running.
$ platfora-services stop
Page 73
Platfora Installation Guide - Initialize Platfora Master Node
3. Run the platfora-catalog ssl utility to configure secure connections to the catalog. For
example, if using a self-signed server certificate and private key:
$ $ platfora-catalog ssl --enable --self
About System Diagnostic Data
During setup, you have the option to enable collection of system diagnostic data. This collects
anonymous statistics about product usage and performance, and will help Platfora improve the product
in future releases. Sending system diagnostic data to Platfora is optional. A system administrator can
choose to enable or disable diagnostic data collection at any time by running setup.py.
What Data is Collected?
Platfora respects your privacy and security. We do not collect any business data,
only diagnostic system metrics. The Platfora server sends system metrics data
over the configured SSL port.
System diagnostic data is completely anonymous. Platfora does not collect any names (data source,
dataset, lens, vizboard, or user names), permissions used, or any personally identifiable information.
Here is a list of some of the diagnostic data that Platfora does collect:
• Actions taken in the UI
• Dataset size
• Lens size (estimated and actual)
• Build duration (how long did a lens build take)
• Scheduled lens build times
• Client browser type
• Screen resolution
• Page load times
• Rest API call duration (how long did an API call take to return)
• Help files viewed
• User metrics (number of users and groups in Platfora)
• Number of logins
• Server startup time (how long did it take for the Platfora server to start)
• Permissions performance metrics (how many times the system used cached permissions versus
having to look up permissions in the catalog)
To see a sample of the data collected, you can look at the system diagnostic logs in
$PLATFORA_DATA_DIR/telemetry.
How to Configure System Diagnostic Collection
If you decide to enable the collection of system diagnostic data, Platfora will log usage information and
send the log files to Platfora Customer Support every 15 minutes by default. If an attempt to send the
data fails, Platfora will only keep the logs for an hour by default to conserve disk space. The following
Page 74
Platfora Installation Guide - Initialize Platfora Master Node
server configuration properties can be used to configure the system diagnostics feature. Changing any of
these properties requires a system restart.
• platfora.support.identifier - The name used to identify a system diagnostic bundle sent
to Platfora support.
• platfora.telemetry.aggregate.frequency - The number of times each send interval to
attempt to aggregate and send the logs to Platfora's telemetry server. Default is 1.
• platfora.telemetry.enabled - Whether or not to send collected diagnostic data to Platfora.
• platfora.telemetry.file.lifespan - The number of seconds to keep historical diagnostic
data between send attempts. The default is 3600 (one hour).
• platfora.telemetry.send.frequency - The number of seconds between send intervals. The
default is 900 (15 minutes).
• platfora.telemetry.url - The URL of Platfora's telemetry server where the diagnostic data is
sent. Default is https://telemetry.platfora.com.
• platfora.telemetry.logparser.enabled - Turns on the ability to parse the log files using
the Platfora application.
Troubleshoot Setup Issues
This section describes typical errors encountered during installation and setup, and how to resolve them.
View the Platfora Log Files
If you encounter errors when initializing, starting or running Platfora, check the Platfora log files. The
logs can provide more information about the cause of the error. You can view the logs on the Platfora
master or in the Platfora web application.
The Platfora master server log file is located at $PLATFORA_DATA_DIR/logs/platforaserver.log.
If the Platfora server is running, you can also access the Platfora server log file in the browser. Go to:
http://hostname:port/debug/view-log/125
Where hostname:port is the Platfora server hostname and web application port (8001 is the default
port) and 125 is the thousand number of bytes to display (the default is 50 or 50,000 bytes).
Setup Fails Setting up Catalog Metadata Service
Platfora uses a PostgreSQL database to store its metadata catalog. If PostgreSQL is already running,
setup will fail when it tries to start PostgreSQL. You must stop PostgreSQL, cleanup the Platfora data
directory, and then try again. You may also get an error if the /var/run/postgresql directory is
missing or has the wrong file permissions.
Page 75
Platfora Installation Guide - Initialize Platfora Master Node
When this error occurs, you might see errors such as the following from setup.py:
Command failed due to: Error occurred command: /usr/lib/postgresql/9.2/
bin/pg_ctl start -D
In the PostgreSQL log file (located in PLATFORA_DATA_DIR/logs/pg.log), you may see an error
such as:
LOG: could not bind IPv4 socket: Address already in use
HINT: Is another postmaster already running on port 5432?
In the PostgreSQL log file, you may also see an error such as this:
FATAL: could not create lock file "/var/run/
postgresql/.s.PGSQL.5432.lock": No such file or directory
This means that the platfora system user does not have permission to write to the location where
PostgreSQL writes its lock files. Make sure to create /var/run/postgresql and give ownership
to the platfora user. Note that a system reboot sometimes clears /var/run, so you may need to
recreate this directory if you have rebooted your server.
1. Check if the PostgreSQL data process is running.
$ ps ax | grep postgres
2. If it is running, kill the process.
$ kill process_id
3. Make sure you have removed the automatic startup scripts for PostgeSQL, otherwise you will
probably hit this error again.
On RedHat/CentOS:
$ sudo rm /etc/init.d/postgresql-9.2
On Ubuntu:
$ sudo rm /etc/rc*/*postgresql
4. Clean out the Platfora data directory location before trying setup.py again. For example:
$ rm -rf /data/PLATFORA_DATA/*
5. Make sure the /var/run/postgresql directory exists and has the correct permissions.
$ sudo mkdir /var/run/postgresql
$ sudo chown platfora /var/run/postgresql
TEST FAILED: Checking integrity of binaries
When you run setup.py utility (which runs the platfora-syscheck utility by default), it does a
checksum of all of the files in the installation package to make sure the package is not corrupt. If you
add, remove, or change any files inside the Platfora installation directory, this checksum test will fail.
When this error occurs, you might see an error such as the following when you try to initialize or
upgrade Platfora using setup.py (or run a system verification check using platfora-syscheck):
Verifying System Requirements
Checking integrity of binaries......
-=-=-=-=-=-=-=-=-=-=- TEST FAILED -=-=-=-=-=-=-=-=-=-=Page 76
Platfora Installation Guide - Initialize Platfora Master Node
Reason: ....
To avoid this error, you should not add, remove, or modify any files inside $PLATFORA_HOME after
you have downloaded and unpacked the installation package. If you have not made any changes to the
Platfora installation files, this error means that the package you downloaded may be corrupt. Contact
Platfora Customer Support to obtain a new installation package.
If you have intentionally made changes to your Platfora installation and want to bypass this check when
running setup.py (and you have successfully ran platfora-syscheck in the past), you can skip
the system checks using the --skip_syscheck option. For example:
$ setup.py --skip_syscheck
Page 77
Chapter
8
Start Platfora
After installing and initializing the Platfora master server, you are ready to start Platfora. After Platfora is started,
log in to the Platfora web application, upload your license, and change the default administrator password.
Optionally, you may want to load the tutorial data to make sure everything is working as expected.
Topics:
•
Start the Platfora Server
•
Log in to the Platfora Web Application
•
Add a License Key
•
Change the Default Admin Password
•
Load the Tutorial Data
Start the Platfora Server
After you have successfully completed setup, you are ready to start the Platfora server for the first time.
Starting the Platfora server also starts the metadata catalog service (PostgreSQL).
Before you can start Platfora, make sure your Hadoop services are up and running. Platfora will not start
if it cannot connect to the Hadoop file system and data processing services you have configured.
PostgreSQL must be installed and in your PATH, but not running.
To start the Platfora server:
$ $PLATFORA_HOME/bin/platfora-services start
To confirm the master server has started correctly (it should be Enabled, Available, and Running):
$ $PLATFORA_HOME/bin/platfora-services status
ID TYPE
HOST
PORT
ENABLED
STATUS
PROCESS
-------------------------------------------------------------------------0
Master ip-10-xxx-xxx-xxx
8002
Enabled
Available
Running
Page 78
Platfora Installation Guide - Start Platfora
Log in to the Platfora Web Application
After Platfora is started, you can open a web browser, and go to the URL of the Platfora master server
process. To log in for the first time, use admin and admin as the username and password.
Enter the following URL in your browser location field, where hostname is the IP address or public
DNS hostname of the Platfora master server and port is the HTTP web services port entered during
setup (the default port is 8001):
http://hostname:port
If SSL is enabled, the Platfora web server redirects the browser to use the HTTPS port instead (8443 by
default).
When prompted for a username and password, use admin and admin to log in for the first time. This
is the default credentials for the Platfora System Administrator account.
Page 79
Platfora Installation Guide - Start Platfora
After logging in for the first time, you will be prompted to accept the Platfora license agreement. You
must accept the license agreement to continue.
Page 80
Platfora Installation Guide - Start Platfora
Add a License Key
When the Platfora software is in an unlicensed state, a system administrator must upload a valid license
key to activate the product functionality.
1. Go to the System > License page.
2. Click Upload.
3. Navigate to the license key file stored on your local machine and select the license key file.
4. Click OK in the message window after the license is successfully installed.
Change the Default Admin Password
After logging in to the Platfora web application for the first time, it is a good idea to change the default
Platfora System Administrator password from admin to something more secure.
Page 81
Platfora Installation Guide - Start Platfora
You can change the default System Administrator (admin) user's password and profile picture in
the Platfora web application.
1. In the top right corner of the page header, open the System pull-down menu and select User
Profile.
2. In the user profile dialog, click Change Password.
3. Enter a new password. Type carefully (there is no password confirmation).
4. Click Update Password.
Load the Tutorial Data
Platfora installs with some sample data that you can load to see examples of how datasets and lenses are
created. Loading the sample data is also a good way to test that Platfora is working correctly with your
configured Hadoop implementation. The Platfora server has a client load utility that you can run via the
command-line to automatically load the sample data. This client utility creates four sample datasets and
one sample lens in the Platfora web application.
If you have not received a valid license file from Platfora Customer Support, and
enabled it within the Platfora web application, you will not be able to load the
tutorial data. You must have a valid license in order to create datasets and lenses.
Page 82
Platfora Installation Guide - Start Platfora
Log in to the Platfora master server in a terminal session, and run the following command:
$PLATFORA_HOME/client/bin/run_python $PLATFORA_HOME/client/examples/
flights/load_flights.py -u admin -p admin -s localhost:8001
If you have changed the default Platfora administrator password (admin) or web
server port (8001), you will need to alter the load command to supply the correct
connection information for your Platfora server.
The command-line does not return until the lens build job completes, which can take several minutes.
In the meantime, you can access the Platfora application in a web browser using the following URL
(replace hostname with the actual IP or hostname of your Platfora master server:
http://hostname:8001
Page 83
Chapter
9
Initialize a Worker Node
Worker nodes are initialized and added to a Platfora cluster by running a utility on the Platfora master node.
Before you can initialize a worker node, make sure you have provisioned and configured the worker node
machine.
Before you initialize a Platfora worker, you must do the following tasks on the worker node machine:
1. Install the prerequisite software directly on the worker node.
• If using the RPM installer packages, Install Dependencies RPM Package.
• If using the TAR installer packages, you must manually Create the Platfora System User, Set OS
Kernel Parameters, and Install Dependent Software.
2. Configure Environment on Platfora Nodes.
After the worker node has been correctly provisioned, you can add it in to the Platfora cluster from the
master. The platfora-node add utility will copy the Platfora software and configurations from the
master over to the worker node, start it, and bring the node into the Platfora cluster.
1. On the master node, add the worker node to the cluster:
$ platfora-node add --host worker_hostname
2. After the command completes, check the status of the cluster.
When the new child node is Enabled and Available then it is ready to serve viz queries. For example:
$ platfora-services status
ID
TYPE
HOST
MGMT_PORT
WEB_PORT
ENABLED
STATUS
PROCESS
-------------------------------------------------------------------------------------0
Master
ip-10-xxx-xxx-xxx
8002
8001
Enabled
Available
Running
1
Child
ip-10-xxx-xxx-xxx
8002
8001
Enabled
Available
Running
A newly added node may have a status of Not Ready until it is finished copying the lens data blocks it needs
over from the Hadoop file system.
Page 84
Appendix
A
Platfora Utilities Reference
The Platfora command-line management utilities are located in $PLATFORA_HOME/bin of your Platfora server
installation. All utility commands should be executed from the Platfora master node.
Topics:
•
setup.py
•
hadoop-check
•
hadoopcp
•
hadoopfs
•
install-node
•
platfora-catalog
•
platfora-config
•
platfora-export
•
platfora-import
•
platfora-license
•
platfora-node
•
platfora-services
•
platfora-syscapture
•
platfora-syscheck
setup.py
Initializes a new Platfora instance or upgrades an existing one. Can also be used to reset bootstrap
system configuration properties.
Synopsis
setup.py [-h] [-q] [-v] [-V]
setup.py [--hadoop_conf path] [--platfora_conf path] [--datadir path]
[--dfs_dir dfs_path] [--port admin_port] [-data_port data_port]
Page 85
Platfora Installation Guide - Platfora Utilities Reference
[--websvc_port http_port] [--ssl_port https_port] [-jvmsize jvm_size]
[--hadoop_version string] [--extraclasspath path] [-extrajavalib path]
[--skip_checks] [--skip_syscheck] [--skip_sync] [-skip_setup_ssl] [--skip_setup_dfscachesize]
[--skip_setup_telemetry] [--upgrade_catalog] [--nochanges]
[--verbose]
Description
The setup.py utility is run on the Platfora master node after installing the Platfora software, but
before starting the Platfora server for the first time.
For new installations, setup.py:
• Runs platfora-syscheck to verify that all system prerequisites have been met.
• Confirms that you have installed the correct Platfora software package for your intended Hadoop
distribution.
• Prompts for bootstrap configuration information, such as port numbers, directory locations, memory
resources, secure connections, and diagnostic data collection.
• Verifies that the supplied ports are open and that permissions and disk space are sufficient on both
the local and remote DFS file systems.
• Initializes the Platfora metadata catalog database in PostgreSQL.
• Creates the default System Administrator user account.
• Copies setup files to the Platfora storage location in the configured Hadoop DFS.
For upgrade installations, setup.py:
• Runs platfora-syscheck to verify that all system prerequisites have been met.
• Confirms that you have installed the correct Platfora software package for your intended Hadoop
distribution.
• Displays your current bootstrap configuration settings and prompts if you want to make changes.
• Upgrades the Platfora metadata catalog database in PostgreSQL if necessary.
• Copies any updated library files to the Platfora storage location in the configured Hadoop DFS.
• Synchronizes the Platfora software and configuration files on the worker nodes in a multi-node
installation.
Required Arguments
No required arguments.
Optional Arguments
-c | --hadoop_conf path
This is the local directory containing your Hadoop configuration files (such as core-site.xml and
mapred-site.xml). Platfora uses the information in these files to connect to your Hadoop cluster.
Page 86
Platfora Installation Guide - Platfora Utilities Reference
-C | --platfora_conf path
This is the local directory where Platfora will store its configuration files. Defaults to
$PLATFORA_CONF_DIR if set.
-d | --datadir path
This is the local directory where Platfora will store its metadata catalog database, lens data, and log files.
Defaults to $PLATFORA_DATA_DIR if set.
--data_port
This is the data transfer port used during query proccessing on multi-node Platfora clusters. By default,
uses the same port number as the master node.
--db_port port
This is the port of the PostgreSQL database instance where the Platfora metadata catalog database
resides. The default PostgreSQL port is 5432.
--db_dump_path path
This is the path where the backup SQL file of the Platfora metadata catalog database will be created
prior to upgrading the catalog. Defaults to the current directory.
-g | --dfs_dir dfs_path
This is the remote directory in the configured Hadoop distributed file system (DFS) where Platfora will
store its library files and MapReduce output (lens data).
-j | --extraclasspath path
This is the path where the Platfora server will look for additional custom Java classes (.jar files), such as
those for Hive JDBC connectors, custom Hive SerDes, or user-defined functions. These are not included
in Lens Building in Hadoop. They are deprecated, please use $PLATFORA_DATA_DIR/extlib instead.
-l | --extrajavalib path
This is the path where the Platfora server should look for native Java libraries. These are not included in
Lens Building in Hadoop. They are deprecated, please use $PLATFORA_DATA_DIR/extlib instead.
-n | --nochanges
On upgrade, do not prompt the user if they want to make changes to their current Platfora bootstrap
configuration settings.
-p | --port admin_port
This is the server administration port used for management utility and API calls to the Platfora server.
This is also the port that multi-node Platfora servers use to connect to each other. The default is 8002.
-s | --jvmsize jvm_size
Page 87
Platfora Installation Guide - Platfora Utilities Reference
The maximum amount of Java virtual memory (JVM) allocated to a Platfora server process. On a
dedicated machine, this should be about 80 percent of total system memory. You can specify size using
M for megabytes or G for gigabytes.
--skip_checks
Do not do safety checks, such as verifying ports, disk space, and file permissions.
--skip_setup_dfscachesize
Do not prompt to configure the maximum local disk space utilization for storing lens data. If
this question is skipped, Platfora will set the maximum to 80 percent of the available space in
$PLATFORA_DATA_DIR. When this limit is reached, lens builds will fail during the pre-fetch stage.
--skip_setup_ssl
Do not prompt to configure secure connections (SSL) between browser clients and the Platfora server. If
these questions are skipped, the default is no (do not use SSL).
--skip_sync
Do not sync the installation directory to the worker nodes.
--skip_syscheck
Do not run the platfora-syscheck utility prior to setup.
--skip_setup_telemetry
Do not prompt to disable/enable diagnostic data collection. If these questions are skipped, the default is
yes (enable diagnostic data collection), and the company name is set to default (anonymous).
-t | --hadoop_version version_string
The version string corresponding to the Hadoop distribution you are using with Platfora. Valid values
are cdh5 (Cloudera 5.0.x an5.1.x), cdh52 (Cloudera 5.2.x and 5.3.x), cdh54 (Cloudera 5.4.x), mapr4
(MapR 4.0.1), mapr402 (MapR 4.0.2 and 4.1.x), emr3 (Amazon Elastic Map Reduce), HDP_2.1
(Hortonworks 2.1.x), HDP_2.2 (Hortonworks 2.2.x), pivotal_3 (PivotalHD 3.0).
--upgrade_catalog
Automatically upgrade the metadata catalog schema if necessary. The catalog update check is run by
default.
-v | --verbose
Runs in verbose mode. Show all output messages.
-w | --websvc_port http_port
This is the HTTP listener port for the Platfora web application server. This is the port that browser
clients use to connect to Platfora. The default is 8001.
-W | --ssl_port https_port
Page 88
Platfora Installation Guide - Platfora Utilities Reference
This is the HTTPS listener port for the Platfora web application server. This is the SSL port that browser
clients use to connect to Platfora. The default is 8443.
Examples
Run setup without doing the prerequisite checks first:
$ setup.py --skip_syscheck
Run initial setup without any prompts using the specified bootstrap configuration settings (or use the
default settings when not specified):
$ setup.py --hadoop_conf /home/platfora/hadoop_conf --platfora_conf /
home/platfora/platfora_conf \
--datadir /data/platfora --dfs_dir /user/platfora --jvmsize 12G -hadoop_version cdh4 \
--skip_setup_ssl --skip_setup_dfscachesize --skip_setup_telemetry
Run upgrade setup without any prompts and keep all previous configuration settings:
$ setup.py --upgrade_catalog --nochanges
hadoop-check
Checks the Hadoop cluster connected to Platfora to make sure it is not misconfigured. Collects
information about the Hadoop environment for troubleshooting purposes.
Synopsis
hadoop-check [-h] [-v] [-vv] [-V]
Description
The hadoop-check utility verifies that Hadoop is correctly configured for use with Platfora. It also
collects system information from the Hadoop cluster environment. You must complete setup.py
before running this utility.
Output from this utility is logged in $PLATFORA_DATA_DIR/logs/hadoop-check.log.
It performs the following checks:
• Root DFS Test. This test makes sure that Platfora can connect to the configured Hadoop file system,
and that file permissions are correct on the directories that Platfora needs to write to. It also makes
sure that any jar files that have been placed in $PLATFORA_DATA_DIR/extlib have the correct
file permissions.
• File Codec Test. This test makes sure that Platfora has the codecs (file compression libraries) it
needs to recognize and read the compression types supported in Hadoop. If Hadoop is configured
to support a compression type that Platfora does not recognize, then this test will fail. You can put
the jar files for any additional codecs in $PLATFORA_DATA_DIR/extlib of the Platfora server
(requires a restart).
Page 89
Platfora Installation Guide - Platfora Utilities Reference
• Hadoop Host Configuration Test. This test runs a small MapReduce job on the Hadoop cluster
and reports back information from the Hadoop environment. It makes sure that memory is not oversubscribed on the Hadoop MapReduce cluster. These tests assume that all nodes in the Hadoop
cluster have the same resource configuration (same amount of memory, CPU cores, etc.).
The check retunrs a RC (return code) value. A return code 0 means all tests passed. Return code 1 means
one or more tests failed.
Root DFS Test
This test is skipped if Platfora is configured to use Amazon S3.
Test DFS file system information and returns the following:
Total
The total disk space in the Platfora storage directory on the
Hadoop file system.
Used
The used disk space in the Platfora storage directory on the
Hadoop file system.
Available
The available disk space in the Platfora storage directory on
the Hadoop file system.
Permissions on the Platfora
DFS Directory
Permissions on Platfora DFS Directory The platfora system
user has write permissions to the Platfora storage directory
on the Hadoop file system (PASSED or FAILED).
File Codec Test
Codecs Installed
The file compression libraries that are installed in Hadoop.
Output compression in
Hadoop Conf
Checks if the mapred-site.xml property
mapred.output.compress is enabled, and if it is
makes sure the compression library specified in
mapred.output.compression.codec is also installed in
Platfora.
Hadoop Host Configuration Test
JobTracker Status
Ensures the server is up and running.
(ResourceManager for YARN)
Black Listed Tasktrackers
(NodeManagers for YARN)
Total Cluster Map Tasks
Lists the number of servers marked unavailable in the
Hadoop cluster.
Total number of map task slots available. This is the
value of mapred.tasktracker.map.tasks.maximum in the
JobTracker for pre-YARN distributions. This is the value
of mapreduce.tasktracker.map.tasks.maximum in the
ResourceManager for YARN distributions.
Page 90
Platfora Installation Guide - Platfora Utilities Reference
Total Cluster Map Tasks
Total number of map task slots available. This is the
value of mapred.tasktracker.map.tasks.maximum in the
JobTracker.
Total Cluster Map Tasks
Total number of map task slots available. This is the value
of mapreduce.tasktracker.map.tasks.maximum in the
ResourceManager.
Map Tasks Occupied
The number of map task slots that were occupied at the
time of the test.
Total Cluster Reduce Tasks
Total number of reduce task
slots available. This is the value of
mapred.tasktracker.reduce.tasks.maximum in the JobTracker.
This is the mapreduce.tasktracker.reduce.tasks.maximum in
the ResourceManager for YARN distributions.
Reduce Tasks Occupied
The number of reduce task slots that were occupied at the
time of the test.
Job Submission Took
How long it took for Platfora to submit the test MapReduce
job.
Hadoop Host
The host name of the JobTracker.The host name of the
ResourceManager node for YARN distributions.
Hadoop Version
The version of Hadoop that is running.
CPUs
Number of CPUs per TaskTracker node.Number of CPUs for the
NodeManager in YARN distributions.
RAM
The available memory per TaskTracker. The available memory per
NodeManager in YARN distributions.
Map Slots
Maximum map task slots available.
Reduce Slots
Maximum reduce task slots available.
Hadoop Configured Memory
The configured amount of memory available to
MapReduce processes. Looks at maximum JVM size
per task (mapred.child.java.opts) times the total
number of tasks slots. The total number of task slots is
equal to mapred.tasktracker.map.tasks.maximum plus
mapred.tasktracker.reduce.tasks.maximum for preYARN distributions. The total number of task slots is equal
to mapreduce.tasktracker.map.tasks.maximum plus
mapreduce.tasktracker.reduce.tasks.maximum on YARN
distributions.
This test will fail if the Hadoop configured memory exceeds
available RAM.
Page 91
Platfora Installation Guide - Platfora Utilities Reference
Required Arguments
No required arguments.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-v | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Shows the software version information and then exits.
-vv
Runs in extra verbose mode.
Examples
Test and collect information from the Hadoop cluster that Platfora is configured to use:
$ hadoop-check
hadoopcp
Copies a file from one location in the configured DFS to another location in the configured DFS with the
ability to transcode files.
Synopsis
hadoopcp source_dfs_uri destination_dfs_uri
Description
The hadoopcp utility allows you to copy a file residing in the remote Hadoop DFS from one location
to another and optionally transcode the file.
File paths must be specified in URI format using the appropriate DFS file system protocol. For example,
hdfs:// for Cloudera, Apache, or Hortonworks Hadoop, maprfs:// for MapR, s3n:// for
Amazon S3.
This command executes as the currently logged in system user (the platfora user, for example). The
target directory location must exist, and this user must have write permissions to the directory.
Required Arguments
source_dfs_uri
The source location in a remote Hadoop file system in URI format. For example:
Page 92
Platfora Installation Guide - Platfora Utilities Reference
hdfs://hostname:[port]/dfs_path
destination_dfs_uri
The target location in a remote Hadoop file system in URI format. For example:
hdfs://hostname:[port]/dfs_path
Optional Arguments
-h
Shows the command-line syntax help and then exits.
Examples
Copy the file /mydata/foo.csv residing in HDFS to the same location in HDFS but transcode it to a gzip
compressed file:
$ hadoodcp hdfs://localhost/mydata/foo.csv hdfs://localhost/mydata/
foo.csv.gz
hadoopfs
Executes the specified hadoop fs command on the remote Hadoop file system.
Synopsis
hadoopfs -command
Description
The hadoopfs utility allows you to run Hadoop file system commands from the Platfora server. This is
analagous to running the specified hadoop fs command on the Hadoop NameNode server.
The command executes as the currently logged in system user (the platfora user, for example). This
user must have sufficient Hadoop file system permissions to perform the command.
Required Arguments
-command
A Hadoop file system shell command. See the Hadoop Shell Command Documentation for the list of
possible commands.
Optional Arguments
No optional arguments.
Examples
List the contents of the /platfora/uploads directory in the configured Hadoop file system:
$ hadoopfs -ls /platfora/uploads
Page 93
Platfora Installation Guide - Platfora Utilities Reference
Remove the file /platfora/uploads/test.csv in the configured Hadoop file system:
$ hadoopfs -rm /platfora/uploads/test.csv
install-node
Copies the Platfora software and configuration directories from the current node to the specified remote
node(s).
Synopsis
install-node --host hostname | --hostsfile filename [-h] [-q] [-v] [-V]
Description
The install-node utility copies the $PLATFORA_HOME directory from the current node to the
specified remote nodes. It also synchronizes the configuration files in the $PLATFORA_CONF_DIR
directory. You can use the install-node utility to copy a Platfora software installation to a remote
node that has not yet been added to your Platfora cluster configuration.
This utility is also called indirectly by the platfora-services sync, platfora-node add,
platfora-node sync, and setup.py upgrade utilities. Platfora recommends using these utilities
when adding new nodes or upgrading existing nodes in your Platfora cluster configuration.
Files are copied to the remote node as the currently logged in system user. The $PLATFORA_HOME and
$PLATFORA_CONF_DIR directory locations must exist on the remote node, and the current system
must have sufficient file system permissions to write to these locations.
Required Arguments
One of either --host or --hostsfile is required.
--host hostname
Copies the Platfora software and configuration directories to the specified host name or IP address.
--host hostsfile
Copies the Platfora software and configuration directories to the host names or IP addresses specified in
the named file, one host per line.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
Page 94
Platfora Installation Guide - Platfora Utilities Reference
-V | --version
Shows the software version information and then exits.
Examples
Install the Platfora software on the remote host named myremotehost by copying over the Platfora
installation installed on the local host:
$ install-node --host myremotehost
platfora-catalog
Manages the Platfora metadata catalog database in PostgreSQL.
Synopsis
platfora-catalog [-h] [-q] [-v] [-V] init | start | stop | status | backup
| restore | upgrade | pswd | keygen | ssl [sub-command options]
Description
Use the platfora-catalog utility to manage the Platfora metadata catalog database in PostgreSQL.
When you first install and initialize Platfora using setup.py, it initializes a PostgreSQL database
instance using the default PostgreSQL port (5432) and creates a platfora database in the
$PLATFORA_DATA_DIR location. You run this utility by passing one its subcommands either directly
or indirectly through the setup.py and platfora-services utilities. The following subcommands
you can call directly.
Subcommand
Description
backup
Dumps the contents of the platfora catalog database to a backup file.
restore
Restores the platfora catalog database using a backup file.
pswd
Creates a new encrypted superuser password for the platfora
metadata catalog database. Platfora encrypts the stored password using
128-bit AES encryption. This command is called by setup.py during
new installations (in 4.1.3 and later releases). You must run platforaservices stop before running this command.
keygen
Generates a new key that is used to encrypt the password used to
access the platfora metadata catalog database and re-encrypts the
password using the new key. You must run platfora-services stop
before running this command.
Page 95
Platfora Installation Guide - Platfora Utilities Reference
Subcommand
Description
ssl
Controls whether or not worker nodes use an SSL connection to
communicate with the metadata catalog database. You must run
platfora-services stop before running this command.
These subcommands are called indirectly, but you can also call them directly:
Subcommand
Description
init
Initializes a new Platfora metadata catalog database. This command is
called by setup.py during new installations.
start
Starts the PostgreSQL database server. This command is called by
platfora-services start.
stop
Stops the PostgreSQL database server. This command is called by
platfora-services stop.
status
Shows the status of the PostgreSQL database server process. This
command is called by platfora-services status.
migrate
Migrates individual elements in the platfora metadata catalog
database from one DFS location to another.
upgrade
Upgrades the schema in the platfora catalog to the latest installed
Platfora version. This command is called by setup.py during upgrade.
Required Arguments
Requires one of the following sub-commands: init, start, stop, status, backup, restore,
pswd, keygen, ssl, or upgrade. To see the arguments available with a sub-command, enter the
following command-line string:
platfora-catalog sub-command --help
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Page 96
Platfora Installation Guide - Platfora Utilities Reference
Shows the software version information and then exits.
platfora-catalog ssl
Controls whether or not worker nodes use an SSL connection to communicate with the metadata catalog
database in PostgreSQL.
Synopsis
platfora-catalog ssl [-h] [--enable] [--disable] [--self] [--manual] [-cert_file certificate_file] [--key_file private_key_file]
Description
The platfora-catalog ssl command controls whether or not worker nodes use an SSL
connection to communicate with the metadata catalog database in PostgreSQL. By default, SSL
connections are not enabled. Note that the Platfora server must be stopped to run this command.
To enable SSL connections between worker nodes and the metadata database on
the master node, the PostgreSQL database that Platfora uses must support and
enable SSL.
Required Arguments
No required arguments.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
--enable
Specifies that worker nodes should use an SSL connection when communicating with the metadata
database on the master node. When enabled, Platfora distributes the server certificate to the worker
nodes every time the server starts. Enabling SSL may increase lens build times. Platfora only
recommends enabling this feature if your organization's security requirements deem it necessary.
--disable
Disables SSL connections between worker nodes and the metadata database on the master node.
--self
Specifies to use a self-signed server certificate and key when enabling SSL connections. When you use
this argument, Platfora generates and signs its own server certificate and private key.
--manual
Page 97
Platfora Installation Guide - Platfora Utilities Reference
Specifies to use a server certificate and private key uploaded to Platfora, typically generated by a
certificate authority (CA). You must specify the certificate and private key using the --cert_file and -key_file arguments.
--cert_file certificate_file
The path and file name of the server certificate to use.
--key_file private_key_file
The path and file name of the server private key to use.
Examples
Use SSL connections between worker nodes and the PostgreSQL database using a self-signed server
certificate and private key:
$ platfora-catalog ssl --enable --self
Use SSL connections between worker nodes and the PostgreSQL database using a server certificate and
private key generated by a certificate authority (CA).
$ platfora-catalog ssl --enable --manual --cert_file file.crt --key_file
file.key
Disable SSL connections between the worker node and the PostgreSQL database:
$ platfora-catalog ssl --disable
platfora-config
Displays the current settings of Platfora configuration properties, and allows you to update property
settings. Requires one of the following sub-commands: get, set, load, server.
Synopsis
platfora-config
options]
[-h] [-q] [-v] [-V] get | reset | list | set |
load | server | get_dfs_path | set_dfs_path [sub-command
Description
The platfora-config command is used to manage Platfora server configuration properties. The
Platfora server does not need to be running to use this utility. After resetting a property, you must restart
Platfora for your changes to take effect.
platfora-config must be run with one of the following sub-commands:
• get - Display all configuration properties and their current settings on the Platfora master or on the
specified worker node.
• reset - Reset a configuration property to its default value on the Platfora master or on the specified
worker node.
Page 98
Platfora Installation Guide - Platfora Utilities Reference
• list - Display all configuration properties and their current settings on the Platfora master or on the
specified worker node. Same functionality as get.
• set - Change the value of the specified configuration property.
• load - Sets the properties specified in a configuration file on the specified Platfora worker node.
• server - List the client-side Hadoop configuration property settings.
• get_dfs_path - Get the current URI path of the given datasource in the remote file system.
• set_dfs_path - Udate the URI path of the given datasource in the remote file system.
Required Arguments
Requires either --help or one of the following sub-commands: get, list, set, load, server,
get_dfs_path, or set_dfs_path.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Shows the software version information and then exits.
Examples
Show all configuration properties and their currently set values:
platfora-config get
Set a configuration property:
platfora-config set --key platfora.license.expirationwarningdays --value
30
Update the datasource path of the Uploads and System data sources when you are migrating Platfora to a
new Hadoop NameNode:
# To get the old paths
$ platfora-config get_dfs_path --datasource System
$ platfora-config get_dfs_path --datasource Uploads
# To set the new paths
$ platfora-config set_dfs_path --datasource System \
--old_path 'protocol://old_namenode_host:port/platfora/system' \
--new_path 'protocol://new_namenode_host:port/platfora/system'
Page 99
Platfora Installation Guide - Platfora Utilities Reference
$ platfora-config set_dfs_path --datasource System \
--old_path 'protocol://old_namenode_host:port/platfora/uploads' \
--new_path 'protocol://new_namenode_host:port/platfora/uploads'
platfora-export
Exports Platfora object metadata from the catalog database to one JSON file per object.
Synopsis
platfora-export [-h] [-q] [-v] [-V] --username username
--password password [--server server_name] [--port port]
[--protocol http|https][--all] [--namespace namespace_name]
[--export-datasources data_source_name [...]] [--exportdatasets dataset_name [...]][--export-lenses lens_name [...]] [--exportvizboards vizboard_title [...]] [--export-users user_name [...]] [-export-groups group_name [...]] [--include-referenced-datasources] [-include-referenced-datasets] [--include-referenced-lenses] [--includereferenced-segments] [--include-permissions] [--lazy-fetch] [--skipobjects-by-name object_name]
Description
The platfora-export command exports Platfora object metadata from the catalog database. You
can export one or more object types. When specifying an objects you use the name or, for vizboards, the
title. For names or titles with spaces, enclose the name in quotes. You can also export multiple objects of
each type. Separate each object with a space or user an * (asterisk) to export everything of that type.
The command exports objects to .json files to a subdirectory in the current directory. The command
labels the subdirectory with a type. Exported file names are URL-encoded along with the exported
objects current version. For example, if you export the Web Logs the data source the command
creates file here: datasources/Web%20Logs%20.json If a particular filename already exists, the
command silently overwrites it.
Vizboards are the exception. Vizboard names need not be unique. For this reason,
the export utility appends a unique identifier to the exported vizboard filename.
When using one of the --include arguments to export referenced objects of a particular type, you
must include all object types in between. For example, if you export a vizboards and want to include
data sources (--include-referenced-datasources), you must also include lenses and datasets.
If you forget to provide the proper includes, the command produces the exported object(s) you requested
but none of the objects refrenced by them.
Required Arguments
--username username
Page 100
Platfora Installation Guide - Platfora Utilities Reference
Username of a Platfora user account that has the appropriate object permissions on the objects to export.
For example, to export an object, the user must be able to view the object in the web application.
--password password
Password for the specified user account.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Shows the software version information and then exits.
--server server_name
Hostname or IP address for the Platfora master node. Defaults to localhost.
--port port
Port for the Platfora master node. Defaults to 8001.
--protocol http|https
Specify which protocol to use to access the Platfora server, either http or https. Defaults to https when
the port ends with 443, otherwise defaults to http.
--namespace namespace_name
Export objects from the specified namespace. You can only export objects from one namespace in a
single call. Defaults to default.
--export-datasources data_source_name [...]
Export the specified data source. You can list multiple names to export multiple objects. Include names
in double quotes if they contain spaces or other special characters.
--export-datasets dataset_name [...]
Export the specified dataset. Use this flag to export segments which are a special kind of dataset.
Segments have two supporting lenses: segment members and segment refresh prerequisites; Include
these using the --include-referenced-lenses flag. Include names in double quotes if they contain
spaces or other special characters.
--export-lenses lens_name [...]
Page 101
Platfora Installation Guide - Platfora Utilities Reference
Export the specified lens. You can list multiple names to export multiple objects. Include names in
double quotes if they contain spaces or other special characters.
--export-vizboards vizboard_title [...]
Export the specified vizboard by title. A vizboard title is the name users assign the vizboard in the
Platfora web application. You can list multiple titles to export multiple objects. Include titles in double
quotes if they contain spaces or other special characters. Vizboard title names may not be unique. If
multiple vizboards use the same title, all vizboards with that title are exported, and each one is assigned
a unique identifier.
--export-users user_name [...]
Export one or more users. Only administrators can export users and groups.
--export-groups group_name [...]
Export one or more groups. Only administrators can export users and groups.
--include-permissions
Export all permissions for all exported Platfora objects such as lenses or datasets. Users and groups do
not have permissions.
--include-referenced-datasources
Use this argument to export all data sources referenced by a specified object.
--include-referenced-datasets
Use this argument to export all datasets referenced by a specified object.
--include-referenced-lenses
Use this argument to export all lenses referenced by a specified object. This option applies to lens
references from segments and vizboards. This option does not support following references from
datasets to the lenses that use them.
--include-referenced-segments
Use this argument to export all segment datasets and segment lenses referenced by a specified vizboard
object. This argument only works when exporting vizboards.
--include-permissions
Use this argument to export all permissions for all exported objects.
--lazy-fetch
When exporting a number of objects that are significantly less than the total number of objects in the
catalog, use this argument to improve export performance. Defaults to false.
--skip-objects-by-name object_name [...]
When exporting multiple objects, use this argument to skip exporting objects with the specified names.
This applies to exporting all objects with the * wildcard as well as referenced objects when using one
Page 102
Platfora Installation Guide - Platfora Utilities Reference
of the --include-referenced-* arguments. By default, this command does not export system-created
objects, these objects are:
Object Type
Excluded by Default
group
Everyone
user
system
admin
data sources
System
Uploads
datasets
Date
Time
Latitude, Longitude with Name
Latitude, Longitude
To override the defaults, provide an * (asterisk) or specify an object name to skip.
--all
Export the entire catalog. This flags behavior is equivalent to:
• --export-datasources "*"
• --export-datasets "*"
• --export-lenses "*"
You must explicitly export permissions, users, and groups.
Examples
Export the "event log" vizboard and the lenses, data sources, and datasets that are used by that vizboard:
$ platfora-export --username admin --password password --exportvizboards "event log" --include-referenced-lenses --include-referenceddatasets --include-referenced-datasources
Export vizboards together with their permissions:
$platfora-export -vvv --username admin --password admin --exportvizboards "o_viz" --include-permissions
Export all data sources and the datasets used by those data sources:
$ platfora-export --username admin --password password --exportdatasources "*" --include-referenced-datasets
Others may find this useful:
Page 103
Platfora Installation Guide - Platfora Utilities Reference
$ platfora-export --username admin --password admin --server localhost
--export-datasets "airports" "batting" "Carriers" --include-referenceddatasources
About to export datasets: [airports, batting, Carriers]
Exporting Dataset: "airports" to file: "datasets/airports.json"
Exporting Dataset: "batting" to file: "datasets/batting.json"
Exporting datasource: "hive on cdh1" to file: "datasources/hive%20on
%20cdh1.json"
Exporting Dataset: "Carriers" to file: "datasets/Carriers.json"
platfora-import
Imports Platfora object metadata from one or more JSON files into the catalog database.
Synopsis
platfora-import [-h] [-q] [-v] [-V] --username username
--password password [--server server_name] [--port port] [--protocol
http|https]
[--import-files file_name [...]] [--handle-conflicts reuse|fail] [-s] [m]
Description
The platfora-import command is used to import Platfora object metadata from one or more JSON
formatted files into the catalog database. You can obtain these files using the platfora-export
command. Only import objects from files that were exported from the same minor release. Platfora does
not support importing objects exported from a different minor release.
If your system uses HDFS Delegated Authorization, the importing user must
have READ permission on the underlying DFS data. If the user does not have this
permissions, the catalog import succeeds but the Platfora instance is unable to
access the underlying data.
Each JSON file should contain a single object definition. After importing an object, the object owner is
assigned the username given in the --username argument. If an object exists in both the catalog and
in one of the imported JSON files, then the --handle-conflicts argument determines whether the
import fails or uses the object in the catalog instead of importing the object from the JSON file.
When importing an object that references another object, the referenced object must exist either in one
of the imported JSON files or in the Platfora catalog. If any referenced object doesn't exist in either
location, the entire import fails.
Vizboards are a special case. They have both a title visible through the user interface (UI) and unique
name which is only used internally and is not visible in the UI. When you import a vizboard, the system
assigns the vizboard a unique name and keeps the visible title unchanged. Vizboard permissions are tied
Page 104
Platfora Installation Guide - Platfora Utilities Reference
to the unique name Platfora uses internally. Therefore, if you want to ensure that imported vizboards
inherit the same object permissions as they did in the original Platfora catalog, you must export both
vizboards and their permissions. Then you must import both the vizboard and their permissions in a
single call using platfora-import.
Required Arguments
--username username
Username of a Platfora user account that has the appropriate object permissions on the objects to import.
For example, to import an object the user must have Own or Edit permission on the object type.
--password password
Password for the specified user account.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Shows the software version information and then exits.
--server server_name
Hostname or IP address for the Platfora master node. Defaults to localhost.
--port port
Port for the Platfora master node. Defaults to 8001.
--protocol http|https
Specify which protocol to use to access the Platfora server, either http or https. Defaults to https when
the port ends with 443, otherwise defaults to http.
--handle-conflicts reuse|fail
Specifies how to handle objects that already exist in the catalog with the same name. Choose reuse to
keep the existing object in the catalog and ignore the imported object with the same name. Choose fail to
stop the import process without importing any object. Defaults to fail.
--import-files file_name [...]
Page 105
Platfora Installation Guide - Platfora Utilities Reference
Import the object in the file. You can list multiple names to import multiple objects. When listing
multiple objects, the order does not matter.
Always import user and groups together. This is because the two object types are interdependent.
Importing groups fails if all the members do not also exist. Similarly, users are not imported unless their
corresponding group exists.
--skip_checks
Skips version checks between imported data and the Platfora instance. Set this when importing JSON
without metadata fields.
-s | --run-as-super-admin
Run the import job in Super Administrator mode. The specified --username must be eligible to switch to
Super Administrator mode.
-m | --skip-objects-with-missing-references
Skips importing any objects that reference other objects that cannot be found. Platfora lists which objects
were not imported because they reference objects that can't be found. Search for "Warning: Removing"
in the command response to find the objects that were not imported. This does not apply to users and
groups.
Examples
Import the lens in the flights_lens.json file. If a lens with the same name already exists, then
keep the existing lens:
$ platfora-import --username admin --password password --import-files
flights_lens.json --handle-conflicts reuse
Use the following to import vizboards and the permissions associated with them.
$platfora-import -vvv --username admin --password admin --import-files
vizboards/* permissions/vizboards/*
platfora-license
Installs, uninstalls, or views a Platfora license. Requires one of the following sub-commands: install,
uninstall, or view.
Synopsis
platfora-license
options]
[-h] [-q] [-v] [-V] install | uninstall | view [sub-command
Description
The platfora-license command is used to manage the license on Platfora. The Platfora server
must be running to use this utility.
Page 106
Platfora Installation Guide - Platfora Utilities Reference
Required Arguments
Requires one of the following sub-commands: install, uninstall, or view.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Shows the software version information and then exits.
platfora-license install
Installs a Platfora license by uploading a license key file.
Synopsis
platfora-license install [--license license_file]
[-h]
Description
The platfora-license install command is used to upload a license key file to Platfora. The
Platfora server must be running to use this utility.
Required Arguments
--license license_file
The path and license key file name to upload to the Platfora server. If no directory is specified, Platfora
looks for the file in the current directory.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
Examples
Upload the license key file named licensekey.license to the Platfora server:
$ platfora-license install --license licensekey.license
Page 107
Platfora Installation Guide - Platfora Utilities Reference
platfora-license uninstall
Uninstalls the current Platfora license.
Synopsis
platfora-license uninstall
[-h]
Description
The platfora-license uninstall command is used to uninstall the license currently installed on
Platfora. The Platfora server becomes in the unlicensed state after running this command. The Platfora
server must be running to use this utility.
Required Arguments
No required arguments.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
Examples
Uninstalls the current license from the Platfora server:
$ platfora-license uninstall
platfora-license view
Displays the details of the currently installed license.
Synopsis
platfora-license view
[-h]
Description
The platfora-license view command is used to view the details of the currently installed
Platfora license. The Platfora server must be running to use this utility.
Required Arguments
No required arguments.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
Page 108
Platfora Installation Guide - Platfora Utilities Reference
Examples
Views the current Platfora license:
$ platfora-license view
platfora-node
Starts, stops, restarts, checks, updates, disables, enables, or removes a worker node in a multinode Platfora cluster. Requires one of the following sub-commands: add, remove, start, stop,
restart, status, sync, config, enable, or disable.
Synopsis
platfora-node [-h] [-q] [-v] [-V] status | enable | stop | sync | remove | start
| add | disable | config | restart [sub-command options]
Description
The platfora-node utility is used to manage worker nodes in a multi-node Platfora cluster, and is
always executed from the Platfora master. It must be run with one of the following sub-commands:
• add - Adds and initializes a new worker node to a Platfora cluster.
• remove - Removes an existing worker node from a Platfora cluster.
• start - Starts the Platfora server process on the designated worker node(s).
• stop - Stops the the Platfora server process on the designated worker node(s).
• restart - Issues a stop immediately followed by a start.
• status - Shows the status of the Platfora server process on the designated worker node(s).
• disable - Takes a worker node out of operation. Disabled nodes remain in the cluster
configuration, but are not available to process queries. Typically you would disable a node to do
server maintenance, and then enable it again after maintenance is complete. When a node is disabled,
other nodes in the cluster will take over the lens data and processing work it was responsible for
serving.
• enable - Brings a disabled worker node back into operation. When a node comes up, it must
retrieve the latest lens data it is responsible for serving before it will be fully available to work on
queries.
• sync - Copies the Platfora software binaries from the master to the designated worker node(s).
• config - Configures the management port and host name of an existing node.
You can also use the platfora-services utility to run the start, stop, restart, status,
sync, and config commands on all nodes at once. This utility is mainly used for adding new worker
nodes, or taking nodes in and out of the cluster for server maintenance.
A node is identified by its unique node ID. This corresponds to the order that the node was added to the
Platfora cluster configuration. Usually the master node is 0, the first worker node added is 1, the second
worker node added is 2, and so on. You can run platfora-services status to see the IDs of all
nodes in the Platfora cluster.
Page 109
Platfora Installation Guide - Platfora Utilities Reference
Finally, this utility ensures the clock on remote, worker nodes are in the acceptable tolerance from the
master node clock. The tolerance is 60 seconds. If the node is not within the acceptable tolerance, the
utility logs an error and, depending on the context, the node is not started/added/enabled.
Required Arguments
Requires one of the following sub-commands: add, remove, start, stop, restart, status,
sync, config, enable, or disable.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Shows the software version information and then exits.
platfora-node add
Adds a new worker node to a Platfora cluster configuration.
Synopsis
platfora-node add --host hostname [--port admin_port] [-data_port data_port]
[--websvc_port http_port] [--datadir path] [-disabled] [--skip_syscheck]
| [-h]
Description
The platfora-node add command checks the remote node for the required software, registers a
new worker node in the Platfora metadata catalog, copies the Platfora installation files to the remote
node, starts the Platfora server on the new node, and enables the node to begin serving query requests.
This command is run from the Platfora master.
Before you can add a node to the Platfora cluster, the remote server has to be
correctly provisioned with the required prerequisite software and OS configuration
settings. See the Provisioning a Platfora Server section of the Platfora Installation
Guide for more information.
Page 110
Platfora Installation Guide - Platfora Utilities Reference
Required Arguments
--host hostname
The host name or IP address of the new worker node to add to the Platfora cluster.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
--datadir path
The local directory where Platfora will store lens data and log files on the worker node. By default, uses
the same $PLATFORA_DATA_DIR location as the master node.
--data_port
This is the data transfer port used during query proccessing on multi-node Platfora clusters. By default,
uses the same port number as the master node.
--disabled
Adds the node to the cluster configuration but in a disabled state. The node will not participate in query
processing until it is enabled.
--port admin_port
This is the server administration port used for management utility and API calls to the Platfora server.
This is also the port that multi-node Platfora servers use to connect to each other. By default, uses the
same port number as the master node.
--skip_syscheck
Do not run the platfora-syscheck utility prior to adding the node.
--websvc_port web_service_port
The web service port of the Platfora application server. By default, uses the same port number as the
master node.
Examples
Add a new worker node with the host name of platfora-worker-1 to the Platfora cluster:
$ platfora-node add --host platfora-worker-1
platfora-node config
Changes the configured host name and/or server administration port for a Platfora worker node.
Synopsis
platfora-node config --id number [--host hostname] [--port admin_port]
| [-h]
Page 111
Platfora Installation Guide - Platfora Utilities Reference
Description
The platfora-node config changes the configured management port and/or host name of an
existing Platfora worker node.
Required Arguments
--id number
This node ID number in the Platfora catalog database. Usually the master node is 0, the first worker node
added is 1, the second worker node added is 2, and so on. You can run platfora-services status to
see the IDs of all nodes in the Platfora cluster.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
--host hostname
The updated host name or IP address of the worker node.
-p | --port admin_port
The updated server administration port.
Examples
Update the port of the worker node ID number 2:
$ platfora-node config --id 2 --port 8004
platfora-services
Starts, stops, restarts, or checks the status of Platfora server processes. Can also be used to syncronize
Platfora software and configuration files in multi-node installations. Requires one of the following subcommands: start, stop, restart, status, or sync.
Synopsis
platfora-services [-h] [-q] [-v] [-V] start | stop | restart | status | sync
[sub-command options]
Description
The platfora-services utility is used to manage Platfora server processes. It must be run with one
of the following sub-commands:
• start - Starts the Platfora server processes. In multi-node installations, starts the master server first
and then the worker servers in sequential order.
Page 112
Platfora Installation Guide - Platfora Utilities Reference
• stop - Stops the Platfora server processes. In multi-node installations, sequentially stops the worker
servers first and then the master server.
• restart - Issues a stop immediately followed by a start.
• status - Shows the status of the Platfora server processes.
• sync - Copies the Platfora software binaries and global configuration settings from the master to the
worker nodes.
The following sub-commands are issued internally by the platfora-services utility. DO NOT
USE without explicit directions from Platfora customer support.
• watchdog - Starts the watch dog daemon for the Platfora server process.
• launch - Includes the specified Java class in the Platfora environment.
Finally, this utility ensures the clock on the master node is not skewed. The tolerance is 60 seconds. If
this master node is not within the acceptable tolerance, the utility logs an error and, depending on the
context, the node process is not started/added/enabled.
Required Arguments
Requires one of the following sub-commands: start, stop, restart, status, or sync.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Shows the software version information and then exits.
platfora-services start
Starts the Platfora server processes. In multi-node installations, starts the master server first and then the
worker servers in sequential order.
Synopsis
platfora-services start [-h] [-d [DEBUG]] [--hadoop_conf path] [-platfora_conf path] [--logdir path] [--profile] [-n node_id] [p management_port] [-w web_port] [--datadir path] [-P pid_path] [s jvm_size] [--no_watchdog] [--nowait] [--heapdump] [--gc] [--gclogging]
[-jvmopts options]
Page 113
Platfora Installation Guide - Platfora Utilities Reference
Description
The platfora-services start command starts the Platfora server processes. If you do not
specify any arguments, the command uses the configuration information specified during setup.
This configuration is stored the Platfora metadata catalog. To view your current configuration, see
your Global Settings in Platfora or the platfora.properties configuration file located in the
$PLATFORA_CONF_DIR.
Required Arguments
No required arguments.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-d | --debug
Starts the Platfora server with the Java debugger listener enabled.
--datadir path
The path of the Platfora data directory where the catalog database, lens data, and logs reside. Defaults to
$PLATFORA_DATA_DIR or what was specified during setup.
--hadoop_conf path
Local directory path where the Hadoop configuration files reside. Defaults to what is specified for the
env.platfora.hadoopconf property.
--heapdump
Enables the JVM to provide a heap dump to $PLATFORA_DATA/log/platfora-heapdump.hprof when
an out of memory error occurs.
--gc G1|SmallHeap
Sets the garbage collection algorithm.
--gclogging
Enables JVM garbage collection logging.
--jvmopts options
Adds additional JVM options to the server process.
--logdir path
The directory of the Platfora server log files. Defaults to $PLATFORA_DATA_DIR/logs.
-n | --nodeid node_id
Page 114
Platfora Installation Guide - Platfora Utilities Reference
Starts the Platfora server process on the given node. The master node id is usually 0. Worker node ids
can be determined by running platfora-services status.
--no_watchdog
Do not start a watch dog daemon process to monitor and restart the server process if needed.
--nowait
Do not wait for the server startup tasks to complete before returning the command prompt.
-P | --piddir pid_path
The path of the server PID file. Defaults to $PLATFORA_DATA_DIR/platfora.pid.
--platfora_conf path
The directory that contains the Platfora server configuration files. Defaults to $PLATFORA_CONF_DIR
or what was specified during setup.
-p | --port management_port
The API port of the Platfora server used by the management utilities. Defaults to what is specified for
the platfora.server.management.port property.
--profile
Starts server with the Java profiler listener enabled.
-s | --jvmsize jvm_size
The size of the Java Virtual Memory (JVM) to allocate to the Platfora server process (M=megabytes,
G=gigabytes). Defaults to what is set for the env.platfora.jvm.maxsize property.
-w | --websvc_port web_service_port
The web service port of the Platfora application server. Defaults to what is specified for the
platfora.webservice.port property.
Examples
Start the Platfora server on all nodes in the cluster (master and workers) using the default settings:
$ platfora-services start
Start the Platfora server on worker node 3 only with a 8 GB JVM:
$ platfora-services start -n 3 -s 8G
platfora-services stop
Stops the Platfora server processes.
Synopsis
platfora-services stop [-h] [--datadir path] [--logdir path] [-master_only] [-n node_id] [-P pid_path] [--no_watchdog] [--force]
Page 115
Platfora Installation Guide - Platfora Utilities Reference
Description
The platfora-services stop command is used to stop the Platfora server processes. If no
arguments are given, it uses the configuration information specified during startup.
Required Arguments
No required arguments.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
--datadir path
The path of the Platfora data directory where the catalog database, lens data, and logs reside. Defaults to
$PLATFORA_DATA_DIR or what was specified during setup.
--logdir path
The directory of the Platfora server log files. Defaults to $PLATFORA_DATA_DIR/logs.
--master-only
Stop the master server process only.
-n | --node node_id
Stops the Platfora server process on the given node. The master node id is usually 0. Worker node ids
can be determined by running platfora-services status.
-P | --piddir pid_path
The path of the server PID file. Defaults to $PLATFORA_DATA_DIR/platfora.pid.
--no_watchdog
Do not stop the watch dog daemon process to monitor and restart the server process if needed.
--force
Stop all nodes in the cluster immediately without waiting for processes to finish gracefully. This is
similar to the kill -9 UNIX command.
Examples
Stop the Platfora server on all nodes in the cluster (master and workers) using the default settings:
$ platfora-services stop
Stop the Platfora server on worker node 3 only:
$ platfora-services stop -n 3
Page 116
Platfora Installation Guide - Platfora Utilities Reference
platfora-services restart
Stops the Platfora server processes immediately followed by a start of the Platfora server processes.
Synopsis
platfora-services restart [-h] [-d [DEBUG]] [--hadoop_conf path] [-profile] [-n node_id] [-p management_port] [-w web_port] [-P pid_path]
[-s jvm_size] [--no_watchdog]
Description
The platfora-services restart command restarts the Platfora server processes. If you do
not specify any arguments, the command uses the configuration information specified during setup.
This configuration is stored the Platfora metadata catalog. To view your current configuration, see
your Global Settings in Platfora or the platfora.properties configuration file located in the
$PLATFORA_CONF_DIR.
Required Arguments
No required arguments.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-d | --debug
Starts the Platfora server with the Java debugger listener enabled.
--hadoop_conf path
Local directory path where the Hadoop configuration files reside. Defaults to what is specified for the
env.platfora.hadoopconf property.
-n | --node node_id
Starts the Platfora server process on the given node. The master node id is usually 0. Worker node ids
can be determined by running platfora-services status.
--no_watchdog
Do not start a watch dog daemon process to monitor and restart the server process if needed.
-P | --piddir pid_path
The path of the server PID file. Defaults to $PLATFORA_DATA_DIR/platfora.pid.
-p | --port management_port
The API port of the Platfora server used by the management utilities. Defaults to what is specified for
the platfora.server.management.port property.
Page 117
Platfora Installation Guide - Platfora Utilities Reference
--profile
Starts server with the Java profiler listener enabled.
-s | --jvmsize jvm_size
The size of the Java Virtual Memory (JVM) to allocate to the Platfora server process (M=megabytes,
G=gigabytes). Defaults to what is set for the env.platfora.jvm.maxsize property.
-w | --websvc_port web_service_port
The web service port of the Platfora application server. Defaults to what is specified for the
platfora.webservice.port property.
Examples
Restart the Platfora server on all nodes in the cluster (master and workers) using the default settings:
$ platfora-services restart
Restart the Platfora server on worker node 3 only:
$ platfora-services restart -n 3
platfora-services status
Shows the status of the Platfora server processes.
Synopsis
platfora-services status [-h] [-P pid_path] [-n node_id] [p management_port] [-w web_port] [--logdir path] [--datadir path]
Description
The platfora-services status command is used to query the status and availability of the
Platfora server processes. If no arguments are given, it uses the configuration information specified at
startup. It reports the following information about the servers in a Platfora cluster:
Information
Description
ID
The system assigned node ID.
Type
The type of node: Master or Child (worker).
Host
The host name of the node.
Port
The management port of the node.
Enabled
The cluster status of the node: Enabled or Disabled or Not Ready.
A node is disabled when an administrator takes it offline from
query processing.
Page 118
Platfora Installation Guide - Platfora Utilities Reference
Information
Description
Status
The network status of the node: Available, Unavailable, or Not
Ready. A node is unavailable when it cannot be reached by the
master or is not responding. A node is not ready when it has been
newly added or re-enabled, but has not yet finished copying the
data blocks it needs to answer queries.
Process
The status of the Platfora server process on a node: Running
or Stopped or Unhealthy. A node is Unhealthy if Platfora cannot
determine the process status. For example, a node is Unhealthy if
the server is Running but not processing ping messages.
Required Arguments
No required arguments.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
--datadir path
The path of the Platfora data directory where the catalog database, lens data, and logs reside. Defaults
to $PLATFORA_DATA_DIR or what is specified for the platfora.data.dir in Platfora's Global
Settings.
--logdir path
The directory of the Platfora server log files. Defaults to $PLATFORA_CONF_DIR/logs.
-n | --node node_id
Starts the Platfora server process on the given node. The master node id is usually 0. Worker node ids
can be determined by running platfora-services status.
-P | --piddir pid_path
The path of the server PID file. Defaults to $PLATFORA_DATA_DIR/platfora.pid.
-p | --port management_port
The API port of the Platfora server used by the management utilities. Defaults to what is specified for
the platfora.server.management.port property in $PLATFORA_CONF_DIR/platfora.properties.
-w | --websvc_port web_service_port
The web service port of the Platfora application server. Defaults to what is specified for the
platfora.webservice.port property in $PLATFORA_CONF_DIR/platfora.properties.
Page 119
Platfora Installation Guide - Platfora Utilities Reference
Examples
Check the status of all nodes in a Platfora cluster:
$ platfora-services status
ID
TYPE
HOST
PORT
ENABLED
STATUS
PROCESS
---------------------------------------------------------------------------------0
Master
ip-10-212-123-456
8002
Enabled
Available
Running
1
Child
ip-10-212-123-567
8002
Enabled
Available
Running
2
Child
ip-10-212-123-678
8002
Enabled
Not Ready
Running
platfora-services sync
Syncronizes the Platfora software binaries and global configuration settings of the master to the worker
nodes.
Synopsis
platfora-services sync [-h]
Description
The platfora-services sync command is used to push software and configuration file settings
from the master node to the worker nodes.
Required Arguments
No required arguments.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
Examples
Push configuration file changes and software binaries from the master to the workers:
$ platfora-services sync
platfora-syscapture
Captures the Platfora log files, configuration files, metadata catalog, and system environment
information needed by Platfora Customer Support to troubleshoot issues.
Page 120
Platfora Installation Guide - Platfora Utilities Reference
Synopsis
platfora-syscapture [--all | --last "number time_units"] [ --hostsfile
filename [--child] ] [--tempdir] [--with-catalog] [-h] [-q] [-v] [-V]
Description
The platfora-syscapture utility captures files needed by Platfora Customer Support, and creates
a compressed tar file in the current directory. It captures the following information from your Platfora
installation:
• The Platfora server log files. By default only master log files from the past 7 days are captured.
• The Platfora configuration files.
• The Hadoop configuration files used by Platfora.
• The OS settings on the master host (provided that /sbin/sysctl is in your PATH).
• System resource information such as memory and CPU.
• The version of Java you are using.
• Optionally, a database dump of the Platfora metadata catalog database.
• Optionally, the list of files in the Platfora directory of DFS.
• Optionally, your Platfora data directory.
Required Arguments
No required arguments.
Optional Arguments
--all
Captures all log files. By default, only log files that have changed within the past 7 days are captured.
--last number time_units
Captures only the log files within the specified time period (relative to now). By default, only log files
that have changed within the past 7 days are captured. Allowed time units are weeks, days, hours,
or minutes.
--outfile filename
The file where you want to to store the syscapture data.
--child
If --hostsfile is used, also captures worker node configuration files in addition to the log files.
--tempdir
Specifies a temporary directory for writing interim results. Defaults to the $PLATFORA_DATA_DIR
directory. The utility automatically cleans up the temporary directory upon success and failure.
--with-catalog
Captures the contents of the Platfora metadata catalog database.
Page 121
Platfora Installation Guide - Platfora Utilities Reference
--with-telemetry
Captures the telemetry data for your Platfora instance.
--with-dfs-ls
Includes a DFS directory listing with the capture.
--datadir
Captures the contents of the Platfora data directory.
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Shows the software version information and then exits.
Examples
Capture log files on the Platfora master for the last 2 days:
$ platfora-syscapture --last "2 days"
Capture log files for the last 36 hours on the Platfora master and on the worker nodes (as named in the
hosts file):
$ platfora-syscapture --last "36 hours" --hostsfile /home/platfora/
worker_nodes.txt
platfora-syscheck
Checks the operating system on the master andworker nodes.
Synopsis
platfora-syscheck [--skipdb] [-h] [-q] [-v] [-V]
Description
The platfora-syscheck utility verifies that the operating system environment on each Platfora node
(master and workers) meets the requirements needed to run the Platfora server software. It performs the
following checks:
• Verifies that the installation package is not corrupt by doing a checksum of the files in $PLATFORAHOME.
Page 122
Platfora Installation Guide - Platfora Utilities Reference
• Verifies that the required Unix OS utilities are installed and can be found in the $PATH.
• Verifies that ulimit is sized appropriately.
• Verifies that ssh keys were correctly configured. This checks against the local host, the fully qualified
domain name, and the hostname.
• Verifies that a compatible Java Runtime Environment (JRE) is installed.
• Verifies that a compatible version of PostgreSQL is installed, and the system shared memory settings
are sized appropriately for PostgreSQL.
• Reports the amount of free disk space in the configured environment variable
$PLATFORA_DATA_DIR. If $PLATFORA_DATA_DIR is not set, checks the disk space of the current
user's home directory.
The utility does not check disabled nodes.
Required Arguments
No required arguments.
Optional Arguments
--skipdb
Skips the database-related checks. This option can be used when verifying the operating system
environment of a Platfora worker node, since the PostgreSQL database software is only required on the
Platfora master.
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Shows the software version information and then exits.
Examples
Run a system check on the Platfora master:
$ export PLATFORA_DATA_DIR=/home/platfora/PLATFORA_DATA
$ platfora-syscheck
cmd line: /usr/local/platfora/current/bin/platfora-syscheck
Verifying System Requirements
Checking integrity of binaries......[SUCCESS]
Checking unix utilities......[SUCCESS]
Checking file and directory permissions......[SUCCESS]
Page 123
Platfora Installation Guide - Platfora Utilities Reference
Checking
Checking
Checking
Checking
ssh to localhost......[SUCCESS]
java version......[SUCCESS]
postgres version......[SUCCESS]
shared memory settings......[SUCCESS]
System Resources:
Platfora Data Directory fs has xx GB free space.
System Memory total: xxMB used: xxMB free: xxMB
Page 124
Appendix
B
Glossary
The glossary defines Platfora product terminology and concepts.
Topics:
•
aggregate lens
•
field
•
aggregation
•
filter
•
Amazon EMR
•
focus
•
Amazon S3
•
funnel
•
categorical data
•
geographic analysis
•
column
•
geo map
•
computed field
•
geo reference
•
CSV
•
granularity
•
data catalog
•
Hadoop
•
dataset
•
HDFS
•
data source
•
Hive
•
derived dataset
•
key
•
dimension dataset
•
location field
•
dimension
•
lens
•
distributed file system
•
MapReduce
•
drill down
•
measure
•
elastic dataset
•
quantitative data
•
entity-centric data model
•
reference
•
event
•
regular expressions
•
event series lens
•
ROLLUP measure
•
expression
•
row
•
fact dataset
•
segment
•
fact-centric data model
•
visualization (viz)
Page 125
Platfora Installation Guide - Glossary
•
vizboard
aggregate lens
An aggregate lens contains a selection of measure and dimension fields chosen from the focal point of a
single transactional (or fact) dataset. A completed or built lens can be thought of as a table that contains
aggregated measure data values grouped by the selected dimension values. An aggregate lens can be
built from any dataset. There are no special data modeling requirements to build an aggregate lens.
aggregation
An aggregation is the result of a function that takes all values of a numeric column, and returns a single
value of more significant meaning or measurement. An aggregate function groups the values of multiple
rows together based on some defined input expression.
Examples of aggregate functions include SUM, COUNT, DISTINCT, MIN, MAX, and STDDEV. In
Platfora, measure fields are always the result of an aggregation.
Amazon EMR
Amazon Elastic MapReduce (Amazon EMR) is a Hadoop framework hosted by Amazon Web Services
(AWS). It utilizes Amazon Elastic Compute Cloud (Amazon EC2) for compute resources and Amazon
Simple Storage Service (Amazon S3) for data storage.
Platfora can be configured to use Amazon EMR as its backend Hadoop processing framework, and
Amazon S3 as its primary data source and storage system.
Amazon S3
Amazon Simple Storage Service (Amazon S3) is a data storage service provided by Amazon Web
Services (AWS).
It is a distributed file system hosted by Amazon where you pay a monthly fee for storage space and data
transfer bandwidth. Data transfer is free between S3 and Amazon Elastic Compute Cloud (EC2) clusters,
making S3 an attractive choice for users who run Hadoop clusters on AWS or utilize the Amazon EMR
service.
Hadoop supports two S3 file system protocols as an alternative to HDFS: S3 Native File System (s3n)
and S3 Block File System (s3). Platfora supports the S3 Native File System (s3n) only.
Page 126
Platfora Installation Guide - Glossary
categorical data
Categorical data is data with unconnected data points that can be represented in a visualization as
a categorical grouping or individual data point. Categorical data is countable and often finite (for
example, the number of products sold or the number of people in a city). In Platfora, categorical values
in a visualization are evenly spaced by sort order. By default, dimension fields in a visualization are
categorical, but numeric or datetime dimensions can be changed to quantitative. Categorical data is
sometimes referred to as discrete data.
column
A column is a set of data values of a particular data type, with one value for each row in the dataset.
Columns provide the structure for composing a row. The terms column and field are often used
interchangeably, although many consider it more correct to use field to refer specifically to the single
item that exists at the intersection of one row and one column.
computed field
A computed field generates its values based on a calculation or condition, and returns a value for
each input row. Values are computed based on expressions that can contain values from other fields,
constants, mathematical operators, comparison operators, or built-in row functions.
Computed fields are useful for deriving meaningful values from base fields (such as calculating
someone's age based on their birthday), doing data cleansing and pre-processing (such as grouping
similar values together or substituting one value for another), or for computing new data values based
on a number of input variables (such as calculating a profit margin value based on revenue and costs).
A computed field that does an aggregate calculation is called a measure, which is a special kind of
computed field in Platfora.
CSV
Comma-separated values (CSV) is a plain text file format for describing tabular data. CSV, in general,
refers to any file that is plain text (typically ASCII or Unicode characters), has one record per line, has
records divided into fields separated by delimiters (typically a comma), and has the same sequence of
fields for every record.
Within these general constraints, there are many variations of CSV in use. For example, some CSV
formats use quotation marks around field values, some use delimiters other than a comma (such as a
tab or a semi-colon), and some reserve the very first line of the file as a header of field names. Platfora
supports the typical CSV formatting conventions, and allows for some configuration to support different
variations.
Page 127
Platfora Installation Guide - Glossary
data catalog
The data catalog is a collection of data items available and visible to Platfora users. Data administrators
build the data catalog by defining and modeling datasets in Platfora that point to source data in Hadoop.
When users request data from a dataset, that request is materialized in Platfora as a lens. The data
catalog shows all of the datasets (data available for request) and lenses (data that is ready for analysis)
that have been created by Platfora users.
dataset
A dataset is a collection of external data files residing in a data source that can be described in table
form (rows and columns). Source data is mapped into Platfora by creating a dataset definition.
A dataset definition describes the rows and columns, the base fields and their associated data types,
computed fields, measure aggregations, and references (or joins) to other related datasets. The collection
of dataset definitions make up the data catalog (the data items available to Platfora users).
data source
A data source is a connection to a mount point or directory on an external data server, such as a file
system or database server. Platfora currently provides data source adapters for Hive, HDFS, Amazon S3,
and MapR FS.
Platfora has one default data source named Uploads (for data files that you upload from your local file
system). This default data source resides in the distributed file system (DFS) that the Platfora server is
configured to use as its primary data source.
derived dataset
A dataset whose underlying data is produced from the results of a Platfora lens query or visualization.
There are two types of derived datasets -- static (lens query results are saved to a static file) or dynamic
(lens query results are refreshed each time the lens is rebuilt).
A derived dataset allows you to save the query results from a lens as a new dataset in Platfora. Once a
derived dataset is saved, you can use it as you would any other dataset in Platfora - you can edit it, add
additional computed fields, and join it by reference to other datasets in the Platfora data catalog.
dimension dataset
Page 128
Platfora Installation Guide - Glossary
A type of dataset that has a primary key and contain attributes (additional dimension fields) that
describe some aspect of a fact or event record (such as a person, item, date, etc.). Dimension datasets are
referenced by a fact dataset.
dimension
A dimension is a type of field (or a collection of fields) that allows you to analyze a measure from
different perspectives to derive meaning from the data. Dimensions are used to summarize, filter,
categorize, and group quantitative measure data in order to answer business questions.
For example, a product dimension can help you understand which products generate the most revenue
for your business. A date dimension can show you the breakdown of sales by year, quarter, month, or
day.
Dimension fields can be character-type data (such as product categories), datetime-type data (such as
months, days, or hours), or categorical numeric-type data (such as customer ratings on a scale of 1-10).
distributed file system
A distributed file system (DFS) is any file system that allows access to files from multiple hosts over
a computer network. It makes it possible for multiple machines and users to share files and storage
resources. HDFS is the primary distributed file system for Hadoop, however Hadoop supports other
distributed file systems as well, such as Amazon S3.
drill down
Drill down (or drill up) is a data analysis technique for navigating from the most summarized to the most
detailed categorization of a particular dimension.
Drill down allows exploration of multi-dimensional data by moving from one level of detail to the next.
A drill-down path is defined by specifying a hierarchy of categories for a dimension or between related
dimensions. For example, a date dimension might have categories defined for year, quarter, month,
week, day, and so on. A product dimension might have categories defined for division, type, and model.
Drill-down levels depend on the granularity of the fields available in the source data.
elastic dataset
Elastic datasets are a special kind of dataset used for entity-centric data modeling in Platfora. They are
used to consolidate unique key values from other datasets into one place for the purpose of defining
segments or event series lenses. They are elastic because the data they contain is dynamically generated
at lens build time.
Page 129
Platfora Installation Guide - Glossary
Elastic datasets are not backed by source files like regular datasets. Instead, they consolidate the unique
foreign keys from any dataset that points to it via a reference. Because they do not contain any records of
their own, elastic datasets cannot be used as the focus for an aggregate lens or event.
entity-centric data model
An entity-centric data model 'pivots' a fact-centric data model to focus an analysis around a particular
dimension (or entity). Modeling the data in this way allows you to do event series analysis and segment
analysis in Platfora.
For example, modeling different fact datasets around a central customer dataset allows you to analyze
different aspects of a customer's behavior. For example, instead of asking "how many customers visited
my web site?" (fact-centric), you could ask questions like "which customers visit my site more than once
a day?" (entity-centric).
event
An event is similar to a reference, but the direction of the join is reversed. An event joins the primary
key field(s) of a dimension dataset to the corresponding foreign key field(s) in a fact dataset, plus
designates a timestamp field for ordering the event records.
event series lens
An event series lens contains a selection of dimension fields chosen from the focal point of a single
entity dataset, including any fields from event datasets associated with that entity. A completed or built
lens can be thought of as a table that contains individual event records partitioned by the primary key of
the entity dataset, and ordered by a time.
An event series lens can only be built from datasets that have at least one event reference defined in
them. It contains non-aggregated event records of various types, partitioned by some common entity
(typically a user id), and sorted by the time the events occurred. Choose this lens type if you specifically
want to do funnel analysis.
expression
An expression computes or produces a value by combining fields (or columns), constant values,
operators, and functions.
An expression's result can be any data type, such as numeric, string, datetime, or Boolean (true/false)
values. Simple expressions can be a single constant value, field (or column), or a function call. You can
use operators to join two or more simple expressions into a complex expression.
Page 130
Platfora Installation Guide - Glossary
fact dataset
In multi-dimentional data models, a fact dataset (or table) contains records (or rows) that represent a
single real-world event that has occurred (such as a sales transaction, a page view, a user registration, an
airline flight, and so on).
A fact record contains the quantitative measure data (such as the dollar amount of a sale), and several
descriptive attributes (or dimensions) that give the measure context (such as the date, the customer, the
product, and so on). Facts are stored at a uniform level of detail (or grain) within a fact dataset.
fact-centric data model
A fact-centric data model is centered around a particular real-world event that has happened, such as
web page views or sales transactions. Datasets are modeled so that a central fact dataset is the focus of
an analysis, and dimension datasets are referenced to provide more information about the fact. In data
warehousing and business intelligence (BI) applications, this type of data model is often referred to as a
star schema.
For example, you may have web server logs that serve as the source of your central fact data about pages
viewed on your web site. Additional dimension datasets can then be related (or joined) to the central fact
to provide more in-depth analysis opportunities.
field
A field is an atomic unit of data that has a name, a value, a data type, and a role of either dimension or
measure. When working with visualizations, fields are the same thing as the dimensions and measures
used to analyze the data.
Fields describe a single aspect of a record (or row) in a dataset. An order record, for example, might
contain an order date field, a product name field, a quantity field, and so on. All records in a dataset have
exactly the same fields, although the values in each field vary from record to record.
filter
A filter is a field value or expression used as a condition for limiting the data that is selected from a lens
and shown in a visualization. A filter can be applied to a visualization to exclude (or include) data that
meets the filter criteria.
For example, if you wanted to show only the sales for the US west coast, you could use the state field
as a filter and just include the values for California, Oregon and Washington. All of the other values for
state would then be filtered out (not shown in the visualization).
Page 131
Platfora Installation Guide - Glossary
focus
A focus sets the central topic for a data exploration and analysis. You set a focus by choosing a single
dataset from the data browser.
For example, if you wanted to explore the characteristics of users who registered on your web site in the
past month, you might choose the user dataset or the registration dataset as the focus of your analysis.
Choosing a focus allows you to find or build a lens of optimized data to work with in a visualization.
funnel
A funnel is a visual analysis type that tracks users' (entities') behavior across a sequence of events, with
each step in the sequence defined as a stage.
Each funnel stage shows progressively decreasing proportions of the original set of users. The first stage
has 100% of the original group of users by definition.
A funnel is always based on an event series lens. The users in the funnel are from the focus dimension
dataset in the lens, and their behaviors are from one or more fact datasets in the lens. The funnel
analyzes their behaviors performed sequentially and counts the number of users that meet the criteria
defined in each stage.
geographic analysis
Geographic analysis is a type of data analysis that involves understanding the role that location plays in
the occurence of other factors. By looking at the geo-spatial distribution of data on a map, analysts can
see how location impacts different variables.
In Platfora, geographic analysis is performed in a geo map viz type.
geo map
A geo map is a viz type that allows analysts to perform geographic analysis on a lens that contains
location data. It includes the Geography drop zone that places positions (using a location field) on a
map background.
Geo map visualizations can be made from an aggregate lens that has at least one location field included.
geo reference
A geo reference is a special type of reference to a dataset that contains a location field.
Page 132
Platfora Installation Guide - Glossary
In addition to at least one location field, the dataset referenced in a geo reference typically contains
primarily location data. For example, this might include population, voting district information, or
government data.
granularity
The granularity of data refers to the fineness with which data fields are sub-divided, and the level of
detail that data is stored within a dataset or lens. For example, a postal address can be recorded with
low granularity with the entire address in one field (address=123 Main St. San Mateo, CA 94403).
Or a higher granularity with the fields broken out (address=123 Main St., city=San Mateo, state=CA,
zipcode=94403).
Hadoop
Hadoop is open-source software framework designed for storing and processing large amounts of
complex, structured, and semi-structured data. It is a distributed system, meaning it runs on a collection
of commodity, shared-nothing servers. Hadoop consists of two key services: a for data storage and for
parallel data processing.
HDFS
Hadoop Distributed File System (HDFS) is the primary storage system for Hadoop applications. It is a
distributed file system, meaning it runs on a collection of commodity servers.
An HDFS cluster usually consists of a NameNode (the metadata management node that manages access
to files and directories) and multiple DataNodes (the storage nodes where file data resides). HDFS
creates multiple replicas of a file's data storage blocks and distributes them throughout the cluster to
enable extremely fast data processing. Platfora can be configured to use HDFS as its primary data
source.
Hive
Hive is an execution engine for Hadoop that lets you write data queries in an SQL-like language called
Hive Query Language (HQL). Hive allows you to create tables by describing the structure of files
residing in HDFS.
Platfora can use a Hive metastore server as a data source, and map a Hive table definition to a Platfora
dataset definition. Platfora uses the Hive table definition to obtain metadata about the source data, such
as which files to process, the parsing logic for rows and columns, and the field names and data types
contained in the source data. It is important to note that Platfora does not execute queries through Hive;
Page 133
Platfora Installation Guide - Glossary
it only uses Hive tables to obtain the metadata needed for defining datasets. Platfora generates and runs
its own MapReduce jobs directly in Hadoop.
key
A key is single field (or combination of fields) that uniquely identifies a row in a dataset, similar to a
primary key in a relational database. A dataset must have a key defined to be the target of a reference.
location field
A location field is a dataset field encoded with a complex datatype that includes geo coordinate
information (latitude and longitude) and a label that associates a location name with the coordinates.
Location fields are defined in the dataset. When defining a location field in a dataset, you can
optionally use the values in an existing dataset field as the location field label. If no label is defined,
Platfora creates a unique string from the coordinates as the label name (for example @(122.33063°W,
37.541886°N)). Use a location field in a geo map viz to place positions on a map.
lens
A lens is a type of data storage that is specific to Platfora. Platfora uses Hadoop as its data source and
processing engine to build and store its lenses. Once a lens is built, this prepared data is copied to
Platfora, where it is available for analysis. A lens can be thought of as a dynamic, on-demand data mart
purpose-built for a specific analysis project.
Platfora generates MapReduce jobs to pull the requested data from the Hadoop source system, and
prepares the data for fast, ad-hoc visual analysis. As users build visualizations, lens data is loaded into
memory on a column-by-column basis as it is needed.
Platfora has two types of lenses you can build: an aggregate lens or an event series lens. The type of lens
you build determines what kinds of visualizations you can create and what kinds of analyses you can
perform when using the lens in a vizboard.
MapReduce
MapReduce is a data-flow programming model for processing large amounts of data on a cluster of
commodity servers. It passes data items from one stage of processing to the next using user-defined
criteria (or jobs).
The MapReduce engine acts as an abstraction, allowing programmers to focus on their desired data
computations. The details of parallelism, distribution, load balancing and fault tolerance are all handled
by the MapReduce framework. Platfora defines and runs MapReduce jobs on the source data in Hadoop
Page 134
Platfora Installation Guide - Glossary
based on the and lens definitions created by Platfora users. The output of the MapReduce jobs executed
by Platfora are stored both in HDFS and Platfora.
MapReduce jobs typically start with a large data file that is broken down into smaller pieces called
splits, which are similar to database rows. Each split is parsed into key/value pairs (similar to fields)
and processed by the user-defined map criteria. The output of the map processing stage is then passed to
the reduce processing stage, which does final grouping and aggregation. Each stage of processing uses
parallelism to enable many map and reduce tasks to run at the same time on multiple machines.
measure
A measure is a numeric value representing an aggregation of some dataset metric (such as total dollars
sold, average number of users, and so on). To create measures, you add computed fields to a dataset or a
lens.
When a lens is built, the build calculates any measures and stores them in the lens. In a visualization,
measures provide the basis for quantitative analysis.
Measures represent a set of real-world events (or facts) and typically answer "how" questions about data
such as how many or how long? If you are familiar with SQL, measure values come from the aggregate
functions such as SUM(), COUNT(), MAX(), MIN(). Measure fields are typically derived from numeric
fields in a dataset, and their values are always the result of an aggregation (average, count, sum, min,
max, and so on).
quantitative data
Quantitative data can be characterized as a sequence or progression of values with connected data points
that can be represented as an unbroken line in a visualization. Quantitative fields usually have values
that can be shown in ordered progression, such as height, speed, or duration measurements. Quantitative
values are placed on a continuous axis, always displayed from low to high. In Platfora, measure data
is always quantitative, but numeric or datetime dimensions can be either quantitative or categorical.
Quantitative data is sometimes referred to as continuous data.
reference
A reference allows two datasets to be joined together on one or more fields that they share in common.
A reference creates a link from a field in one dataset to the primary key of another dataset.
Reference fields are typically created in a fact dataset, and point to a dimension dataset. Creating a
reference allows the datasets to be joined when building lenses or segments, similar to a foreign key to
primary key relationship in a relational database.
Page 135
Platfora Installation Guide - Glossary
regular expressions
Regular expressions, also referred to as regex or regexp, are a standardized collection of special
characters and constructs used for matching strings of text. They provide a flexible and precise language
for matching particular characters, words, or patterns of characters.
ROLLUP measure
ROLLUP is a modifier to an aggregate expression that allows you to define complex measure
expressions, such as windowed, partitioned, or adaptive measure expressions. This is useful when you
want to compute an aggregation for a subset of rows within the overall result of a viz query. It allows
you to compute things such as running totals, moving averages, benchmark comparisons, rank ordering,
percentiles, and so on.
row
A a row represents a single object or record in a dataset. A dataset or lens consists of rows of columns
(or fields).
Each row represents a set of related data, and every row has the same structure. For example, in a dataset
that represents customers, each row would represent a single customer. Columns might represent things
like customer name, email address, gender, age, and so on.
segment
A segment is a special type of dimension field that you can create to group together members of a
population that meet some defined common criteria. A segment is a based on members of a dimension
dataset (such as customers) that have some behavior in common (such as purchasing a particular
product).
In Platfora, a segment is always based on a dimension (or referenced) dataset, and must include at
least one condition from a fact or event dataset. For example, customers who are female would not be
considered a valid segment, however customers who are female that made a purchase would be. A
segment is not just people or things that share common attributes, but also share a common behavior or
action.
Behind the scenes, segments are saved as a special type of lens that can be used and updated
independently of the lens that they were created from. For example, you can create a segment from a
customer purchases lens but then use that segment in a different customer support calls lens. As long as
the lenses have a conforming dimension in common (such as customer), then segments can be used to
compare behaviors of a group of individuals across multiple fact or event datasets.
Page 136
Platfora Installation Guide - Glossary
visualization (viz)
A visualization (or viz for short) is a graphical representation of certain data fields chosen from the
perspective of a single Platfora lens. It is a query of lens data that is visually rendered based on the types
of fields chosen (measure or dimension), their order and placement in the Builder drop zones, and the
various appearance encodings applied to the data (color, size, shape, and so on).
A viz shows aggregated measure data grouped and filtered by the chosen dimensions. A chart in Platfora
can best be described as a recipe of dimension and measure fields, plus axis placement (X-axis and Yaxis), plus appearance encodings (Color, Size, Shape, Opacity, Labels), plus mark type (Point, Line, Bar,
Area, and so on).
vizboard
A vizboard is the starting point for data analysis, and can be thought of as a dashboard or project
workspace. The vizboard is the canvas for discovering and sharing data insights.
A vizboard contains one or more pages of visualizations that together are meant to tell a data story. The
individual visualizations on a vizboard page can be related (use the same underlying data), or unrelated
(use completely different data). A vizboard can be saved, versioned, and shared with others.
Page 137