Data Ingest Guide Version 4.5 Copyright Platfora 2015 Last Updated: 10:15 p.m. June 28, 2015 Contents Document Conventions............................................................................................. 9 Contact Platfora Support.........................................................................................10 Copyright Notices.................................................................................................... 10 Chapter 1: About the Platfora Data Pipeline........................................................... 12 FAQs - Platfora Data Pipeline................................................................................ 12 About the Data Workflow........................................................................................ 15 Chapter 2: Manage Data Sources.............................................................................18 Supported Data Sources.........................................................................................18 Add a Data Source................................................................................................. 20 Connect to a Hive Data Source........................................................................ 20 Connect to an HDFS Data Source....................................................................25 Connect to an S3 Data Source......................................................................... 26 Connect to a MapR Data Source...................................................................... 27 Connect to Other Data Sources........................................................................ 28 About the Uploads Data Source........................................................................29 Configure Data Source Security............................................................................. 31 Delete a Data Source............................................................................................. 32 Edit a Data Source................................................................................................. 33 Chapter 3: Define Datasets to Describe Data..........................................................34 FAQs - Dataset Basics........................................................................................... 34 Understand the Dataset Workspace....................................................................... 39 Understand the Dataset Creation Process............................................................. 40 Understand Dataset Permissions............................................................................44 Select Source Data................................................................................................. 45 Supported Source File Formats........................................................................ 46 Select a Hive Source Table.............................................................................. 48 Select DFS Source Files................................................................................... 49 Edit the Dataset Source Location......................................................................50 Parse the Data........................................................................................................ 51 View Raw Source Data Rows........................................................................... 51 Update the Dataset Sample Rows.................................................................... 52 Update the Dataset Source Schema.................................................................53 Parse Delimited Data.........................................................................................55 Parse Hive Tables............................................................................................. 59 Parse JSON Files.............................................................................................. 62 Parse XML Files................................................................................................ 67 Parse Avro Files................................................................................................ 76 Data Ingest Guide - Contents Parse Web Access Logs................................................................................... 78 Parse Other File Types..................................................................................... 79 Prepare Base Dataset Fields..................................................................................81 Confirm Data Types...........................................................................................81 Change Field Names.........................................................................................83 Add Field Descriptions.......................................................................................84 Hide Columns from Data Catalog View.............................................................85 Default Values and NULL Processing............................................................... 86 Bulk Upload Field Header Information.............................................................. 87 Transform Data with Computed Fields................................................................... 89 FAQs - Dataset Computed Fields..................................................................... 89 Add a Dataset Computed Field......................................................................... 91 Add Binned Fields............................................................................................. 92 Add Measures for Quantitative Analysis.................................................................96 FAQs - Dataset Measures.................................................................................96 The Default 'Total Records' Measure................................................................ 98 Add Quick Measures......................................................................................... 98 Add Computed Measures.................................................................................. 99 Prepare Date/Time Data for Analysis................................................................... 101 FAQs - Date and Timestamp Processing........................................................101 Cast DATETIME Data Types.......................................................................... 103 About Date and Time References................................................................... 103 About the Default 'Date' and 'Time' Datasets..................................................104 Prepare Location Data for Analysis...................................................................... 105 FAQs - Location Data and Geographic Analysis.............................................106 Understand Geo Location Fields..................................................................... 108 Add a Location Field to a Dataset...................................................................110 Understand Geo References........................................................................... 111 Prepare Geo Datasets to Reference............................................................... 111 Add a Geo Reference..................................................................................... 116 Prepare Drill Paths for Analysis............................................................................118 FAQs - Drill Paths........................................................................................... 118 Add a Drill Path............................................................................................... 121 Model Relationships Between Datasets............................................................... 122 Understand Data Modeling in Platfora............................................................ 122 Add a Reference..............................................................................................128 Add an Event Reference................................................................................. 129 Add an Elastic Dataset.................................................................................... 130 Delete or Hide a Reference............................................................................ 133 Update a Reference........................................................................................ 134 Define the Dataset Key.........................................................................................134 Chapter 4: Use the Data Catalog to Find What's Available..................................136 FAQs - Data Catalog Basics................................................................................ 136 Page 3 Data Ingest Guide - Contents Find Available Datasets........................................................................................ 139 Find Available Lenses...........................................................................................141 Find Available Segments...................................................................................... 143 Organize Datasets, Lenses and Vizboards with Labels....................................... 144 Chapter 5: Define Lenses to Load Data.................................................................148 FAQs - Lens Basics..............................................................................................148 Lens Best Practices.............................................................................................. 150 About the Lens Builder Panel............................................................................... 151 Understand the Lens Build Process..................................................................... 152 Understand Lens MapReduce Jobs................................................................ 152 Understand Source Data Input to a Lens Build...............................................154 Understand How Datasets are Joined.............................................................157 Create a Lens....................................................................................................... 158 Name a Lens................................................................................................... 160 Choose the Lens Type.................................................................................... 160 Choose Lens Fields.........................................................................................164 Define Lens Filters...........................................................................................178 Allow Ad-Hoc Segments.................................................................................. 182 Estimate Lens Size............................................................................................... 183 About Dataset Profiles..................................................................................... 183 About Lens Size Estimates............................................................................. 185 Manage Lenses.....................................................................................................187 Edit a Lens Definition...................................................................................... 187 Update Lens Data............................................................................................188 Delete or Unbuild a Lens................................................................................ 189 Check the Status of a Lens Build................................................................... 191 Manage Lens Notifications.............................................................................. 191 Schedule Lens Builds...................................................................................... 194 Manage Segments—FAQs................................................................................... 196 Chapter 6: Export Lens Data...................................................................................202 Export an Entire Lens as CSV............................................................................. 202 Export a Partial Lens as CSV...............................................................................204 Query a Lens Using the REST API...................................................................... 204 FAQs - Lens Export Basics.................................................................................. 206 Chapter 7: Platfora Expressions.............................................................................210 Expression Building Blocks................................................................................... 210 Functions in an Expression............................................................................. 210 Operators in an Expression.............................................................................212 Fields in an Expression................................................................................... 214 Literal Values in an Expression.......................................................................216 PARTITION Expressions and Event Series Processing (ESP).............................217 Page 4 Data Ingest Guide - Contents How Event Series Processing Works..............................................................217 Best Practices for Event Series Processing (ESP)......................................... 221 ROLLUP Measures and Window Expressions..................................................... 223 Understand ROLLUP Measures...................................................................... 223 Understand ROLLUP Window Expressions.................................................... 226 Computed Field Examples.................................................................................... 227 Troubleshoot Computed Field Errors....................................................................229 Write a Lens Query...............................................................................................231 FAQs - Expression Basics.................................................................................... 232 Expression Language Reference..........................................................................233 Expression Quick Reference........................................................................... 233 Comparison Operators.....................................................................................248 Logical Operators.............................................................................................249 Arithmetic Operators........................................................................................ 250 Conditional and NULL Processing...................................................................250 Event Series Processing..................................................................................252 String Functions............................................................................................... 260 URL Functions................................................................................................. 288 IP Address Functions...................................................................................... 293 Date and Time Functions................................................................................ 295 Math Functions................................................................................................ 301 Data Type Conversion Functions.................................................................... 305 Aggregate Functions........................................................................................310 ROLLUP and Window Functions.....................................................................314 User Defined Functions (UDFs)...................................................................... 328 Regular Expression Reference........................................................................333 Appendix A: Expression Language Reference..................................................... 343 Expression Quick Reference.................................................................................343 Comparison Operators.......................................................................................... 358 Logical Operators.................................................................................................. 359 Arithmetic Operators............................................................................................. 360 Conditional and NULL Processing........................................................................ 361 CASE................................................................................................................361 COALESCE......................................................................................................362 IS_VALID..........................................................................................................362 Event Series Processing....................................................................................... 363 PARTITION...................................................................................................... 363 PACK_VALUES............................................................................................... 370 String Functions.................................................................................................... 371 CONCAT.......................................................................................................... 371 ARRAY_CONTAINS........................................................................................ 371 FILE_NAME..................................................................................................... 372 FILE_PATH...................................................................................................... 373 Page 5 Data Ingest Guide - Contents EXTRACT_COOKIE.........................................................................................374 EXTRACT_VALUE...........................................................................................374 INSTR...............................................................................................................375 JAVA_STRING.................................................................................................376 JOIN_STRINGS............................................................................................... 377 JSON_ARRAY_CONTAINS.............................................................................377 JSON_DOUBLE............................................................................................... 378 JSON_FIXED................................................................................................... 379 JSON_INTEGER..............................................................................................380 JSON_LONG....................................................................................................381 JSON_STRING................................................................................................ 382 LENGTH...........................................................................................................383 REGEX.............................................................................................................383 REGEX_REPLACE.......................................................................................... 389 SPLIT............................................................................................................... 395 SUBSTRING.................................................................................................... 396 TO_LOWER..................................................................................................... 397 TO_UPPER...................................................................................................... 397 TRIM.................................................................................................................398 XPATH_STRING..............................................................................................398 XPATH_STRINGS........................................................................................... 399 XPATH_XML.................................................................................................... 401 URL Functions.......................................................................................................402 URL_AUTHORITY........................................................................................... 402 URL_FRAGMENT............................................................................................ 403 URL_HOST...................................................................................................... 404 URL_PATH.......................................................................................................405 URL_PORT...................................................................................................... 406 URL_PROTOCOL............................................................................................ 407 URL_QUERY................................................................................................... 407 URLDECODE...................................................................................................408 IP Address Functions............................................................................................410 CIDR_MATCH..................................................................................................410 HEX_TO_IP......................................................................................................411 Date and Time Functions......................................................................................411 DAYS_BETWEEN............................................................................................412 DATE_ADD...................................................................................................... 412 HOURS_BETWEEN.........................................................................................413 EXTRACT.........................................................................................................414 MILLISECONDS_BETWEEN...........................................................................415 MINUTES_BETWEEN..................................................................................... 415 NOW.................................................................................................................416 SECONDS_BETWEEN....................................................................................417 TRUNC.............................................................................................................417 YEAR_DIFF......................................................................................................418 Page 6 Data Ingest Guide - Contents Math Functions......................................................................................................419 DIV................................................................................................................... 419 EXP.................................................................................................................. 420 FLOOR............................................................................................................. 420 HASH............................................................................................................... 421 LN.....................................................................................................................421 MOD................................................................................................................. 422 POW.................................................................................................................422 ROUND............................................................................................................ 423 Data Type Conversion Functions..........................................................................424 EPOCH_MS_TO_DATE...................................................................................424 TO_CURRENCY.............................................................................................. 424 TO_DATE.........................................................................................................424 TO_DOUBLE....................................................................................................427 TO_FIXED........................................................................................................427 TO_INT.............................................................................................................428 TO_LONG........................................................................................................ 428 TO_STRING.....................................................................................................429 Aggregate Functions............................................................................................. 430 AVG..................................................................................................................430 COUNT.............................................................................................................431 COUNT_VALID................................................................................................ 431 DISTINCT.........................................................................................................432 MAX..................................................................................................................432 MIN...................................................................................................................433 SUM................................................................................................................. 434 STDDEV...........................................................................................................434 VARIANCE....................................................................................................... 435 ROLLUP and Window Functions.......................................................................... 435 ROLLUP........................................................................................................... 435 DENSE_RANK................................................................................................. 441 NTILE............................................................................................................... 443 RANK............................................................................................................... 446 ROW_NUMBER............................................................................................... 449 User Defined Functions (UDFs)............................................................................451 Writing a Platfora UDF Java Program.............................................................451 Adding a UDF to the Platfora Expression Builder........................................... 454 Regular Expression Reference............................................................................. 456 Regex Literal and Special Characters.............................................................457 Regex Character Classes................................................................................458 Regex Line and Word Boundaries.................................................................. 462 Regex Quantifiers............................................................................................ 462 Regex Capturing Groups................................................................................. 464 Page 7 Data Ingest Guide - Contents Appendix B: Lens Query Language Reference.....................................................467 SELECT Statement............................................................................................... 467 DEFINE Clause................................................................................................469 WHERE Clause............................................................................................... 470 GROUP BY Clause......................................................................................... 471 HAVING Clause............................................................................................... 472 Example of Lens Queries................................................................................ 472 Page 8 Preface This guide provides information and instructions for ingesting and loading data into a Platfora® cluster. This guide is intended for data administrators who are responsible for making Hadoop data accessible to business users and data analysts. Knowledge of Hadoop, data processing, and data storage is recommended. Document Conventions This documentation uses certain text conventions for language syntax and code examples. Convention Usage Example $ Command-line prompt proceeds a command to be entered in a command-line terminal session. $ ls $ sudo Command-line prompt $ sudo yum install open-jdk-1.7 for a command that requires root permissions (commands will be prefixed with sudo). UPPERCASE Function names and keywords are shown in all uppercase for readability, but keywords are caseinsensitive (can be written in upper or lower case). SUM(page_views) italics Italics indicate a usersupplied argument or variable. SUM(field_name) [ ] (square Square brackets denote optional syntax items. CONCAT(string_expression[,...]) ... (elipsis) An elipsis denotes a syntax item that can be repeated any number of times. CONCAT(string_expression[,...]) brackets) Page 9 Data Ingest Guide - Introduction Contact Platfora Support For technical support, you can send an email to: support@platfora.com Or visit the Platfora support site for the most up-to-date product news, knowledge base articles, and product tips. http://support.platfora.com To access the support portal, you must have a valid support agreement with Platfora. Please contact your Platfora sales representative for details about obtaining a valid support agreement or with questions about your account. Copyright Notices Copyright © 2012-15 Platfora Corporation. All rights reserved. Platfora believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” PLATFORA CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Use, copying, and distribution of any Platfora software described in this publication requires an applicable software license. Platfora®, You Should Know™, Interest Driven Pipeline™, Fractal Cache™, and Adaptive Job Synthesis™ are trademarks of the Platfora Corporation. Apache Hadoop™ and Apache Hive™ are trademarks of the Apache Software Foundation. All other trademarks used herein are the property of their respective owners. Embedded Software Copyrights and License Agreements Platfora contains the following open source and third-party proprietary software subject to their respective copyrights and license agreements: • Apache Hive PDK • dom4j • freemarker • GeoNames • Google Maps API • javassist Page 10 Data Ingest Guide - Introduction • javax.servlet • Mortbay Jetty 6.1.26 • OWASP CSRFGuard 3 • PostgreSQL JDBC 9.1-901 • Scala • sjsxp : 1.0.1 • Unboundid Page 11 Chapter 1 About the Platfora Data Pipeline Got questions about how Platfora enables self-service access to raw data in Hadoop? Want to know what happens to the data on the way to those stunning, interactive visualizations? This section explains how data flows from Hadoop to Platfora, and what happens to the data at each step in the workflow. Topics: • FAQs - Platfora Data Pipeline • About the Data Workflow FAQs - Platfora Data Pipeline This section answers the most frequently asked questions (FAQs) about Platfora's Interest Driven Pipeline™ and the data workflow. What does 'Interest Driven Pipeline' mean? The traditional data pipeline is mainly an 'operations driven pipeline' -- IT pushes the data to the consumers, rather than the consumers pulling the data that interests them. In a traditional data pipeline data is pre-processed, modeled into a relational schema, and loaded into a data warehouse. Then it is optimized (or pre-aggregated) to make it possible for BI and reporting tools to access it. All of this work to move and prepare the data happens regardless of the immediate needs of the business users. The idea behind an 'interest driven pipeline' is to not move or pre-process the data until somebody wants it. Platfora's approach is to catalog all of data that's available, then allow business users to discover and request data of interest to them. Once a request is made, then Platfora pulls the data from Hadoop, cleanses and processes it, and optimizes it for analysis. Having the entire data pipeline managed in a single application allows for more agile data projects. How does Platfora access data in Hadoop? When you first install Platfora, you provide connection information to your Hadoop cluster services. Then you define a Data Source in the Platfora application to point to a particular location in the Hadoop file system. Once a data source is registered in Platfora, the data files in that location are then visible to Platfora users. Page 12 Data Ingest Guide - About the Platfora Data Pipeline Can I control who sees what data? Yes. Platfora provides role-based security so you can control who can see data coming from a particular data source. You can also control access at a more granular per-dataset level if necessary. You can either control data access within the Platfora application, or configure Platfora to inherit the file system permissions from HDFS. Does all the source data have to be in Hadoop? Yes (for the most part). Platfora primarily works with data stored in a single distributed file system -typically HDFS for on-premise Hadoop deployments or Amazon S3 for cloud deployments. However, it is also possible to develop custom data connectors to access smaller datasets outside of Hadoop. For example, you may have customer data in a relational database that you want to use in conjunction with log data stored in HDFS. Data connectors can be used to pull relatively small amounts of external data over to Hadoop on demand. How does Platfora know the structure of the data and how to process it? You have to tell Platfora how your data is structured by defining a Dataset. A dataset points to a set of files in a data source and describes the structure of the data, as well as any processing logic needed to prepare the data for consumption. A dataset is just a metadata description of the data -- it contains all of the data about the data -- plus a small sampling of raw rows to facilitate data discovery. How does Platfora handle messy, complicated data? Platfora's dataset workspace has a number of tools to help you cleanse and transform data into a structured format. There are a number of built-in data parsers for common file formats, such as Delimited Text, CSV, JSON, XML or Avro. For unstructured or semi-structured data, Platfora has an extensive library of built-in functions that you can use to define data processing tasks. When Platfora processes the data during a lens build, it also logs any problem rows that it could not process according to the logic defined in the dataset. These 'dirty rows' are shown as lens build warnings. Platfora administrators can investigate these warnings to determine the extent of the problem. How does Platfora deal with multi-structured data? Defining a dataset in Platfora overlays the structure on the data as a light-weight metadata layer. The actual data remains in Hadoop in its raw form until it is requested by a Platfora user. This allows you to have datasets with very different characteristics exist together in Platfora, described in the unifying language of the Platfora dataset. If two datasets have fields that can be used to join them together, then the logic of that join can also be described in the dataset as a Reference. Modeling references between datasets within Platfora allows you to quickly combine multi-structured data without having to move or pre-process the data up front. How do I find available data? Every dataset that is defined in Platfora is added to the Data Catalog. You can search or browse the data catalog to find datasets of interest. The Platfora data catalog is the one place where you capture all of Page 13 Data Ingest Guide - About the Platfora Data Pipeline your organizational knowledge about your data. It is where non-technical users can discover and request the data they need. How do I request data? Once you find the dataset you want in the data catalog, you create a Lens in Platfora to request data from that dataset. A lens is a selection of fields from the focal point of a single dataset. A dataset points to data in Hadoop. How does data get from Hadoop into Platfora? Users bring data into Platfora by kicking off a Lens Build. A lens build runs a series of processing jobs in Hadoop to pull, process, and optimize the requested data. The output of these jobs is the Lens. Once the lens build jobs have completed successfully in the Hadoop cluster, the prepared lens data is then copied over to the Platfora nodes. At this point the data is in Platfora and available for analysis. Where does the prepared data (the lenses) reside? Lens data is distributed across the Platfora worker nodes. This allows Platfora to use the resources of multiple servers to process lens queries in parallel, and scale up as your data grows. Lenses are stored on disk on the Platfora nodes, and are also loaded into memory whenever Platfora users interact with them. Having the lenses in memory makes the queries run faster. A copy of each lens is also stored in the primary Hadoop file system as a backup. How do I explore the data in a lens? Once a lens is built, the data is in Platfora and ready to explore in a Vizboard. The main way to interact with the data in a lens is to create a Visualization (or Viz for short). A viz is just a lens query that is represented visually as a chart, graph, or table. Is building a viz the only way to look at the data in a lens? No, but we think it is the best way! Platfora also has a REST API that you can use to programmaticly query a lens, or you can export lens data in CSV format for use in other applications or data workflows. How is Platfora different from Hadoop tools like Hive? First of all, users do not need to have any special technical knowledge to build a lens. Platfora enables all levels of users to request data from Hadoop -- no programming or SQL skills required. Secondly, with a query tool like Hive, each query is its own MapReduce job in Hadoop. You have to wait for each query to run in order to see the results. If you want to change the query, you have to rewrite it and run it again (and wait). It is not very responsive or interactive. A lens is more like an on-demand data mart rather than a single query result. It contains optimized data that is loaded into memory so the query experience is fast and interactive. The data contained in a lens can support many combinations of queries, and the results are rendered visually so that insights are easier to find. Page 14 Data Ingest Guide - About the Platfora Data Pipeline What if a lens doesn't have the data I need? If a lens doesn't quite meet your data requirements, there are a couple of things you can do: • You can edit an existing lens definition to add additional fields or expand the scope of the data requested. • You can add computed fields directly in the vizboard to further manipulate the data you already have. • You can go back to the data catalog and create an entirely new lens. You can even upload new data from your desktop and combine it with datasets already in Platfora. How can I know if the data is correct? One of the advantages to having the entire data pipeline in one application is complete visibility at each stage of the workflow. Platfora allows you to see the data lineage of every field in a lens, all the way back to the source file that the data originated from. How do I share my insights with others? Platfora's vizboards were purpose built for sharing and collaboration. You can invite others to join you in a vizboard, and use comment threads to collaborate. You can prepare view-only dashboards and email them to your colleagues. You can also export data and images from the vizboard for use in other business applications, like R, Excel, or PowerPoint. About the Data Workflow What are the steps involved in going from raw data in Hadoop to visualizations in Platfora? What skills do you need to perform each step? This section explains each stage of the data workflow from data ingest, to analysis, to collaboration. Step 1: Define Data Sources to Connect to Raw Data in Hadoop The first step in the data pipeline is to make the raw data accessible to Platfora. This is done by defining a Data Source. A data source uses a data connector to point to some location in the Hadoop file system or other external data server. Platfora has out-of-the-box data connectors for: Page 15 Data Ingest Guide - About the Platfora Data Pipeline • HDFS • MapR FS • Amazon S3 • Hive Metastore Platfora also provides APIs for defining your own custom data connectors. Who does this step? System Administrators (someone who knows where the raw data resides and how to provide access to it). System administrators also define the security permissions for the data sources. Platfora users can only interact with data that they are authorized to see. Step 2: Create Datasets to Describe the Structure of the Data After you have connected to a data source, the next step is to describe and model the data by creating Datasets in Platfora. A dataset is a pointer to a collection of raw data files along with a metadata description of how those files are structured. Platfora provides a number of built-in file parsers for common file formats, such as: • Delimited Text • Comma-Separated Values (CSV) • JSON • XML • Avro • Web Access Logs • Hive Table Definitions In addition to describing the structure of the data, the datasets also contain information on how to process the data, plus how to join different datasets together. If you are familiar with ETL workflows (extract, transform, and load), the dataset encompasses the extract and transform logic. Who does this step? Data Administrators (someone who understands the data and how to make the data ready for consumption). Step 3: Build a Lens to Pull Data from Hadoop into Platfora All datasets that have been defined in Platfora are available in Platfora's Data Catalog. The data catalog is where Platfora users can see what data is available, and make requests for the data they need. The way you request data is by choosing a dataset, then building a Lens from that dataset. A lens can be thought of as an on-demand data mart, a summary table, or a materialized view. A Lens Build automates a number of Hadoop processing tasks -- it submits a series of MapReduce jobs to Hadoop, collects the results, and brings the results back into Platfora. The data populated to a lens is pre-aggregated, compressed, and columnar. From the perspective of an ETL workflow, the lens build is the load part of the process. Who does this step? Data Analysts or Data Administrators (someone who understands the business need for the data or has an analysis use case they want to achieve). Lenses provide self-service Page 16 Data Ingest Guide - About the Platfora Data Pipeline access to the data in Hadoop -- users do not need any specialized technical skills to build a lens. Data administrators may want to set up a schedule of production lenses that are built on a regular basis. However, data analysts can also build their own lenses as needed. Step 4: Create Vizboards to Analyze and Visualize the Data Once a lens is built, the data is available in Platfora for analysis. Platfora users create Vizboards to manage their data analysis projects. The vizboard can be thought of as a project workspace where you can explore the data in a lens by creating visualizations. A Visualization (or Viz for short) is the result of a lens query, but the data is represented in a visual way. Visualizations can take various forms such as charts, graphs, maps, or cross-tabs. As users build vizzes using the data in a lens, the data is loaded into memory so the experience is fast and interactive. Within a vizboard, analysts can build dashboards (or pages) of visualizations that reveal particular business insights or tell a data story. For example, a vizboard may show two or three charts that support a future business direction or confirm the results of a past business campaign or decision. Who does this step? Data Analysts (anyone who has access to the data and has a question or hunch they want to investigate). Step 5: Share Your Insights with Others The Platfora Vizboard is a place where you can collaborate with your fellow analysts or share prepared insights with business users. You can invite other Platfora users to view and comment on your vizboards, or you can export images from a vizboard to send to others via email or PDF. You can also export query results (the viz data) for use in other applications, such as Excel or R. Who does this step? Data Analysts (anyone who has an insight they want to share). Page 17 Chapter 2 Manage Data Sources The first step in making Hadoop data available in Platfora is identifying what source data you want to expose to your business users, and making sure the data is in a format that Platfora can work with. Although the source data may be coming into Hadoop from a variety of source systems, and in a variety of different file formats, Platfora needs to be able to parse it into rows and columns in order to create a dataset in Platfora. Platfora supports a number of data sources and source file formats. Topics: • Supported Data Sources • Add a Data Source • Configure Data Source Security • Delete a Data Source • Edit a Data Source Only System Administrators can manage data sources. Supported Data Sources Hadoop supports many different distributed file systems, of which HDFS is the primary implementation. Platfora provides data adapters for a subset of the file systems that Hadoop supports. Hadoop also has various database and data warehouse implementations, some of which can be used as data sources for Platfora. This section describes the data sources supported by Platfora. Source Description Hive Platfora can use a Hive metastore server as a data source, and map a Hive table definition to a Platfora dataset definition. Platfora uses the Hive table definition to obtain metadata about the source data, such as which files to process, the parsing logic for rows and columns, and the field names and data types contained in the source data. It is important to note that Platfora does not execute queries through Hive; it only uses Hive tables to obtain the metadata needed for defining datasets. Platfora generates and runs its own MapReduce jobs directly in Hadoop. Page 18 Data Ingest Guide - Manage Data Sources Source Description HDFS Hadoop Distributed File System (HDFS) is the primary storage system for Hadoop. Platfora can be configured to connect to the HDFS NameNode server and use the HDFS file system as its primary data source. Amazon S3 Amazon Simple Storage Service (Amazon S3) is a distributed file system hosted by Amazon where you pay a monthly fee for storage space and data transfer bandwidth. It can be used as a data source for users who run their Hadoop clusters on Amazon EC2 or who utilize the Amazon EMR service. Hadoop supports two S3 file systems as an alternative to HDFS: S3 Native File System (s3n) and S3 Block File System (s3). Platfora supports the S3 Native File System (s3n) only. MapR FS MapR FS is the proprietary Hadoop distributed file system of MapR. Platfora can be configured to connect to a MapR Container Location Database (CLDB) server and use the MapR file system as its primary data source. Uploaded Files Platfora allows you to upload files from your local file system into Platfora. These files are added to a special Uploads data source, which resides in the distributed file system (DFS) that the Platfora server is configured to use when it first starts up. Custom Data Connector Plugins Platfora provides Java APIs that allow developers to create custom data connector plugins. For example, you can create a plugin that connects to a relational database such as MySQL or PostgreSQL. Datasets created from a custom data source should be relatively small (less than 100,000 rows). External data is pulled over to Hadoop at lens build time via the Platfora master (which is not a parallel operation). Page 19 Data Ingest Guide - Manage Data Sources Add a Data Source A data source is a connection to a mount point or directory on an external data server, such as a file system or database server. Platfora currently provides data source adapters for Hive, HDFS, Amazon S3, and MapR FS. 1. Go to the Data Catalog > Datasets page. 2. Click Add Dataset to open the dataset workspace. 3. Click New Source. 4. Enter the connection information for the data source server. The required connection information depends on the Source Type you choose. 5. Click Connect. 6. Click Cancel to exit the dataset workspace. Connect to a Hive Data Source When Platfora uses Hive as a data source, it connects to the Hive metastore to query information about the source data. There are multiple ways to configure the Hive metastore service in your Hadoop environment. If you are using the Hive Thrift Metastore (known as the remote metastore client configuration), you can add a Hive data source directly in the Platfora application. If you connect directly to the Hive metastore relational database management system (RDBMS) (known as a local Page 20 Data Ingest Guide - Manage Data Sources metastore client configuration), this requires additional configuration on the Platfora master server. You cannot define this type of Hive data source in the Platfora application. See the Hive wiki for more information about the different Hive metastore client configurations. Connect to a Hive Thrift Metastore By default, the Platfora application allows you to connect the the Hive Thrift Metastore service. To use the Thrift server as a data source for Platfora, you must start the Hive Thrift Metastore server in your Hadoop environment and know the URI to connect to this server. In a remote Hive metastore setup, Hive clients (such as Platfora) make a connection to the Hive Thrift Metastore server which then queries the metastore database (typically a MySQL database) for the Hive metadata. The client and metastore server communicate using the Thrift protocol. You can add a Hive Thrift Metastore data source in the Platfora application. You will need to supply the URI to the Hive Thrift metastore service in the format of: thrift://hive_host:thrift_port Where hive_host is the DNS host name or IP address of the Hive server, and thrift_port is the port that the Hive Thrift metastore service is listening on. For Cloudera 4, Hortonworks 1.2, and MapR installations the default Thrift port is 9083. For Hortonworks 2 installations, it is 9983. If the connection to Hive is successful, you will see a list of available Hive databases in that data source. Click on a database name to show the Hive tables within that database. The default database in Hive is named default. If you have not created your own databases in Hive, this is where all of your tables will reside. Page 21 Data Ingest Guide - Manage Data Sources If you are using Hive views, they will also be listed. However, Hive views are disabled to use as the basis of a Platfora dataset. You can only create a dataset from Hive tables. If you have trouble connecting to Hive, make sure that the Hive Thrift metastore server process is running, and that the Platfora server machine has access over the network to the designated Hive Thrift server port. Also, make sure that the system user that the Platfora server runs as has read permissions to the underlying data files in HDFS. The Hive Thrift metastore is an optional service and is not usually started by default when you install Hive, so it is possible that the service is not started. To check if Platfora can connect to the Hive Thrift Metastore, run the following command from the Platfora master server: $ hive --hiveconf hive.metastore.uris="thrift://your_hive_server:9083" --hiveconf hive.metastore.local=false Make sure the Hive server host name or IP address and Thrift port is correct for your Hadoop installation. For Cloudera 4, Hortonworks 1.2, and MapR installations the default Thrift port is 9083. For Hortonworks 2 installations, it is 9983. If the Platfora server can connect, you should see the Hive console command prompt and be able to query the Hive metastore. For example: hive> SHOW DATABASES; hive> exit; If you cannot connect, it is possible that your Hive Thrift Metastore service is not running. Depending on the Hadoop distribution you are using and the version of Hive server you are running, there are different ways to start the Hive Thrift metastore. For example, run the following command on the server where Hive is installed: $ sudo hive --service metastore Page 22 Data Ingest Guide - Manage Data Sources or $ sudo hive --service hiveserver2 Check on your Hive server to make sure it is started, and view you Hive server logs for any issues with starting the metastore. Connecting to a Hive RDBMS Metastore If you are not using the Hive Thrift Metastore server in your Hadoop environment, you can configure Platfora to connect directly to a Hive metastore relational database management system (RDBMS), such as MySQL. This requires additional configuration on the Platfora master server that must be done before you can create the data source in the Platfora application. To have Platfora connect directly to a RDBMS metastore requires additional configuration on the Platfora master server. The Platfora master server needs a hive-site.xml file with the correct RDBMS connection information. You also need to install the appropriate JDBC driver on the Platfora master server, and make sure that Platfora can find the Java libraries and class files for the JDBC driver. Here is an example hive-site.xml to connect to a MySQL metastore. A hive-site.xml containing these properties must reside in the local Hadoop configuration directory of the Platfora master server. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hive_hostname:port/metastore</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>database_username</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>database_password</value> </property> <property> <name>hive.metastore.client.socket.timeout</name> <value>120</value> </property> <property> Page 23 Data Ingest Guide - Manage Data Sources <name>hive.metastore.batch.retrieve.max</name> <value>100</value> </property> </configuration> The Platfora server would also need the MySQL JDBC driver installed in order to use this configuration. You can place the JDBC driver .jar files in $PLATFORA_DATA_DIR/extlib to install them (requires a Platfora restart). You can add a Hive RDBMS Metastore data source in the Platfora application after you have done the appropriate configuration on the Platfora master server. When you leave the Thrift Metastore URI blank, the Platfora application will look for the metastore connection information in the hivesite.xml file on the Platfora master server. If the connection to Hive is successful, you will see a list of available Hive databases in that data source. Click on a database name to show the Hive tables within that database. The default database in Hive is named default. If you have not created your own databases in Hive, this is where all of your tables will reside. Page 24 Data Ingest Guide - Manage Data Sources If you are using Hive views, they will also be listed. However, Hive views are disabled to use as the basis of a Platfora dataset. You can only create a dataset from Hive tables. If you have trouble connecting to the Hive RDBMS metastore, make sure that: • The Hive RDBMS metastore server process is running (i.e. the MySQL database server is running). • The Platfora server machine has access over the network to the designated database server host and port. • The system user that the Platfora server runs as has database permissions granted on the appropriate database objects in the RDBMS. For example, if using a MySQL metastore you could run a command such as the following in MySQL: GRANT ALL ON *.* TO 'platfora'@'%'; • The system user that the Platfora server runs as has read permissions to the underlying data files in HDFS. Connect to an HDFS Data Source Creating an HDFS data source involves specifying the connection information to the HDFS NameNode server. Once you have successfully connected, you will be able to browse the files and directories in HDFS, and choose the files that you want to add to Platfora as datasets. When you add a new data source that connects to an HDFS NameNode server, you will need to supply the following connection information: Connection Information Select or Enter Source Type HDFS Page 25 Data Ingest Guide - Manage Data Sources Connection Information Select or Enter Name A name for the data source location. This can be any name you choose, such as HDFS User Data or HDFS Root Directory. Host The external DNS hostname or IP address of the HDFS NameNode server. Port The port that the HDFS NameNode server listens on for connections. For Cloudera installations, the default port is 8020. For Apache installations, the default port is 9000. Root Path The HDFS directory that Platfora should access. For example, to access the entire HDFS file system, use / (root directory). To access a particular directory only, enter the qualified path (for example, / user/data or /data/weblogs). If the connection to HDFS is successful, you will see a list of the files and directories that reside in the specified location of the HDFS file system when defining a dataset from the data source. If you have trouble connecting to HDFS, make sure that the HDFS NameNode server process is running, and that the Platfora server machine has access over the network to the designated NameNode port. Also, make sure that the system user that the Platfora server runs as has read permissions to the HDFS directory location you specified. Connect to an S3 Data Source Amazon Simple Storage Service (Amazon S3) is a distributed file system hosted on Amazon Web Services (AWS). Data transfer is free between S3 and Amazon cloud servers, making S3 an attractive choice for users who run their Hadoop clusters on EC2 or utilize the Amazon EMR service. Page 26 Data Ingest Guide - Manage Data Sources If you are not using Amazon EC2 or EMR as your primary Hadoop implementation for Platfora, you can still use S3 as a data source, but keep in mind that the source data will be copied to the Platfora primary Hadoop implementation during the lens build process. If you are transferring a lot of data between S3 and another network outside of Amazon, it could be slow. Hadoop supports two S3 file systems as an alternative to HDFS: S3 Native File System (s3n) and S3 Block File System (s3). Platfora supports the S3 Native File System (s3n) only. When you add a new data source that connects to an S3 data source, you will need to supply the following connection information: Connection Information Select or Enter Source Type Amazon S3 Name A name for the data source location. This can be any name you choose, such as S3 Sample Data or Marketing Bucket on S3. Bucket Name A bucket is a named container for objects stored in Amazon S3. If you go to your AWS Management Console S3 Home Page, you can see the list of buckets you have created for your account. Path The directory in the specified bucket that Platfora should access. For example, to access the entire bucket, use / (root directory). To access a particular directory only, enter the qualified path (for example, /user/data or /data/weblogs). If the connection to Amazon S3 is successful, you will see a list of the files and directories that reside in the specified location of the S3 file system when defining a dataset from the data source. If you have trouble connecting to Amazon S3, make sure that the Platfora server machine has access over the network to Amazon Web Services, and that your S3 connection information and AWS security credentials are specified in the core-site.xml configuration file of the Platfora master server. If you are using Amazon EMR as the default Hadoop implementation for Platfora, your Platfora administrator should have configured the S3 connection information and AWS security credentials during installation. Connect to a MapR Data Source Creating a MapR data source involves specifying the connection information to the MapR Container Location Database (CLDB) server. Once you have successfully connected, you will be able to browse the files and directories in the MapR file system, and choose the files that you want to add to Platfora as datasets. When you add a new data source that connects to a MapR cluster, you will need to supply the following connection information: Connection Information Select or Enter Source Type MapR Page 27 Data Ingest Guide - Manage Data Sources Connection Information Select or Enter Name A name for the data source location. This can be any name you choose, such as MapR File System or MapRFS Marketing Directory. Host The external DNS hostname or IP address of the MapR Container Location Database (CLDB) server. Port The port that the MapR CLDB server listens on for client connections. The default port is 7222. Root Path The MapR file system (MRFS) directory that Platfora should access. For example, to access the entire file system, use / (root directory). To access a particular directory only, enter the qualified path (for example, /user/data or /data/weblogs). If the connection to MapR is successful, you will see a list of the files and directories that reside in the specified location of the MapR file system when defining a dataset from the data source. If you have trouble connecting to MapR, make sure that the CLDB server process is running, and that the Platfora server machine has access over the network to the designated CLDB port. Also, make sure that the system user that the Platfora server runs as has read permissions to the MapRFS directory location you specified. Connect to Other Data Sources The Other data source type allows you to specify a connection URL to an external data source server. You can use this to create a data source when you already know the protocol and URL to connect to a supported data source type. When you add a data source using Other, you will need to supply the following connection information: Connection Information Select or Enter Source Type Other Name A name for the data source location. This can be any name you choose, such as My File System or Marketing Data. URL A connection URL for the data source using one of the supported data source protocols (hdfs, maprfs, thrift, or s3n), or you can also use the file protocol to access a directory or file on the local Platfora master server file system. For example: file://localhost:8001/file_path_on_platfora_master Page 28 Data Ingest Guide - Manage Data Sources If the connection to the data source is successful, you will see a list of the files and directories that reside in the specified location of the file system when defining a dataset from the data source. If you have trouble connecting, make sure that the Platfora server machine has access over the network to the designated server. Also, make sure that the system user that the Platfora server runs as has read permissions to the directory location specified. About the Uploads Data Source When you first start the Platfora server, it connects to the configured distributed file system (DFS) for Hadoop, and creates a default data source named Uploads. This data source cannot be deleted. You can upload single files residing on your local file system, and they will be copied to the Uploads data source in Hadoop. For large files, it may take a few minutes for the file to be uploaded. The largest file that you can upload through the Platfora web application is 50 MB. If you have large files, consider adding them directly in the Hadoop file system rather than uploading them through the browser. Other than adding new files, you cannot manage the files in the Uploads data source through the Platfora application. You cannot remove files from the Uploads data source once they have been uploaded, or create sub-directories to organize uploaded files. If you want to remove a file, you must delete it in the DFS source system. Re-uploading a file with the same file name will overwrite the previously uploaded copy of the file. Page 29 Data Ingest Guide - Manage Data Sources Upload a Local File You can upload a file through the Platfora application and it will be copied to the default Uploads data source in Hadoop. Once a file is uploaded, you can then select it as the basis for a dataset. 1. Go to the Data Catalog page. 2. Click Add Dataset to open the dataset workspace. 3. Click Upload File. 4. Browse your local file system and select the file you want to upload. 5. Click Upload. After the file is uploaded, you can either Cancel to exit the dataset workspace, or Continue to define a dataset from the uploaded file. About Security on Uploaded Files By default, data access permissions on the Uploads data source is granted to the Everyone group. Object permissions allow the Everyone group to define datasets from the data source (and thereby upload files to this data source). Keep in mind that only users with a system role of System Administrator or Data Administrator are allowed to create datasets, so only these roles can upload files. Page 30 Data Ingest Guide - Manage Data Sources Configure Data Source Security Only system administrators can create data sources in Platfora. Access to the files in a data source location is controlled by granting data access permissions to the data source. The ability to manage or define datasets from a data source is controlled by its object permissions. 1. Go to the Data Catalog page. 2. Click Add Dataset to open the dataset workspace. 3. Select the data source in the Source List. 4. Click the data source information icon (to the right of the data source name). 5. Click Permission Settings. 6. The Data Access section lists the users and groups allowed to see the data coming from this data source location. If a user does not have data access, they will not be able to see any data values in Platfora that originate from this data source. Data access permissions apply to any Platfora object created from this source (dataset, lens, or viz). Page 31 Data Ingest Guide - Manage Data Sources Data access defaults to the Everyone group (click the X to remove it). Click Add Data Access to grant data access to other users and groups. 7. The Collaborators section lists the users and groups allowed to access the data source object. Click Add Collaborators to grant object access to users or groups. The following data source object permissions can be granted: • Define Datasets on Data Source. The ability to define datasets from files and directories in the data source. • Manage Permissions on Data Source. Includes the ability to define datasets plus the ability to grant data access and object access permissions to other Platfora users. Delete a Data Source Deleting a data source from Platfora removes the data source connection as well as any Platfora dataset definitions you have created from that data source. It does not remove source files or directories from the source file system, only the Platfora definitions. The default Uploads data source cannot be deleted. 1. Go to the Data Catalog page. 2. Click Add Dataset to open the dataset workspace. 3. Select the data source you want to delete from the Source List. 4. Click the data source information icon (to the right of the dataset name). Page 32 Data Ingest Guide - Manage Data Sources 5. Click Delete. 6. Click Confirm to delete the data source and all of its dataset defintions. 7. Click Cancel to exit the dataset workspace. Edit a Data Source You typically do not need to edit a data source once you have successfully established a connection. If the connection information changes, however, you can edit an existing data source to update its connection information, such as the server name or port of the data source. You cannot, however, change the name of a data source after it has been created. 1. Go to the Data Catalog page. 2. Click Add Dataset to open the dataset workspace. 3. Select the data source you want to edit from the Source List. 4. Click the data source information icon (to the right of the dataset name). 5. Click Edit. 6. Change the connection information for the data source. You cannot change the name of a data source after it has been saved. 7. Click Save. 8. Click Cancel to exit the dataset workspace. Page 33 Chapter 3 Define Datasets to Describe Data Data in Hadoop is added to Platfora by defining a Dataset. A dataset describes the characteristics of the source data, such as its file locations, the structure of individual rows or records, the fields and data types, and the processing logic to cleanse, transform, and aggregate the data when it is loaded into Platfora. The collection of modeled datasets make up the Data Catalog (the data items available to Platfora users). This section explains how to create and manage datasets in Platfora. Datasets point to source data in Hadoop. Topics: • FAQs - Dataset Basics • Understand the Dataset Workspace • Understand the Dataset Creation Process • Understand Dataset Permissions • Select Source Data • Parse the Data • Prepare Base Dataset Fields • Transform Data with Computed Fields • Add Measures for Quantitative Analysis • Prepare Date/Time Data for Analysis • Prepare Location Data for Analysis • Prepare Drill Paths for Analysis • Model Relationships Between Datasets • Define the Dataset Key FAQs - Dataset Basics This section answers the most frequently asked questions (FAQs) about creating and managing Platfora datasets. Page 34 Data Ingest Guide - Define Datasets to Describe Data What is a dataset? A dataset points to a set of files in a data source and describes the structure of the data, as well as any processing logic needed to prepare the data for consumption. A dataset is just a metadata description of the data -- it contains all of the data about the data -- plus a small sampling of raw rows to facilitate data discovery. What are the prerequisites for creating a dataset? You need access to the source data. Before you add a new dataset to Platfora, the source data files on which the dataset is based must be in the Hadoop file system and accessible to Platfora via a Data Source. You can also upload files from your desktop to the default Uploads data source. Who can create a dataset? Only System Administrators or Data Administrators can create and edit datasets in Platfora. You must also have data access permissions to the source data in order to define a dataset from data files in Hadoop. The person who creates the dataset becomes the dataset owner. The dataset owner can grant permissions to other Platfora users. How do I create a dataset? Go to the Data Catalog and click Add Dataset. The dataset workspace guides you through a series of steps to define the structure and processing rules for the data. See Understand the Dataset Creation Process. How do I edit an existing dataset? Open the dataset detail page, and click Edit... Page 35 Data Ingest Guide - Define Datasets to Describe Data or find the dataset in the Data Catalog and choose Edit from the dataset action menu. If the edit option is not available, it means you don't have the appropriate permissions. Ask the dataset owner to grant you edit permission. How do I rename a dataset? You cannot rename a dataset after it has been saved for the first time. You can, however, make a duplicate copy of a dataset and save it as a new name. Then you can then delete the old dataset and keep the renamed one. Note that any references to the renamed dataset will be broken in other datasets, so you will have to manually update those. Can I make a copy of a dataset? Yes, you can make a copy of an existing dataset. Edit the dataset you want to copy, and choose Save As from the dataset workspace Save menu. Platfora makes a copy of the current version of the dataset using the new name. Any dataset changes that were made since saving the previous dataset are applied to the new dataset only. Page 36 Data Ingest Guide - Define Datasets to Describe Data You might want to copy an existing dataset to: • Experiment with changes to the dataset computed fields without affecting the original dataset. • Create another dataset that accesses different source files for users that only have access to source files in a different path. • Change the name of the dataset (then delete the original dataset). Since duplicating a dataset changes its name, references to the previous dataset will not be automatically updated to point to the duplicated dataset. You must manually edit the other datasets and update their references to point to the new dataset name instead. How do I delete a dataset? Open the dataset detail page, and click Delete... or find the dataset in the Data Catalog and choose Delete from the dataset action menu. If the delete option is not available, it means you don't have the appropriate permissions. Only a dataset owner can delete a dataset. Deleting a dataset does not remove files or directories in the source file system. It does not remove lenses built from the dataset. Any lenses that have been built from the dataset will remain in Platfora, however future lens builds that use a deleted dataset will fail. Also, any references to the deleted dataset will be broken in other datasets. What kinds of data can be used to define a dataset? You can define a dataset from data files that reside in Hadoop. Platfora supports a number of file formats out-of-the-box. See Supported Source File Formats. Page 37 Data Ingest Guide - Define Datasets to Describe Data How do I join datasets together? The logic of a join is described within the dataset definition as a Reference. A reference joins two datasets together using fields they share in common. A reference creates a link in one dataset to the primary key of another dataset. The actual joining of the datasets happens during lens build time, not when the reference is created. See Model Relationships Between Datasets. What are the different kinds of columns or fields that a dataset can have? A field is an atomic unit of data that has a name, a value, and a data type. A column is a set of data values of a particular data type, with one value for each row in the dataset. Columns provide the structure for composing a dataset row. The terms column and field are often used interchangeably. Within a dataset, there are three basic classes of fields or columns: • Base fields are the raw fields parsed directly from the source data. • Computed fields are fields that you add to the dataset to perform some kind of extraction, cleansing, or transformation on the base data fields. • Measure fields are a special type of computed field that specifies how the data should be aggregated when it is analyzed. For example, suppose you had a Dollars Sold field in your dataset. At analysis time, you may want to know the Total Dollars Sold per day (a SUM aggregation). Measures serve as the quantitative data in an analysis, and every dataset, lens, and viz must have at least one measure. Every dataset column or field also has a data type, which describes the kind of values allowed in that column. See About Platfora Data Types. You can change the data types of base fields. Computed field data types are set by the output type of their computed expression. How do I transform or manipulate the data? To transform or manipulate the data, add computed fields to the dataset. Platfora's expression language has an extensive library of built-in functions and operators that you can use to define computed fields. Think of a computed field as a single step in an ETL (extract, transform, load) workflow. Sometimes several steps, or computed fields, are needed to achieve the result you want. You can hide the computed fields that do interim data processing steps. Page 38 Data Ingest Guide - Define Datasets to Describe Data How do I request data from a dataset? You request data from a dataset by choosing one dataset in the Data Catalog, and creating a lens from that dataset. When you create a lens, you can choose any fields you want from the focus dataset, plus dimension fields from any dataset that it references. When you build the lens, Platfora will go fetch the data from Hadoop and prepare it for analysis. See Define Lenses to Load Data. Understand the Dataset Workspace When you add new dataset or edit an existing one, you are brought to the dataset workspace. This is where you describe the structure and characteristics of your source data in the form of a Platfora dataset. 1. The dataset workspace is divided into six areas to guide you through the dataset definition process. You can go back and forth between the areas as you work on the dataset. You do not have to do the steps in order. 2. The dataset is horizontally divided into columns (or fields). Columns are listed in the order that they occur in the source data (for original base fields), then in the order that they were added to the dataset (for computed fields). 3. When you select a column, the Field Info panel shows the field detail information. This is where you can edit things like the field name, description, data type, or quick measure aggregations. 4. Platfora shows twenty rows to help with data discovery. These are records taken directly from the source data files, and shown with the parsing or expression logic applied. Page 39 Data Ingest Guide - Define Datasets to Describe Data Some computed columns do not show sample values because the values are computed as lens build time, such as measures (aggregate computed fields), event series processing (ESP) computed fields, and any computed field that operates on fields not in the current dataset (via references). 5. At the bottom of the dataset workspace is where you can navigate between areas (Back or Continue), save your changes (Save, Save As, or Save and Exit), or exit the dataset without saving (Cancel). Understand the Dataset Creation Process There are several steps involved in creating a Platfora dataset. This section helps data administrators understand all of the tasks to consider when defining a dataset in Platfora. The goal of the dataset is to make the data consumable for data analysts and business users. The dataset workspace is divided into six areas to guide you through the dataset definition process. You can go back and forth between the areas as you work on the dataset. You do not have to do the steps in order. Step 1 - Select Data The Select Data step is where you point Platfora to a specific location in a data source. You can only browse the data sources that have been added to Platfora by a system administrator. Page 40 Data Ingest Guide - Define Datasets to Describe Data Once the dataset has been saved, the Select Data step becomes disabled. You can change the Source Data location within the same data source, but you cannot switch data sources for an existing dataset. Step 2 - Parse Data The Parse Data step is where you specify the parsing logic used to extract rows and columns from the source data. Platfora comes with several built-in parsers for the most common file formats. After you have done the initial parsing, you usually don't need to revisit this step unless the underlying structure of the data changes. The Wrangled tab shows the data with the parsing logic applied. The Raw tab shows the original raw data records. Page 41 Data Ingest Guide - Define Datasets to Describe Data Step 3 - Manage Fields The Manage Fields step is where you prepare the actual fields that users can see and request from the Platfora data catalog. This is where the majority of the dataset definition work is performed. To make sure that the data is in consumable format for analysis, you may need to: 1. Verify the base field data types 2. Give fields meaningful names 3. Add field descriptions to help users understand the data 4. Add computed fields to further transform and process the data 5. Identify the dataset measures 6. Hide fields you don't want users to see 7. Specify how NULL values are handled 8. Prepare geo-location data for analysis 9. Prepare datetime data for analysis 10.Define drill path hierarchies Page 42 Data Ingest Guide - Define Datasets to Describe Data Step 4 - Create References The Create References step is where you create joins to other datasets. You may need to come back to this step later once all of the dependent datasets have been added to Platfora. When adding the dependent datasets, you must make sure that a) they have a primary key, and b) the data types of the primary key and foreign key fields are the same in both datasets. Step 5 - Define Key The Define Key step is where you choose the column(s) that uniquely identify each row in the dataset, also known as the primary key of the dataset. A dataset only needs a primary key if: Page 43 Data Ingest Guide - Define Datasets to Describe Data • You plan to join to it from another dataset (it is the target of a reference) • You want to use it as the focus of an event series lens • You want to use it to define segments Step 6 - Finish & Save The Finish & Save step is where you can add a description of the dataset and verify the dataset name. Dataset names cannot be changed after the dataset has been saved for the first time. Understand Dataset Permissions Only system and data administrators can create datasets in Platfora. The ability to edit or create a lens from a dataset is controlled by the dataset's object permissions. In addition to the dataset object permissions, users must also have access to the source data itself in order to see and work with the data in Platfora. Platfora controls access to a dataset at two levels: • Source Data Access Permission - Source data access permission determines who is authorized to view the raw source data. By default, data access permission is controlled at the data source level only, and is inherited by the datasets coming from that data source. Your Platfora system administrator may also configure Platfora to authorize data access using the permissions set in HDFS. In these two cases, data access permission is disabled at the dataset level. If Platfora is configured for more granular per-dataset access control, then data access can be set independently of the data source, but this is not the default behavior. Page 44 Data Ingest Guide - Define Datasets to Describe Data • Dataset Object Permissions - Dataset object permissions control who can edit, delete, or create a lens from a dataset within the Platfora application. Users must have dataset permissions at both levels in order to work with a dataset. To manage permissions for a dataset, find the dataset in the data catalog and select Permissions. Click Add Collaborators to choose new users or groups to add. By default, the user who created the dataset is the owner, and the Everyone group is granted Define Lens from Dataset access. The following dataset object permissions can be granted: • Define Lens from Dataset. The ability to define a lens from the visible fields of a dataset. The fields of referenced datasets are not included in this permission by default. A user must have appropriate permissions on each individual dataset in order to choose dataset fields for a lens. By default, all datasets have this permission granted to Everyone. • Edit. Define lens plus the ability to edit the dataset definition. Editing the dataset definition means a user can see the raw data, including hidden fields. • Own. Edit plus the ability to delete a dataset or manage its permissions. Select Source Data After you have created a data source, the first step in creating a dataset is selecting some Hadoop source data to expose in Platfora. This is accomplished by choosing files from the source file system. For Hive data sources, a single Hive table definition maps to a single Platfora dataset. Page 45 Data Ingest Guide - Define Datasets to Describe Data For file system data sources such as HDFS or S3, a dataset can map to either a single file or to multiple files residing in the same parent directory location. For the default Uploads data source, a dataset usually maps to a single uploaded file, although you can select multiple uploaded files if they use a similar file naming convention. Supported Source File Formats To ingest source data, Platfora uses its parsing facilities to parse the data into records (rows) and fields (columns). Platfora supports the following source file formats and file compression formats. Format Description Hive Tables When creating a dataset from a Hive table, there is no need to define parsing controls in Platfora. Platfora uses the Hive table definition to obtain metadata about the source data, such as which files to process, the parsing logic for rows and columns, and the field names and data types contained in the source data. Since Platfora relies on Hive to do the file parsing, you must make sure that Hive is able to correctly handle the source file format of the underlying table data files. Platfora is able to parse Hive tables that refer to data in the following file formats: • Delimited Text file format • SequenceFile format • Record Columnar File (RCFile format) • Optimized Row Columnar (ORC file format) • Custom Input Format (provided that the SERDE used to define the row format is also installed in Platfora) Delimited Text A delimited file is a plain text file format for describing tabular data. It refers to any file that is plain text (typically ASCII or Unicode characters), has one record per line, has records divided into fields, and has the same sequence of fields for every record. Records (or rows) are separated by line breaks, and fields (or columns) within a line are separated by a special character called the delimiter (usually a comma or tab character). If the delimiter also appears in the field values, it must be escaped. The Platfora delimited parser supports single character escapes (such as a backslash), as well as enclosing field values in double quotes (as is common with CSV files). CSV Comma-separated value (CSV) files are a type of delimited text files. The Platfora delimited file parser also supports typical CSV formatting conventions, such as enclosing field values in double quotes, using double quotes to escape literal quotes, and the use of header rows. Page 46 Data Ingest Guide - Define Datasets to Describe Data Format Description JSON JavaScript Object Notation (JSON) is a data-interchange format based on a subset of the JavaScript Programming Language. JSON is a text format comprised of two basic data structures: objects and arrays. The Platfora JSON parser supports the selection of a top-level JSON object to signify a record or row, and selection of name:value pairs within an object to signify columns or fields (including nested objects and arrays). XML Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. XML is a text format comprised of two basic data structures: elements and attributes. The Platfora XML parser supports the selection of a top-level XML element to signify a record or row, and selection of attribute:value or element:value pairs within a parent element to signify columns or fields (including nested elements). Avro Apache Avro is a remote procedure call (RPC) and data serialization framework. Its primary use is to provide a data serialization format for persistent data stored in Hadoop. It uses JSON for defining schema, and JSON or binary format for data encoding. When Avro data is stored in a persistent file (called a container file), its schema is stored along with it. This allows any program to be able to read the serialized data in the file. Hadoop Sequence Files Sequence files are file format generated by Hadoop MapReduce tasks, and are a common format for storing data in Hadoop. It is a flat file format containing binary records. Platfora can import records contained within a sequence file as long as the format of the records is delimited text, CSV, JSON, XML, or Avro. Web Access Logs A web access log contains records about incoming requests made to a web server. Platfora has a built-in parser that automatically recognizes web access logs that adhere to the NCSA common or combined log formats used by many popular web servers (such as Apache HTTP Server). Other File Types For semi-structured file formats, you can still define parsing logic using regular expressions or Platfora's built-in expression language. Platfora provides a Regex or Line parser to allow you to define your own parsing logic to extract data columns from the records in your source files (as long as your source files have one record per line). Custom Data Sources For source data coming in from a custom data connector, the logic of the data connector dictates the format of the data. For example, if using a JDBC data connector to access data in a relational database, the data is returned in delimited format. Page 47 Data Ingest Guide - Define Datasets to Describe Data For Platfora to read a compressed source file, both your Platfora and your Hadoop configuration must support the compression format. By default, Platfora supports the following formats: Format Notes Deflate (zlib) Platfora and Hadoop support these formats out-of-the-box. Gzip Bzip Snappy Platfora includes support for Snappy in its distribution. Hadoop does not. Your administrator must configure Hadoop to support Snappy. Refer to your Hadoop distribution documentation for information on configuring Snappy. LZO (Hadoop-LZO) Due to licensing restrictions, Platfora does not bundle support for these with the product. Your administrator must configure these compression formats both in Platfora and Hadoop. Although neither compression format is explicitly qualified with each new release, Platfora will fix issues and release patches if a problem is discovered. LZ4 Select a Hive Source Table For Hive data sources, Platfora points to a Hive metastore server. From that point, you can browse the available databases and tables, and select a single Hive table on which to base your dataset. Since Hive tables are already in tabular format, the parsing step is skipped for Hive data sources. Platfora does not execute any queries through Hive; it only uses the table definition to obtain the metadata needed to define the dataset. Page 48 Data Ingest Guide - Define Datasets to Describe Data 1. On the Select Data step of the dataset workspace, select a Hive data source from the Source List. 2. Select the Hive database that contains the table you want to use. The default Hive database is named default. 3. Select a single Hive table. Only tables can be used to define datasets, not views. Platfora will use the Hive table definition to determine the source files, columns, data types, partitioning, and so on. 4. Click Continue. Platfora skips the Parse Data step for Hive data sources, and goes directly to the Manage Fields step. Select DFS Source Files For distributed file system data sources, such as HDFS and S3, a data source points to a particular directory in the file system. From that point, you can browse and select the source files to include in Page 49 Data Ingest Guide - Define Datasets to Describe Data your dataset. You can enter a wildcard pattern to select multiple files including files from multiple directory locations, however all of the files selected must be of the same file format. 1. On the Select Data step of the dataset workspace, select an HDFS or S3 data source from the Source List. 2. Browse the file system to choose a directory or file you want to use as the basis for your dataset. 3. To select multiple files within the selected directory, use a wildcard pattern in the Source Location path, where ? represents a single character and * represents any number of characters. For example, suppose you wanted to base a dataset on log files that are partitioned into monthly directories. To select all log files for 2014, you could use a wildcard path such as: hdfs://myhdfs.mycompany.com/data/*2014/*.log 4. In the Selected Source Files list, confirm that the files you want are selected. If a large number of source files are selected, Platfora will only display the first 200 file names. 5. Click Continue. Edit the Dataset Source Location Once the dataset has been saved, the Select Data step of the dataset workspace becomes disabled. You can edit the dataset to point to different source location as long as it is in the same data source. Page 50 Data Ingest Guide - Define Datasets to Describe Data You cannot switch data sources for a dataset after it has been saved. For example, you cannot change a dataset that points to the Uploads data source to then use another HDFS data source instead. 1. Open the dataset workspace, and click Source Data in the dataset header. 2. Edit the Source Location to point to the new directory path or file name within the same data source. 3. Click Update. 4. Click Save. Parse the Data The Parse Data step of the dataset workspace is where your specify the parsing options for a dataset. This section describes how to use Platfora's built-in file parsers to describe your source data in tabular format (rows and columns). The built-in parsers assume that each record has a similar data structure. View Raw Source Data Rows On the Parse Data step of the dataset workspace, Platfora shows a sample of raw lines or records from a source data file. This allows you to compare the data in its original format (the Raw data) to the data with the parsing logic applied (the Wrangled data). Viewing the raw data is helpful in determining the parsing logic, and when writing computed field expressions that do transformations on base fields. For delimited data, Platfora shows a sampling of 20 lines taken from one source file. Page 51 Data Ingest Guide - Define Datasets to Describe Data For structured file formats, such as JSON and XML, Platfora shows a sampling of the first 20 top-level objects taken from one source file. If your data is one record per file, only file is shown (one sample record). The Raw tab shows a sample of records from one source data file. The Wrangled tab shows the data values after the parsing logic has been applied. 1. Open the dataset and go to the Parse Data step of the dataset workspace. 2. Select the Raw tab. 3. To see where the sample records are coming from, click Source Data. 4. To make sure you are seeing the latest source data, click the refresh button. The sample data rows are cached, and this will ensure that the cache is refreshed from the source. Update the Dataset Sample Rows Platfora displays a sample of dataset rows to facilitate the data ingest process. The sample consists of 20 records taken from the first file in the source location. If a dataset is comprised of multiple source files, you can change which file the sample rows are taken from. You can also refresh the sample rows to read from the latest the source data. Page 52 Data Ingest Guide - Define Datasets to Describe Data You can only change the sample file for an existing dataset. When the dataset is first created, Platfora takes a sample of rows and stores it in the dataset cache. You may want to take the sampling from a different source file, or refresh the data if the original source data has changed. 1. Open the dataset and go to the Parse Data step of the dataset workspace. 2. Click Source Data in the dataset header. 3. Choose another file from the Display sample data using drop-down. This option is only available for source locations that point to multiple files. 4. Click Update. 5. (Optional) Click the refresh button to resample rows from the original source data file. Refreshing the sample rows is particularly useful when you replace a file in the Uploads data source. The cached sample rows are not updated automatically when the source data changes. Update the Dataset Source Schema Over time a dataset's source schema may evolve and change. You may need to periodically re-parse the source data to pick up schema changes, such as when new columns are added in the source data. Updating the dataset source schema in this way only applies to Hive and Delimited source data types. Update Schema for Hive Datasets Datasets based on Hive tables have the Parse Data step disabled in the Platfora dataset workspace. This is because the Hive table definition is used to determine the dataset columns and their respective column order, column names, and data type information. Page 53 Data Ingest Guide - Define Datasets to Describe Data If the source data schema changes for a Hive-based data source, you would first update the table definition in Hive. Then in Platfora you can refresh the dataset schema to get the latest dataset columns from the Hive table definition. 1. Update the table in the Hive source system. 2. Edit the dataset in Platfora. 3. Click Source Data at the top of the dataset workspace. 4. Click Refresh Hive. 5. Click Update. Platfora re-reads the table definition from Hive and displays the updated column order, names, and data types. 6. Save your changes. Update Schema for Delimited Datasets For datasets based on delimited text or comma-separted value (csv) files, the only schema change that is supported is appending new columns to the end of a row. If new columns are added in the source data files, you can refresh the schema to pick up the new columns. Changing the column order (adding new columns in the middle of the row) is not supported for delimited datasets. For delimited datasets that have a header row, the base column names in the Platfora dataset definition must match the header column names in the source data file in order to use this feature. Page 54 Data Ingest Guide - Define Datasets to Describe Data Older source data rows that do not have the new appended columns will just have NULL (empty) values for those columns. 1. Edit the dataset in Platfora. 2. Click Source Data at the top of the dataset workspace. 3. Choose a source data file that has the most recent schema containing the new columns. 4. Click Refresh Schema. 5. Click Update. Platfora re-reads the schema from the sample file and displays the new base columns (as long as the new columns are appended at the end of the rows). 6. Save your changes. Parse Delimited Data To use Platfora's delimited file parser, your data must be in plain text file format, have one record per line, and have the same sequence of fields for every record separated by a common delimiter (such as a comma or tab). Delimited records (or rows) are separated by line breaks, and fields (or columns) within a line are separated by a special character called the delimiter (usually a comma or tab character). If the delimiter also appears in the field values, it must be escaped. The Platfora delimited parser supports single character escapes (such as a backslash), as well as enclosing field values in double quotes (as is common with CSV files). Page 55 Data Ingest Guide - Define Datasets to Describe Data On the Parse Data step of the dataset workspace, the Parsing Controls for the Delimited parser are as follows: Parser Control Description File Type Choose the Delimited parser for delimited text and CSV files. The Wrangled view shows the data with the parsing logic applied. Row Delimiter Specifies a single character used to separate rows (or records) in your source data files. In most delimited files, rows are separated by a new line, such as the line feed character, carriage return character, or carriage return plus line feed. Line feed is the standard new line representation on UNIXlike operating systems. Other operating systems (such as Windows) may use carriage return individually, or carriage return plus line feed. Selecting Any New Line will recognize any of these representations of a new line as the row delimiter. Ignore Top Rows Specifies the number of lines at the beginning of the file to ignore when reading the source file during data ingest and lens builds. Enter the number of lines to ignore and click Update. To use this with the Raw Files Contains Header option, ensure that the line containing the column names is visible and is the first remaining line. Column Delimiter Specifies the single character used to separate the columns (or fields) of a row in your source data files. Comma and tab are the most commonly used column delimiters. Page 56 Data Ingest Guide - Define Datasets to Describe Data Parser Control Description Escape Character Specifies the single character used to escape delimiter characters that occur within your data values. If your data values contain delimiter characters, those characters must be escaped, otherwise the parser will assume the special character denotes a new row or column. For comma-separated values (CSV) files, it is common practice to escape delimiters by enclosing the entire field value within double quotes. If your source data uses this convention, then you should specify a Quote Character instead of an Escape Character. Quote Character The quote character is used to enclose individual data values in CSVformatted files. The quote character is usually the double quote character ("). If a data value contains a delimiter, then enclosing the value in double quotes treats every character within the quotes as data, including the delimiters. If the data also contains the quote character, the quote character can also be used to escape itself. For example, suppose you have a row with these three data values: weekly special wine, beer, and soda "2 for 1" or 9.99 each If the column delimiter is a comma, and the quote character is a double quote, a correctly formatted row in the source data would look like this: "weekly special","wine, beer, and soda","""2 for 1"" or 9.99 each" Raw File Contains Header A header is a special row containing column names at the beginning of a data source file. If your source data files have a header row as the first line in the file, select this check-box. This will treat the first line in each source file as a header row instead of as a row of data. Upload Field Names Allows you to upload a comma or tab delimited text file containing the field information you want to set. When a dataset has a lot of fields to manage, it may be easier to update several field names, descriptions, data types, and visibility settings all at once rather than editing each field one-by-one. For more information, see Bulk Upload Field Header Information. Specify a Single-Character Custom Delimiter If your delimited data uses a special delimiter character that is not available in the default choices, you can define a custom delimiter as either a single-character string or decimal-encoded ASCII value. Page 57 Data Ingest Guide - Define Datasets to Describe Data 1. Go to the Parsing Controls panel. Make sure you are using the Delimited parser. 2. Choose Add Custom from the Column Delimiter or Row Delimiter menu. 3. Choose the encoding to use: String or ASCII Decimal Code. 4. Enter the delimiter value. For String, you can enter any single character that you can type on your keyboard. For ASCII Decimal Code, enter the decimal-encoded representation of the ASCII character. For example, 29 is the ASCII code for group separator, 30 is for record separator, 65 is for the letter A. 5. Click OK to add the custom delimiter to the selected parser delimiter menu. Page 58 Data Ingest Guide - Define Datasets to Describe Data Specify a Multi-Character Column Delimiter In some cases, your source data may use a multi-character column delimiter. The delimited parser does not support multi-character delimiters, but you can work around it by using the Regex parser instead. 1. Go to the Parse Data step of the dataset workspace. 2. In the Parsing Controls panel, choose the Regex parser. 3. Enter a Regular Expression that matches the structure of your data lines. For example, if your multi-character column delimiter was a two colons (::), and your data had 6 fields, then you could use a regular expression such as: (.*)::(.*)::(.*)::(.*)::(.*)::(.*) 4. Click Continue. Parse Hive Tables When creating a dataset from a Hive table, there is no need to define parsing controls in Platfora. Platfora uses the Hive table definition to obtain metadata about the source data, such as which files to process, the parsing logic for rows and columns, and the field names and data types contained in the source data. You can only create a dataset based on Hive tables, not Hive views. The following example shows how to define a table in Hive based on comma-delimited files that reside in a directory of HDFS. The EXTERNAL keyword lets you provide a LOCATION so that Hive accesses the files at their current location in HDFS. Without the EXTERNAL clause, Hive moves the files into its own area of HDFS. When dropping an EXTERNAL table in Hive, data in the table is not deleted from the file system. CREATE EXTERNAL TABLE users(user_id INT, name STRING, gender STRING, birthdate STRING) COMMENT 'This table stores user data' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' Page 59 Data Ingest Guide - Define Datasets to Describe Data STORED AS TEXTFILE LOCATION '/data/users/*'; For Platfora to be able to access the Hive table defined above, you would need to make sure the system user that the Platfora server runs as has read access to the /data/users directory of HDFS. See the Apache Hive Wiki for more information about using Hive. Hive to Platfora Data Type Mapping When you create a dataset based on a Hive table, Platfora maps the data types of the Hive columns to one of the Platfora internal data types. Platfora has a number of built-in data types that can be used to classify the fields in a dataset. Hive also has a set of primitive and complex data types it supports for Hive table columns. Platfora does not currently support the Hive BINARY primitive data type. HIVE DECIMAL data type is mapped to DOUBLE by default. This may result in a lost of precision due to roundoff errors. You can choose to map DECIMAL columns to FIXED. This retains precision for numbers that have four or fewer digits after the decimal point, and loses precision for more precise numbers. Hive complex data types (MAP, ARRAY, STRUCT and UNIONTYPE) are imported into Platfora as a single JSON-formatted STRING. You can then use the Platfora expression language to define new computed columns in the dataset that extract a particular key:value pair from the imported JSON structure. Hive Data Type Platfora Data Type TINYINT INTEGER SMALLINT INTEGER INT INTEGER BIGINT LONG DECIMAL DOUBLE FLOAT DOUBLE DOUBLE DOUBLE STRING STRING MAP STRING (JSON-formatted) ARRAY STRING (JSON-formatted) STRUCT STRING (JSON-formatted) Page 60 Data Ingest Guide - Define Datasets to Describe Data Hive Data Type Platfora Data Type UNIONTYPE STRING (JSON-formatted) TIMESTAMP (must be in Hive timestamp format DATETIME of yyyy-MM-dd HH:mm:ss[:SSS] Enable Hive SerDes in Platfora If you are using Hive as a data source, Platfora must be able to parse the underlying source data files that a Hive table definition refers to. For Hive to be able to support custom file formats, you implement Serialization/Deserialization (SerDe) libraries in Hive that describe how to read (or parse) the data. Any custom SerDe libraries that you implement in Hive must also be installed in Platfora. In order for Platfora to be able to read and process data files referenced by a Hive table, any custom SerDe library (.jar file) that you are using in your Hive table definitions must also be installed in Platfora. To install a Hive SerDe in Platfora, copy the SerDe .jar file to the following location on the Platfora master server (create the extlib directory in the Platfora data directory if it doesn't exist): $PLATFORA_DATA_DIR/extlib Restart Platfora after installing all of your Hive SerDe jars: platfora-services restart How Platfora Uses Hive Partitions and Buckets Hive source tables can be partitioned, bucketed, neither, or both. In Platfora, datasets defined from Hive table sources take advantage of the partitioning defined in Hive. However, Platfora does not exploit the clustering or sorting of bucketed tables at this time. Defining a partitioning field on a Hive table organizes the data into separate files in the source file system. The goal of partitioning is to improve query performance by keeping records together in the way that they are accessed. When a Hive query uses a WHERE clause to filter data on a partitioning field, the filter effectively describes which data files are relevant. If a Platfora lens includes a filter on any of the partitioning columns defined in Hive, Platfora will only read the partitions that match the filter. A bucketed table is created using the CLUSTER BY field [SORT BY field] INTO n BUCKETS clause of the Hive table definition. Bucketing defines a hash partitioning of data based on values in the table. A bucketed table may also be sorted within each bucket. When the table is bucketed, each partition must be reorganized during the load phase to enforce the clustering and sorting. Platfora does not exploit the clustering or sorting of bucketed tables at this time. Platfora doesn't support Hive partitions with spaces in the partition name. Use the underscore character (_) instead of white spaces. Page 61 Data Ingest Guide - Define Datasets to Describe Data Parse JSON Files This section explains how to use the Platfora JSON parser to create datasets based on JSON files. JSON is a plain-text file format comprised of two basic data structures: objects and arrays. The Platfora JSON parser allows you to choose a top-level object to signify a record or row, and name:value pairs within an object to signify columns or fields (including nested objects and arrays). What is JSON? JavaScript Object Notation (JSON) is a data-interchange format based on a subset of the JavaScript Programming Language. JSON is a plain-text file format comprised of two basic data structures: objects and arrays. A name is just a string identifier, also sometimes called a key. A value can be a string, a number, true, false, null, an object, or an array. An array is an ordered, comma-separated collection of values enclosed in brackets []. Objects and arrays can be nested in a tree-like structure within a JSON record or document. For example, here is a user record in JSON format: { } "userid" : "joro99" "firstname" : "Joelle", "lastname" : "Rose", "email" : "jojorose@gmail.com", "phone" : [ { "type" : "home", "number": "415 123-4567" }, { "type" : "mobile", "number": "650 456-7890" }, { "type" : "work", "number": null } ] And the same user record in XML format: <user> <userid>joro99</userid> <firstname>Joelle</firstname> <lastname>Rose</lastname> <email>jojorose@gmail.com</email> <phone> <number type="home">415 123-4567</number> <number type="mobile">650 456-7890</number> <number type="work"></number> </phone> </user> Supported JSON File Formats This section describes how the Platfora JSON parser expects JSON files to be formatted, and how to specify what makes a record or row in a JSON file. There are two general JSON file formats supported by Platfora: JSON Object per line and JSON Object. Page 62 Data Ingest Guide - Define Datasets to Describe Data The JSON Object per line format supports files containing top-level JSON objects, with one object per line. For example, here is a JSON file where each top-level object represents a user record with one user object per line. {"name": "John Smith","email": "john@gmail.com", "phone": [{"type":"mobile","number":"123-456-7890"}]} {"name": "Sally Jones", "email: "sally@yahoo.com", "phone": [{"type":"home","number":"456-789-1007"}]} {"name": "Jeff Hamm","email": "jhamm@hotmail.com", "phone": [{"type":"mobile","number":"789-123-3456"}]} The JSON Object format supports files containing a top-level array of JSON objects: [ ] {"name": "John Smith","email": "john@gmail.com"}, {"name": "Sally Jones", "email: "sally@yahoo.com"}, {"name": "Jeff Hamm","email": "jhamm@hotmail.com"} or one large JSON object with the records to import contained within a sub-array: { "registration-date": "Sept 24, 2014", "users": [ {"name": "John Smith","email": "john@gmail.com"}, {"name": "Sally Jones", "email: "sally@yahoo.com"}, {"name": "Jeff Hamm","email": "jhamm@hotmail.com"} ] } In some cases, the structure of your JSON file might be more complicated. You must always specify one level from the JSON object tree to use as the basis for rows. You can, however, still extract columns from a top-level object as well. As an example, suppose you had the following JSON file containing movie review records. You want a row to be created for each reviews record, but still want to retain the value of movie_title and year for each row: [{"movie_title":"Friday the 13th", "year":1980, "reviews":[{"user":"Pam","stars":3,"comment":"a bit predictable"}, {"user":"Alice","stars":4,"comment":"classic slasher flick"}]}, {"movie_title":"The Exorcist", "year":1984, "reviews":[{"user":"Jo","stars":5,"comment":"best horror movie ever"}, {"user":"Bryn","stars":4,"comment":"I will never eat pea soup again"}, {"user":"Sam","stars":4,"comment":"loved it"}]}, {"movie_title":"Rosemary's Baby", "year":1969, "reviews":[{"user":"Fred","stars":4,"comment":"Mia Farrow is great"}, {"user":"Lou","stars":5,"comment":"the one that started it all"}]} ] Page 63 Data Ingest Guide - Define Datasets to Describe Data Using the JSON Object parser, you would choose the reviews array as the record filter. You could then add the movie_title and year columns by their path as follows: $. movie_title $. year The $. notation starts the path from the base of the object tree hierarchy. Use the JSON Parser The Platfora JSON parser takes a sample of the source data to determine the format of your JSON files, and then shows the object hierarchy so you can choose the rows and columns to include in the dataset. 1. When you select data that is in valid JSON format, Platfora recognizes the file format and chooses a JSON parser. 2. The basis of a record or row depends on the format of your JSON files. You can either choose to use each line in the file as a record (JSON Object per line), or choose a sub-array in the file to use as the basis of a record (JSON Object). 3. If your object hierarchy is nested, you can add a Filter to a specific object in the hierarchy. This allows you to use objects nested within a sub-array as the basis for rows. 4. Use the Record object tree to select the columns to include in the dataset. You can browse up to 20 JSON records when choosing columns. 5. You can add additional columns based on objects above what was used as the row Filter. Use the Data Field Path to add a column by its path in the top-level object hierarchy. The $. notation is used to specify a path from the root of the file. 6. Sometimes you can't delete columns that are added by mistake. For example, the parser may incorrectly guess the row filter, or you might make a mistake adding columns using Data Field Path. If this happens, you can always hide these columns on the Manage Fields step. Page 64 Data Ingest Guide - Define Datasets to Describe Data For the JSON Object per line format, each line in the file represents a row. For the JSON Object format, the top-level object is used by default to signify rows. If the objects you want to use as rows are contained within a sub-array, you can specify a Filter to the array name containing the objects you want to use. For example, in this JSON structure, the Filter value would be users (use the objects in the users array as the basis for rows): { "registration_date": "September 24, 2014", "users": [ {"name": "John Smith","email": "john@gmail.com"}, {"name": "Sally Jones", "email": "sally@yahoo.com"}, {"name": "Jeff Hamm","email": "jhamm@hotmail.com"} ] } Or in the example below, you could use the filter users.address to select the contents of the address array as the basis for rows. { "registration_date": "September 24, 2014", "users": [ {"name": "John Smith","email": "john@gmail.com", "address": [ "street":"111 Main St.", "city":"Madison", "state":"IL", "zip":"35460" ] }, {"name": "Sally Jones", "email": "sally@yahoo.com", "address": [ "street":"32 Elm St.", "city":"Dallas", "state":"TX", "zip":"23456" ] }, {"name": "Jeff Hamm","email": "jhamm@hotmail.com"}, "address": [ "street":"101 2nd St.", "city":"San Mateo", "state":"CA", "zip":"94403" ] } ] } Once the parser knows the root object to use as rows, the JSON object tree is displayed in the Parsing Controls panel. You can add fields contained within nested objects and arrays by selecting the field name in the JSON tree. The field is then added as a dataset column. You can browse through a sample of 20 records to check for fields to add. If you unselect a field containing a nested object or array (remove it from the dataset), and later decide to select it again (add it back to the dataset), make sure that Format as JSON string is selected. This Page 65 Data Ingest Guide - Define Datasets to Describe Data will format the contents of the field as a JSON string rather than as a regular string. This is important if you plan to do additional processing on the values using the JSON string functions. In some cases, you may want to extract columns from an object one or more levels above the record filter in the JSON structure. For example, in this JSON structure above, the Filter value would be users (use the objects in the users array as the basis for rows), but you may also want to include the registration_date object as a column. To capture upstream objects as columns, you can add the field by its path in the object tree. • The $. notation starts the path from the base of the object tree hierarchy. • To access fields within a nested object, specify a dot-separated path of field names (for example top_level_field_name.nested_field_name). • To extract a value from an array, specify the dot-separated path of field names and the array position starting at 0 for the first value in an array, 1 for the second value, and so on (for example field_name.0). • If the name identifier contains dot or period characters within the name itself, escape the name by enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name] Page 66 Data Ingest Guide - Define Datasets to Describe Data • If the field name is null (empty), use brackets with nothing in between as the identifier, for example []. Parse XML Files This section explains how to use the Platfora XML parser to create datasets based on XML files. XML is a plain-text file format that encodes data using different components, such as elements and attributes, in a document hierarchy. The Platfora XML parser allows you to choose a top-level element to signify the starting point of a record or row, and attributes or elements to signify columns or fields. What is XML? Extensible Markup Language (XML) is a markup language for encoding documents. XML is a textual file format that can contain different components, including elements and attributes, in a document hierarchy. A valid XML document starts with a declaration that states the XML version and document encoding. For example: <?xml version= "1.0" encoding= "UTF-8"?> An element is a logical component in the document. Elements always begin with opening tag and end with a matching closing tag. Element content can contain text, markup, attributes, or other nested Page 67 Data Ingest Guide - Define Datasets to Describe Data elements, called child elements. For example, here is a parent users element that contains individual child elements for each user: <users> <user name="John Smith" email="john@gmail.com"/> <user name="Sally Jones" email="sally@yahoo.com"/> <user name="Jeff Hamm" email="jhamm@hotmail.com"/> </users> Elements can be empty. For example, this image element has no content between its open and closing tags: <image href="mypicture.jpg"/> Elements can also have attributes in their opening tag. Attributes are name=value pairs that contain useful data about the element. For example, here is how you might list attributes of an element called address: <address street="45 Pine St." city="Atlanta" state="GA" zip="53291"/> Elements can also have both attributes and content. For example, this address element has the actual address components as attributes, and the address type as its content: <address street="45 Pine St." city="Atlanta" state="GA" zip="53291">home address</address> For details on the XML standard, go to http://www.w3.org/XML/. Supported XML File Formats This section describes how the Platfora XML parser expects XML files to be formatted, and how to specify what makes a record or row in an XML file. There are two general XML file formats supported by Platfora: XML Element per line and XML Document. The XML Element per line format supports files containing one XML record per line, each record having the same top-level element and structure. For example, here is an XML file where each top-level element represents a user record with one record per line. <user name="John Smith" email="john@gmail.com"><phone type="mobile" number="123-456-7890"/></user> <user name="Sally Jones" email="sally@yahoo.com"><phone type="home" number="456-789-1007"/></user> <user name="Jeff Hamm" email="jhamm@hotmail.com"><phone type="mobile" number="789-123-3456"/></user> The XML Document format supports valid XML document files (one document per file). In the following example, the top-level XML element contains nested XML element records: <?xml version= "1.0" encoding= "UTF-8"?> <registration date="Aug 21, 2012"> <users> <user name="John Smith" email="john@gmail.com"/> <user name="Sally Jones" email="sally@yahoo.com"/> Page 68 Data Ingest Guide - Define Datasets to Describe Data <user name="Jeff Hamm" email="jhamm@hotmail.com"/> </users> </registration> In the following example, the top-level XML element contains a sub-tree of nested XML element records: <?xml version= "1.0" encoding= "UTF-8"?> <registration date="Sept 24, 2014"> <region name="us-east"> <user name="Georgia" age="42" gender="F"> <address street="111 Main St." city="Madison" state="IL" zip="35460"/> <statusupdate type="registered"/> </user> <user name="Bobby" age="30" gender="M"> <address street="45 Pine St." city="Atlanta" state="GA" zip="53291"/> <statusupdate type="unsubscribed"/> </user> </region> </registration> Page 69 Data Ingest Guide - Define Datasets to Describe Data Use the XML Parser The Platfora XML parser takes a sample of the source data to determine the format of your XML files, and then shows the element and attribute hierarchy so you can choose the rows and columns to include in the dataset. 1. When you select data that is in valid XML format, Platfora recognizes the file format and chooses an XML parser. 2. The basis of a record or row depends on the format of your XML files. You can either choose to use each line in the file as a record (XML Element per line), or choose a child element in the XML document to use as the basis of a record (XML Document). 3. (XML Document formats only) You can add a Filter that determines which rows to include in the dataset. For more details, see Parsing Rows from XML Documents. 4. Use the Record element tree to select the elements and attributes to include in the dataset as columns. You can browse up to 20 sample records when choosing columns. For more details, see Extracting Columns from XML Using the Element Tree. 5. The Data Field Path field allows you to add a column represented by its path in the element hierarchy. You can use any XPath 1.0 expression that is relative to the result of the row Filter. For more details, see Extracting Columns from XML Using an XPath Expression. 6. Sometimes you can't delete columns that are added by mistake. For example, the parser may incorrectly guess the row filter, or you might make a mistake adding columns using Data Field Path. If this happens, you can always hide these columns on the Manage Fields step. Page 70 Data Ingest Guide - Define Datasets to Describe Data For the XML Element per line format, each line in the file represents a row. For the XML Document format, by default, the top-level element below the root of the XML document is used as the basis of rows. If you want to use different elements as the basis for rows, you can enter a Filter to specify the element name you want to use as the basis of rows. The Platfora XML parser supports an XPath-like notation for specifying which XML element to use as rows. As an example of how to use the Platfora XML parser filter notation, suppose you had the following XML document containing movie review records: <?xml version= "1.0" encoding= "UTF-8"?> <records> <movie title="Friday the 13th" year="1980"> <reviews> <review user="Pam" stars="3">a bit predictable</review> <review user="Alice" stars="4">classic slasher flick</review> </reviews> </movie> <movie title="The Exorcist" year="1984"> <reviews> <review user="Jo" stars="5">best horror movie ever</review> <review user="Bryn" stars="4">I will never eat pea soup again</ review> <review user="Sam" stars="4">loved it</review> </reviews> </movie> </records> The document hierarchy is assumed to start one level below the root element. The root element would be the records element in this example. From this point in the document, you can use the following XPath-like notation to specify row filters: Row Filter Notation Description Example // Specifies all elements with the given name located within the previous element no matter where they exist within the previous element. When used at the beginning of the row filter, this specifies all elements in the document with the given name. Use any review element as the Page 71 basis for rows: //review Data Ingest Guide - Define Datasets to Describe Data Row Filter Notation Description Example / Specifies an element with the given name one level in the document hierarchy within the element listed before it. When used as the first character in the row filter, it specifies one level below the root element of the document. Use the review element as the Page 72 basis for rows: /movie/reviews/review Data Ingest Guide - Define Datasets to Describe Data Row Filter Notation Description Example $ Specifies an element in the row filter as an extraction point. An extraction point Use the review element as the basis for rows while allowing the ability to extract reviews data for is an element in a XML row filter that allows you to define a variable that can that row as column data: be used to define a column definition /movie/$reviews/review expression relative to that element in the filter. The last element in a row filter is always considered an extraction point, so it is unnecessary to use the $ notation for the last element. You can specify zero or more extraction points in a row filter. Extraction points give you more flexibility when extracting columns. Use an extraction point element at the beginning of a column definition to signify an expression relative to the extraction point element. You might want to use an extraction point to extract a column or attribute from a parent element one or more levels above the last element defined in the row filter. For example, for the row filter /a/$b/c/d you could write the following column definition: $bxpath_expression Use caution when adding an extraction point to the row filter. Platfora buffers all XML source data in an extraction point element during data ingest and when it builds a lens in order to extract column data. Depending on the source data, this may impact performance during data ingest and may increase lens build times. Note for the XML structure above, the following row filter expressions are equivalent: • movie • /movie • $/records/movie Page 73 Data Ingest Guide - Define Datasets to Describe Data • $//movie For example, in this XML structure, the Filter value would be $users (use the collection of child elements contained in the in the users element as the basis for rows): <?xml version= "1.0" encoding= "UTF-8"?> <registration date="Sept 24, 2014"> <users> <user name="John Smith" email="john@gmail.com"/> <user name="Sally Jones" email="sally@yahoo.com"/> <user name="Jeff Hamm" email="jhamm@hotmail.com"/> </users> </registration> Once the parser knows the element to use as rows, the XML element tree is displayed in the Record panel. You can add fields based on XML attributes or nested XML elements by selecting the element or attribute name in the XML element tree. The field is then added as a dataset column. You can browse through a sample of 20 records in a single file to check for fields to add. If you unselect a field containing nested XML elements (remove it from the dataset), and later decide to select it again (add it back to the dataset), make sure that Format as XML string is selected. This will format the contents of the field as XML rather than a regular string. This is important if you plan to Page 74 Data Ingest Guide - Define Datasets to Describe Data do additional processing on the values using the XPATH string functions. For more details, see Parsing of Nested Elements and Content. Another way to add columns is to enter an XPath expression in Data Field Path that represents a path in the element hierarchy. You might want to do this to extract columns from a parent element one or more levels above the row filter in the XML document hierarchy. Note the following rules and guidelines when using an XPath expression to extract columns: • The Platfora XML parser only supports XPath 1.0. • The expression must be relative to the last element or any extraction point element in the row Filter. • Platfora recommends starting the expression with a variable using the $element/ syntax. The element must be the last element or an extraction point element in the row Filter. • XML namespaces are not supported. The XML parser strips all XML namespaces from the XML file. • Variables are only allowed at the beginning of the expression. For example, assume you have the following row filter: /movie/$reviews/review Page 75 Data Ingest Guide - Define Datasets to Describe Data You could create a column definition expression for any element or attribute in the document hierarchy that comes after the review element. Additionally, because the row filter includes an extraction point for $reviews, you could also create a column definition relative to that node: $reviewsxpath_expression. For more information about XPath, see http://www.w3.org/TR/xpath/. If the element you are parsing contains nested XML elements and content, and you want to preserve the XML structure and hierarchy, select Format as XML string. This will allow you to do further processing on this data with the XPATH_STRING, XPATH_STRINGS and XPATH_XML functions. If the column contains nested elements and Format as XML string is not enabled, Platfora returns NULL. Repeated elements are wrapped inside a <list> ... </list> parent element to maintain valid XML structure. Parse Avro Files The Platfora Avro parser supports Avro container files where the top-level object is an Avro record data type. The file must have a JSON-formatted schema declared at the beginning of the file, and the serialized data must be in the Avro binary-encoded format. 1. On the Parse Data step of the dataset workspace, select Avro as the File Type in the Parsing Controls panel. 2. The Avro parser uses the JSON schema of the source file to extract the name:value pairs from each record object in the Avro file. Page 76 Data Ingest Guide - Define Datasets to Describe Data What is Avro? Apache Avro is a remote procedure call (RPC) and data serialization framework. Its primary use is to provide a data serialization format for persistent data stored in Hadoop. Avro uses JSON for defining schema, and JSON or binary format for data encoding. When Avro data is stored in a persistent file (called a container file), its schema is stored along with it. This allows any program to be able to read the serialized data in the file. For more information about the Avro schema and encoding formats, see the Apache Avro Specification documentation. Avro to Platfora Data Type Mapping Avro has a set of primitive and complex data types it supports. These are mapped to Platfora's internal data types. Complex data types are imported into Platfora as a single JSON-formatted STRING. You can then use the JSON String Functions in the Platfora expression language to define new computed columns in the dataset that extract a particular name:value pair from the imported JSON structure. Avro Data Type Platfora Data Type BOOLEAN INTEGER INT INTEGER LONG LONG FLOAT DOUBLE DOUBLE DOUBLE STRING STRING BYTES STRING (Hex-encoded) RECORD STRING (JSON-formatted) ENUM STRING (JSON-formatted) ARRAY STRING (JSON-formatted) MAP STRING (JSON-formatted) UNION STRING (JSON-formatted) FIXED FIXED Page 77 Data Ingest Guide - Define Datasets to Describe Data Parse Web Access Logs A web access log contains records about incoming requests made to a web server. Platfora has a builtin Web Access Log parser that automatically recognizes web access logs that adhere to the NCSA common or combined log formats. 1. On the Parse Data step of the dataset workspace, select Web Access Log as the File Type in the Parsing Controls panel. 2. The Web Access Log parser extracts fields according to the supported NCSA log formats. Supported Web Access Log Formats Platfora supports web access logs that comply with the NCSA common or combined log formats. This is the log format used by many popular web servers (such as Apache HTTP Server). An example log line for the common format looks something like this: 123.1.1.456 - - [16/Aug/2012:15:01:52 -0700] "GET /home/index.html HTTP/1.1" 200 1043 The NCSA common log format contains the following fields for each HTTP access record: • Host - The IP address or hostname of the HTTP client that made the request. • Logname - Identifies the client making the HTTP request. If no value is present, a dash (-) is substituted. • User - The user name used by the client for authentication. If no value is present, a dash (-) is substituted. • Time - The timestamp of the request in the format of dd/MMM/yyyy:hh:mm:ss +-hhmm. • Request - The HTTP request. The request field contains three pieces of information: the requested resource (/home/index.html), the HTTP method (GET) and the HTTP protocol version (HTTP/1.1). Page 78 Data Ingest Guide - Define Datasets to Describe Data • Status - The HTTP status code indicating the success or failure of the request. • Response Size - The number of bytes of data transferred as part of the HTTP request, not including the HTTP header. The NCSA combined log format contains the same fields as the common log format with the addition of the following optional fields: • Referrer - The URL that linked the requestor to your site. For example, http:// www.platfora.com. • User-Agent - The web browser and platform used by the requestor. For example, Mozilla/4.05 [en] (WinNT; I). • Cookie - Cookies are pieces of information that the HTTP server can send back to a client along the with the requested resource. A client browser may store this information and send it back to the HTTP server upon making additional resource requests. The HTTP server can establish multiple cookies per HTTP request. Cookie values take the form KEY=VALUE. Multiple cookie key/value pairs are delineated by semicolons (;). For example, USERID=jsmith;IMPID=01234. For web access logs that do not conform to the default expected ordering of fields and data types, Platfora will make a best guess at parsing the rows and columns found in the web log files, and use generic column headers (for example column1, column2, etc.). You can then rename the columns to match your web log format. Parse Other File Types For other file types that cannot be parsed using the built-in parsing controls, Platfora provides two generic parsers: Regex and Line. As long as your source data has one record per line, you can use one of these generic parsers to extract columns from semi-structured source data. Parse Raw Lines with a Regular Expression The Regex parser allows you to search lines in the source data and extract columns using a regular expression. It evaluates each line in the source data against a regular expression to determine if there is a Page 79 Data Ingest Guide - Define Datasets to Describe Data match, and returns each capturing group of the regular expression as a column. Regular expressions are a way to describe a set of rows based on characteristics they share in common. 1. On the Parse Data step of the dataset workspace, select Regex as the File Type in the Parsing Controls panel. 2. Enter a regular expression that matches the entire line with parenthesis around each column matching pattern you want to return. 3. Confirm the regular expression is correct by comparing the raw data to the wrangled data. Platfora uses capturing groups to determine what parts of the regular expression to return as columns. The Regex line parser applies the user-supplied regular expression against each line in the source file, and returns each capturing group in the regular expression as a column value. For example, suppose you had user records in a file, and the lines were formatted like this (no common delimiter is used between fields): Name: John Smith Address: 123 Main St. Age: 25 Comment: Active Name: Sally R. Jones Address: 2 E. El Camino Real Age: 32 Name: Rod Rogers Address: 55 Elm Street Comment: Suspended You could use the following regular expression to extract the Full Name, Last Name Only, Address, Age, and Comment column values: Name: (.*\s(\p{Alpha}+)) Address:\s+(.*) Age:\s+([0-9]+)(?:\s+Comment:\s +(.*))? Page 80 Data Ingest Guide - Define Datasets to Describe Data Parse Raw Lines with Platfora Expressions The Line parser simply returns each line in the source file as one column value, essentially not parsing the source data at all. This allows you to bypass the parsing step and instead define a series of computed fields to extract the desired column values out of each line. 1. On the Parse Data step of the dataset workspace, select the Line in the Parsing Controls panel. This creates a single column where each row contains an entire record. 2. Go to the Manage Fields step. 3. Define computed fields that extract columns from the raw line. Prepare Base Dataset Fields When you first add a dataset, it only has its Base fields. These are the fields parsed direcly from the raw source data. This section describes the tasks involved in making sure the base data is correct and ready for Platfora's analyst users. In some cases, the data values contained in the base fields may be ready for consumption. Most likely, however, the raw data values will need some additional processing. It is best practice to confirm and edit all of the base fields in a dataset before you begin defining computed field expressions to do any additional processing on the dataset. Confirm Data Types The dataset parser will guess the data type of a field based on the sampled source data, but you may need to change the data type depending on the additional processing you plan to do. The expression language functions require input values to be of a certain data type. It is best practice to confirm and change the data types of your base fields before defining computed fields. Changing them later may introduce errors to your computed field expressions. Page 81 Data Ingest Guide - Define Datasets to Describe Data Note that you can only change the data type of a Base field. Computed field data types are determined by the return type of the computed expression. 1. On the Manage Fields step of the dataset workspace, verify the data types that Platfora has assigned to the base fields. 2. Select a column and change the data type in the column header or the Field Info panel. Note that you cannot accurately convert the data type of a field to DATETIME from the drop-down data type menus. See Cast DATETIME Data Types. About Platfora Data Types Each dataset field, whether a base or a computed field, has a data type attribute. The data type defines what kind of values the field can hold. Platfora has a number of built-in data types you can assign to dataset fields. The dataset parser attempts to guess a field's data type by sampling the data. A base field's data type restricts the expressions you can apply to that field. For example, you can only calculate a sum with numeric fields. For computed fields, the expression's result determines the field's data type. You may want to change a base field's data type to accommodate the computed field processing you plan to do. For example, many value manipulation functions require input values to be strings. Platfora supports the following data types: Table 1: Platfora Data Types Type Description Range of Values STRING variable length non-unicode maximum string length of 2,147,483,647 string data Page 82 Data Ingest Guide - Define Datasets to Describe Data Type Description Range of Values DATETIME date combined with a time date range: January 1, 1753, through December 31, of day with fractional 9999, time range: 00:00:00 through 23:59:59.997 seconds based on a 24-hour clock FIXED Fixed decimal values with accuracy to a tenthousandth of a numeric unit -922,337,203,685,477.5808 through 2^63 - 1 (+922,337,203,685,477.5807), with accuracy to a ten-thousandth of a numeric unit. INTEGER 32-bit integer (whole number) -2,147,483,648 to 2,147,483,647 LONG 64-bit long integer (whole number) -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 DOUBLE double-precision 64-bit floating point number 4.94065645841246544e-324d to 1.79769313486231570e+308d (positive or negative) Change Field Names The field name defined in the dataset is what users see when they browse the Platfora data catalog. In some cases, the field names imported from the source data may be fine. In other cases, you may want to change the field names to something more understandable for your users. It is important to decide on base field names before you begin defining computed fields and references (joins), as changing a field name later on will break computed field expressions and references that rely on that field name. 1. On the Manage Fields step of the dataset workspace, select the column you want to rename. Page 83 Data Ingest Guide - Define Datasets to Describe Data 2. Enter the new name in the column header or the Field Info panel. If a name change breaks other computed field expressions or reference links in the dataset, the error panel will show all of the affected computed fields and references. You can either change the dependent field back to the original name, or edit the affected fields to use the new name. Add Field Descriptions Field descriptions are displayed in the data catalog view of a dataset or lens, and can help users decide if a field is relevant for their needs. Data administrators should add helpful field descriptions that explain the meaning and data value characteristics of a field. 1. On the Manage Fields step of the dataset workspace, select the column you want to update. Page 84 Data Ingest Guide - Define Datasets to Describe Data 2. In the Field Info panel, click inside the Description text box and enter a description. Hide Columns from Data Catalog View Hiding a column or field in a dataset definition removes it from the data catalog view of the dataset. Users cannot see hidden columns when browsing datasets in the data catalog, or select them when they build a lens. 1. On the Manage Fields step of the dataset workspace, select the column you want to hide. 2. Check Hide Column in the column header or in the Field Info panel. Why Hide Dataset Columns? A data administrator can control what fields of a dataset are visible to Platfora users. Hidden fields are not visible in the data catalog view of the dataset and connot be selected for a lens. You might choose to hide a field for the following reasons: • Protect Sensitive Data. In some cases, you may want to hide fields to protect sensitive information. In Platfora, you can hide detail fields, but still allow access to summary information. For example, in a dataset containing employee salary information, you may want to hide sensitive identifying information such as names, job titles, and individual salaries, but still allow analysts to view average salary by department or job level. In database applications, this is often referred to as column-level security or column access control. • Hide Unpopulated or Sparse Data Columns. You may have columns in your raw data that did not have any data collected, or the data collected is too sparse to be valid for analysis. For example, a web application may have a placeholder column for comments, but it was never implemented on the website so the comments column is empty. Hiding the column prevents analysts from choosing a field with mostly null values when they go to build a lens. Page 85 Data Ingest Guide - Define Datasets to Describe Data • Control Lens Size. High cardinality dimension fields can significantly increase the size of a lens. Hiding such fields prevents analysts from creating large lenses unintentionally. For example, you may have a User ID field with millions of unique values. If you do not want analysts to be able to create a lens at that level of granularity, you can hide User ID, but still keep other dimension fields about users available, such as age or gender. • Use Computed Values Instead of Base Values. You may add a computed field to transform the values of the raw data. You want your users to choose the transformed values, not the raw values. For example, you may have a return reason code column where the reason codes are numbers (1,2,3, and so on). You want to transform the numbers to the actual reason information (Did not Fit, Changed Mind, Poor Quality, and so on) so the data is more usable during analysis. • Hide Computed Fields that do Interim Processing. As you work on your dataset to cleanse and transform the data, you may need to add interim computed fields to achieve a final result. These are fields that are necessary to do a processing step, but are not intended for final consumption. These working fields can be hidden so they do not clutter the data catalog view of the dataset. Default Values and NULL Processing If a field or column value in a dataset is empty, it is considered a NULL value. During lens processing, Platfora replaces all NULL values with a default value instead. Platfora lenses and vizboards have no concept of NULL values. NULLs are always substituted with the default field values specified in the dataset definition. How Platfora Processes NULL Values A value can be NULL for the following reasons: • The raw data is missing values for a particular field. • A computed field expression returns an empty or invalid result. • A record in the focus (or fact) dataset does not have a corresponding record in a referenced (or dimension) dataset. During lens processing, any rows that do not join will use the default values in place of the unjoined dimension fields. For lenses that include fields from referenced datasets, Platfora performs an outer join between the focus dataset and any referenced datasets included in the lens. This means that rows in the fact dataset are compared to related rows in the referenced datasets. Any row that does not have a corresponding row in the referenced dataset is considered an unjoined foreign key. The dimension columns for unjoined foreign keys are treated as NULL and replaced with the default values. A Platfora aggregate lens is analagous to a summary or roll-up table in a data warehouse. During lens processing, the measure values are pre-aggregated and grouped by each dimension field value included in the lens. For dimension fields, NULL values are replaced with the default values before the measure aggregations are calculated. For measure fields, 0 is used in place of NULL to compute the measure value. Average (AVG) calculations exclude NULL values from the row count. Page 86 Data Ingest Guide - Define Datasets to Describe Data Default Values by Data Type If you do not specify your own default values in the dataset, the following default values are used in place of any NULL value. The default value depends on the data type of the field or column. Data Type Default Value LONG, INTEGER, DOUBLE, FIXED 0 STRING NULL (as a string) DATETIME January 1, 1970 12:00:00:000 GMT LOCATION (latitude,longitude coordinate position) 0,0 Change the Default Value for a Column You can specify different default values on a per-column basis. These values will replace any NULL values in that column during lens build processing. Analysts will see the default values instead of NULL (empty) values when they are working with the data in a vizboard. To change the default value for a column: 1. Go to the Manage Fields step of the dataset workspace and select a column. 2. Click the Default Value text box in the Field Info panel to edit it. Bulk Upload Field Header Information When a dataset has a lot of fields to manage, it may be easier to update several field names, data types, descriptions, and visibility settings all at once rather than editing each field one-by-one in the Platfora Page 87 Data Ingest Guide - Define Datasets to Describe Data application. To do this, you can upload a comma or tab delimited text file containing the field header information you want to set. 1. Create an update file on your local machine containing the field information you want to update. This file must meet the following specifications: • It must be a comma-delimited or tab-delimited text file. • It can contain up to four lines (separated by a new line). Any additional lines in the file will be ignored. • Field names are specified on the first line of the file. • Field data types are specified on the second line of the file ( DOUBLE, FIXED, INTEGER, LONG, STRING, or DATETIME). • Field descriptions are specified on the third line of the file. • Field visibility settings are specified on the fourth line of the file (Hidden or Not Hidden). • On a line, values must be be specified in the column order of the dataset. 2. On the Parse Data step of the dataset workspace, click Upload Field Names. Find and open your update file. 3. After uploading the file, advance to the Manage Fields step to confirm the results. Example Update Files Here is an update file that updates the field names, descriptions, data types, and visibility settings for the first four columns of a dataset. UserID,Name,Email,Address INTEGER,STRING,STRING,STRING Page 88 Data Ingest Guide - Define Datasets to Describe Data The unique user ID,The user's name,The email linked to this user's account, The user's mailing address Hidden,Not Hidden,Not Hidden,Not Hidden The double-quote character can be used to quote field names or descriptions. This is useful if a field name or description contains the delimiter character (comma or tab). For example: UserID,Name,Email ,"The user's name, both first and last.",,The user's mailing address HIDDEN Notice how lines can be left blank, and values can be skipped over by leaving them out or by specifying nothing between the two delimiters. Missing values will not be updated in the Platfora dataset definition (the dataset will use the previously set values). Transform Data with Computed Fields The way you transform data in Platfora is by adding computed fields to your dataset definition. A dataset computed field contains an expression that describes a single data processing step. Sometimes several steps are needed to achieve the result that you want. The result of a dataset computed field can be used in the expressions of other dataset computed fields, allowing you to define a chain of processing steps. FAQs - Dataset Computed Fields This section answers the most frequently asked questions (FAQs) about creating and editing dataset computed fields. What kinds of things can I do with dataset computed fields? Computed fields are useful for deriving meaningful values from base fields (such as calculating someone's age based on their birthday), doing data cleansing and pre-processing (such as grouping similar values together or substituting one value for another), or for computing new data values based on a number of input variables (such as calculating a profit margin value based on revenue and costs). Platfora has an extensive library of built-in functions that you can use to define data processing tasks. These functions are organized by the type of data they operate on, or the kind of processing they do. See the Expression Quick Reference for a list of what's available. Can I do further processing on the results of a computed field? Yes. A computed field is treated just like any other field in the dataset. You can refer to it in other computed field expressions or aggregate the results to create a measure. To analyst users, computed fields are just like any other dataset field. Users can include them in a lens and analyze their results in a vizboard. Page 89 Data Ingest Guide - Define Datasets to Describe Data One exception is a computed field that uses an aggregate function in its expression (measure). You cannot combine row functions and aggregate functions in the same expression. A row function cannot take a measure field as input. Per-row processing on aggregated data in not allowed. How do I edit a computed field expression in a dataset? Go to the Manage Fields step of the dataset workspace, and find the computed column you want to edit. With the column selected, click the expression in the Field Info panel. This will open the expression builder. How do I remove a computed field from a dataset? Go to the Manage Fields step of the dataset workspace, find the computed column you want to edit, and click the X in the field header. Note that this might cause errors if other computed fields refer to the deleted field. If you need the computed field for an interim processing step, but want to remove it from the selection of fields that the users see, you can hide it. Hiding a field keeps it in the dataset definition and allows it to be referred to by other computed field expressions. However, users cannot see hidden fields in the data catalog, or select them in a lens or vizboard. See Hide Columns from Data Catalog View. Where can I find examples of useful computed field expressions? Platfora's expression reference documentation has lots of examples of useful expressions. See Expression Language Reference. Why isn't my computed field showing any sample values? Certain types of computed field expressions can only be computed during lens build processing. Because of the complicated processing involved, the dataset workspace can't show sample results for: • Measures (computed fields containing aggregate functions) • Event Series Processing (computed fields containing PARTITION expressions) • Computed field expressions that reference fields in other datasets Why can't I change the data type of a computed field? A computed field's data type is set by the output type of its expression. For example, a CONCAT function always outputs a STRING. If you want the output data type to be something else, you can nest the expression inside the appropriate data type conversion function. For example: TO_INT(CONCAT(field1, field2)) Page 90 Data Ingest Guide - Define Datasets to Describe Data Can analyst users add computed fields if they want? Analyst users can't add computed fields to a dataset. You must be a data administrator and have the appropriate dataset permissions to edit a dataset. In a vizboard, analyst users can create computed fields to manipulate the data they already have in their lens. With some exceptions, analyst users can add a vizboard computed field that can do almost anything that a dataset computed field can do. However, event series processing (ESP) computed fields and most aggregate functions (measure expressions) cannot be used to create vizboard computed fields. Add a Dataset Computed Field You can add a new dataset computed field on the Manage Fields step of the dataset workspace. A computed field has a name, a description, and an expression. The computed field expression describes some processing task you want to perform on other fields in the dataset. Computed fields contain expressions that can take other fields as input. These fields can be base fields or they can be other computed fields. When you save a computed field, it appears as a new column in the dataset definition. 1. Go to the Manage Fields step of the dataset workspace. 2. Choose Computed Field from the dataset workspace Add menu. This opens the Add Field dialog containing the expression builder controls. Page 91 Data Ingest Guide - Define Datasets to Describe Data 3. Enter a name for your field and a description. The description is optional but very useful for others that will use the field later. 4. Choose a function from the Functions list. Use the drop-down to restrict the type of functions you see. Functions are organized by the type of data they operate on, or the type of processing they do. 5. Double-click a function in the Functions list to add it to the Expression area. The Expression panel updates with the function's template. Also, the Fields list refreshes with the fields that can be used as input to the selected function. For example, the CONCAT function only accepts STRING type fields. 6. Double-click a field in the Fields list to add it into the Expression area. 7. Continue adding functions and fields into your expression until it is complete. 8. Make sure your expression is correct. The system checks your syntax as you build the expression. The yellow text box below the Expression area displays any error messages. You can save expressions that contain errors, but will not be able to save the dataset until all expressions evaluate successfully. 9. Click Save to add the new computed field to the dataset. Your new computed field appears as a new column in the dataset. 10.Check the computed column values to make sure the expression logic is working as expected. The dataset workspace can't show sample results for measures, event series processing computed fields, or computed fields that operate on fields of a referenced dataset. Expressions are an advanced topic. For information on working with the Platfora expression syntax, see Expressions Guide . Add Binned Fields A binned field is a special kind of computed field that groups ranges of values together to create new categories. The dataset workspace has tools for quickly creating bins on numeric type fields. Binned fields are a way to reduce the number of values in a high-cardinality column, or to group data in a Page 92 Data Ingest Guide - Define Datasets to Describe Data way that makes it easier to analyze. For example, you might want to bin the values in an age field into categories such as under 18, 19 to 29, 30-39, and 40 and over. Bin Numeric Values You can bin numeric values by adding a binned quick field. 1. On the Manage Fields step of the dataset workspace, select a column that is a numeric data type. 2. In the Field Info panel, click Add Bins. 3. Choose a Bin Method and enter your bin intervals. • Even Intervals will group numeric values into even numbered bins. The bin name that is returned is determined by rounding the value down to the starting value of its containing bin. For Page 93 Data Ingest Guide - Define Datasets to Describe Data example, if the interval is 10, then a value of 5 would return 0 (it is in the bin 0-9), and a value of 11 would return 10 (it is in the bin 10-19). • Custom Intervals groups values into user-defined ranges and assigns a text label to each range. For example, suppose you had a Flight Duration field that was in minutes and you wanted to bin the values into one hour intervals (60 minutes). Each value you enter creates a range between the current value and the previous one. So if you entered a starting value of 60 (one hour), the starting range would be less than one hour. If the last value you entered was 600 (10 hours), the ending range would be over 10 hours. With custom intervals, the values you enter should correspond to the data type. For example, an integer would have values such as 60, 120, etc. A double would have values such as 60.00, 120.00, etc. Page 94 Data Ingest Guide - Define Datasets to Describe Data The output of a custom interval is always a string (text label). 4. Edit the name of the new binned field. 5. Edit the description of the new binned field. 6. Click Add. The binned column is added to the dataset. 7. Verify that the bin values are calculated as expected. Bin Text Values If you wanted to bin text or STRING values, you can define a computed field that groups values together using a CASE expression. For example, here is a CASE expression to bucket values of a name field together by their first letter: CASE WHEN WHEN WHEN WHEN SUBSTRING(name,0,1)=="A" SUBSTRING(name,0,1)=="B" THEN SUBSTRING(name,0,1)=="C" THEN SUBSTRING(name,0,1)=="D" THEN THEN "A" "B" "C" "D" Page 95 Data Ingest Guide - Define Datasets to Describe Data WHEN WHEN WHEN WHEN WHEN WHEN WHEN WHEN WHEN WHEN WHEN WHEN WHEN WHEN WHEN WHEN WHEN WHEN WHEN WHEN WHEN WHEN ELSE SUBSTRING(name,0,1)=="E" SUBSTRING(name,0,1)=="F" SUBSTRING(name,0,1)=="G" SUBSTRING(name,0,1)=="H" SUBSTRING(name,0,1)=="I" SUBSTRING(name,0,1)=="J" SUBSTRING(name,0,1)=="K" SUBSTRING(name,0,1)=="L" SUBSTRING(name,0,1)=="M" SUBSTRING(name,0,1)=="N" SUBSTRING(name,0,1)=="O" SUBSTRING(name,0,1)=="P" SUBSTRING(name,0,1)=="Q" SUBSTRING(name,0,1)=="R" SUBSTRING(name,0,1)=="S" SUBSTRING(name,0,1)=="T" SUBSTRING(name,0,1)=="U" SUBSTRING(name,0,1)=="V" SUBSTRING(name,0,1)=="W" SUBSTRING(name,0,1)=="X" SUBSTRING(name,0,1)=="Y" SUBSTRING(name,0,1)=="Z" "unknown" END THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN THEN "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" Expressions are an advanced topic. For information on working with Platfora expressions and their component parts, see Expressions Guide. Add Measures for Quantitative Analysis A measure is a special type of computed field that returns an aggregated value for a group of records. Measures provide the basis for quantitative analysis when you build a lens or visualization in Platfora. Every dataset, lens, or visualization must have at least one measure. There are a couple of ways to add measures to a dataset. FAQs - Dataset Measures This section describes the basic concept of measures, and why they are needed in a Platfora dataset. Measures are necessary if you plan to build aggregate lenses from a dataset, and use the data for quantitative analysis. What is a measure? Measures provide the basis for quantitative analysis in a visualization or lens query. A measure is a numeric value representing an aggregation of values from multiple rows. For example, measures contain data such as total dollar amounts, average number of users, count distinct of users, and so on. Measure values always result from a computed field that uses an aggregate function in its expression. Examples of aggregate functions include COUNT, DISTINCT, AVG, SUM, MIN, MAX, VARIANCE, and so on. Page 96 Data Ingest Guide - Define Datasets to Describe Data Why do I need to add measures to a dataset? In some data analysis tools, measures (or metrics as they are sometimes called) can be aggregated at the time of analysis because the amount of data to aggregate is relatively small. In Platfora, however, the data in a lens is pre-aggregated to optimize performance of big data queries. Therefore, you must decide how to aggregate the metrics of your dataset up front. You do this by defining measures either in the dataset or at lens build time. When you go to analyze the data in a vizboard, you can only do quantitative analysis on the measures you have available in the lens. How do I add measures to a dataset? There are a couple of ways to add measures to a dataset: • Add a computed field to the dataset that uses an aggregate function in its expression. Measures computed in this way allow data administrators more control over how the data is aggregated, and what level of detail is available to users. For example, you may want to prevent users from seeing the original values of salary field, but allow users to see averages or percentiles of salary data. Also, more complex aggregate calculations, such as standard deviation or ranking, can only be done with computed field expressions. • Choose quick field aggregations on certain columns of the dataset. This way, if a user chooses the field in the lens builder, they will automatically get the measure aggregations you have selected. Users can always override quick field selections if they want. • Use the default measure. Every dataset has one default measure, which is a simple count of dataset records. Can analyst users add their own measures if they want? Analyst users can always choose quick measure aggregations when they go to build a lens, but they can't add computed measures to a dataset. You must be a data administrator and have the appropriate dataset permissions to add computed fields to a dataset. In a vizboard, users can manipulate the measure data they already have in their lens. They can use ROLLUP and window functions to compute measure results over different time frames or categories. Most aggregate calculations must be computed during lens build processing. However, a few aggregate expressions are allowed without having to rebuild the lens. DISTINCT, MIN, and MAX can be used to define new measures in the vizboard without having to rebuild the lens. What doe the Original Value quick field do? If Original Value is selected on a field, then all possible values of the field are included in a lens (if the field is selected in the lens). It also means the field can be used as a dimension (for grouping measure data) in an analysis. If a field only make sense as a measure, you should deselect Original Value. This will only include the aggregate results in the lens and keep the lens size down. Page 97 Data Ingest Guide - Define Datasets to Describe Data The Default 'Total Records' Measure Platfora automatically adds a default measure to every dataset you create. This measure is called Total Records, and it counts the number of records (or rows) in the dataset. You can change the name, description, or visibility of this default measure, but you cannot delete it. When you build a lens from a dataset, this measure is always selected by default. Add Quick Measures If you have a field in your dataset that you want to use for quantitative analysis, you can select that field and quickly add measures to the dataset. A quick measure sets the default aggregation(s) to use when a user builds a lens. Quick measures are an easy way to add measures to a dataset without having to define new computed fields or write complicated expressions. Quick measures are added to a field in a dataset, and they set the default measures to create if a user chooses that field for their lens. Users can always decide to override the default measure selections when they define a lens. Page 98 Data Ingest Guide - Define Datasets to Describe Data 1. On the Manage Fields step of the dataset workspace, select the field you want to use as a measure. 2. In the Field Info panel, choose how the values should be aggregated by default. DISTINCT (count of distinct values) is available for all field types. MIN (lowest value) and MAX (highest value) are available for numeric-type or datetime-type fields. SUM (total) and AVG (average) are available for numeric-type fields only. Leaving Original Value selected will also add the field as a dimension (grouping column) if it is selected for a lens. In most cases, fields that are intended to be used as measures (aggregated data only) should not have Original Value selected, as this can cause the lens to be larger than intended. Add Computed Measures In addition to quick measures, you can create more sophisticated measures using computed field expressions. A computed field expression containing an aggregate function is considered a measure. Page 99 Data Ingest Guide - Define Datasets to Describe Data 1. Go to the Manage Fields step of the dataset workspace. Review the sample data values before writing your measure expression. 2. Choose Computed Field from the dataset workspace Add menu. This opens the Add Field dialog containing the expression builder controls. 3. Enter a name for your field and a description. The description is optional but very useful for others that will use the field later. 4. Choose Aggregate from the Functions dropdown. The list shows available Aggregate functions. 5. Double-click a function from the list to add it to the Expression area. The Expression panel updates with the function's template. Also, the Fields list refreshes with those fields you can use with the function. For example,MIN and MAX functions can aggregate numeric or datetime data types. 6. Double-click a field to add it into the Expression area. 7. Continue adding functions and fields into your expression until it is complete. Aggregate functions can only take fields or literal values as input. 8. Make sure your expression is correct. The system checks your syntax as you build the expression. The yellow text box below the Expression area displays any error messages. You can save expressions that contain errors, but will not be able to save the dataset until all expressions evaluate successfully. 9. Click Save to add the new computed measure field to the dataset. Page 100 Data Ingest Guide - Define Datasets to Describe Data Your new field appears in the dataset. At this point, the field has no sample values. This is expected for measure fields. As an aggregate field, it depends on a defined group of input rows to calculate a value. 10.(Optional) Hide the field you used as input to the aggregate function. Hiding the input field is useful when only the aggregated data is useful for future analysis. Expressions are an advanced topic. For information on working with Platfora's expression syntax, see Expressions Guide. Prepare Date/Time Data for Analysis Working with time series data is an important part of data analysis. To prepare time-based data for analysis, you must tell Platfora which fields of your dataset contain DATETIME type data, and how your timestamp fields are formatted. This allows users to analyze data chronologically and see trends in the data over time. FAQs - Date and Timestamp Processing This section answers the common questions about how Platfora handles date and time data in a dataset. Date and time data should be assigned to the DATETIME data type for Platfora to recognize it as a date or timestamp. In what format does Platfora store timestamp data? Internally, Platfora stores all DATETIME type data in UTC format (coordinated universal time). If your timestamp data does not have a time zone component to it, Platfora uses the local timezone of the Platfora server. When time-based data is in DATETIME format it can be ordered chronologically. You can also use the DATETIME processing functions to calculate time intervals between two DATETIME fields. For example, you can calculate the time difference between an order date and a ship date field. How does Platfora parse timestamp data? There are a handful of timestamp formats that Platfora can recognize automatically. On the Parse Data step of the dataset workspace, pay attention to the data type assigned to your timestamp columns. If the data type is DATETIME, then Platfora was able to parse the timestamp correctly. If the data type is STRING, then Platfora was not able to parse the timestamp correctly. You will have to create a computed field to tell Platfora how your date/time data is formatted. See Cast DATETIME Data Types. Why are all my dates/times 1970-01-01T00:00:00.000Z (January 1, 1970 at 12:00 AM)? This is the default value for the DATETIME data type in Platfora. If you see this value in your date or timestamp columns, it could mean: Page 101 Data Ingest Guide - Define Datasets to Describe Data • Platfora does not recognize the format of your timestamp string, and was not able to parse it correctly. • Your data values are NULL (empty). Check the raw source data to confirm. • Your data does not have a time component (or a date component) to it. Platfora only has one data type for dates and times: DATETIME. It does not have just DATE or just TIME. If one of these components is missing in your timestamp data, the defaults will be substituted for the missing information. For example, if you had a date value that looks like this: 04/30/2014, then Platfora will convert it to this: 2014-04-30T00:00:00.000Z (the time is set to midnight). What are the Date and Time datasets for? Slicing and dicing data by date and time is a very common reporting requirement. Platfora's built-in Date and Time datasets allow users to explore time-based data at different granularities in a vizboard. For example, you can explore date-based data by day, week, or month or time-based data by hour, minute, or second. Why does my dataset have all these date and time references when I didn't add them? Every DATETIME type field in a dataset automatically generates two references: one to the built-in Date dataset and one to the built-in Time dataset. These datasets have a built-in hierarchy that allows users to explore dates at different granularities. How do I remove the automatic references to Date and Time? You cannot remove the automatic references to Date and Time, but you can rename them or hide them. Page 102 Data Ingest Guide - Define Datasets to Describe Data Cast DATETIME Data Types If Platfora can recognize the format of a date field, it will automatically cast it to the DATETIME data type. However, some date formats are not automatically recognized by Platfora and need to be converted to DATETIME using a computed field expression. 1. On the Manage Fields step of the dataset workspace, find your base date field. If the data type is STRING and not DATETIME, that means that Platfora could not automatically parse the date format. 2. Choose Computed Field from the dataset Add menu. 3. Enter a name for the new computed field. 4. Write an Expression using the TO_DATE function. This function converts values to DATETIME using the date format you specify. 5. Click Save to add the computed field to the dataset. 6. Verify that the new DATETIME field is formatting the date values correctly. Also check that the automatic references to the Date and Time datasets are created. About Date and Time References Every DATETIME type field in a dataset automatically generates two references: one to the built-in Date dataset and one to the built-in Time dataset. These datasets have a built-in hierarchy that allows users to Page 103 Data Ingest Guide - Define Datasets to Describe Data explore dates at different granularities. You cannot remove these auto-generated references, but you can rename or hide them. 1. Go to the Create References step of the dataset workspace. For each DATETIME type field in the dataset, you will see two references: one to Date and one to Time. 2. Click a reference to select it. 3. In the Define References panel, you can edit the reference name or description. This is the reference name as it will appear to users in the data catalog. 4. If you make changes, you must click Update for the changes to take effect. 5. If you don't want a reference to appear in the data catalog at all, you can hide it. About the Default 'Date' and 'Time' Datasets Slicing and dicing data by date and time is a very common reporting requirement. Platfora allows you to analyze date and time-based data at different granularities by automatically linking DATETIME fields to Platfora's built-in Date and Time dimension datasets. The source data for these datasets is added to the Hadoop file system when Platfora first starts up (in /platfora/system by default). If you have a different fiscal calendar for your business, you can either replace the built-in datasets or add additional ones and link your datasets to those instead. You cannot delete the default Date and Time references, however you can hide them if you do not need them. Page 104 Data Ingest Guide - Define Datasets to Describe Data The Date dataset has Gregorian calendar dates ranging from January 1, 1800 to December 31, 2300. Each date is broken down into the following columns: Date Dimension Column Data Type Description Date DATETIME A single date in the format yyyy-MM-dd, for example 2014-10-31. This is the key of the Date dataset. Day_of_Month INTEGER The day of the month from 1-31 Day_of_Year INTEGER The day of the year from 1-366 Month INTEGER Calendar month, for example January 2014 Month_Name STRING The month name (January, February, etc.) Month_Number INTEGER The month number where January=1 and December=12 Quarter STRING The quarter number with year (Q1 2014) where quarters start on January 1, April 1, July 1, or October 1. Quarter_Name STRING The quarter number without year (Q1) where quarters start on January 1, April 1, July 1, or October 1 Week INTEGER The week number within the year where week 1 starts on the first Monday of the calendar year Weekday STRING The weekday name (Monday, Tuesday, etc.) Weekday_Number INTEGER The day of the week where Sunday is 1 and Saturday is 7 Work_Day STRING One of two values; Weekend (Saturday, Sunday) or Weekday (Monday - Friday) Year INTEGER Calendar year, for example 2014 The Time dataset has each time of day divided into different levels of granularity, from the most general (AM/PM) to the most detailed (Time in Seconds). Prepare Location Data for Analysis Adding geographic location information to a dataset allows vizboard users to use maps and geospatial analytics to discover new insights in the data. To prepare location data for analysis, you must tell Platfora which fields of your dataset contain geographic coordinates (latitude and longitude), and optionally a place name to associate with those coordinates (such as the name of a business). Page 105 Data Ingest Guide - Define Datasets to Describe Data FAQs - Location Data and Geographic Analysis This section answers the common questions about how Platfora handles location data in a dataset. Location information can be added to a dataset by geo-encoding certain fields of the dataset, or by creating a geo location reference to another dataset that contains geo-encoded location data. What is location data? Location data represents a geographic point on a map of the Earth's surface. It is comprised of latitude / longitude coordinates, plus an optional label that associates a place name with the set of coordinates. What is geographic analysis? Geographic analysis is a type of data analysis that involves understanding the role that location plays in the occurrence of other factors. By looking at the geo-spatial distribution of data on a map, analysts can see how location impacts different variables. In a vizboard, analysts can use the geo map viz type to do geographic analysis. How does Platfora do geographic analysis? Platfora enables geographic analysis by allowing data administrators to encode their datasets with location information. This geo-encoded data then appears as special location fields in the dataset, lens, and vizboard. These special location fields can then be used to create map visualizations in a Platfora vizboard. Platfora uses Google Maps to render map visualizations. What are the prerequisites to doing geographic analysis in Platfora? In order to do geographic analysis in Platfora, you must have: Page 106 Data Ingest Guide - Define Datasets to Describe Data • Access to the Google Maps web service from your Platfora master server. Your Platfora System Administrator must configure this for you. • Datasets with latitude / longitude coordinates in them. Platfora provides some curated datasets for US states, counties, cities and zip codes. You can import these datasets and use them to create geo references if needed (assuming your datasets have a column that can be used to link to these datasets). What are the high-level steps to prepare data for geographic analysis? 1. Geo-encode the location data in your datasets by creating geo location fields or geo location references. 2. Make sure to include location fields in your lens when you build it. 3. In the vizboard, choose the map viz type. What is a location field? A location field is a new type of field you create in a dataset. It has a field name, latitude / longitude coordinates, plus an optional label that associates a place name with a set of coordinates. To create a geo location field in a dataset, you must tell Platfora which columns of the dataset contain this information. What is a geo reference? A geo location reference is a reference to another dataset that contains location fields. Geo references should be used when the dataset you are referencing is primarily used for location purposes. How are geo references different from regular references? References and geo references are basically the same -- they both create a link to another dataset. A geo reference, however, can only point to datasets that have geo location fields in it. The purpose of a geo reference is to link to datasets that primarily contain location information. Geo references and regular references are also displayed differently in the data catalog view of the dataset, the lens, and the vizboard. Notice that either type of reference can contain location fields. Geo references just use a different icon. This visual cue helps users find location data more easily. Page 107 Data Ingest Guide - Define Datasets to Describe Data When should I create geo references versus regular references? The purpose of a geo reference is to link to datasets that primarily contain just location fields. This helps users identify location data more easily when they want to do geographic analysis. Geo location type fields are featured more prominently via a geo reference. If you have a dataset that has lots of other fields besides just location fields, you may want to use a regular reference instead. Users will still be able to use the location fields in the referenced dataset to create map vizzes if they want. The location data is just featured less prominently in the data catalog and lens. Can I change a regular reference into a geo reference (or vice versa)? No. If you want to change the type of a reference, you will have to delete the regular reference and recreate it as a geo reference (or the other way around). You should use the same reference name so that lenses and vizboards that are using the old reference name do not break. You cannot have two references with the same name, even though they are different types of references. You will either have to delete or rename the old reference before you create the new one. Understand Geo Location Fields A geo location field is a new type of field you create in a dataset. It has a field name, latitude / longitude coordinates, plus a label that associates a place name value with a set of coordinates. To create a geo location field in a dataset, you must tell Platfora which columns of the dataset contain this location information. You can think of a geo location field as a complex data type comprised of multiple dataset columns. In order to create a geo location field, your dataset must have: • A latitude column with numeric data type values • A longitude column with numeric data type values • A place name column containing STRING data type values. Place name values must be unique for each latitude, longitude coordinate pair. The values of this column will be used to: • label tooltips for points on a map viz • when creating a filter on a location field • label marks and axes when using a location field in non-map visualizations Page 108 Data Ingest Guide - Define Datasets to Describe Data The reason for creating geo location fields is so analyst users can plot location data in a map visualization. Location fields are shown with a special pin icon in the data catalog, lens, and vizboard. This location icon lets users know that the field can be used on a map. Page 109 Data Ingest Guide - Define Datasets to Describe Data Add a Location Field to a Dataset The process of adding a geo location field to a dataset involves mapping information from other dataset columns. To create a location field, a dataset needs a latitude column, a longitude column, and label column containing place names. 1. Make sure that your dataset has the necessary columns. Latitude and longitude columns are required to create a geo location field. Each coordinate must be in its own column, and the columns must be a numeric data type. A location name column is optional, but highly recommended. If you do use location names, the values must be unique for each latitude, longitude coordinate pair. For example, a column containing just city names may not be unique (there may be a city named Paris in multiple states and countries). You may need to create a unique place name column by combining the values of multiple fields in a computed field expression. 2. Choose Add > Geo Location. 3. Under Geo Location Type > Geo Location Fields choose either Latitude, Longitude (if you only have the coordinates) or Latitude, Longitude with Name (if you also have a column to use as place name labels). 4. Give the location field a Name. Page 110 Data Ingest Guide - Define Datasets to Describe Data Consider using a standard naming convention for all location type fields. For example, always use Location or Geo in the field name. This will make it easier for users to find location fields using search. 5. (optional) Enter a Description for the location field. 6. Choose the fields in the current dataset that map to Latitude, Longitude and Location Name. If you don't see the expected dataset columns as choices, make sure the dataset columns are the correct data type -- DOUBLE, FIXED, LONG or INTEGER for Latitude and Longitude, STRING for Location Name. 7. Click Add. 8. Make sure the geo location field was added to the dataset as expected. Location fields and geo references are both added in the references section of the dataset on the Geo Locations tab. Understand Geo References If your datasets do not have geographic coordinates in them, you can reference special geo datasets that do have coordinate information in them. For example, if your dataset has US zip code information in it, you can reference a special geo dataset that contains latitude/longitude coordinates for each US zip code. A geo reference is similar to a regular reference in Platfora. The difference is that geo references are used specifically for the purpose of linking to datasets containing geo location fields. A regular reference links to datasets that have other dimension information besides just location information. Although a regular referenced dataset may have location information in it as well, location information is not the primary reason the dataset exists. Prepare Geo Datasets to Reference A geo dataset is a dataset that contains mainly location information. The main purpose of a geo dataset is to be the target of a geo location reference from other datasets in the data catalog. Linking another dataset to a geo dataset allows users to do geographic analysis in a vizboard. Platfora comes with some built-in geo datasets that you can install and use for US States, US Counties, US Cities, and US Zipcodes. Page 111 Data Ingest Guide - Define Datasets to Describe Data Optionally, you may have your own data that you want to use to create your own geo datasets. For example, you may have location information about sites that are relevant to your business, such as store locations or office locations. Load the US Geo Datasets Platfora installs with some curated geo-location datasets for US States, US Counties, US Cities, and US Zipcodes. You can load these datasets into the Platfora data catalog, and then use them to create geo references from datasets that do not have location information in them. The geo location datasets contain United States location data only. If you have International location data, or custom locations you want to create (such as custom business locations) you can look at these datasets as examples for creating your own geo-location datasets. In order to reference these geo datasets, your own datasets must have a column that can be used to join to the key of appropriate geo dataset. For example, to join to the US States dataset, your dataset must have a column that has two-digit state codes (CA, NY, TX, etc.). 1. Log in to the Platfora master server in a terminal session as the platfora system user. 2. Run the geo dataset install script: $ $PLATFORA_HOME/client/examples/geo/US/install_geo_us.sh You should see output such as: Importing Importing Importing Importing Importing Importing Importing Importing Importing Importing Importing Importing Importing Importing Importing Importing dataset: "US States" dataset: "US Counties" dataset: "US Cities" dataset: "US Zipcodes" permissions for SourceTable: permissions for SourceTable: permissions for SourceTable: permissions for SourceTable: permissions for SourceTable: permissions for SourceTable: permissions for SourceTable: permissions for SourceTable: permissions for SourceTable: permissions for SourceTable: permissions for SourceTable: permissions for SourceTable: 'US 'US 'US 'US 'US 'US 'US 'US 'US 'US 'US 'US Cities' Cities' Cities' Counties' Counties' Counties' Zipcodes' Zipcodes' Zipcodes' States' States' States' 3. Go to the Platfora web application in your browser. 4. Go to the Datasets tab in the Data Catalog. Look for the US States, US Counties, US Cities, and US Zipcodes datasets. Page 112 Data Ingest Guide - Define Datasets to Describe Data The default permissions on these datasets allow Everyone to view them and build lenses, but only the default System Administrator (admin) account to edit or delete them. You may want to grant edit permissions to other system or data administrator users. Create Your Own Geo Datasets You may have location information about sites that are relevant to your business, such as store or office locations. If you have location data that you want to reference from other datasets, you can create a special geo dataset. Geo datasets are datasets that are intended to be the target of a geo reference. Creating a geo dataset is basically the same as creating any other dataset. However, you prepare the fields and references within a geo dataset so that only (or mostly) location fields are visible in the data catalog. Hide all other fields that are not location fields. Prepare the dataset so that only (or mostly) location fields appear as top-level columns in the dataset. For example, in the Airports dataset, there are 3 possible locations for an airport (from most granular to most general): Airport Location, Airport City Location, and Airport State Location. If the dataset references other datasets, hide the internal references so users don't see a complicated tree of references in the data catalog. The goal is to flatten and simplify the reference structure for users. For Page 113 Data Ingest Guide - Define Datasets to Describe Data example, in the Airports dataset, there is an internal reference to US Cities. That reference is hidden so users don't see it in the data catalog. Use interim computed fields to 'pull up' the required latitude and longitude columns from the referenced dataset into the current dataset. For example, in the Airports dataset, the latitude and longitude columns for Airport Location are already in the current dataset. The latitude and longitude columns for Airport City Location, however, are in the referenced US Cities dataset. Page 114 Data Ingest Guide - Define Datasets to Describe Data Then create geo location fields in the current dataset. The computed fields add the required columns needed to create a location field in the current dataset. The goal is to create all possible geo location fields in the current dataset, so users don't have to navigate through multiple references to find them. Consider using a common naming convention for location fields, such as always having Location in the name. This will help users easily find location fields using search. After all of the location fields have been added to your geo dataset, consider adding a drill path from the most general location field (for example, Airport State Location) to the most specific (for example, Airport Location). This will allow users to drill-down on points in a map visualization. Page 115 Data Ingest Guide - Define Datasets to Describe Data Don't forget to designate a key for your geo dataset. A dataset must have a key to be the target of a geo reference from another dataset. This approach takes a bit of work, but the end result makes it clear to users what fields they can use in map visualizations. Here is what a geo reference from the Flights dataset to a specially prepared Airport Locations geo dataset might look like. Add a Geo Reference A geo reference is basically the same as a regular reference. It creates a link from a field in the current dataset (the focus dataset) to the primary key field of another dataset (the target dataset). You should use a geo reference when the dataset you are linking to is mostly used for location purposes. The target Page 116 Data Ingest Guide - Define Datasets to Describe Data dataset of a geo reference must have a primary key defined and also contain at least one geo location type field. Also, the fields used to join two datasets must be of the same data type. 1. Make sure that your dataset has the necessary foreign key columns to join to the target geo dataset. For example, to join to the US Cities dataset provided by Platfora, your dataset must have a state column containing two digit, capitalized state values (CA, TX, NY, and so on), and a city column with city names that have initial capital letters, proper spacing, and no abbreviated names (for example, San Francisco, Los Angeles, Mountain View -- not san francisco, LA, or Mt. View). 2. Choose Add > Geo Location. 3. Under Geo Location Type > Geo References choose the dataset you want to link to. Only datasets that have keys defined and geo location fields in them will appear in the list. 4. Give the geo reference a Name. Consider using a standard naming convention for all geo location references. For example, always use Location or Geo in the name. This will make it easier for users to find geo references and location fields using search. 5. (optional) Enter a Description for the geo reference. 6. Choose the Foreign Key field(s) in the current dataset to link to the key field(s) of the target dataset. The foreign key field must be of the same data type as the focus dataset key field. Page 117 Data Ingest Guide - Define Datasets to Describe Data If the focus dataset has a compound key, you must choose a corresponding foreign key for each field in the key. 7. Click Add. 8. Make sure the geo location reference was added to the dataset as expected. Location fields and geo references are both added in the references section of the dataset on the Geo Locations tab. This is how geo references appear in the data catalog view of the dataset, the lens, and the vizboard. The location fields under a geo reference are listed before other dimension fields. Prepare Drill Paths for Analysis Adding a drill path to a dataset allows vizboard users to drill down to more granular levels of detail in a viz. A drill path is defined in a dataset by specifying a hierarchy of dimension fields. For example, a Product drill path might have categories for Division, Type, and Model. Drill path levels depend on the granularity of the dimension fields available in the dataset. FAQs - Drill Paths This topic answers some frequently asked questions about defining and using drill paths in Platfora. What is a drill path? A drill path is a hierarchy of dimension fields, where each level in the hierarchy is a sub-division of the level above. For example, the default drill path on Date starts with a top-level category of Year, subdivided by Quarter, then Month, then Date. Drill paths allow vizboard users to interact with data in a viz. Users can double-click on a mark in a viz (or a cell in a cross-tab) to navigate from summarized to detailed levels of categorization. Who can define a drill path? You must be a Data Administrator system role or above, have Edit permissions on the dataset, and have data access permissions to the datasets included in the drill path hierarchy in order to define a drill path. Page 118 Data Ingest Guide - Define Datasets to Describe Data Any user can navigate a drill path in a viz or cross-tab (provided they have sufficient data access permissions). Where are drill paths defined? Drill paths are defined in a dataset. You can define drill paths when adding a new dataset or when editing an existing one. Choose Add > Drill Path in the dataset workspace. Can a field be included in more than one drill path? Yes. A dataset can have multiple drill paths, and the same fields can be used in more than one drill path. However, there is currently no way for a user to choose which drill path they want in a vizboard if a field has multiple paths. The effective drill path will always be the path that comes first alphabetically (by drill path name). For example, the Date dataset has two pre-defined drill paths: YQMD (Year > Quarter > Month > Date) and YWD (Year > Week > Date). If a user adds the Year field to a viz, they should be able to choose between Quarter or Week as the next drill level. However, since there is no way to choose between multiple drill paths, the current behavior is to pick the first drill path (YQMD in this case). The ability to choose between multiple drill paths will be added in a future release. Can I define a drill path on a single column of a dataset? No. A drill path is a hierarchy of more than one dimension field. If you want to drill on different granularities of data contained in a single column, you can create computed fields to bin or bucket the values at different granularities. See Add Binned Fields. For example, suppose you had an age field, and wanted to be able to drill from age in 10-year increments, to age in 5-year increments, to actual age. To accomplish this, you'd first need to define two additional computed fields: age-by-10 (10-year buckets) and age-by-5 (5-year buckets). Then you could create a drill path hierarchy of age-by-10 to age-by-5 to age. Can a drill path include fields from more than one dataset? Yes. A drill path can include fields from the focus dataset, as well as from any datasets that it references. For example, you can define one drill path that includes fields from both the Date and Time datasets via their associated references. Are there any default drill paths defined? Yes. The built-in datasets for Date and Time have default drill paths defined. Any DATETIME type fields that reference these datasets will automatically include these default drill paths. Platfora recommends leaving the default Date and Time drill paths as is. You can always override the default Date and Time drill paths by defining your own drill paths in the datasets that you create. Page 119 Data Ingest Guide - Define Datasets to Describe Data Why do the Date and Time datasets have multiple drill paths defined? The built-in datasets for Date and Time are automatically referenced by any dataset that contains a DATETIME type field. These datasets include some built-in drill paths to facilitate navigation between different granularities of dates and times. You may notice that the Date dataset has two pre-defined drill paths, and the Time dataset has four. The multiple drill paths accommodate different ways of dividing date and time. In each drill path hierarchy, each level is evenly divisible by the next level down. This ensures consistent drill behavior for whatever field is used in a viz. What things should I consider when defining a drill path? A couple things to consider when defining drill paths: • Consistent Drill Levels. Levels in the hierarchy should ideally be evenly divisible subsets of each other. For example, in the Time dataset, the drill increments go from AM/PM to Hour by 6 to Hour by 3 to Hour. Each level in the hierarchy is evenly divisible by levels below it. This ensures consistent drill-down navigation in a viz. • Alphabetical Drill Path Names. When a field participates in multiple drill-paths, the effective drill path is the one that comes first alphabetically. Plan your drill path names accordingly. • The Lens Decides the Drill Behavior. Ultimately, the fields that are included in the lens will dictate the drill path levels available in a vizboard. If a level in the drill path hierarchy is not included in the lens, it is simply skipped by the drill-down navigation. Consider defining one large drill path hierarchy with all possible levels, and then use the lens field selections to control the levels of granularity applicable to your analysis. • Aggregate Lenses Only. Viz users can only navigate through a drill path in a viz that uses an aggregate lens. Drill paths are not applicable to event series lenses. How do I include a drill path in my lens or vizboard? To include a drill path in a lens or vizboard, simply choose the fields that you are interested in analyzing. As long as there is more than one field from a given drill path in the lens, then drill-down capabilities are automatically included. The lens builder does not currently indicate if a field is a member of a drill path or not. You do not have to include every level of the drill path hierarchy in a lens -- the vizboard drill-down behavior can skip levels that are not present. For example, if you have defined a drill path that goes from year to month to day, but you only have year and day in your lens, the effective drill path for that lens then becomes year to day (month is skipped). Page 120 Data Ingest Guide - Define Datasets to Describe Data Add a Drill Path Drill paths are defined in a dataset. You can define drill paths when adding a new dataset or when editing an existing one. 1. Edit the dataset. 2. On the Manage Fields step of the dataset workspace, choose Add > Drill Path. 3. Enter a name for the drill path. Keep in mind that drill path precedence is determined alphabetically by name whenever a field is part of multiple drill paths. 4. Add the fields that you want to include in the drill path. You can include fields from a referenced dataset as well. 5. Use the up and down arrows to set the drill hierarchy order. The most general categorization should be on top, and the most detailed categorization should be on the bottom. 6. Save the drill path. Page 121 Data Ingest Guide - Define Datasets to Describe Data Model Relationships Between Datasets This section explains the relationships between datasets, and how to model dataset references, events and elastic datasets in Platfora to support the type of analysis you want to do on the data. Understand Data Modeling in Platfora This section explains the different kind of relationships you model between datasets to support quantitative analysis, event series analysis, and/or behavioral segment analysis. The Fact-Centric Data Model A fact-centric data model is centered around a particular real-world event that has happened, such as web page views or sales transactions. Datasets are modeled so that a central fact dataset is the focus of an analysis, and dimension datasets are referenced to provide more information about the fact. In data warehousing and business intelligence (BI) applications, this type of data model is often referred to as a star schema. For example, you may have web server logs that serve as the source of your central fact data about pages viewed on your web site. Additional dimension datasets can then be related (or joined) to the central fact to provide more in-depth analysis opportunities. In Platfora, you would model dataset relationships in this way to support the building of aggregate lenses for quantitative data analysis. Fact-centric data modeling involves the following high-level steps in Platfora: 1. Define a key in the dimension dataset. A key is one or more dataset columns that uniqely identify a record. Page 122 Data Ingest Guide - Define Datasets to Describe Data 2. Create a reference in your fact dataset that points to the key of the dimension dataset. How References Work in Platfora Creating a reference allows the datasets to be joined when building aggregate lenses and executing aggregate lens queries, similar to a foreign key to primary key relationship between tables in a relational database. Once you have added your datasets, you can model the relationships between them by adding references in your dataset definitions. A reference is a special kind of field in Platfora that points to the key of another dataset. A reference is created in a fact dataset and points to the key of a dimension dataset. For example, you may have web server logs that serve as the source of your central fact data about pages viewed on your web site. Additional dimension datasets can then be related (or joined) to the central fact to provide more in-depth analysis opportunities. Page 123 Data Ingest Guide - Define Datasets to Describe Data Upstream datasets point to other datasets. Downstream datasets are the datasets being pointed to. For example, the Page Views dataset is upstream of the Visitors dataset, and the Visitors dataset is downstream of Page Views. Once a reference is created, the fields of all downstream datasets are available through the dataset where the reference was created. Data administrators can define computed expressions using downstream dimension fields, and analyst users can choose downstream dimension fields when they build a lens. Measure fields, however, are not available through a reference. Page 124 Data Ingest Guide - Define Datasets to Describe Data The Entity-Centric Data Model An entity-centric data model 'pivots' a fact-centric data model to focus an analysis around a particular dimension (or entity). Modeling the data in this way allows you to do event series analysis, behavioral analysis, or segment analysis in Platfora. For example, suppose you had a common dimension that spanned multiple facts. In a relational database, this is sometimes referred to as a conforming dimension. In this example, our conforming dimension is customer. Modeling the fact datasets around a central customer dataset allows you to analyze different aspects of a customer's behavior. For example, instead of asking "how many customers visited my web site?" (factcentric), you could ask questions like "which customers visit my site more than once a day?" or "which customers are most likely to respond to a direct marketing campaign?" (entity-centric). In Platfora, you would model dataset relationships in this way to support the building of event series lenses and/or segments for behavioral data analysis. Entity-centric data modeling involves the following high-level steps in Platfora: 1. Identify or create a dimension dataset to serve as the common entity you want to analyze. If your existing data is only comprised of fact datasets, you can create an elastic dataset (a virtual dimension used to model entity-centric relationships). 2. Define a key for the dimension dataset. A key is one or more dataset columns that uniqely identify a record. 3. Create references in your fact datasets that point to the key of the common entity dimension dataset. 4. Model events in your common entity dimension dataset. Page 125 Data Ingest Guide - Define Datasets to Describe Data How Events Work in Platfora An event is similar to a reference, but the direction of the join is reversed. An event joins the primary key field(s) of a dimension dataset to the corresponding foreign key field(s) in a fact dataset, plus designates a timestamp field for ordering the event records. Adding event references to a dataset allows you to define an event series lens from that dataset. An event series lens can contain records from multiple fact datasets, as long as the event references have been modeled in the dimension dataset. For example, suppose you had a common dimension dataset (customer) that was referenced by multiple fact datasets (clicks, emails, calls). By creating different events within the customer dataset, you can build an event series lens from customer that allows you to analyze different aspects of a customer's behavior. By looking at different customer events together in a single lens, you can discover additional insights about your customers. For example, you could analyze customers who were the target of an email or direct marketing campaign who then visited your website or made a call to your call center. How Elastic Datasets Work in Platfora Elastic datasets are a special kind of dataset used for entity-centric data modeling in Platfora. They are used to consolidate unique key values from other datasets into one place for the purpose of defining segments, event series lenses, references, or computed fields. They are elastic because the data they contain is dynamically generated at lens build time. Elastic datasets can be created when you have a flat data model with the majority of your data in a single dataset. Platfora requires you to have separate dimension datasets in order to create segments and event series lenses. Elastic datasets allow you to create 'virtual dimensions' so you can do the entity-centric data modeling required to use these features of Platfora. Elastic datasets can be used to work with data that is not backed by a single data source, but instead is embedded in other various datasets. For example, suppose you wanted to do an analysis of the IP addresses that accessed your network. You had various server logs that contained IP addresses, but did not have a separate IP Address dataset modeled out in Platfora. In order to consolidate the unique IP addresses that occurred in your other Page 126 Data Ingest Guide - Define Datasets to Describe Data server log datasets, you could create an elastic dataset called IP Address. You could then model references and events that pointed to this elastic, virtual dataset of IP addresses. There are two ways to create an elastic dataset: 1. From one or more columns in an existing dataset. This will generate the elastic dataset, a static example data file, and the corresponding reference at the same time. 2. From a file containing static example data. You can also use a dummy file of key examples to define an elastic dataset. The file is then used for example purposes only. After the elastic dataset has been created, you then need to model references (if you want to create segments) and events (if you want to create event series lenses). Elastic datasets are virtual - all of their data values are consolidated from the other datasets that reference it. They are not file-based like other datasets. The actual key values that comprise the elastic dataset are computed at lens build time. The example data shown in the Platfora data catalog is filebased, but it is only used for example purposes. Elastic datasets inherit the data access permissions of the datasets that reference them. So for example, if a user has access to the Web Logs and Network Logs datasets, they will have access to the IP address values consolidated from those datasets via the IP Address elastic dataset. One thing to keep in mind is the sample file used to show the values seen in the dataset workspace and the Platfora data catalog. The values in this sample data file are viewable to all Platfora users by default. If you are concerned about this, don't use real data values to create this sample data file. Since elastic datasets contain no actual data of their own, they cannot be used as the focus of an aggregate lens. They can be included by reference in an aggregate lens, or be used as the focus when building an event series lens. Also, since they are used to consolidate key values from other datasets, every base field in the dataset must be included in the elastic dataset key. Additional base fields that are not part of the key are not allowed in an elastic dataset (additional computed fields are OK though). Page 127 Data Ingest Guide - Define Datasets to Describe Data Add a Reference A reference creates a link from a field in the current dataset (the focus dataset) to the primary key field of another dataset (the target dataset). The target dataset must have a primary key defined. Also, the fields used to join two datasets must be of the same data type. 1. Go to the Create References step of the dataset workspace. 2. Choose Add > Reference. 3. In the the Define References panel, select the Referenced Dataset to link to. Only datasets that have keys defined will appear in the list. If you do not see the dataset you want to reference in the target list, make sure that it has a key defined and that the data type of the key field(s) is the same as the foreign key field(s) in the focus dataset. For example, if the key of the target dataset is an INTEGER data type, but the focus dataset only has STRING fields, you will not see the dataset in the target list because the data types are not compatible. 4. Choose the Foreign Key field(s) in the current dataset to link to the Key field(s) of the target dataset. The foreign key field must be of the same data type as the target dataset key field. If the target dataset has a compound key, you must choose a corresponding foreign key for each field in the key. 5. Enter a Name for the reference. 6. (optional) Enter a Description for the reference. 7. Click Add. The new reference is added to the dataset in the References section. Page 128 Data Ingest Guide - Define Datasets to Describe Data You may also want to hide the foreign key field(s) in the current dataset so that users only see the reference fields in the data catalog. To refer to the referenced dataset from here on out, use the reference name (not the original dataset name). Add an Event Reference An event is a special reverse-reference that is created in a dimension dataset. Before you can model event references, you must define regular references first. Event references allow you to define an event series lens from a dataset. In order to create an event in a dataset, the current dataset and the event dataset you are linking to must meet the following requirements: • The current dataset must have a key defined. • The current dataset must be the target of a reference. See Add a Reference. • The event dataset that you are linking to must have a timestamp field in it (a DATETIME type field). If the dataset does not meet these requirements, you will not see the Add > Event option. 1. Edit the dimension dataset in which you want to model event references. 2. Choose Add > Event. 3. Provide the event information and click Add Event. Event Name This is a logical name for the event. This is the name users will see in the data catalog or a lens. Event Dataset This is the fact dataset that you are linking to. You will only see datasets that have references to the current dataset. Page 129 Data Ingest Guide - Define Datasets to Describe Data Event Dataset Reference This is the name of the reference in the event dataset. If the event dataset has multiple references to the current dataset, then choose the appropriate one for your event. Ordering Field This is a timestamp field in the event dataset. When an event series lens is built, this is the field used to order event records. Only DATETIME type fields in the event dataset are shown. 4. The event is added to a separate Events tab in the References section. Click on an event to edit the event details. Add an Elastic Dataset Elastic datasets are a special kind of dataset used for entity-centric data modeling in Platfora. There are two ways to create an elastic dataset - from a column in an existing dataset or from a file containing sample values. Elastic datasets are used to consolidate unique key values from other datasets into one place for the purpose of defining segments or event series lenses. They are elastic because the data they contain is dynamically generated at lens build time. The data values used when adding the elastic dataset are for example purposes only. They are visible to users as example values when they view or edit the elastic dataset in the data catalog. The actual data values of an elastic dataset come from the datasets that reference it. Page 130 Data Ingest Guide - Define Datasets to Describe Data Create an Elastic Dataset from an Existing Dataset As a convenience, you can create an elastic dataset while working in another dataset. This creates the elastic dataset, a static example data file, and the corresponding fact-to-dimension reference at the same time. 1. Edit the dataset that contains the key values you want to consolidate. 2. Choose Add > Elastic Dataset. 3. Choose the column in the current dataset that you want to base the elastic dataset on. If the key values are comprised from multiple columns (a compound key), click Add Additional Fields to choose the additional columns. 4. Enter a name for the new elastic dataset that will be created. 5. Enter a name for the new reference that will be created in the current dataset. 6. Enter a description for the new elastic dataset that will be created. 7. Click Add. 8. Notice that the reference to the elastic dataset is created in the current dataset. 9. Save the current dataset. 10.You are notified that the new elastic dataset is about to be created using sample values from the column(s) you selected in the current dataset. Click Confirm. The sample values are written to Platfora's system directory in the Hadoop file system. For example, in HDFS at: /platfora/system/current+dataset+name/sample.csv Page 131 Data Ingest Guide - Define Datasets to Describe Data This file is only used as sample data when viewing or editing the elastic dataset. This sample file is not removed from HDFS if you delete the elastic dataset in Platfora. You'll have to remove this file in HDFS directly. Create an Elastic Dataset Using a Sample File If you are creating an elastic dataset based on sensitive values, such as social security numbers or email addresses, you may want to use a sample file of fake data to create the elastic dataset. This way unauthorized users will not be able to see any real data via the sample values. This is especially important for Platfora systems using HDFS delegated authorization. 1. Upload a file to Platfora as a basis for the elastic dataset. This file should contain a newline separated list of sample values. 2. Go to the Define Key step of the dataset workspace. 3. Select Include in Key for all Base fields of the dataset. 4. Select Elastic Dataset to change the dataset's type from file-based to elastic. 5. Click OK to confirm the dataset type change. 6. Change the Dataset Name for the elastic dataset. For example, you may want to use a special naming convention for elastic datasets to help you find them in the Platfora data catalog. 7. Save and Exit the dataset. Page 132 Data Ingest Guide - Define Datasets to Describe Data After you create the elastic dataset, you have to add references in your fact datasets to point to it. This is how the elastic dataset gets populated with real data values at lens build time. It consolidates the foreign key values from the datasets that reference it. Delete or Hide a Reference Deleting a reference removes the link between two datasets. If you want to keep the reference link, but do not want the reference to appear in the data catalog, you can always hide it instead. The automatic references to Date and Time cannot be deleted, but they can be hidden. Before deleting a reference, make sure that you do not have computed fields, lenses, or vizboards that are using referenced fields. A missing reference can cause errors the next time someone updates a lens or vizboard that is using fields downstream of the reference. 1. Go to the Create References step of the dataset workspace. 2. Find the reference on the References tab, event references on the Events tab, or geo reference on the Geo Locations tab. 3. Delete or hide the reference. • To delete the reference, click the delete icon. • To hide the reference select Hidden. This will keep the reference in the dataset definition, but hide it in the data catalog view of the dataset. Page 133 Data Ingest Guide - Define Datasets to Describe Data Update a Reference You can edit and existing reference, event, or geo reference to change its name or description. Make sure to click Update to apply any changes you make. Before changing the name of a reference, make sure that you do not have computed fields, lenses, or visualizations that are using it. Changing the name can cause errors the next time someone updates a lens or vizboard that is using the old reference name. 1. Go to the Create References step of the dataset workspace. 2. Find the reference on the References tab, event references on the Events tab, or geo reference on the Geo Locations tab. 3. Click the reference you want to update. 4. In the Define References panel, update the name or description. 5. Click Update to apply your changes. Define the Dataset Key A key is single field (or combination of fields) that uniquely identifies a row in a dataset, similar to a primary key in a relational database. A dataset must have a key defined in order to be the target of a Page 134 Data Ingest Guide - Define Datasets to Describe Data reference, the base dataset of a segment, or the focus of an event series lens. Also, the key field(s) used to join two datasets must be of the same data type. 1. Go to the Define Key step of the dataset workspace. 2. Select Include in Key in the column header of the key field(s). A dataset may have a compound key (a combination of fields that are the unique identifier). Select each field that comprises the key. 3. Click Save. Page 135 Chapter 4 Use the Data Catalog to Find What's Available The data catalog is a collection of data items available and visible to Platfora users. Data administrators build the data catalog by defining and modeling datasets in Platfora that point to source data in Hadoop. When users request data from a dataset, that request is materialized in Platfora as a lens. The data catalog shows all of the datasets (data available for request) and lenses (data that is ready for analysis) that have been created by Platfora users. Topics: • FAQs - Data Catalog Basics • Find Available Datasets • Find Available Lenses • Find Available Segments • Organize Datasets, Lenses and Vizboards with Labels FAQs - Data Catalog Basics The data catalog is where users can find datasets, lenses, and segments that have been created in Platfora. This topic answers the basic questions about the data catalog. How can I see the relationships between datasets? There isn't one place in the data catalog where you can see how all of the datasets are related to each other. You can however, open a particular dataset to see how it relates to other datasets in the data catalog. Page 136 Data Ingest Guide - Use the Data Catalog to Find What's Available The dataset detail page shows the Referenced Datasets that the current dataset is pointing to. If a dataset is the target of an incoming reference, it is considered a dimension dataset. Dimension datasets show both upstream and downstream relationships on their dataset detail page, whereas fact datasets only show downstream relationships (or no relationships at all). If a dimension dataset has an event or segment associated with it, then it is also considered an entity dataset. Entity datasets serve as a conforming dimension to join multiple fact datasets together. Entity datasets can be used as the focus of an event series lens. Page 137 Data Ingest Guide - Use the Data Catalog to Find What's Available What does it mean when a dataset or lens has a lock on it? If you are browsing the data catalog and see datasets or lenses that are grayed-out and locked, this means that you do not have sufficient data access permissions to see the data in that dataset or lens. Contact your Platfora system administrator to ask if you can have access to the data. What does it mean when a dataset has (static) or (dynamic) after its name? This means that the dataset is a derived dataset. A derived dataset query) in Platfora, whereas a regular dataset is defined from a viz (or lens is defined from a data source outside of Platfora. A static derived dataset takes a snapshot of the viz data at a point in time - the data does not change if the parent lens is updated. A dynamic derived dataset does not save the actual viz data, but instead saves Page 138 Data Ingest Guide - Use the Data Catalog to Find What's Available the lens query used to produce the data - the data is dynamically updated whenever the parent lens is updated. Why doesn't 'My Datasets' or 'My Lenses' have anything listed? Even though you may work with certain datasets and lenses on a regular basis, they won't show in the My Datasets or My Lenses panels unless you were the original user who created them. Find Available Datasets Datasets represent a collection of source data in Hadoop that has been modeled for use in Platfora. You can browse or search the data catalog to find datasets that are available to you and that meet your data requirements. Once you find a dataset of interest, you can request that data by building a lens (or check if a lens already exists that has the data you need). Page 139 Data Ingest Guide - Use the Data Catalog to Find What's Available Search within Datasets Using the Quick Find search, you can find datasets by name. Quick find also searches the field names within the datasets. 1. Go to the Datasets tab in the Data Catalog. 2. Search by dataset name or by a field name within the dataset using the search. Page 140 Data Ingest Guide - Use the Data Catalog to Find What's Available Dataset List View List view allows you to sort the available datasets by different criteria to find the dataset you want. 1. Go to the Datasets tab in the Data Catalog. 2. Select List view. 3. Click a column header to sort by that column. 4. Once you find the dataset you want, use the dataset action menu to access it. 5. While in list view, you can select multiple datasets to delete at once. Find Available Lenses Lenses contain data that is already loaded into Platfora and immediately available for analysis. Lenses are always built from the focus of a single dataset in the data catalog. Before you build a new lens, you Page 141 Data Ingest Guide - Use the Data Catalog to Find What's Available should check if there is already a lens that has the data you need. You can browse the available lenses in the Platfora data catalog. 1. Go to the Lenses tab in the Data Catalog. 2. Choose the List view to easily sort and search for lenses. 3. Search by lens name or by a field name within the lens using the search. 4. Click a column header to sort by that column, such as finding lenses by their focus dataset. 5. Once you find a lens you want, use the lens action menu to access it. Page 142 Data Ingest Guide - Use the Data Catalog to Find What's Available Find Available Segments The data catalog does not have a centralized tab or view that shows all of the segments that have been created in Platfora. You can find segments by looking in a particular dataset to see if any segments have been created from that dataset. 1. Go to the Data Catalog. 2. Find the dataset that you would want to segment and open it. 3. Choose the Segments tab. If you do not see any segments listed, this means the dataset has not been used to define segments. A dataset must be the target of an incoming reference in order to define segments from it. 4. Click a segment to see its details. 5. The segment details shows: Segment Name The name given to the segment when it was defined in the vizboard Built On The last time the segment was updated Segment of The focus dataset (and its associated reference to the current dataset) that was used to define the segment Occuring in Dataset This is always the same as the currently selected dataset name Origin Lens The lens that was queried to define this segment Segment Conditions The criteria that a record must meet to be counted in the segment Page 143 Data Ingest Guide - Use the Data Catalog to Find What's Available IN and NOT IN Value Labels The value labels given to records that are in the segment, and those that are not Segment Count The number of rows of this dataset that met the segment conditions 6. From the segment action menu you can Delete the segment or edit its Permissions. Segments cannot be created or edited from the data catalog. To edit the segment conditions, you must go to a vizboard where the segment is in use. Choose Show Vizboards to find vizboards that are using the segment. Organize Datasets, Lenses and Vizboards with Labels If you have datasets, lenses, and vizboards that you use all of the time, you can tag them with a label so you can easily find them. Labels allow you to organize and categorize Platfora objects for easier search and collaboration. For example, you can label all datasets, lenses and vizboards associated with a particular department or project. Anyone in Platfora can create a label and apply it to any object to which they have view data access. Labels are just an organizational tool. They do not have any security or privacy settings associated with them. Labels can be created, viewed, applied, or deleted by any Platfora user, even labels created by other users. There is no ownership associated with labels. Create a Label Before you create new labels, first decide how you want to categorize and organize your data objects in Platfora. For example, do you want to tag objects by user names? by project? by department? by use case? a combination of these? Page 144 Data Ingest Guide - Use the Data Catalog to Find What's Available Labels can be created for each category you want to search by, and within a category, you can create up to 10 levels of nested sub-label categories. By default, there is one parent label category called All which cannot be renamed or deleted. Any label you add will be a sub-label of All. 1. You can manage labels from the Data Catalog or the Vizboards area of Platfora. 2. Select Manage Labels from the Labels menu. 3. Select Create Sublabel from the desired parent label in the hierarchy. The default parent label category is All. 4. Enter a name for the label. 5. Click Create. 6. Click OK. Page 145 Data Ingest Guide - Use the Data Catalog to Find What's Available Apply a Label to a Dataset, Lens, or Vizboard You can apply as many labels as you like to a dataset, lens, or vizboard. Applying a label to an object allows you to search for that object by that label name. 1. You can apply labels from the Data Catalog or the Vizboards area of Platfora. 2. Select Labels from the dataset, lens or vizboard action menu. 3. Click the plus sign (+) to apply a label. Click the minus sign (-) to remove a label that has been previously applied. 4. Click OK. Delete or Rename a Label When you delete a label, the label is removed from all objects to which it was applied. The objects themselves are not affected. Page 146 Data Ingest Guide - Use the Data Catalog to Find What's Available When you rename a label, the label will be updated to the new name wherever it is applied. You do not need to re-apply it to the objects after renaming. If you are getting errors using a label after is has been renamed, try reloading the browser page. Sometimes old label names are cached by the browser and can cause unexpected results. Search by Label Name Once you have applied labels to your objects, you can use the label breadcrumbs and search to find objects by their assigned labels. You can search by label in the Data Catalog or Vizboards areas of the Platfora application. 1. Click any level in the breadcrumb hierarchy to filter by that label category. 2. Select an existing label to filter on. Page 147 Chapter 5 Define Lenses to Load Data To request data from Hadoop and load it into Platfora, you must define and build a lens. A lens can be though of as a dynamic, on-demand data mart purpose-built for a specific analysis project. Topics: • FAQs - Lens Basics • Lens Best Practices • About the Lens Builder Panel • Understand the Lens Build Process • Create a Lens • Estimate Lens Size • Manage Lenses • Manage Segments—FAQs FAQs - Lens Basics A lens is a type of data storage that is specific to Platfora. This topic answers some frequently asked questions about lenses. What is a lens? A lens is a type of data storage that is specific to Platfora. Platfora uses Hadoop as its data source and processing engine to build and store its lenses. Once a lens is built, this prepared data is copied to Platfora, where it is then available for analysis. A lens can be though of as a dynamic, on-demand data mart purpose-built for a specific analysis project. Who can create a lens? Lenses can be created by any Platfora user with the Analyst system role (or above), provided that user also has the appropriate security permissions to the underlying source data and the dataset. Page 148 Data Ingest Guide - Define Lenses to Load Data How do I create a lens? You create a lens by first choosing a Dataset in the Platfora data catalog, then choose Create Lens from the dataset detail page or the dataset action menu. If the Create Lens option is grayed-out, you don't have the appropriate security permissions on the dataset. Ask your system administrator or the dataset owner to grant you access. How big can a lens be? It depends on how much disk space and memory you have available in Platfora, and if your system administrator has set a limit on how much data you can request at once. As a general rule, a lens should not be bigger than the amount of memory you have available in your entire Platfora cluster. For most Platfora users, your system administrator sets a lens quota which limits how big of a lens you can build. The default lens quota depends on your system role: 1 GB for Analysts, 1 TB for Data Administrators, and Unlimited for System Administrators. You can see your lens quota when you go to build a lens. Likely, your organization uses Hadoop because you are collecting and storing a lot of data. It probably doesn't make sense to request all of that data all at once. You can limit the amount of data you request by using lens filters and only choosing the fields you need for your analysis. How long does it take to build a lens? It really depends - a lens build can take a few minutes or several hours. There are a lot of factors that determine how long a lens build will take, and a lot of those factors depend on your Hadoop cluster, not necessarily on Platfora. Since the lens build jobs happen in Hadoop, the biggest factor is the resources that are available in your Hadoop cluster to run Platfora's MapReduce jobs. If the Hadoop cluster is busy with other workload, or if there is not enough memory on the Hadoop task nodes, then Platfora's lens builds will take longer. The time it takes to build a lens also depends on the size of the input data, the number and cardinality of the dimension fields you choose, and the complexity of the processing logic you have defined in your dataset definitions. What are the different kinds of lenses? Platfora has two types of lenses you can build: an Aggregate Lens or an Event Series Lens. The type of lens you build determines what kinds of visualizations you can create and what kinds of analyses you can perform when using the lens in a vizboard. An aggregate lens can be built from any dataset. It contains aggregated measure data grouped by the various dimension fields you select from the dataset. Choose this lens type if you want to do ad hoc data analysis. An event series lens can only be built from dimension datasets that have an Event reference defined in them. It contains non-aggregated events (fact dataset records), partitioned by the primary key of the Page 149 Data Ingest Guide - Define Lenses to Load Data selected dimension dataset, and sorted by the time the event occurred. Choose this lens type if you want to do time series analysis, such as funnel paths. How does Platfora handle rows that can't be loaded? When Platfora processes the data during a lens build, it logs any problem rows that it could not process according to the logic defined in the dataset. These 'dirty rows' are shown as lens build warnings. Platfora administrators can investigate these warnings to determine the extent of the problem. Lens Best Practices When you define a lens, you want the selection of fields to be broad enough to support all of the business questions you want to answer. A lens can be used by many visualizations and many users at the same time. On the other hand, you want to constrain the overall size of the lens so that it will fit into the available memory and so queries against the lens are fast. Check for existing lenses before you build a new one. Once you find a dataset that contains the data you want, first check for any existing lenses that have been built from that dataset. There may already be a lens that you can use for your analysis. Also, if there is an existing lens that contains some but not all of the fields you want, you can always modify the lens definition to add additional fields. This is more efficient than building a whole new lens from scratch. Define lens filters to reduce the amount of data you request. You can add a lens filter on any dimension field of a dataset. Lens filters constrain the number of records pulled into the lens from the data source. For example, if you store 10 years worth of data in Hadoop, but only need to access the past year's worth of data, you can set a date-based filter to limit the lens to get only the data you need. Keep in mind that you can also create filters within visualizations. Lens filters should be used to limit the number of records (and overall lens size). You don't want to have a lens be too narrow in scope as to limit its analysis opportunities. Don't include high cardinality fields that are not essential to your analysis. The size of an aggregate lens depends mostly on the cardinality (number of unique values) of the dimension fields selected. The more granular the dimension data, the bigger the aggregate lens will be. For example, aggregating time-based data to the second granularity will make the lens significantly bigger than if you chose to analyze the data to the hour granularity. For fields that you intend to use as measures only (you only need the aggregated values), make sure to deselect Original Value. When Original Value is selected, the field is also included in your lens as a dimension field. Page 150 Data Ingest Guide - Define Lenses to Load Data Don't include DISTINCT measures unless they are essential to your analysis. Measures that calculate DISTINCT counts must also include the original field values that they are counting. If you add a DISTINCT measure on a high-cardinality field, this can make your aggregate lens larger than expected. Only include DISTINCT measures in your lens when you are sure you need them for your analysis. For any dimension field you have in your lens, you can also calculate a DISTINCT count in the vizboard using a vizboard computed field. DISTINCT is the one measure aggregation that doesn't have to be calculated at lens build time. About the Lens Builder Panel When you create a new lens or edit an existing one, it opens the lens builder panel. The lens builder is where you choose and confirm the dataset fields that you want in your lens. The lens builder panel looks slightly different depending on the type of lens you are building (aggregate or event series lens). You can click any field to see its definition and description. 1. Lens Name 2. Focus Dataset Name 3. Focus Dataset Size 4. Lens Type (Aggregate or Event Series) 5. Lens Size and Quota Information 6. Field Selection Controls 7. Field Information and Descriptions Page 151 Data Ingest Guide - Define Lenses to Load Data 8. Quick Measure Controls 9. Lens Filter Controls 10.Lens Management Controls 11.Segment Controls 12.Lens Actions (save lens definition and/or initiate a build job to get data from Hadoop) Understand the Lens Build Process The act of building a lens in Platfora generates a series of MapReduce jobs in Hadoop to select, process, aggregate, and prepare the data for use by Platfora's visual query engine, the vizboard. This section explains how source data is selected for processing, what happens to the data during lens build processing, and what resulting data to expect in the lens. By understanding the lens build process, administrators can make decisions to improve lens build performance and ensure the resulting data meets the expectations of business users. Understand Lens MapReduce Jobs When you build or update a lens in Platfora, it generates a series of MapReduce jobs in the Hadoop MapReduce cluster. The number of jobs and time to complete each job depends on the number of datasets involved, the number and size of the fields selected, and if that lens definition has been built before (incremental vs non-incremental lens builds). This topic explains all of the MapReduce jobs or steps that you might possibly see in a lens build, and what is hapening in each step. These steps are listed in the order that they occur in the overall lens build process. These MapReduce jobs appear on the Platfora System page as distinct steps of a lens build. Depending on the lens build, you might see all of these steps or just a few of them. Depending on the number of datasets involved in the lens build, you may see some steps more than once: Order Job / Step What's Happening in this Step? 1 Inspect Source Data This step scans the data source to determine the number and size of the files to be processed. If a lens was built previously using the same dataset and field selections, then the inspection checks for any new or changed files since the last build. If you have defined lens filters in an input partitioning field, these filters are applied at this time before any other processing occurs. Page 152 Data Ingest Guide - Define Lenses to Load Data Order Job / Step What's Happening in this Step? 2 Waiting for lens build slot to become available To prevent Platfora from overwhelming the Hadoop cluster with too many concurrent lens build jobs, Platfora limits the number of concurrent jobs it runs. Any lens build submitted after that limit is reached waits for existing lens builds to finish before starting. The limit is 3 by default. This limit is controlled by the platfora.builder.lens.build.concurrency property. 3 Event series processing for This step only occurs in lenses that include event series processing computed fields (computed fields computed_field_name defined using a PARTITION statement). This job does the value partitioning and multi-row pattern match processing of event series computed fields. 4 Build Data Dictionaries This step scans the source files and determines the distinct values for each dimension (grouping column) in the lens. For example, a gender field might have two distinct values (Male, Female) and a state field might have 50 distinct values (CA, NY, WA, TX, etc.). For high-cardinality fields, you may see an additional Build Partitioned Data Dictionaries preceding this step. This splits up the distinct values so that the dictionary can be distributed across multiple nodes. This job is run for each dataset included in the lens. 5 Encoding Attribute This step encodes the dimension values (or attributes) using the data dictionaries. When data dictionaries are small, this step does not require its own job (it is performed as part of dictionary building). When a data dictionary is large, encoding attributes is a separate MapReduce job. 6 Encoding Reference This step joins datasets that are connected by references. When data dictionaries are small, this step does not require its own job (it is performed as part of dictionary building). When a data dictionary is large, joining datasets is a separate MapReduce job. Page 153 Data Ingest Guide - Define Lenses to Load Data Order Job / Step What's Happening in this Step? 7 Aggregate Datasets For aggregate lenses, this step calculates the aggregated measure values for each dimension value and each unique combination of dimension values. For example, if the lens included a measure for SUM(sales), and the dimension fields gender and state, then the sum of sales would be calculated for each gender, each state, and each state/gender combination. For event series lenses, this step partitions the individual event records by the focus dataset key and orders the event records in each partition by time. This job is run for each dataset included in the lens. 8 Load Datasets This step creates a columnar representation of the data and writes out the lens data structures to disk in the Hadoop file system. This job is run for each dataset included in the lens. 9 Index Datasets For lenses that include fields from multiple datasets, this step creates indexes on key fields to allow joins between the datasets when they are queried. This job is run for each referenced dataset included in the lens. 10 Transfer to Final Location This is a step specific to Amazon Elastic Map Reduce (EMR) lens builds. It copies lens output files from the intermediate directory in the EMR job flow to the final destination in S3. 11 Preload Built Data Files to This step copies the lens data structures from the Hadoop file system to the data directory locations on Local Disk the Platfora servers. Pre-fetching the lens data from Hadoop reduces the initial query time when a lens is first accessed in a vizboard. Understand Source Data Input to a Lens Build This section describes how Platfora determines what source data files to process for a given lens build. Source data input refers to the raw data files in Hadoop that are considered for a particular lens build. A Platfora dataset points to a location in a data source (a directory in HDFS, an S3 bucket, a Hive table, etc.). By choosing a focus dataset for your lens, you set the scope of source data to be considered for that lens. As to what source data files actually get processed by a lens build depends on other characteristics of the lens, such as if the lens has been built before or if there are any lens filters that exclude source data files. Page 154 Data Ingest Guide - Define Lenses to Load Data Understand Incremental vs Full Lens Builds Whenever possible, Platfora tries to conserve processing resources on the Hadoop cluster by only processing the source data files it needs for the lens. If a source data file has already been processed once for a particular lens definition, Platfora can reuse that work from past lens builds and not process that file again. However, if the underlying data has changed in some way, Platfora must re-process all of the source data files in order to ensure data accuracy. This section describes how a lens build determines if it needs to process all of the source data (full lens build) or just the new source data that was added since the last time the lens was built (incremental lens build). Incremental lens builds are more desirable because they are faster and use fewer resources. When you first create a lens, Platfora builds the lens data using a full build. During the build, Platfora stores a record of the build inputs. Then, as it manages that lens, Platfora can determine if any build inputs changed. Platfora rebuilds a lens whenever a user manually fires a build by pressing a lens' Build button or a scheduled build is fired by the system. Whenever a build is fired, Platfora first compares the last build inputs to the new build inputs. If nothing changed between the two builds, Platfora reuses the results of the last build. If there are changes and those changes fall within certain conditions, Platfora does an incremental lens build. If it cannot do an incremental build, Platfora does a full rebuild. Platfora defaults to incremental builds because they are faster than full rebuilds. You can optimize lens build performance in your environment by understanding the conditions that determine if a lens build is full or incremental. An incremental build appends new data to an existing lens without changing any previously built data. So, Platfora can only incrementally build changes that add but that do not modify or delete old build inputs. For this reason, Platfora can only incrementally build lenses that rely only on HDFS or HIVE data sources. HDFS directory or HIVE partitions permit incremental builds because they support wildcard configurations. Wildcard configurations typically acquire new data through pattern matching incoming data. They do not modify or delete existing data. An incremental lens build retrieves the newly added data, processes it, and appends it to the old data in Platfora. The old data is not changed. Even though a data source is HIVE or HDFS, it does not guarantee that a lens will always build incrementally. Under certain conditions, Platfora always builds the full lens. When any of the following happens between the last build and a new build, Platfora does a full lens build: • The lens has a LAST X DAYS filter and the last build occurred outside the filter's parameters. • The lens is modified. For example, a user changes the description or adds a field. • The dataset is modified. For example, a user adds a field to the dataset. • A referenced dimension dataset changes in any way. • A data source is modified. For example, a file is modified or a file is deleted from an HDFS directory. Page 155 Data Ingest Guide - Define Lenses to Load Data Additionally, Platfora builds the full lens under the following conditions: • The lens includes event series processing fields. Due to the nature of patten matching logic, lenses with ESP fields require full lens builds that scan all of a dataset's input data. • The HDFS Delegated Authorization feature is enabled. A full lens build can be resource intensive and it can take a long time. Which is why Platfora always tries to do an incremental build if it can. You can increase the chances Platfora does an incremental build by relaxing the build behavior for dimension data. Understand Input Partitioning Fields An input partitioning field is a field in a dataset that contains information about how to locate the source files in the remote file system. Defining a filter on these special fields eliminates files from lens build processing as the very first step of the lens build process, as compared to other lens filters which are evaluated later in the process. Defining a lens filter on an input partitioning field is a way to reduce the amount of source data that is scanned by a lens build. For Hive data sources, partitioning fields are defined on the data source by the Hive administrator. Hive partitioning fields appear in the PARTITIONED BY clause of the Hive table definition. Hive administrators use partitioning fields to organize the Hive data into separate files in the Hadoop file system. The goal of Hive table partitioning is to improve query performance by keeping records together for faster access. For HDFS or S3 data sources, Platfora administrators can define a partitioning field when they create a dataset. A partitioning field for HDFS or S3 is any computed field that uses a FILE_NAME() or FILE_PATH() function. File or directory path partitioning is useful when the source data that comprises a dataset comes from multiple files, and there is useful information in the directory or file names themselves. For example, useful path information includes dates or server names. Page 156 Data Ingest Guide - Define Lenses to Load Data Not all datasets will have partitioning fields. If there are partitioning fields available, the lens page displays a special icon next to them. Platfora applies filters on input partitioning fields as the first step of a lens build. Then, Platfora computes any event series processing computed fields. Any other lens field filters are then applied later in the build process. Event series processing computed fields are those that are defined using a PARTITION statement. The interaction of input partitioning fields and event series processing is important to understand if you are using even series processing computed fields. Understand How Datasets are Joined This topic explains how datasets are joined together during the lens build process, and what to expect in the resulting lens data. Joins only occur for datasets that have references to other datasets, and fields from the referenced datasets are also included in the lens definition. About Focus Datasets and Referenced Datasets When you build a lens, you must choose one dataset as the starting point. This is called the focus dataset for the lens. The focus dataset may have references to other datasets allowing you to choose dimension fields from both the focus dataset and the referenced datasets as well. If a lens includes fields from multiple datasets, then all of the selected fields are combined into one consolidated row in the lens output. This consolidation of fields is done by joining together the rows of the various datasets on the fields that they share in common. Page 157 Data Ingest Guide - Define Lenses to Load Data The Default Join Behavior: (Left) Outer Joins Consider a lens that includes fields from both the focus dataset and a referenced dataset. When Platfora builds this lens, it does an OUTER JOIN between the focus dataset and any referenced datasets. The OUTER JOIN operation compares rows in the focus dataset to related rows in the referenced datasets. If a row in the focus dataset cannot join to a row in the referenced dataset, then Platfora still includes these unjoined focus rows in the lens results. However, the values for the referenced fields that did not join are treated as NULL values. These NULL values are then replaced with default values and joined to the consolidated focus row. Platfora notifies you with an 'unjoined foreign keys' warning whenever there is a focus row that did not join. How Lens Filters can Change Join Behavior to (Right) Inner Joins If a lens filter is on a field from the focus dataset, then the default join behavior is still an OUTER JOIN. The focus dataset rows are used as the basis of the join. However, if the lens filter is on a field from a referenced dataset, the lens build process uses an INNER JOIN instead. The referenced dataset rows are used at the basis for comparison. This means that focus rows can potentially be excluded from the lens entirely. Before doing the join, the lens build first filters the rows of the referenced dataset and discards any rows that don't match the filter criteria. Then, the build joins the filtered referenced dataset to the focus dataset. When it uses an INNER JOIN, Platfora entirely excludes all unjoined rows from the lens results. Because the lens build performs the filter first and it excludes unjoined rows, an INNER JOIN can return fewer focus rows than you may expect. Create a Lens A lens is always defined from the focal point of a single dataset in the data catalog. Once you have located a dataset that has the data you need, first check and see if there are any existing lenses that you can use. If not, click Create Lens on the dataset details page to define and build a new lens from that dataset. Page 158 Data Ingest Guide - Define Lenses to Load Data To create a new lens, you must be at least an Analyst role or above. You must have data access permissions on the source data and at least Define Lens from Dataset object permissions on the focus dataset, as well as any datasets that are included in the lens by reference. 1. Go to the Platfora Data Catalog. 2. Find the dataset you want, and open it. 3. Go to the Lenses section and click Add. 4. In the lens builder panel, define your lens and choose the data fields you want to analyze. a) Name your lens. Choose the name carefully - lenses cannot be renamed. b) Choose the lens type. An aggregate lens is the default lens type, but you can choose to build an event series lens if your datasets are modeled in a certain way. c) Choose lens fields. The types of fields you choose depend on the type of lens you are building. d) (Optional) Define lens filters. Filters limit the scope of data being requested. e) (Optional) Allow ad-hoc segments. Choose whether or not to allow vizboard users to create segments based on members in a particular referenced dataset. 5. Save and build the lens. Page 159 Data Ingest Guide - Define Lenses to Load Data Name a Lens The first step of defining a lens is to give it a meaningful name. The lens name should help users understand what kind of data they can find in the lens, so they can decide if it will meet their analysis needs. Choose the lens name carefully - you cannot rename a lens after it has been saved or built for the first time. You won't be able to save or build a lens until you give it a name. The lens name must be unique - the name can't be the same as any existing lens, dataset, or segment in Platfora. You can't change the lens name after you save the lens. It is also a good idea to give the lens a description to help users understand what data is in the lens. You can always edit the description later. Choose the Lens Type There are two types of lenses you can create in Platfora: an Aggregate Lens or an Event Series Lens. The type of lens you can choose depends on the underlying characteristics of the dataset you pick as the focus of your lens. The type of lens you build also determines what kinds of visualizations you can create and what kinds of analyses you can perform when using the lens in a vizboard. Page 160 Data Ingest Guide - Define Lenses to Load Data 1. Aggregate lenses can be built from any dataset. Event series lenses can only be built from datasets that meet certain data modeling requirements. If your dataset does not meet the requirements for an event series lens, you will not see it as a choice. 2. When building an aggregate lens, you can choose any measure or dimension field from the current dataset. You can also choose additional dimension fields from any datasets that are referenced from the current dataset. 3. To build an event series lens, the dataset must have one or more event references created in it. Events are a special kind of reverse-reference that includes timestamp information. Events do not apply to aggregate lenses, only to event series lenses. When building an event series lens, you can choose dimension fields from the focus dataset or any related event dataset. Measure fields are not always applicable to event series analysis, since the data in an event series lens is not aggregated. About Aggregate Lenses An aggregate lens can be built from any dataset. There are no special data modeling requirements to build an aggregate lens. Aggregate lenses contain aggregated measure data grouped by the various dimension fields you select from the dataset. Choose this lens type when you want to do ad hoc data analysis. An aggregate lens contains a selection of measure and dimension fields chosen from the focal point of a single fact dataset. A completed or built lens can be thought of as a table that contains aggregated measure data values grouped by the selected dimension values. For example, suppose you had the following simple dataset containing 6 rows: id date customer product quantity unit price total amount 1 Jan 1 2013 smith tea 2 1.00 2.00 2 Jan 1 2013 hutchinson coffee 1 1.00 1.00 3 Jan 2 2013 smith coffee 1 1.00 1.00 4 Jan 2 2013 smith coffee 3 1.00 3.00 5 Jan 2 2013 smith tea 1 1.00 1.00 6 Jan 3 2013 hutchinson tea 1 1.00 1.00 In Platfora, a measure is always aggregated data. So in the example above, the field total amount would only be considered a measure if an aggregate function, such as SUM, were applied to that field. A dimension is always used to group the aggregated measure data. Suppose we chose the product field as a dimension in our lens. There would be two groups in this case: coffee and tea. Page 161 Data Ingest Guide - Define Lenses to Load Data If our lens only contained that one measure (sum of total amount) and that one dimesion (product), then the data in the lens would look something like this: dimension = product measure = total amount (Sum) tea 4.00 coffee 5.00 Suppose we added one more measure (sum of quantity) and one more dimesion (customer) to our lens. The measure values are then calculated for each combination of dimension values. In this case, the data in the lens would look something like this: dimensions = product, customer, product+customer measure = total amount (Sum) measure = quantity (Sum) tea 4.00 4 coffee 5.00 5 smith 7.00 7 hutchinson 2.00 2 smith, tea 3.00 3 smith, coffee 4.00 4 hutchinson, tea 1.00 1 hutchinson, coffee 1.00 1 About Event Series Lenses An event series lens can only be built from dimension datasets that have at least one event reference defined in them. It contains non-aggregated fact records, partitioned by the key of the focus dataset, sorted by the time an event occurred. Choose this lens type if you want to do time series analysis, such as funnel paths. To build an event series lens, the dataset you choose as the focus of your lens must meet the following data model requirements: • The dataset must have a primary key. • The dataset must have at least one event reference modeled in it. Events are a special reversereferences that associate a dimension dataset to a fact dataset, and designate a timestamp field for ordering of the fact records. Page 162 Data Ingest Guide - Define Lenses to Load Data An event series lens contains a selection of dimension fields chosen from the focal point of a single dimension dataset, and from any event datasets associated to that dimension dataset. Measure fields are not always applicable to event series lenses, since the data is not aggregated at lens build time. If you do decide to add measure fields, you can only choose measures from the event datasets (not from the focus dataset). These measures will be added to the lens, but will not always be visible in the vizboard depending on the type of analysis you choose. For example, measures are hidden in the vizboard if you choose to do funnel analysis. A completed or built lens can be thought of as a table that contains individual event records partitioned by the primary key of the dimension dataset, and ordered by a timestamp field. An event series lens can contain records from multiple event datasets, as long as the event references have been modeled in the dimension dataset. For example, suppose you had a dimension dataset that contained these 2 user records. This dataset has a primary key (a user_id field that is unique for each user record in the dataset): user_id name A smith B hutchinson This user dataset contains a purchase event reference that points to a dataset containing these 6 purchase event records: transaction date user_id product quantity unit price total amount 1 Jan 1 2014 A tea 2 1.00 2.00 2 Jan 1 2014 B coffee 1 1.00 1.00 3 Jan 2 2014 A coffee 1 1.00 1.00 4 Jan 3 2014 A coffee 3 1.00 3.00 5 Jan 4 2014 A tea 1 1.00 1.00 6 Jan 3 2014 B tea 1 1.00 1.00 Page 163 Data Ingest Guide - Define Lenses to Load Data In an event series lens, individual event records are partitioned by the primary key of the dimension dataset and sorted by time. If our event series lens contained one measure (sum of total amount) and one dimesion (product), then the data in the lens would look something like this: user_id date product total amount A Jan 1 2014 tea 2.00 A Jan 2 2014 coffee 1.00 A Jan 3 2014 coffee 3.00 A Jan 4 2014 tea 1.00 B Jan 1 2014 coffee 1.00 B Jan 3 2014 tea 1.00 Notice that there are a couple of differences between event series lens data and aggregate lens data: • The key field (user_id) and timestamp field (date) of the event are automatically included in the lens. • Measure data is not pre-aggregated. Instead, individual event records are partitioned by the key field and ordered by time. Having the lens data structured in this way allows analysts to create special event series viz types in the vizboard. Event series lenses allow you to analyze sequences of events, including finding patterns between multiple types of events (purchases and returns, for example). Choose Lens Fields Choosing fields for a lens depends on the lens type you pick (Aggregate Lens or Event Series Lens), and the type of analysis you plan to do. Aggregate lenses need both measure fields (aggregated variable data) and dimension fields (categorical data). Event series lenses only need dimension fields measures are optional and not always applicable to event series analysis. About Lens Field Types Fields are categorized into two basic roles: measures and dimensions. Measure fields are the quantitative data. Dimension fields are the categorical data. A field also has an associated data type, which describes the types of values the field contains (STRING, DATETIME, INTEGER, LONG, FIXED, or DOUBLE). Page 164 Data Ingest Guide - Define Lenses to Load Data Fields are grouped by the dataset they originate from. As you choose fields for your lens, you will notice that each field has an icon to denote what kind of field it is, and where it originates from. Icon Field Role Description Measure (Numeric) Measure fields are quantitative data that have an aggregation applied, such as SUM or AVG. Measures always produce aggregated values in Platfora. Measure values are always a numeric data type (INTEGER, LONG, FIXED, or DOUBLE) and are always the result of an aggregation. Every aggregate lens must have at least one measure. The default measure is Total Records (a count of the records in the dataset). Measures are not applicable to event series lenses and funnel analysis visualizations. Datetime Measure Datetime measure fields are a special variety of measure fields. They are datetime data that have either the MIN or MAX aggregate functions applied to them. Datetime measure values are always the DATETIME data type. Datetime measures are not applicable to event series lenses and funnel analysis visualizations. Categorical Dimension Dimension fields are used to filter dataset records, group measure data (in an aggregate lens), or define set conditions (in an event series lens). Categorical dimension fields contain STRING type data. Numeric Dimension Dimension fields are used to filter dataset records, group measure data (in an aggregate lens), or define set conditions (in an event series lens). Numeric dimension fields contain INTEGER, LONG, FIXED, or DOUBLE type data. You can apply an aggregate function to a numeric dimension field to turn it into a measure. Date Dimension Dimension fields are used to filter dataset records and group measure data. Date dimension fields contain DATETIME type data. Every datetime field also auto-creates a reference to Platfora's built-in Date and Time datasets. These date and time references allow you to analyze the time-based data at different granularities (week, day, hour, and so on). Location Field Location fields are a special kind of dimension field used only in geo map visualizations. They are comprised of a set of geo coordinates (latitude, longitude) and optionally a label name. Page 165 Data Ingest Guide - Define Lenses to Load Data Icon Field Role Description Current Dataset Fields Fields that are within the currently selected dataset are grouped together at the top of the lens field list. References A reference groups fields together that come from another dataset. A reference joins two datasets together on a common key. You can select dimension fields from any dataset, however you can only choose measure fields from the current dataset (if building an aggregate lens). Geo References A geo reference is similar to a regular reference. The difference is that geo references are used specifically for the purpose of linking to datasets containing location fields. Events An event is like a reference, except that the direction of the join is reversed. An event groups fields together that come from another dataset containing fact records that are associated with a point in time. Event fields are only applicable to event series lenses. Segments Platfora groups together all segment fields that are based on members of a referenced dimension dataset. You can select segment fields originally defined in any lens as long as the segment is based on members in the referenced dimension dataset. Segment Field A segment is a special type of dimension field that groups together members of a population that meet some defined common criteria. A segment is based on members of a dimension dataset (such as customers) that have some behavior in common (such as purchasing a particular product). Any segment defined on a particular dimension dataset is available as a segment field in lens that references that dataset. Segments are created in a viz based on the lens used in the viz. After creating a segment, Platfora creates a special lens build to populate the segment members. However, after segments are defined, you can optionally choose to include a segment field in any lens that references that dimension dataset. For more information, see Allow Ad-Hoc Segments. Page 166 Data Ingest Guide - Define Lenses to Load Data Choose Fields for Aggregate Lenses Every aggregate lens must have at least one measure field and one dimension field to be a valid lens. Choose only the fields you need to do your analysis. You can always come back and modify the lens later if you decide you need other fields. You can choose fields from the currently selected dataset, as well as from any datasets it references. 1. Click Add+ or Add- to add or remove all of the fields grouped under a dataset or reference. Note that this does not apply to nested references and events. You must select fields from each referenced dataset independently. 2. 3. Click the plus icon Click the minus icon lens. to add a field to your lens. The plus sign means the field is not in the lens. to remove the field from your lens. The minus sign means the field is in the 4. Open the quick field selector to confirm the measure aggregations you have chosen on a field. Original Value (the default), means the field will be included in the lens as a dimension. 5. Expand references to find additional dimension fields. 6. Use the Fields added to lens tab to confirm the field selections you have made. Page 167 Data Ingest Guide - Define Lenses to Load Data Choose Measure Fields (Aggregate Lens) Every aggregate lens needs at least one measure. In Platfora, measure fields are always the result of an aggregate calculation. If you have metric fields in your dataset that you want to use as the basis for quantitative analysis, you must decide how to aggregate those metrics before you build a lens. 1. Some measures are pre-defined in the dataset. Pre-defined measures are always at the top of the dataset field list. 2. Other non-measure fields can be converted into a measure by choosing additional aggregation types in the lens definition. Page 168 Data Ingest Guide - Define Lenses to Load Data Define Quick Measures A quick measure is an aggregation applied to a dimension field to turn it into a measure. You can add quick measures to your lens based on any dimension field in the current focus dataset (for an aggregate lens) or event dataset (for an event series lens). 1. First check if the dataset has pre-defined measures that meet your needs. Pre-defined measure fields are always listed at the top of the dataset. These measures are aggregated computed fields that have already been defined in the dataset. Clicking on a predefined measure will show the aggregate expression used to define the measure. 2. Find the field that you want to use as a measure and add it to your lens definition. 3. Click the gear icon to open the quick field selector. 4. Choose the measure aggregations you want to apply to that field: • Sum (total) is available for numeric type dimension fields only. • Avg (average) is available for numeric type dimension fields only. • Distinct (a count of the distinct values in a column) is available for all field types. • Max (highest value) is available for numeric type dimension fields only. • Min (lowest value) is available for numeric type dimension fields only. Each selection will create a new measure in the lens when it is built. Quick measure fields appear in the built lens with a name such as field_name(Avg). Page 169 Data Ingest Guide - Define Lenses to Load Data 5. Original Value also keeps the field in the lens as a dimension (grouping column) as well as aggregating its values for use as a measure. For fields that have lots of unique values, it is probably best to deselect this option if building an aggregate lens. Choose the Default Measure The default lens measure is automatically added to new visualizations created from this lens. This allows a default chart to be shown in the vizboard immediately after the data analyst chooses a lens for their viz. If a lens does not have a default measure, the record count of the lens is used as the default measure. 1. Select the measure that you want to designate as the default lens measure. Only pre-defined measures can be used. You cannot designate quick measures as the default lens measure. 2. Make sure the measure field is added to the lens definition. 3. Click Default to This Measure. Page 170 Data Ingest Guide - Define Lenses to Load Data Choose Dimension Fields (Aggregate Lens) Every aggregate lens needs at least one dimension field. Dimension fields are used to group and filter measure data in an aggregate lens. You can add dimension fields from the currently selected focus dataset or any of its referenced datasets. 1. Dimension fields are denoted by a cube icon. 2. Click Add+ or Add- to add or remove all of the dimension fields grouped under a particular dataset or reference. Add+ and Add- does not apply to nested references. You must select fields from each referenced dataset independently. 3. Expand references to see the dimension fields available in referenced datasets. 4. Click the plus icon to add a dimension field to your lens. 5. Click a dimension field to see the details about it. The following information is available about a field, depending on it's field type and whether or not the dataset has been profiled. Data heuristic information is only applicable to aggregate lenses. Field Detail Description Field Type Either Base , Computed or Measure . Base field values come directly from the source data. Computed fields and measure values have been transformed or processed in some way. Page 171 Data Ingest Guide - Define Lenses to Load Data Field Detail Description Field Name The name of the field as defined in the dataset. Expression If it is a computed field, the expression used to derive the field values. Description The description of the field that has been added to the Platfora dataset definition. Example Data Shows a sampling of the field values from 20 dataset rows. This is not available for certain types of computed fields, such as measures, event series computed fields, or computed fields that reference other datasets. Data Type The data type: STRING, DATETIME, INTEGER, LONG, FIXED, or DOUBLE. Default Value The default value that will be substituted for NULL dimension values when the lens is built. If n/a, then Platfora will use the defaults of January 1, 1970 for datetimes, NULL for strings, and 0 for numeric data types. Estimated Distinct Values If the dataset has been profiled, this is an estimate of how many unique values the field has. This information is only applicable to aggregate lenses. Data Distribution If the dataset has been profiled, this shows the top 20 values for that field and an estimation of how the top values are distributed across all the rows of the dataset. This information is only applicable to aggregate lenses. Path If it is a field from a referenced dataset, shows the dataset name, reference name, and field name. Choose Segment Fields (Aggregate Lens) Segments are members of a referenced dataset that have some behavior in common. After created in a visualization, the segment field is available to include in any lens that references that dataset. You might Page 172 Data Ingest Guide - Define Lenses to Load Data want to include a segment field in a lens if it's commonly used in visualizations or you want to increase viz query performance. 1. Expand references to see the segments available in referenced datasets. 2. Expand Segments to see the segments available for a particular referenced dataset. 3. 4. Segment fields are denoted by a cut-out cube icon Click the plus icon . to add a segment field to your lens. 5. Click Add+ or Add- to add or remove all of the segment fields grouped under a particular referenced dataset. Add+ and Add- does not apply to nested references. You must select fields from each referenced dataset independently. 6. Click a segment field to see the details about it. The following information is available about a segment field. Field Detail Description Field Type Always Segment . Field Name The name of the segment field as defined in the segment. Page 173 Data Ingest Guide - Define Lenses to Load Data Field Detail Description Built On The date of the special segment lens build that populated the current members of the segment. Segment Of The referenced dimension dataset of which the segment values are a member of. This dataset matches the referenced dataset under which the segment field is located. Occurring in Dataset The fact dataset that includes the behaviors the segment members have in common. This dataset may be the focus dataset in the current lens, or it may be from a different dataset that references this dimension dataset. Origin Lens The lens used in the vizboard in which the segment was created originally. Segment Conditions The conditions defined in the segment that determine the segment members. "IN" Value Label The value labels for records that are members of the segment. "NOT IN" Value Label The value labels for records that are not members of the segment. Selected Members The number of segment members out of the total number of records in the referenced dataset. Choose Fields for Event Series Lenses For an event series lens, field selections are mostly dimension and timestamp fields. You can choose dimension fields from the currently selected dataset, and any fields from the event datasets it references. Page 174 Data Ingest Guide - Define Lenses to Load Data Measure fields (aggregated variables) are not applicable to event series analysis, since data is not aggregated in an event series lens. 1. Click Add+ or Add- to add or remove all of the fields grouped under a dataset, reference, or event reference. Note that this does not apply to nested references and events. You must select fields from each referenced dataset independently. 2. 3. Click the plus icon Click the minus icon lens. to add a field to your lens. The plus sign means the field is not in the lens. to remove the field from your lens. The minus sign means the field is in the 4. Open the quick field selector to confirm the selections are appropriate for an event series lens. In an event series lens, aggregated measures (such as SUM or AVG) are not applicable. For example, if you want to do funnel analysis on some metric of the dataset, make sure that Original Value (the default) is selected. This means the field will be included in the lens as a dimension. 5. Expand event references (or regular references) to find additional dimension fields. 6. Use the Fields added to lens tab to confirm the field selections you have made. Timestamp Fields and Event Series Lenses Timestamp fields have a special purpose in an event series lens. They are used to order all fact records included in the lens, including fact records coming from multiple datasets. Event series lenses have a global Timestamp field that applies to all event records included in the lens. There are also global Page 175 Data Ingest Guide - Define Lenses to Load Data Timestamp Date and Timestamp Time references, which can be used to filter records on different granularities of date and time. Dataset records are not aggregated in an event series lens. Records are partitioned (or grouped) by the key of the focus dataset and ordered by a datetime field in the event dataset(s). For example, suppose you built an event series lens based on a customer dataset that had event references to a purchases dataset and a returns dataset. The lens would partition the event records by customer and order both types of events (purchases and returns) by the timestamp of the event record. Event series lenses have a global Timestamp field, and global Timestamp Date and Timestamp Time references that apply to all event records included in the lens. This is especially relevant if the lens includes links to multiple event datasets. Because event series lenses order records by a designated event time (represented by the global Timestamp), other references to date and time may or may not be relevant to your event series analysis. For example, suppose you were building an event series lens based on customers that contained both purchase and return events. The global Timestamp represents the purchase timestamp or the return timestamp of the corresponding event record. As an attribute of a customer, suppose you also had the date the customer first registered on your web site. This customer registration date may be useful for your analysis if you wanted to group or filter customers by how long they have been a customer. For example, if you wanted to know 'how does new customers' purchase behavior differ from customers who registered over a year ago?' Page 176 Data Ingest Guide - Define Lenses to Load Data Measure Fields and Event Series Lenses In Platfora, measure fields are always the result of an aggregate calculation. Since event series lenses do not contain aggregated data, measure fields are not always applicable to event series analysis. Measure fields may be included in an event series lens, however they may not show up in the vizboard (depending on the type of analysis you choose). 1. For event series lenses, you can only choose measures from a referenced event dataset only, not from the currently selected dataset. 2. Pre-defined measures are listed at the beginning of an event dataset. If you add a measure to an event series lens, the aggregation will not be calculated at lens build time. Measure fields that are added to the lens will not show up in the Funnel viz type in Vizboards. Even though measure fields are not needed for event series analysis, the lens builder still requires every lens to have at least one measure. Open an event dataset and choose any measure so you won't get a No measure fields error when you go to build the lens. 3. For event series lenses, quick field aggregations are not applicable. If you want to use a field for funnel analysis, make sure that Original Value is selected and any aggregations are unselected. This adds the field to the lens as a dimension. Page 177 Data Ingest Guide - Define Lenses to Load Data Define Lens Filters One way to limit the size of a lens is to define a filter to constrain the number of rows pulled in from the data source. You can only define filters on dimension fields - one filter condition per field. Filters are evaluated independently during a lens build, so the order in which they are added to the lens does not matter. 1. Select the dimension field to filter on. 2. Click the filter icon to the right of the field name. You can only define one filter per field. 3. Define the Filter expression. Filter expressions are always Boolean expressions, meaning they must evaluate to either true or false. Note that the selected field name serves as the first argument of the expression, followed by a comparison operator or logical operator, and then the comparison value. The comparison value must be of the same data type as the field you are filtering on. Some examples of lens filter expressions: BETWEEN 2012-06-01 AND 2012-07-31 LAST 7 DAYS LIKE("Plat*") IN("Larry","Curly","Moe") Page 178 Data Ingest Guide - Define Lenses to Load Data NOT IN ("Saturday","Sunday") < 50.00 >= 21 BETWEEN 2009 AND 2012 IS NOT NULL 4. Click Save. 5. Make sure the filter is added in the Filters panel. Lens Filters on DATETIME Type Fields This section contains special considerations you must make when filtering on datetime type values.Filter conditions on DATETIME type fields must be in the format YYYY-MM-DD without enclosing quotes or any other punctuation. If the date value is in string format rather than a datetime format, the value must be enclosed in quotes. Date-based filters can do absolute or relative comparisons. An absolute date comparison specifies a specific boundary such as: >= 2013-01-01 The filter expression specifies a range of dates using particular dates in addition to the allowed comparison operators. BETWEEN 2013-06-01 AND 2013-07-31 When specifying a range of dates, the earlier date should always come first. Relative comparisons are always relative to the current date. Relative date filters use the following format: LAST <integer> DAYS LAST 7 DAYS LAST 0 DAYS When using a relative date filter, the filter includes all data from the current day. The current day is defined as the day in Coordinated Universal Time (UTC) when the lens build began. Therefore, the expression LAST 0 DAYS includes data from the current day only, and the expression LAST 1 DAYS includes data from the current day and the previous day. You can use a relative date filter together with a lens build schedule to define a rolling time window. For example, you could define a lens filter expression of LAST 7 DAYS and schedule the lens to build nightly. This way, the lens always contains the previous week’s worth of data. Page 179 Data Ingest Guide - Define Lenses to Load Data Lens Filters on Input Partition Fields An input partitioning field is a field in a dataset that contains information about how to locate the source files in the remote file system. Defining a filter on these special fields eliminates source data files as the very first step of the lens build process.Not all datasets will have input partitioning fields. If there are partitioning fields available, the lens page displays a special icon next to them. You should look for these special fields when you build a lens. Adding a filter on these fields reduces the amount of source data to be scanned and processed during a lens build. See Understand Input Partitioning Fields for more information about these special fields and how they affect lens build processing. Troubleshoot Lens Filter Expressions Invalid lens filter expressions don't always result in an error in the web application. Some invalid filter expressions are only caught during a lens build and can cause the lens build to fail. This section has some common lens filter expression mistakes that can cause an error or a lens build failure. A comparison expression must compare values of the same data type. For example, if you create a filter on a field that is an INTEGER data type, you can't specify a comparison argument that is a STRING. Lens filters often compare a field value to a literal value. Specifying a literal value correctly depends on its data type (string, numeric, or datetime). For example: • Date literals must be in the format of yyyy-MM-dd without any enclosing quotation marks or other punctuation. • String literals are enclosed in double quotes ("). If the string itself contains a quote, it must be escaped by doubling the double quote (""). Page 180 Data Ingest Guide - Define Lenses to Load Data When specifying a range of dates, the earlier date should always come first. For example, when using the BETWEEN operator: use BETWEEN 2013-07-01 AND 2013-07-15 (correct) not BETWEEN 2013-07-15 AND 2012-07-01 (incorrect) For relative date filter expressions, the only valid date range keyword is DAY or DAYS. For example: use LAST 7 DAYS (correct) not LAST 1 WEEK (incorrect) Below are some more examples of incorrect lens filter expressions, and their corrected versions. Filtered Field Lens Filter with Error Corrected Lens Filter What's Wrong? Date.Year Date.Year = "2012" Date.Year =2012 Can't compare an Integer field to a String literal OrderDate BETWEEN "2013-07-01" AND "2013-07-15" BETWEEN 2013-07-01 AND 2013-07-15 Can't compare a Datetime field to a String literal Date.Year 2012 = 2012 No comparison operator Title IN(Mrs,Ms,Miss) IN("Mrs","Ms","Miss") String literals must be quoted Height = "60"" = "60""" Quotes in a literal string literal must be escaped Height LIKE("\d\'(\d)+"") LIKE("\d\'(\d)+""") Quotes in a regular expression must be escaped PurchaseDate LAST 1 WEEK LAST 7 DAYS Unsupported keyword for relative dates PurchaseDate BETWEEN 2013-07-15 AND 2012-07-01 BETWEEN 2013-07-01 AND 2013-07-15 Invalid date range Page 181 Data Ingest Guide - Define Lenses to Load Data Allow Ad-Hoc Segments When the focus dataset in a lens references other datasets, you can choose whether or not to allow vizboard users to create and use ad-hoc segments based on the members of the referenced datasets. You can enable this option per reference in the lens. A segment is a special type of dimension field that vizboard users can create in a viz to group together members of a population that meet some defined common criteria. You might want to allow users to create and use ad-hoc segments so they can use segmentation analysis to analyze members of a population and perform side-by-side comparisons. When ad-hoc segments are enabled for a reference in a lens, vizboard users have the ability to create ad-hoc segments in a viz. Additionally, they can use other segments that have been created for that reference if they are granted permission on the segment. Allowing ad-hoc segments may increase the lens size depending on the cardinality of the referenced dataset. By default, ad-hoc segments are not allowed for references in a lens due to lens size considerations. If the lens already includes the primary key from the referenced dataset, allowing ad-hoc segments for that reference doesn't significantly increase the lens size. After a segment has been created, you can choose to include the segment field in the lens. Segment fields included in a lens perform faster in a viz than the equivalent ad-hoc segment. For more information on including a segment field in a lens, see Choose Segment Fields (Aggregate Lens). To allow vizboard users to create and use segments based on members of a particular referenced dataset, click Ad-Hoc for that reference in the lens builder. Page 182 Data Ingest Guide - Define Lenses to Load Data Estimate Lens Size The size of an aggregate lens is determined by how much source data you request (the number of input rows), the number of dimension fields you select, and the cardinality (or number of unique values) of the dimension fields you select. Platfora can help estimate the size of a lens by profiling the data in the dataset. About Dataset Profiles Dataset profiling takes a sampling of rows (50,000 by default) to determine the characteristics of the data, such as the number of distinct values per field, the distribution of values, and the size of the various fields in the dataset. Profiling a dataset runs a series of MapReduce jobs in Hadoop, and builds a special purpose lens called a profile lens. This lens cannot be opened or used in a vizboard like a regular lens. The sole purpose of the profile lens is to scan the source data and capture the data heuristics. These heuristics are then used to estimate lens output size in the lens builder. Having more information about the source data can guide users to make better choices when they create a lens, which can reduce the overall lens size and build time. Some good things to know about dataset profiles: • When you profile a dataset, any of its referenced datasets are profiled too. • You do not need to rerun dataset profile jobs every time new source data arrives in Hadoop. The data characteristics of a dataset typically do not change that often. • You must have Define Lens from Dataset object permission on a dataset in order to profile it. • The time is takes to run a profile job depends on the amount of source data in Hadoop. If there is a lot of data to scan and sample, it can take a while. • Profile lenses use the naming convention, dataset_name profile. • The data hueristics collected during profiling are only applicable to estimating the output size of an aggregate lens. There is no lens output size estimation for event series lenses. Page 183 Data Ingest Guide - Define Lenses to Load Data Profile a Dataset You can profile a dataset as long as you have data access and Define Lens from Dataset permissions on the dataset. Profiling a dataset initiates a special lens build to sample the source data and collect its data characteristics. 1. Go to the Data Catalog and choose the Datasets tab. 2. Use the List view or the Quick Find to locate the dataset you want to profile. 3. Choose Profile Dataset from the dataset action menu. 4. Click Confirm. 5. To check the status of the profile job, go to the System page and choose the Activities tab. After the profile job is finished, you can review the results that are collected when you build or edit a lens from that dataset. Dataset profile information can only be seen in the lens builder panel, and is only applicable when building aggregate lenses. Page 184 Data Ingest Guide - Define Lenses to Load Data About Lens Size Estimates Platfora uses the information collected by the dataset profile job to estimate the input and output size of a lens. Dataset profile and estimation information is shown in the lens builder workspace. Lens size estimates dynamically change as you add or remove fields and filters in the lens definition. Lens estimations and are calculated before the lens is built. As a result, they may be off from the actual size of a built lens. Profile data is not available for certain kinds of computed fields that require multi-row or multi-dataset processing, such as aggregated computed fields (measures), event series processing computed fields, or computed fields that refer to fields in other datasets. As a result, estimates may be off by a large factor when there are lots of these types of computed fields in a lens. This is especially true for DISTINCT measures. Lens input size estimations apply to both aggregate and event series lens types. However, lens output size estimates are only applicable to aggregate lenses. Lens Input Size Estimates Lens input size estimates reflect how much source data will be scanned in the very first stage of a lens build. Lens input size estimates are available on all datasets (event datasets that have not been profiled yet). Input size estimation is applicable to all lens types. 1. Platfora looks at the source files in the Hadoop data source location to estimate the overall dataset size. You do not need to profile a dataset to estimate this number. If the source data files are compressed, this represents the compressed file size. Page 185 Data Ingest Guide - Define Lenses to Load Data 2. The size of the input data to a lens build can be reduced if the dataset has input partitioning fields. The partitioning fields of a dataset are denoted with a special filter icon . Not all datasets will have partitioning fields, but if they do, the names of those fields will be shown under the Input data size estimate. 3. After applying a lens filter, the lens builder estimates the percentage of total dataset size that will be excluded from the lens based on the filter. Lens Output Size Estimates Lens output size refers to how big the final lens will be after it is built. Lens output size estimates are only available for datasets that have been profiled. Output size estimation is only applicable to aggregate lenses (not to event series lenses). 1. Based on the fields you have added to the lens, Platfora will use the profile data to estimate the output size of the lens. The estimate is shown as a range. 2. The relative dimension size helps you identify which dimension fields are the largest (have the most distinct values), and therefore are contributing most to the overall lens size. Hover your mouse over a mark in the distribution chart to see the field it represents. 3. Dimension fields are marked with small/medium/large size icons to denote the cost of adding that field to a lens. Small means the field has less than 1000 unique values. Medium means the field has between 1001-9999 unique values. Large means the field has over 10,000 unique values. 4. When you select a dimension field, you can see the data characteristics for that field, such as: Page 186 Data Ingest Guide - Define Lenses to Load Data • An estimate of how many unique values the field has. • The top 20 values for that field and an estimation of how the top values are distributed across all the rows of the dataset. • A sampling of the field values from 20 dataset rows. This is not available for certain types of computed fields, such as event series computed fields, or computed fields that reference other datasets. Manage Lenses After a lens is defined, you can edit it, build it, check its status, update/rebuild it (refresh its data), or delete it. Lenses can be managed from the Data Catalog or System page. Edit a Lens Definition After working with a lens, you may realize that there are fields that you do not need, or additional fields that you'd like to add to the lens. You can edit the lens definition to add or remove fields as long as you have data access and edit permissions on the lens. Lens definition changes will not be available in a vizboard until the lens is rebuilt. Page 187 Data Ingest Guide - Define Lenses to Load Data You can edit a lens definition from the Data Catalog page or from a Vizboard. To edit a lens, you must be at least an Analyst role or above. You must have Edit object permissions on the lens and data access permissions on the dataset, as well as any datasets that are included in the lens by reference. Some things to know about changing a lens defintion: • If you remove fields from a lens definition, it may break visualizations that are using that lens and those fields. • Changing the lens defintion requires a full lens rebuild. Rather than just incrementally processing the most recently added source data, all of the source data must be re-processed. Update Lens Data Once a lens has been built, you can update it to refresh its data it at any time. You may want to update a lens if new data has arrived in the source Hadoop system, or if you have changed the lens definition to include additional fields. Depending on what has changed since the last build, subsequent lens builds are usually a lot faster. Page 188 Data Ingest Guide - Define Lenses to Load Data To rebuild a lens, you must be at least an Analyst role or above. You must have Edit object permissions on the lens and data access permissions on the dataset, as well as any datasets that are included in the lens by reference. 1. Go to the Data Catalog and choose the Lenses tab. 2. Use the List view or the Quick Find to locate the lens you want to update. 3. Choose Rebuild from the lens action menu. 4. Click Confirm. 5. To check the status of the lens build job, go to the System page and choose the Activities tab. Delete or Unbuild a Lens Deleting or unbuilding a lens is a way to free up space in Platfora for lenses that are no longer needed or used. Deleting a lens removes the lens definition as well as the lens data. Unbuilding a lens only removes the lens data (but keeps the lens definition in case you want to rebuild the lens at a later time). Page 189 Data Ingest Guide - Define Lenses to Load Data To delete or unbuild a lens, you must be at least an Analyst role or above and have Own object permissions on the lens. Deleting or unbuilding a lens should be done with care, as doing so will invalidate all of the visualizations that depend on that lens. 1. Go to the Data Catalog and choose the Lenses tab. 2. Use the List view or the Quick Find to locate the lens you want to delete or unbuild. 3. Choose Unbuild or Delete from the lens action menu. 4. Click Confirm or Delete (depending on the action you chose). When you unbuild or delete a lens, the lens data is immediately cleared from the Platfora servers disk and memory. However, the built lens data will remain on disk in Hadoop for a period of time (the default is 24 hours) in case you change your mind and decide to rebuild it. Page 190 Data Ingest Guide - Define Lenses to Load Data Check the Status of a Lens Build Depending on the size of the source data requested, it may take a while to process the requested data in Hadoop to build a Platfora lens. The lens is not available for use in visualizations until the lens build is complete. You can check the status of a lens build on the System page. 1. Go to the System page. 2. Go to the Lens Builds tab. 3. In the Activities section, click In Progress. 4. Find the lens in the list and expand it to see the build jobs running in Hadoop. 5. Click on a build status message to see more detailed messages about the tasks of a particular job. This shows the messages from the Hadoop MapReduce JobTracker. Manage Lens Notifications You can configure a lens so Platfora sends an email message to users when it detects an anomaly in lens data. Do this by defining a lens notification. You might want to define a lens notification so data analysts know when to view a vizboard to analyze the data further. When Platfora builds a lens, it queries the lens data to search for data that meets the defined criteria. When it detects that some data meets the criteria, it notifies users with the results of the query it ran against the lens. Analysts might then choose to view a vizboard based on the same lens so they can investigate the data further. Consider the following rules and guidelines when working with lens notifications: • Platfora must be configured to connect to an email server using SMTP. • To define a rule, the lens must be built already. Page 191 Data Ingest Guide - Define Lenses to Load Data • You can define multiple notification rules per lens. Each rule results in a different email message. • The rule can only return results for one measure in the lens. However, it can filter on multiple measures. • The email message only contains data on the fields chosen in the lens notification rule. • Platfora sends email notifications after a lens is built, not after the lens notification is created or edited. • You can disable a lens notification after it has been defined. You might want to do that to temporarily stop notifying users while retaining the logic defined in the notification rule. Add a Lens Notification Rule Define a lens notification rule so Platfora sends an email to notify users when the data in a lens build meets specified criteria. You can define multiple rules per lens. To add a lens notification rule, you must be at least an Analyst role or above. You must have Edit object permissions on the lens and data access permissions on the dataset, as well as any datasets that are included in the lens by reference. 1. Go to the Lenses tab in Data Catalog and find the lens for which you want to define a lens notification rule. Page 192 Data Ingest Guide - Define Lenses to Load Data 2. Choose Notifications from the lens action menu. 3. In the Notifications dialog, click Create. A dialog appears where you can define a lens notification rule. 4. Enter a name for this notification rule. 5. Define the query to run against the lens after the lens is built. You must select one measure field, and you may select zero or more dimension fields. You can group by dimension fields, and filter the scope of rows by either dimension or measure fields. Click the icon to include more fields in the query. 6. Define the criteria in the query results that triggers Platfora to send the email notification. Select the icon to define additional criteria. 7. Enter one or more email addresses that should receive the notification messages. Separate multiple email addresses with commas (,). 8. Choose whether the lens notification email should be sent when the criteria defined here are met or not met. 9. Click Save. The lens notification rule is created and enabled by default. Disable a Lens Notification Rule Disabling a lens notification rule allows you temporarily stop notifications while retaining the logic defined in the notification rule. 1. Go to the Lenses tab in Data Catalog and find the lens for which you want to disable a lens notification rule. Page 193 Data Ingest Guide - Define Lenses to Load Data 2. Choose Notifications from the lens action menu. 3. In the Notifications dialog, clear the check box for the notification rule you want to disable. 4. Click Close. Schedule Lens Builds You can configure a lens to be built at specific times on specific days. By default, lenses are built on demand, but when you define a schedule for a lens, it is built automatically at the times and days specified in the schedule. You might want to define a schedule so the lens is built nightly, outside of regular working hours. About Lens Schedules When you create or edit a schedule, you define one or more rules. A rule is a set of times and days that specify when to build a lens. You might want to create multiple rules so the lens builds at different times on different days. For example, you might want to build the lens at 1:00 a.m. on weekdays, and 8:00 p.m. on weekends. Lenses build start times are determined by the clock on the Platfora server. Users who have permission to build a lense can define and edit its schedule. Lens builds are run by the user who last updated or created the lens build schedule. This is important because the user's lens build size limit applies to the lens build. For example, if a user with a role type that has permission to build unlimited size lenses creates a schedule, and then a user with a role type that has permission to build 100 GB size lenses edits the schedule, the lens will only successfully build if it is less than 100 GB. When a scheduled lens build occurs when the same lens is in progress of being built, the scheduled lens build is skipped and the in progress lens build continues. You can define rules with the following day and time repeat patterns: • Specific days of the week at specific times. For example, every Monday, Tuesday, Wednesday, Thursday, and Friday at 11:00 pm. • Specific days of the week at repeated hourly intervals. For example, every Saturday and Sunday, every four hours starting at 12:15 am. • Specific days of the month at specific times. For example, on the first and 15th of every month at 1:00 am. Create a Lens Schedule You can configure a schedule for a lens so it is built at specific times on specific days. You define the lens schedule when you edit the lens. The schedule is saved whether or not you save your changes on the lens page. Page 194 Data Ingest Guide - Define Lenses to Load Data To create a lens schedule, you must be at least an Analyst role or above. You must have Edit object permissions on the lens and data access permissions on the dataset, as well as any datasets that are included in the lens by reference. 1. Go to the Lenses tab in Data Catalog and find the lens for which you want to define a schedule. 2. Choose Schedule from the lens action menu. 3. In the Lens Build Schedule dialog, define a rule for the schedule using the Day of Week or Day of Month rules. 4. (Optional) Click Add another rule to define an additional rule for the schedule. Lenses are only built once if you define multiple overlapping rules for the same time and day. 5. (Optional) Select Export this lens once the build completes if you want to export the lens data in CSV format. The export files will be created in the remote file system location you specify. For example, to export to HDFS, the location URL would look something like this: hdfs://10.80.231.123:8020/platfora/exports 6. Click OK. View All Scheduled Builds Users with the role of System Administrator can view all scheduled lens builds. Additionally, they can pause (and later resume) a schedule lens build, which might be useful during maintenance windows or a time of unusually high lens build demand. Page 195 Data Ingest Guide - Define Lenses to Load Data 1. Go to the System page. 2. Go to the Activities tab. 3. Click View Full Schedule. 4. The Scheduled Activities dialog displays the upcoming scheduled lens builds and vizboard PDF emails. a) (Optional) Click a column name to sort by that column. b) (Optional) Click Pause for a scheduled lens build to prevent the lens from building at the scheduled time. It will remain paused until someone resumes the schedule. 5. Click OK. Manage Segments—FAQs After a segment is defined in a vizboard, you can edit it, update the segment members, delete it, schedule updates, and show the data lineage. After creating a segment, the segment appears as a catalog object on the Data Catalog > Segments page. This topic answers some frequently asked questions about managing segments. Page 196 Data Ingest Guide - Define Lenses to Load Data How do I view a segment definition? Any user with data access permission on the underlying datasets can view a segment definition. Click the segment on the Data Catalog > Segments page. You can view the segment and its conditions, but cannot edit it. How do I edit a segment definition? To edit a segment, you must be at least an Analyst role or above. You must have Edit object permissions on the segment and data access permissions on the datasets used in the segment. Page 197 Data Ingest Guide - Define Lenses to Load Data You can edit a segment definition from the Data Catalog page or from a vizboard. When editing a segment, you can add or remove conditions and edit the segment value labels for members and nonmembers. How do I update the segment members when the source data changes? Once a segment is created, Platfora creates its own special type of lens behind the scenes to create and populate the members of the segment. To update the segment members from the latest source data, you rebuild the segment lens. To schedule a segment, you must have Edit object permission on the segment. Page 198 Data Ingest Guide - Define Lenses to Load Data Choose Rebuild from the segment's menu on the Data Catalog > Segments page. Can I delete a segment? Yes. Deleting a segment removes the segment definition and its data from the Platfora catalog. To delete a segment, you must have Own object permission on the segment. Any visualization using the segment in an analysis will show an error if the segment is deleted from the dataset. To use these visualizations without error, remove the deleted segment from the drop zone that contains it. You can't delete segments that are currently included in a lens. To delete a segment that is included in a lens, remove it from the lens and then rebuild the lens. Page 199 Data Ingest Guide - Define Lenses to Load Data Choose Delete from the segment's menu on the Data Catalog > Segments page. Segments created from an aggregate lens can also be deleted using the Delete button when editing the segment from a viz or from the Data Catalog page. Can I configure a segment lens build to build on a defined schedule? Yes, you can configure segment lenses to be built at specific times on specific days like other lens builds. To schedule a segment, you must have Edit object permission on the segment. Choose Schedule from the segment's menu on the Data Catalog > Segments page. For more details on how to define a schedule for a segment lens build, see Schedule Lens Builds. Can I show the data lineage for a segment? To show the data lineage for a segment, you must have Edit object permission on the segment. Page 200 Data Ingest Guide - Define Lenses to Load Data The data lineage report for a segment shows the following types of information: • Segment conditions • Lens field names • Reference field names • Filter expressions • Field expressions • Lens names • Dataset names • Data source names • Data source locations • Lens build specific source file names including their paths • Timestamps Choose Show info & lineage from the segment's menu on the Data Catalog > Segments page. Page 201 Chapter 6 Export Lens Data Platfora allows you to export lens data for use in other tools or applications. Full lens data can be exported in comma-separated values (csv) format to a remote file system such as HDFS or Amazon S3. You can also get a portion of lens data out of Platfora by exporting the results of a lens query or visualization. Topics: • Export an Entire Lens as CSV • Export a Partial Lens as CSV • Query a Lens Using the REST API • FAQs - Lens Export Basics Export an Entire Lens as CSV Exporting a lens writes out the data in parallel to a distributed file system such as HDFS or S3. You can export an entire lens from the Platfora data catalog. Make sure you have the correct URL and path information for the remote file system, and that Platfora has write permissions to the specified export location. Also make sure there is enough free space in the export location. If the export location does not exist, Platfora will create it if it has the appropriate permissions. Page 202 Data Ingest Guide - Export Lens Data 1. Go to the Lenses tab in the Data Catalog. 2. From the lens Actions menu, select Export Data as CSV. 3. Enter the Export Destination, which is a URI of the export location in the remote file system. The format of the URI is: native_filesystem_protocol://hostname:port/path-to-export-location For example, a URI to a location in HDFS: hdfs://10.80.231.123:8020/platfora/exports For example, a URI to a location in an S3 bucket.: s3n://your-bucket-name/platfora/exports If exporting to S3, make sure Platfora also has your Amazon access key id and secret key entered in the properties platfora.S3.accesskey and platfora.S3.secretkey. Platfora needs these to authenticate to your Amazon Web Services (AWS) account. 4. Click Write. 5. A notification message will appear when the lens export completes. In the remote file system, a directory is created in the specified export location using the directory naming convention: export-location/lens-name/timestamp The lens data is exported in parallel and is usually split across multiple export files. The export location contains a series of csv.gz lens data files and a .success file if the export completed successfully. Page 203 Data Ingest Guide - Export Lens Data Export a Partial Lens as CSV Exporting a partial lens writes out the results of a lens query to a distributed file system such as HDFS or S3. You can export a partial lens from a single viz in a vizboard. Make sure you have the correct URL and path information for the remote file system, and that Platfora has write permissions to the specified export location. Also make sure there is enough free space in the export location. If the export location does not exist, Platfora will create it if it has the appropriate permissions. 1. Go to Vizboards and open the vizboard containing the data you want to export. 2. From the viz export menu, select Export Data as CSV. 3. Enter the Export Destination, which is a URI of the export location in the remote file system. The format of the URI is: native_filesystem_protocol://hostname:port/path-to-export-location 4. Click Write. 5. A notification message will appear when the lens export completes. Query a Lens Using the REST API Platfora provides a SQL-like query language that you can use to programmatically access data in a lens. You can submit a SELECT statement using the REST API, and the query results are returned in CSV format. Page 204 Data Ingest Guide - Export Lens Data The syntax used to define a lens query is similar to a SQL SELECT statement. Here is an overview of the syntax used to define a lens query: [ DEFINE new-computed-field_alias AS computed_field_expression ] SELECT measure-fields, dimension-fields FROM aggregate-lens-name [ WHERE dimension-filter-expression ] GROUP BY dimension-fields [ SORT BY measure-field [ASC | DESC] [LIMIT number] ] [ HAVING measure-filter-expression ] The LIMIT clause applies to the group formed by the GROUP BY clause, not the entire lens. If you have been using Platfora vizboards, you have already been generating lens queries by creating visualizations. Here is how the query language clauses map to actions in the viz builder. For more information about the lens query language syntax and usage, see the Lens Query Language Reference. 1. Write a lens query SELECT statement. For example: SELECT [Total Records],[Lease Status],Carrier.Name FROM Planes WHERE Carrier.Name NOT IN ("NULL") GROUP BY [Lease Status],Carrier.Name HAVING [Total Records] > 100 Page 205 Data Ingest Guide - Export Lens Data Notice how the lens field names containing spaces are escaped by enclosing them in brackets. Also notice the dot notation to refer to a field from a referenced dataset. 2. Depending on the REST client you are using, you may need to URL encode the query before submitting it via the REST API. For example, here is the URL-encoded version of the previous lens query: SELECT+%5BTotal+Records%5D%2C%5BLease+Status%5D%2CCarrier.Name+FROM +Planes+WHERE+Carrier.Name+NOT+IN+%28%22NULL%22%29+GROUP+BY+%5BLease +Status%5D%2CCarrier.Name+HAVING+%5BTotal+Records%5D+%3E+100 3. Submit the encoded query string via the REST API. For example, using the cURL command-line utility: curl -u admin:admin http://localhost:8001/api/v1/query?query=SELECT +%5BTotal+Records%5D%2C%5BLease+Status%5D%2CCarrier.Name+FROM +Planes+WHERE+Carrier.Name+NOT+IN+%28%22NULL%22%29+GROUP+BY+%5BLease +Status%5D%2CCarrier.Name+HAVING+%5BTotal+Records%5D+%3E+100 >> query_output.csv Notice the first part of the URL specifies the Platfora server hostname and port. This example is connecting to localhost using the default admin username and password. Notice the latter part of the URL which specifies the Rest API endpoint: /api/v1/query The GET method for this endpoint expects one input parameter, query, which is the encoded query string. The output is returned in CSV format, which you can redirect to a file if you want to save the query results. FAQs - Lens Export Basics Lens data exports allow you to copy data out of Platfora. This topic answers some frequently asked questions about lens data exports. How can I allow or prevent exporting lens data? Lens exports and downloads are controlled by two configuration settings platfora.permit.export.data and platfora.permit.lens.to.desktop. The platfora.permit.export.data setting is a global setting which controls display of all data downloads and exports GUI controls. When this setting is true, users can export an entire lens as CSV from the Data Catalog to a DFS (for example, HDFS or S3). The platfora.permit.export.data setting also allows users to export/download individual viz data from the Vizboards interface. Downloading/exporting viz data from a vizboard download/exports only a portion of a lens, not the entire lens. When both the platfora.permit.export.data setting and the platfora.permit.lens.to.desktop setting are true, users can download the full lens from the Page 206 Data Ingest Guide - Export Lens Data Data Catalog to their desktop as CSV. The platfora.permit.lens.to.desktop setting is an experimental setting because downloading a large amount of lens data to a desktop can cause the Platfora application to run out of memory. Use this setting with caution. To control the ability of individual users or groups to export lens data, you must use permissions. What kind of permission is needed to export lens data? You must be an Analyst Limited system role or above to export or download lens data in CSV format, as well as have data access permissions to all of the datasets included in the lens. To export the lens data to a location in a remote file system (such as HDFS or S3), Platfora must have write permissions to the export directory you specify. How much lens data can I export? For lens data that you download to your desktop, there is a default maximum row limit of one million rows. This limit can be adjusted using the platfora.csv.download.max.rows property. Of course, the size of a lens row can vary greatly - you may have 'wide' rows (lots of lens fields) or 'skinny' rows (just a few lens fields). The download row limit is just a general safety threshold to prevent too much export data from crashing your browser client. For lens data that you export to a remote file system, there is no hard size limit, but you are limited by the amount of disk space in the remote export location and the amount of memory in the Platfora cluster. Very large lenses can be costly to export in terms of memory usage. To prevent a large lens export from using all of the memory on the Platfora servers, only one concurrent lens export query can execute at a time. Can I download an entire lens to my desktop? When both the platfora.permit.export.data setting and the platfora.permit.lens.to.desktop setting are true, users can download a lens from the Data Catalog to their desktop as CSV. The platfora.permit.lens.to.desktop setting is an experimental setting use it with caution. Downloading a large amount of lens data to a desktop can cause the Platfora application to run out of memory. Users can always download a partial lens from a viz in a vizboard. This requires that the platfora.permit.export.data is true. What is the performance impact of a large lens export? In order to export data from a lens, Platfora uses the in-memory query engine to select the export data. This is the same query engine that other Platfora users rely upon to build visualizations. Lens export data is treated just like any other lens query. This means that large lens exports can impact performance for other vizboard users by competing for in-memory resources. Page 207 Data Ingest Guide - Export Lens Data Is there a way to request just some of the lens data rather than the entire lens? Yes. If you don't want to export all of the data in a lens, you can use the vizboard to construct a viz that limits the number of fields to export and filters the number of rows requested. Then you can export just the data comprising that viz. You can also programmatically export lens data by submitting a lens query using Platfora's REST API. What is the file format of the exported data? Lens data is exported in comma-separated values (csv) format, and when the files are exported to a remote file system they are compressed using gzip (gz). The first row of a export file is a header row containing the lens field names. Measure field names are enclosed in brackets []. Data values are enclosed in double-quotes (") and separated by commas. If a data value contains a double-quote character, then it is escaped using a double double-quote (""). The column order in the export file is dimension fields first (in alphabetical order) followed by measure fields (in alphabetical order). Where are the export files located? When you choose to export lens data via a download to your local computer, a single export file is created on your Desktop (for Windows) or in Downloads (for Mac). The file naming convention is: dataset-name_lens-name_epoch-timestamp.csv When you choose to export lens data to a remote file system such as HDFS or S3, a directory is created in the specified export location using the directory naming convention: export-location/lens-name/timestamp When exporting data to a remote file system, the lens data is exported in parallel and is usually split across multiple export files. The export location contains a series of csv.gz lens data files and a .success file if the export completed successfully. Page 208 Data Ingest Guide - Export Lens Data Can I automate data exports following a scheduled lens build? Yes. When you create a lens build schedule, there is an option to export the lens data after the lens build completes. You must supply a location in the remote file system to copy the export files. Page 209 Chapter 7 Platfora Expressions Platfora comes with a powerful, flexible built-in expression language that you can use to transform, manipulate, and query data. This section describes Platfora's expression language, and describes how to use it to define dataset computed fields, vizboard computed fields, measures, lens filters, and lens query statements. Topics: • Expression Building Blocks • PARTITION Expressions and Event Series Processing (ESP) • ROLLUP Measures and Window Expressions • Computed Field Examples • Troubleshoot Computed Field Errors • Write a Lens Query • FAQs - Expression Basics • Platfora Expression Language Reference Expression Building Blocks This section explains the building blocks of an expression, and the general rules for constructing a valid expression. Functions in an Expression Functions perform common data processing tasks. While not all expressions contain functions, most do. This section describes basic concepts you need to know to use functions. Function Inputs and Outputs Functions take one or more input values and return an output value. Input values can be a literal value or the name of a field that contains a value. In both cases, the function expects the input value to be a particular data type such as STRING or INTEGER. For example, the CONCAT() function combines STRING inputs and outputs a new STRING. Page 210 Data Ingest Guide - Platfora Expressions This example shows how to use the CONCAT() function to concatenate the values in the month, day, and year fields separated by the literal forward slash character: CONCAT(month,"/",day,"/",year) A function's return value may be the same as its input type or it may be an entirely new data type. For example, the TO_DATE() function takes a STRING as input, but outputs a DATETIME value. If a function expects a STRING, but is passed another data type as input, the function returns an error. Typically, functions are classified by what data type they take or what purpose they serve. For example, CONCAT() is a string function and TO_DATE() is a data type conversion function. You'll find a complete list of functions by type in Platfora's Expression Language Reference. Nesting Functions Functions can take other functions as arguments. For example, you can use the CONCAT function as an argument to the TO_DATE() function. The final result is a DATETIME value in the format 10/31/2014. TO_DATE(CONCAT(month,"/",day,"/",year),"MM/dd/yyyy") The nested function must return the correct data type. So, because TO_DATE() expects string input and CONCAT() returns a string, the nesting succeeds. Only row functions allow nesting. Aggregate functions do not allow nested expressions as input. Aggregate Functions versus Row Functions Most functions process one value from one row at a time. These are called row functions because they operate on one value from a single row at a time. Aggregate functions are a special class of functions. Unlike row functions, aggregate functions process the values from multiple rows together into a single return value. Some examples of row functions are: • SUM() • MIN() • VARIANCE() Aggregate functions are also special because you use them to define measures. Measures always return numeric values that serve as the quantitative data in an analysis. Aggregate expressions are often refered to as measure expressions in Platfora. Limitations of Aggregation Functions Unlike row functions, aggregate functions can only take simple expressions as input (such as field names or literal values). Aggregate functions cannot take row functions as arguments. You also cannot use an aggregate function as input into a row function. You cannot mix aggregate functions and row functions together in one expression. Finally, while you can build expressions in both the dataset or the vizboard, only the following aggregate functions are allowed in a vizboard computed field expressions: • DISTINCT() Page 211 Data Ingest Guide - Platfora Expressions • MIN() • MAX() • ROLLUP Operators in an Expression Platfora has a number of built-in operators for doing arithmetic, logical, and comparison operations. Often, you'll use operators to combine or compare values. The values can be literal values, field values, or even other expressions. Arithmetic Operators Arithmetic operators perform basic math operations on two values of the same data type. For example, you could calculate the gross profit margin percentage using the values of a total_revenue and total_cost field as follows: ((total_revenue - total_cost) / total_cost) * 100 Or you can use the plus (+) operator to combine STRING values: "Firstname" + " " + "Lastname" You can use the plus (+) and minus (-) operators to add or subtract DATETIME values. The following table lists the math operators: Operator Description Example + Addition amount + 10 (add 10 to the value of the amount field) - Subtraction amount - 10 (subtract 10 from the value of the amount field) * Multiplication amount * 100 (multiply the value of the amount field by 100) / Division bytes / 1024 (divide the value of the bytes field by 1024 and return the quotient) Page 212 Data Ingest Guide - Platfora Expressions Comparison Operators Comparison operators are used to define Boolean (true / false) expressions. They test whether two values are equivalent. Comparisons return 1 for true, 0 for false. If the comparison is invalid, for example comparing a STRING to an INTEGER, the comparison operator returns NULL. For example, you could use comparison operators within a CASE expression: CASE WHEN age <= 25 THEN "0-25" WHEN age <= 50 THEN "26-50 ELSE "over 50" END This expression compares the value in the age field to a literal number value. If true, it returns the appropriate STRING value. You cannot use comparison operators to test for equality between DATETIME values. The following table lists the comparison operators: Operator Meaning Example Expression = or == Equal to order_date = "12/22/2011" > Greater than age > 18 !> Not greater than age !> 8 < Less than age < 30 !< Not less than age !< 12 >= Greater than or equal to age >= 20 <= Less than or equal to age <= 29 <> or != or ^= Not equal to age <> 30 Logical Operators Logical operators are used in expressions to test for a condition. Logical operators are often used in lens filters, CASE expressions, and PARTITION expressions. Filters test if a field or value meets some condition. For example, this tests if a date falls between two other dates. BETWEEN 2013-06-01 AND 2013-07-31 Logical operators are also used to construct WHERE clauses in Platfora's query language. The following table lists the logical operators: Operator Meaning Example Expression AND Test whether two conditions are true. OR Test if either of two conditions are true. Page 213 Data Ingest Guide - Platfora Expressions Operator Meaning BETWEEN Test whether a date or year BETWEEN 2000 AND 2012 numeric value is within the min and max values (inclusive). IN(list) Test whether a value is product_type within a set. IN("tablet","phone","laptop") LIKE("pattern") Simple inclusive caseinsensitive character pattern matching. The * character matches any number of characters. The ? character matches exactly one character. last_name LIKE("?utch*") matches Kutcher, hutch but not Krutcher or crutch Check whether a field value or expression is null (empty) ship_date IS NULL evaluates to true when the ship_date field is Reverses the value of other operators. • year NOT BETWEEN 2000 AND 2012 min_value AND max_value value IS NULL NOT Example Expression company_name LIKE("platfora") matches Platfora or platfora empty • first_name NOT LIKE("Jo?n*") excludes John, jonny but not Jon or Joann • Date.Weekday NOT IN("Saturday","Sunday") • purchase_date IS NOT NULL evaluates to true when the purchase_date field is not empty Fields in an Expression Expressions often operate on the values of a field. This section explains how to use field names in expressions. Referring to Fields in the Current Dataset When you specify a field name in an expression, if the field name does not contain spaces or special characters, you can simply refer to the field by its name. For example, the following expression sums the values of the sales field: SUM(sales) Page 214 Data Ingest Guide - Platfora Expressions Enclose field names with square brackets ([]) if they contain spaces, special characters, reserved keywords (such as function names), or start with numeric characters. For example: SUM([Sale Amount]) SUM([2013_data]) SUM([count]) If a field name contains a ] (closing square bracket), you must escape the closing square bracket by doubling it ]]. So if the field name is: Min([crs_flight_duration]) You enclose the entire field name in square brackets and escape the closing bracket that is part of the actual field name: [Min([crs_flight_duration]])]> If you are using the expression builder, it provides the correct escapes for you. Field is a synonym for dataset column. The documentation uses the word field because that is the terminology used in Platfora's user interface. Use Dot Notation for Fields in a Referenced Dataset Your expression might refer to a field in the focus dataset. (Focus dataset is simply the current dataset you are working with.) You also might include a field in a referenced dataset. When including fields in a referenced dataset, you must qualify the field name with the proper notation. The convention is reference_name.field_name. Don't confuse a reference name with the dataset name; they are not the same. When you create a reference link in a dataset, you give that reference its own name. Use . (dot) notation to separate the two components. For example consider, the Airports dataset which goes by the Departure Airport reference name. To refer to the City field of the Departure Airport reference to the Airports dataset, you would use the notation: [Departure Airport].City Just as with field names, you must escape reference names if they contain spaces, special characters, reserved keywords (such as function names), or start with numeric characters. Aggregated Functions and Fields in a Referenced Dataset Aggregate functions can only operate on fields in the current focus dataset. You cannot directly calculate a measure on a field belonging to a referenced dataset. For example, the following expression is not allowed: DISTINCT([Departure Airport].City) Page 215 Data Ingest Guide - Platfora Expressions Instead, use a two-step process to 'pull up' a referenced field into the current dataset. First, define Departure Airport City computed field whose expression is just the path to the referenced dataset field: [Departure Airport].City Then, you can use the interim Departure Airport City computed field as an argument to the aggregate expression. For example: DISTINCT([Departure Airport City]) Literal Values in an Expression Sometimes you need to use a literal value in an expression, as opposed to a field value. How you specify a literal value depends on its data type (text, numeric, or date). This section explains how to use literals in expressions. Literal STRING Values To specify a literal or actual STRING value, enclose the value in double quotes ("). For example, this expression converts the values of a gender field to the literal values of male, female, or unknown: CASE WHEN gender="M" THEN "male" WHEN gender="F" THEN "female" ELSE "unknown" END To escape a literal quote within a literal value itself, double the literal quote character. For example: CASE WHEN height="60""" THEN "5 feet" WHEN height="72""" THEN "6 feet" ELSE "other" END The REGEX() function is a special case. In the REGEX() function, string expressions are also enclosed in quotes. When a string expression contains literal quotes, double the literal quote character. For example: REGEX(height, "\d\'(\d)+""") Literal DATE and DATETIME Values To refer to a DATETIME value in a lens filter expression, the date format must be yyyy-MM-dd without any enclosing quotation marks or other punctuation. order_date BETWEEN 2012-12-01 AND 2012-12-31 To refer to a literal date value in a computed field expression, you must specify the format of the date and time components using TO_DATE, which takes a string literal argument and a format string. For example: CASE WHEN order_date=TO_DATE("2013-01-01 00:00:59 PST","yyyy-MM-dd HH:mm:ss z") THEN "free shipping" ELSE "standard shipping" DONE Page 216 Data Ingest Guide - Platfora Expressions Literal Numeric Values For literal numeric values, you can just specify the number itself without any special escaping or formatting. For example: CASE WHEN is_married=1 THEN "married" is_married=0 THEN "not_married" ELSE NULL END PARTITION Expressions and Event Series Processing (ESP) Computed fields that contain a PARTITION expression are considered event series processing (ESP) computed fields. You can add ESP computed fields to Platfora datasets only (not vizboards). Event series processing is also referred to as pattern matching or event correlation. Use event series processing (ESP) to partition the rows of a dataset, order the rows sequentially (typically by a timestamp), and search for matching patterns among the rows. ESP fields evaluate multiple rows in the dataset, and output one value (or column) per row. You can use the results of an ESP computed field in other expressions or (after lens build processing) in a viz. How Event Series Processing Works This section explains how even series processing works by walking you through a simple use of the PARTITION expression. This example uses some weblog page view data. Each row represents a page view at a given point in time within a user session. Each session is unique and belongs to only one user. Users can have multiple sessions. Within any session a user can visit any page one or more times. SessionID UserID Timestamp Page 2A 2 3/4/13 2:02 AM products.html 1A 1 12/1/13 9:00 AM home.html 1A 1 12/1/13 9:10 AM products.html 1A 1 12/1/13 9:05 AM company.html 1B 1 3/1/13 9:45 PM home.html 1A 1 3/1/13 9:40 PM checkout.html 2A 2 3/4/13 2:56 AM checkout.html 1B 1 3/1/13 9:46 PM products.html 1A 1 12/1/13 9:20 AM checkout.html Page 217 Data Ingest Guide - Platfora Expressions SessionID UserID Timestamp Page 2A 2 3/4/13 2:20 AM home.html 2A 2 3/4/13 2:33 AM blogs.html 1A 1 12/1/13 9:15 AM blogs.html Consider the following partial PARTITION expression: PARTITION BY SessionID ORDER BY Timestamp ... This paritions the rows by the SessionID. Within each partition, the function orders each row by Timestamp in ascending order (the default order). Suppose you wanted to find sessions where users traversed the pages in order from home.html to products.html and then to the checkout.html page. To look for this page view pattern, you complete the expression like this. PARTITION BY SessionID ORDER BY Timestamp PATTERN (A,B,C) DEFINE A AS Page = "home.html", B AS Page = "product.html", C AS Page = "checkout.html" OUTPUT "TRUE" The PATTERN clause describes the sequence and the DEFINE clauses assigns values to the PATTERN elements. This pattern says that there is a match whenever there are 3 consecutive rows that meet criteria A then B then C. If the computed field containing this PARTITION expression was called Path=home,product,checkout, you would get output that looks like this: SessionID UserID Timestamp Page Path=home,product,checkout 1A 1 12/1/13 9:00 AM home.html NULL 1A 1 12/1/13 9:05 AM company.html NULL 1A 1 12/1/13 9:10 AM products.html NULL 1A 1 12/1/13 9:15 AM blogs.html NULL 1A 1 12/1/13 9:20 AM checkout.html NULL 1B 1 3/1/13 9:40 PM home.html NULL 1B 1 3/1/13 9:45 PM products.html NULL 1B 1 3/1/13 9:46 PM checkout.html TRUE Page 218 Data Ingest Guide - Platfora Expressions SessionID UserID Timestamp Page Path=home,product,checkout 2A 2 3/4/13 2:02 AM products.html NULL 2A 2 3/4/13 2:20 AM home.html NULL 2A 2 3/4/13 2:33 AM blogs.html NULL 2A 2 3/4/13 2:56 AM checkout.html NULL The lens build processing that happens to produce these results is as follows: 1. Partition (or group) the rows of the dataset by session. 2. Order the rows in each partition by time (in ascending order by default). 3. Evaluate the rows against each DEFINE clause and bind the row to the symbol where there is a match. 4. Check if the PATTERN clause conditions are met in the specified order and frequency. 5. If the PATTERN criteria is met, output TRUE as the result value for the last row that caused the pattern to be true. Write the output results to a new computed field: Path=home,product,checkout. If a row does not cause the pattern to be true, output nothing (NULL). Understand Pattern Match Processing Order During lens processing, the build evaluates patterns row-by-row from the partitions top row and going downwards. A pattern match is evaluated based on the current row, and any rows that come before (in terms of their position in the partition). The pattern match only looks back from the current row – it does not look ahead to the next row in the partition. Order processing is important to consider when you want to look for events that happened later or next (chronologically speaking). With the default sort order (ascending), the build sorts rows within a partition from oldest to most recent. This means that you can only pattern match backwards chronologically (or look for events that happened previously in time). Page 219 Data Ingest Guide - Platfora Expressions For example, to answer a question such as "what page did a user visit before they visited the product page?", the following expression would return the previous (chronologically) viewed page before the product page: PARTITION BY SessionID ORDER BY Timestamp ASC PATTERN (^product_page?,A) DEFINE product_page AS "product.html", A AS TRUE OUTPUT A.Page If you want to pattern match forwards chronologically (or look for events that happened later in time), you would specify DESC sort order in the ORDER BY clause of your PARTITION expression. For example, to answer a question such as "what page did a user visit after they visited the product page?", the following expression would return the next (chronologically) viewed page after the product page: PARTITION BY SessionID ORDER BY Timestamp DESC PATTERN (^product_page?,A) DEFINE product_page AS "product.html", A AS TRUE OUTPUT A.Page Understand Pattern Match Precedence By default, pattern expressions are matched from left to right. The innermost parenthetical expressions are evaluated first and then moving outward from there. For example, the pattern: PATTERN (((A,B)|(C,D)),E) Would evaluate differently than: PATTERN (A,B|C,D,E) Understand Regex-Style Quantifiers (Greedy and Reluctant) The PATTERN clause can use regex-style quantifiers to denote the frequency of a match. By default, quantifiers are greedy. This means that it matches as many rows as possible. For example: PATTERN (A*,B?) Causes symbol A to match zero or more rows. Symbol B can match to exactly one row. Adding an additional question mark ? to a quantifier makes it reluctant. This means that the PATTERN only matches to a row when the row cannot match to any other subsequent match criteria in the pattern. For example: PATTERN (A*?,B) Causes symbol A to match zero or more rows, but only when symbol B does not produce a match. You can use reluctant quantifiers to break ties when there is more than one possible match to the pattern. Page 220 Data Ingest Guide - Platfora Expressions A quantifier applies to a single match criteria symbol only. You cannot apply quantifiers to parenthetical expressions. For example, you cannot write ((A,B,C)*, D) to indicate that the asterisk quantifier applies to the whole (A,B,C) expression. Best Practices for Event Series Processing (ESP) Event series processing (ESP) computed fields, unlike other computed fields, require advanced processing during lens builds. This means they require more compute resources on your Hadoop cluster. This section discusses what to consider when adding event series computed fields to your dataset definitions, and the best practices when using this feature. Use Helpful Field Names and Descriptions In the Data Catalog and Vizboards areas of the Platfora application, event series computed fields look just like any other dataset field. When defining event series computed fields, give them names and descriptions that help users understand the field's purpose. This cues users on how to use a field in an analysis. For example, if describing an event series computed field that computes Next Page Viewed, it may be helpful for users to know that this field is best used in conjunction with the Page field. Whatever the current value is for the Page field, the Next Page Viewed field has the value of Page for the next click record immediately following the current page. Increase Partition Limit for Larger Event Series Processing Jobs The global configuration property platfora.max.pattern.events sets the maximum number of rows in a partition to evaluate for a pattern match. The default is one million rows. If a partition exceeds this number of rows, the result of the PARTITION function is NULL for all the rows that exceed the limit. For example, if you had an event series computed field that partitioned by UserID and ordered by Timestamp, the build processes only the first million rows and ignores any rows beyond that so the event series computed field is NULL for those rows. If you are noticing a lot of default values in your lens data (for example: ‘January 1, 1970’ for dates or ‘NULL’ for strings), you may want to increase platfora.max.pattern.events so that all of the rows are processed. Keep in mind that increasing this limit will consume more memory resources on the Hadoop cluster during lens processing. Filter Partitioning Fields to Restrict Lens Build Scope Platfora cannot incrementally build lenses that include event series processing fields. Due to the nature of patten matching logic, lenses with ESP fields require full lens builds that scan all of a dataset's input data. You can limit the scope of these lens builds and improve processing time by adding a lens filter on a dataset partitioning field. A dataset partitioning field is different from the partition criteria of the ESP field. For Hive data sources, partitioning fields are defined on the data source by the Hive administrator. For HDFS or S3 data Page 221 Data Ingest Guide - Platfora Expressions sources, partitioning fields are defined in a Platfora dataset. If there are partitioning fields available in a lens, the lens builder displays a special icon next to them. Consider How Lens Filters Impact Event Series Processing Results Lens builds always apply lens filters on dataset partitioning fields as the first step of a lens build. This means a build excludes some source data before processing any computed field expressions. If your lens includes both lens filters on partitioning fields and ESP computed fields, you should take this behavior into consideration as it can change the results of PARTITION expresssions, and ultimately, your analysis conclusions. For example, suppose you are analyzing web page visits by user on data from 2012 and 2013: SessionID UserID Timestamp (partition field) Page 1A 1 12/1/12 9:00 AM home.html 1A 1 12/1/12 9:05 AM company.html 1A 1 12/1/12 9:10 AM products.html 1A 1 12/1/12 9:15 AM blogs.html 1B 1 3/1/13 9:40 PM home.html 1B 1 3/1/13 9:45 PM products.html 1B 1 3/1/13 9:46 PM checkout.html 2A 2 3/4/13 2:02 AM products.html 2A 2 3/4/13 2:20 AM home.html Page 222 Data Ingest Guide - Platfora Expressions SessionID UserID Timestamp (partition field) Page 2A 2 3/4/13 2:33 AM blogs.html 2A 2 3/4/13 2:56 AM checkout.html Timestamp is a partitioning field and it has a filter that excludes 2012 sessions. Then, you create a computed field with an event series PARTITION function that returns a user's first visit date. When the lens builds, the PARTITION expression would process this filtered data: SessionID UserID Timestamp Page 1B 1 3/1/13 9:40 PM home.html 1B 1 3/1/13 9:45 PM products.html 1B 1 3/1/13 9:46 PM checkout.html 2A 2 3/4/13 2:02 AM products.html 2A 2 3/4/13 2:20 AM home.html 2A 2 3/4/13 2:33 AM blogs.html 2A 2 3/4/13 2:56 AM checkout.html Additionally, the results would say UserID 1 had a first visit date of 3/1/13 even though the user's first visit was actually 12/1/12. This discrepancy results from the build processing the lens filter on the partitioning field (Timestamp) before the event series processing field. Lens filters on other, non-partitioning dataset fields are applied after event series processing. ROLLUP Measures and Window Expressions This section explains how to write ROLLUP and window expressions to calculate complex measures, such as running totals, benchmark comparisons, rank ordering, percentiles, and so on. Understand ROLLUP Measures ROLLUP is a modifier to a measure (or aggregate) expression that allows you to operate on a subset of rows within the overall result set of a query. Using ROLLUP you can build a frame around one or more rows in a dataset or query result, and then compute an aggregate result in relation to that frame only. The result of a ROLLUP expression is always a measure. However, instead of just doing a simple aggregation, it does more complex aggregate processing over a specified set of rows (or marks in a viz). Page 223 Data Ingest Guide - Platfora Expressions If you are familiar with SQL, a ROLLUP expression in Platfora is equivalent to the OVER clause in SQL. For example, this SQL statement: SELECT SUM(distance) OVER (PARTITION BY departure_date) would be equivalent to this ROLLUP expression in Platfora: ROLLUP SUM(Distance) TO [Departure Date] What is the difference between a measure and a ROLLUP measure? A measure is the result of an aggregate function (such as SUM) applied to a group of input data rows. For example, using the Flights tutorial data that comes with your Platfora installation, suppose you wanted to calculate the total distance flown by an airline. You could create a measure called Distance(Sum) with an aggregate expression such as this: SUM(Distance) The group of input records passed into this aggregate calculation is then determined by the dimension(s) used in a visualization or lens query. Records that have the same dimension members are grouped together in a single row, which then gets represented as a mark in a viz. For example, in this viz there is one group or mark for each Carrier/Week combination in the input data. A ROLLUP clause modifies another aggregate function to define additional partitioning, ordering, and window frame criteria. Like a regular aggregate function, ROLLUP also computes aggregate values over groups of input rows. However, a ROLLUP measure then partitions the overall rows returned by the Page 224 Data Ingest Guide - Platfora Expressions viz query into subsets or buckets, and then computes the aggregate expression separately within each individual bucket. A ROLLUP is useful when you want to compute an aggregation over a subset of rows (or marks) independently of the overall result of the viz query. The ROLLUP function specifies how to partition the subset of rows and how to compute the aggregation within that subset. For example, suppose you wanted to calculate the percentage of all miles that were flown in a given week. You could write a ROLLUP expression that calculates the percent of total distance within the partition of a week (total distance for the week is 100%). The ROLLUP expression to define such a calculation would look something like this: 100 * [Distance(Sum)] / ROLLUP [Distance(Sum)] TO ([Departure Date].Week) Then when this ROLLUP expression is used in a viz, the group of input records passed into the aggregate calculation is determined by the dimension(s) used in the viz (such as Carrier in this case), however the aggregation is calculated independently within each week. In this case, you can see the percentage that each carrier contributed to the total distance flown in a given week. How to calculate a ROLLUP over an 'adaptive' partition A ROLLUP expression can have fixed or adaptive partitioning criteria. When you define the ROLLUP measure expression, the TO clause of the expression specifies how to partition the data. You can either specify an exact field name (fixed), a reference field name (adaptive), or no field name at all (adaptive). Page 225 Data Ingest Guide - Platfora Expressions In the previous example, the ROLLUP expression used a fixed partition of [Departure Date].Week. If we changed the partition criteria to use just [Departure Date] (a reference), the partition criteria becomes adaptive to any field of that reference that is used in a viz. The expression to define an adaptive date partition might look something like this: 100 * [Distance(Sum)] / ROLLUP [Distance(Sum)] TO ([Departure Date]) Since Departure Date is a reference that points to the Date dimension, the calculation dynamically changes if you drill down from week to day in the viz. This expression can then be used to partition by any granularity of Departure Date without having to rewrite the ROLLUP expression. The ROLLUP expression adapts to any granularity of Departure Date used in a viz. Understand ROLLUP Window Expressions Adding an ORDER BY plus an optional RANGE or ROWS clause to a ROLLUP expression turns it into a window expression. These clauses are used to specify an order inside of each partition, and a window frame around all, one, or several rows over which to compute the aggregate calculation. The window frame defines how to crop, shift, or fix the row set in relation to the position of the current row. For example, suppose you wanted to calculate a cumulative total on a day to day basis. You could do this by adding a window frame to your ROLLUP expression that ordered the rows in each partition by date (using the ORDER BY clause) , and then summed up the current row and all the days that came Page 226 Data Ingest Guide - Platfora Expressions before it (using a ROWS UNBOUNDED PRECEDING clause). In the Flights tutorial data, an expression that calculated a cumulative total of flights per day would look something like this: ROLLUP [Total Records] TO () ORDER BY ([Departure Date].Date) ROWS UNBOUNDED PRECEDING When this ROLLUP expression is used in a viz, the Total Records measure is computed cumulatively by day for each partition group (the Date and Cancel Status dimensions in this case), allowing us to see the progression of cancelled flights in the month of October 2012. This allows us to see unusual growth patterns in the data, such as the dramatic spike in cancellations at the end of the month. The RANK, DENSE_RANK, and NTILE functions are considered exclusively window functions because they can only be used in a ROLLUP expression, and they always require an ordered set of rows (or window) over which to compute their result. Computed Field Examples This section contains examples of some common data processing tasks you can accomplish using Platfora computed fields. The Expression Language Reference has examples for all of the built-in functions that Platfora provides. Finding and Replacing Values You may have a particular values in your data that you want to find and change to something else, or reformat them in a way so they are all consistent. For example, find and replace values in a name field Page 227 Data Ingest Guide - Platfora Expressions where name values are formatted as firstname lastname and replace them with name values formatted as lastname, firstname: REGEX_REPLACE(name,"(.*) (.*)","$2, $1") Or you may have field values that are not formatted exactly the same, and want to change them so that like values can be grouped and sorted together. For example, change all profession_title field values that contain the word "Retired" anywhere in the string to just be a value of "Retired": REGEX_REPLACE(profession_title,".*(Retired).*","Retired") Extracting Information from File Names and Directories You may have a dataset where the information you need is not inside the source files, but in the Hadoop file name or directory path, such as dates or server names. Suppose your dataset is based on daily log files that are organized into directories by date, and the file names are the server IP address of the server that produced the log file. For example, the URI path to a log file produced by server 172.12.131.118 on July 4, 2012 is: hdfs://myhdfs-server.com/data/logs/20120704/172.12.131.118.log The following expression uses FILE_PATH() in combination with REGEX() and TO_DATE() to create a date field from the date directory name: TO_DATE(REGEX(FILE_PATH(),"hdfs://myhdfs-server.com/data/logs/(\d{8})/ (?:\d{1,3}\.*)+\.log"),"yyyyMMdd") And the following expression uses FILE_NAME() and REGEX() to extract the server IP address from the file name: REGEX(FILE_NAME(),"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\.log") Extracting a Portion of Field Values You may have field values where only part of the value contains useful information. You can pull out a portion of a field value to define a new field. For example, suppose you had an email_address field with values in the format of username@provider.domain, and you wanted to extract just the provider portion of the email address: REGEX(email,"^[a-zA-Z0-9._%+-]+@([a-zA-Z0-9._-]+)\.[a-zA-Z]{2,4}$") Renaming Field Values Sometimes field values are not very user-friendly. For example, a Boolean field may have values of 0 and 1 that you want to change to more human-readable values. CASE WHEN cancelled=0 THEN "Not Cancelled" WHEN cancelled=1 THEN "Cancelled" ELSE NULL END Page 228 Data Ingest Guide - Platfora Expressions Deriving a New Field from Other Fields You may want to combine the values of other fields to create a new field. For example, you could combine a month, day, and year field into a single date field. This would then allow you to reference Platfora's built-in Date dimension dataset. TO_DATE(CONCAT(month,"/",day,"/",year),"MM/dd/yyyy") You can also use the values of other fields to calculate a new value. For example, you could calculate a gross profit margin percentage using the values of a revenue and cost field as follows: ((revenue - cost) / cost) * 100 Cleansing and Casting Field Values Sometimes the are data values in a column need to be transformed and cast to another data type in order to allow for further calculations on the data. For example, you might have some numeric data that you want to use as a measure, however, it has string values of "NA" to represent what should really be NULL values. You could transform the "NA" values to NULL and then cast the column to a numeric data type. TO_INT(CASE WHEN delay_minutes="NA" then NULL ELSE delay_minutes END) Troubleshoot Computed Field Errors When you create a computed field Platfora catches any syntax error in your expression when you try to save the field. This section describes the most common causes of expression syntax errors. Function Arguments Don't Match the Expected Data Type Functions expect input arguments to be of a certain data type. When a function uses another field as its input argument, and that field is not of the expected data type, you might see an error such as: Function REGEX takes 2 arguments with types STRING, STRING, but one argument of type INTEGER was provided. Look at the function's arguments that appear in the error message and verify they are the proper data types. If the argument is a field, you might need to change the data type of the base field or use a data type conversion function to cpnvert the argument to the expected data type within the expression itself. See also: Functions in an Expression Not Escaping Field or Dataset Names Field and dataset names used in an expression must be enclosed in square brackets ([ ]) if they contain spaces, special characters, reserved keywords, or start with numeric characters. When an expression contains a field or dataset name that meets one of these criteria and is not encosed in square brackets, you might see an error such as: Platfora expected the string `)', but instead received `F'. TO_LONG(New Field) Page 229 Data Ingest Guide - Platfora Expressions Look at the bolded character in the expression to find the location of the error. Note the text that comes after this position. If it is part of a field or dataset name, you need to enclose the name with square brackets. To correct the expression in this example, use: TO_LONG([New Field]) See also: Escaping Spaces or Special Characters in Field and Dataset Names Not Specifying the Full Path to Fields of a Referenced Dataset Functions can use a field that is in dataset referenced from the focus dataset. You must specify the field's full path by including the reference dataset's reference name. If you forget to use the full path, you might see an error like: Field not found: carrier_name When you see the Field not found error, make sure the field is qualified with the reference name. In this example, carrier_name is a field in a referenced dataset. The reference name in this example is carriers. To correct this expression, use: carriers.carrier_name for the field name. See also: Referring to Fields in a Referenced Dataset Unenclosed Literal Strings You can include a literal string value as a function argument, but it must be enclosed in double quotes ("). When an expression uses a literal string that isn't enclosed in double quotes, you might see an error such as: Field not found: Platfora When you see the Field not found error, one option is that the alleged field is meant to be a literal string and needs to be enclosed in double quotes. To correct this expression, use: "Platfora" for the string. See also: Literal Values in an Expression Unescaped Special Characters Field and dataset names may contain a right square bracket (]), but it must be preceded by another right square bracket (]]). Literal strings may contain a double quote ("), but it must be preceded by another double quote (""). Suppose you want to concatenate the strings "Hello and world." to make the string "Hello world.". The double quotes in each string are special characters and must be escaped in the expression. If not, you might see an error like: Platfora expected the string `)', but instead received `H'. CONCAT(""Hello", " world."") Look at the bolded character in the expression to find the location of the error. To correct this error, escape the double quotes with another double quote: CONCAT("""Hello", " world.""") Page 230 Data Ingest Guide - Platfora Expressions Invalid Syntax Functions have specific requirements, including required arguments and keywords. When an expression is missing a keyword, you might see an error such as: Platfora expected a string matching the regular expression `(?i)\Qend\E', but instead received end of source. CASE WHEN cancel_code=0 THEN "Not Cancelled" WHEN cancel_code=1 THEN "Cancelled" ELSE NULL Look at the bolded character in the expression to find the location of the error. In this example, it expected the string END (indicated by (?i)\Qend\E), but instead it reached the end of the expression. The CASE function requires the END keyword at the end of its syntax string. To correct this error, add END to the end of the expression: CASE WHEN cancel_code=0 THEN "Not Cancelled" WHEN cancel_code=1 THEN "Cancelled" ELSE NULL END See also: Expression Language Reference Using Row and Aggregate Functions Together in the Same Expression Aggregate functions (functions used to define measures) cannot use nested expressions as their input arguments. Aggregate functions can only accept field names as input. You also cannot use an aggregate expression as input to a row function expression. Aggregate functions and row functions cannot be mixed together in one expression. Write a Lens Query Platfora includes a programmatic query access feature you can use to query a lens. This section describes support for querying lenses using Platfora's lens query language and the REST API. Platfora allows you to make a query against an aggregate lens in your Platfora instance. This feature is not meant as an end-user feature. Rather it is intended to allow you to write programs that issue SQLlike queries to a Platfora lens. For example, you could write a simple command-line client for querying a lens. Since programmatic query access is meant for use by programs rather than people, a caller makes the queries through REST API calls. A query consists of a SELECT statement with one or more optional clauses. The statement and its clauses use the same expression language elements you encounter when building a computed field expression and/or a lens filter expression. [ DEFINE alias-name AS expression [ DEFINE ... ] ] SELECT measure-field [ AS alias-name ] | measure-expression AS alias-name [ , { dimension-field [ AS alias-name ] | row-expression AS alias-name } [ , ...] ] FROM lens-name [ WHERE filter-expression [ AND filter-expression ] ] [ GROUP BY dimension-field [ [, group-ordering ] ] [ HAVING measure-filter-expression ] Page 231 Data Ingest Guide - Platfora Expressions For example, you make a query like the following: SELECT [device].[manufacturer], [user].[gender], [Num Users] FROM bo_view2G_PSM WHERE video.genre %3D "Action/Comedy" AND user.gender !%3D "male" GROUP BY [device].[manufacturer], [user].[gender] Once you know the query structure, you make an REST call use the query endpoint. You can pass the query as a parameter to a GET or as JSON body to a POST. https://hostname:port/api/v1/query?query="HTML-encoded SELECT statement ..." Considerations for Using Programmatic Query Access Here are some considerations to keep in mind when constructing lens queries: • You can only query aggregate lenses. You cannot query event series lenses. • Queries run against the currently built version of the lens. • Queries that once worked can later fail because the underlying dataset or lens changed. • You cannot do a SELECT * on a lens. FAQs - Expression Basics This section covers the basic concepts and common questions about the Platfora expression language. What is an expression? An expression computes or produces a value by combining fields (or columns), constant values, operators, and functions. An expression outputs a value of a particular data type, such as numeric, string, datetime, or Boolean (true/false) values. Simple expressions can be a single constant value, the values of a given column or field, or a function call. You can use operators to join two or more simple expressions into a complex expression. How are expressions used in the Platfora application? Platfora expressions allow you to select, process, transform, and manipulate data. Expressions are used in several ways in the Platfora application: • In Datasets, they are used to define computed fields and measures that operate on the raw source data. • In Lenses, they are used to define lens filters that limit the scope of raw data requested from Hadoop. • In Vizboards, they are used to define computed fields that further manipulate the prepared data in a lens. Page 232 Data Ingest Guide - Platfora Expressions • In the Lens Query Language via the REST API, they are used to programmatically access and manipulate the prepared data in a lens from external applications or plugins. What is the expression builder? The expression builder helps you create computed field expressions in the Platfora application. It shows the available fields in the dataset or lens you are working with, plus the list of Platfora's built-in functions and statements. It validates your expressions for correct syntax, input data types, and so on. You can also access the help to view correct syntax and examples for all of the built-in functions and statements. What is a computed field expression? A computed field expression generates its values based on a calculation or condition, and returns a value for each input row. Computed field expressions that can contain values from other fields, constants, mathematical operators, comparison operators, or built-in row functions. What is a measure expression? A measure expression generates its values as the result of an aggregate function. It takes input values from multiple rows and returns a single aggregated value. How are expressions used in programmatic lens queries? Platfora's lens query language does not have a graphical user interface like the expression builder. Instead, you can use the cURL command line, Chrome's Postman extension, or write your own plugin extension to submit a SQL-like SELECT query statement through Platfora's REST API. The lens query language makes use of expressions in its SELECT statement, DEFINE clause, WHERE clause and HAVING clause. Programmatic lens queries are subject to some of the same expression limitations as vizboard computed fields, since they also operate on the pre-processed data in a lens. Platfora Expression Language Reference An expression computes or produces a value by combining field or column values, constant values, operators, and functions. Platfora has a built-in expression language. You use the language's functions and operators in dataset computed fields, vizboard computed fields, lens filters, and programmatic lens queries. Expression Quick Reference An expression is a combination of columns (or fields), constant values, operators, and functions used to evaluate, transform, or produce a value. Simple expressions can be combined to make more complex expressions. This quick reference describes the functions and operators that can be used to write expressions. Page 233 Data Ingest Guide - Platfora Expressions Platfora's built-in statements, functions and operators are divided into the following categories: • Conditional and NULL Processing • Event Series Processing • String Processing • Date and Time Processing • URL Processing • IP Address Processing • Mathematical Processing • Data Type Conversion • Aggregation and Measure Processing • ROLLUP and Window Calculations • User Defined Functions • Comparison Operators • Logical Operators • Arithmetic Operators Conditional and NULL Processing Conditional and NULL processing allows you to transform or manipulate data values based on certain defined conditions. Conditional processing (CASE) can be done at either the dataset or vizboard level. NULL processing (COALESCE and IS_VALID) is only applicable at the dataset level. During a lens build, any NULL values in the source data are converted to default values, so lenses and vizboards have no concept of NULL values. Function Description Example CASE evaluates each row in the dataset according to one or more input conditions, and outputs the specified result when the input conditions are met CASE WHEN gender = "M" THEN "Male" WHEN gender = "F" THEN "Female" ELSE "Unknown" END COALESCE returns the first valid value (NOT NULL value) from a commaseparated list of expressions COALESCE(hourly_wage * 40 * 52, salary) IS_VALID returns 0 if the returned value is NULL, and 1 if the returned value is NOT NULL. IS_VALID(sale_amount) Page 234 Data Ingest Guide - Platfora Expressions Event Series Processing Event series processing allows you to partition rows of input data, order the rows sequentially (typically by a timestamp), and search for matching patterns in a set of rows. Computed fields that are defined in a dataset using a PARTITION expression are considered event series processing computed fields. Event series processing computed fields are processed differently than regular computed fields. Instead of computing values from the input of a single row, they compute values from inputs of multiple rows in the dataset. Event series processing computed fields can only be defined in the dataset - not in the vizboard. Function Description Example PACK_VALUES returns multiple PACK_VALUES("ID",custid,"Age",age) output values packed into a single string of key/value pairs separated by the Platfora default key and pair separators - useful when the OUTPUT clause of a PARTITION expression returns multiple output values PARTITION partitions the rows of a dataset, orders the rows sequentially (typically by a timestamp), and searches for matching patterns in a set of rows PARTITION BY SessionID ORDER BY Timestamp PATTERN (A,B,C) DEFINE A AS Page = "home.html", B AS Page = "product.html", C AS Page = "checkout.html" OUTPUT "TRUE" String Functions String functions allow you to manipulate and transform textual data, such as combining string values or extracting a portion of a string value. Function Description Example ARRAY_CONTAINS performs a whole string match against a string containing delimited values and returns a 1 or 0 depending on whether or not the string contains the search value. ARRAY_CONTAINS(device,",","iPad") Page 235 Data Ingest Guide - Platfora Expressions Function Description Example CONCAT concatenates (combines together) the results of multiple string expressions CONCAT(month,"/",day,"/",year) FILE_NAME returns the original file TO_DATE(SUBSTRING(FILE_NAME(),0,8),"yyyyMMdd") name from the source file system FILE_PATH returns the full URI path from the source file system TO_DATE(REGEX(FILE_PATH(),"hdfs:// myhdfs-server.com/data/logs/(\d{8})/(?: \d{1,3}\.*)+\.log"),"yyyyMMdd") EXTRACT_COOKIE extracts the value of the given cookie identifier from a semicolon delimited list of cookie key=value pairs. EXTRACT_COOKIE("SSID=ABC; vID=44", "vID") returns 44 EXTRACT_VALUE extracts the value for the given key from a string containing delimited key/value pairs. EXTRACT_VALUE("firstname;daria| lastname;hutch","lastname",";","|") returns INSTR returns an integer indicating the position of a character within a string that is the first character of the occurrence of a substring. INSTR(url,"http://",-1,1) JAVA_STRING returns the unescaped version of a Java unicode character escape sequence as a string value CASE WHEN currency == JAVA_STRING("\u00a5") THEN "yes" ELSE "no" END JOIN_STRINGS concatenates JOIN_STRINGS("/",month,day,year) (combines together) the results of multiple string expressions with the separator in between each non-null value hutch Page 236 Data Ingest Guide - Platfora Expressions Function Description Example JSON_ARRAY_CONTAINS performs a whole string match against a string formatted as a JSON array and returns a 1 or 0 depending on whether or not the string contains the search value JSON_ARRAY_CONTAINS(software,"platfora") JSON_DOUBLE extracts a DOUBLE value from a field in a JSON object JSON_DOUBLE(top_scores,"test_scores.2") JSON_FIXED extracts a FIXED value JSON_FIXED(top_scores,"test_scores.2") from a field in a JSON object JSON_INTEGER extracts an INTEGER value from a field in a JSON object JSON_INTEGER(top_scores,"test_scores.2") JSON_LONG extracts a LONG value from a field in a JSON object JSON_LONG(top_scores,"test_scores.2") JSON_STRING extracts a STRING value from a field in a JSON object JSON_STRING(misc,"hobbies.0") LENGTH returns the count of characters in a string value LENGTH(name) REGEX performs a whole REGEX(weblog.request_line,"GET\s/([a-zAstring match against Z0-9._%-]+\.[html])\sHTTP/[0-9.]+") a string value with a regular expression and returns the portion of the string matching the first capturing group of the regular expression Page 237 Data Ingest Guide - Platfora Expressions Function Description Example REGEX_REPLACE evaluates a string value against a regular expression to determine if there is a match, and replaces matched strings with the specified replacement value REGEX_REPLACE(phone_number,"([0-9] {3})\.([[0-9]]{3})\.([[0-9]]{4})","\($1\) $2-$3") SPLIT breaks down a delimited input string into sections and returns the specified section of the string SPLIT("Restaurants>Location>San Francisco",">", -1) returns San Francisco SUBSTRING returns the specified characters of a string value based on the given start and end position SUBSTRING(name,0,1) TO_LOWER converts all alphabetic characters in a string to lower case TO_LOWER("123 Main Street") returns 123 converts all alphabetic characters in a string to upper case TO_UPPER("123 Main Street") returns 123 TRIM removes leading and trailing spaces from a string value TRIM(area_code) XPATH_STRING takes an XMLformatted string and returns the first string matching the given XPath expression XPATH_STRING(address,"// address[@type='home']/zipcode") XPATH_STRINGS takes an XMLformatted string and returns a newlineseparated array of strings matching the given XPath expression XPATH_STRINGS(address,"/list/address[1]/ street") TO_UPPER main street MAIN STREET Page 238 Data Ingest Guide - Platfora Expressions Function Description Example XPATH_XML takes an XMLformatted string and returns an XMLformatted string matching the given XPath expression XPATH_XML(address,"//address[last()]") Date and Time Functions Date and time functions allow you to manipulate and transform datetime values, such as calculating time differences between two datetime values, or extracting a portion of a datetime value. Function Description Example DAYS_BETWEEN calculates the whole number of days (ignoring time) between two DATETIME values DAYS_BETWEEN(ship_date,order_date) DATE_ADD adds the specified time DATE_ADD(invoice_date,45,"day") interval to a DATETIME value HOURS_BETWEEN calculates the whole number of hours (ignoring minutes, seconds, and milliseconds) between two DATETIME values HOURS_BETWEEN(NOW(),impressions.adview_timestam EXTRACT returns the specified portion of a DATETIME value EXTRACT("hour",order_date) MILLISECONDS_BETWEEN calculates the MILLISECONDS_BETWEEN(request_timestamp,response_ MINUTES_BETWEEN calculates the whole MINUTES_BETWEEN(impression_timestamp,conversion_t whole number of milliseconds between two DATETIME values number of minutes (ignoring seconds and milliseconds) between two DATETIME values NOW returns the current system date and time as a DATETIME value YEAR_DIFF(NOW(),users.birthdate) Page 239 Data Ingest Guide - Platfora Expressions Function Description Example SECONDS_BETWEEN calculates the whole number of seconds (ignoring milliseconds) between two DATETIME values SECONDS_BETWEEN(impression_timestamp,conversion_ TRUNC truncates a DATETIME value to the specified format TRUNC(TO_DATE(order_date,"MM/dd/yyyy HH:mm:ss"),"day") YEAR_DIFF calculates the fractional number of years between two DATETIME values YEAR_DIFF(NOW(),users.birthdate) URL Functions URL functions allow you to extract different portions of a URL string, and decode text that is URLencoded. Function Description Example URL_AUTHORITY returns the authority URL_AUTHORITY("http:// portion of a URL string user:password@mycompany.com:8012/ mypage.html") returns user:password@mycompany.com:8012 URL_FRAGMENT returns the fragment URL_FRAGMENT("http://platfora.com/ portion of a URL string news.php?topic=press#Platfora%20News") returns Platfora%20News URL_HOST returns the host, URL_HOST("http:// domain, or IP address user:password@mycompany.com:8012/ portion of a URL string mypage.html") returns mycompany.com URL_PATH returns the path URL_PATH("http://platfora.com/company/ portion of a URL string contact.html") returns /company/contact.html URL_PORT returns the port URL_PORT("http:// portion of a URL string user:password@mycompany.com:8012/ mypage.html") returns 8012 URL_PROTOCOL returns the protocol URL_PROTOCOL("http://www.platfora.com") (or URI scheme name) returns http portion of a URL string Page 240 Data Ingest Guide - Platfora Expressions Function Description Example URL_QUERY returns the query URL_QUERY("http://platfora.com/news.php? portion of a URL string topic=press&timeframe=today") returns topic=press&timeframe=today URLDECODE decodes a string that has been encoded with the application/xwww-form-urlencoded media type URLDECODE("N%2FA%20or%20%22not %20applicable%22") IP Address Functions IP address functions allow you to manipulate and transform STRING data consisting of IP address values. Function Description Example CIDR_MATCH compares two CIDR_MATCH("60.145.56.0/24","60.145.56.246") STRING arguments returns 1 representing a CIDR mask and an IP address, and returns 1 if the IP address falls within the specified subnet mask or 0 if it does not HEX_TO_IP converts a HEX_TO_IP(AB20FE01) returns 171.32.254.1 hexadecimal-encoded STRING to a text representation of an IP address Math Functions Math functions allow you to perform basic math calculations on numeric values. You can also use the arithmetic operators to perform simple math calculations, such as addition, subtraction, division and multiplication. Function Description Example DIV divides two LONG values and returns a quotient value of type LONG DIV(TO_LONG(file_size),1024) Page 241 Data Ingest Guide - Platfora Expressions Function Description Example EXP raises the EXP(Value) mathematical constant e to the power (exponent) of a numeric value and returns a value of type DOUBLE. FLOOR returns the largest integer that is less than or equal to the input argument FLOOR(32.6789) returns 32.0 HASH evenly partitions data values into the specified number of buckets HASH(username,20) LN returns the natural logarithm of a number LN(2.718281828) returns 1 MOD divides two LONG values and returns the remainder value of type LONG MOD(TO_LONG(file_size),1024) POW raises a numeric 100 * POW(end_value/start_value, 0.2) - 1 value to the power (exponent) of another numeric value and returns a value of type DOUBLE. ROUND rounds a DOUBLE value to the specified number of decimal places ROUND(32.4678954,2) returns 32.47 Page 242 Data Ingest Guide - Platfora Expressions Data Type Conversion Functions Data type conversion functions allow you to cast data values from one data type to another. These functions are used implicitly whenever you set the data type of a field or column in the Platfora user interface. The supported data types are: INTEGER, LONG, DOUBLE, FIXED, DATETIME, and STRING Function Description Example EPOCH_MS_TO_DATEconverts LONG values EPOCH_MS_TO_DATE(1360260240000) to DATETIME values, returns 2013-02-07T18:04:00:000Z where the input number represents the number of milliseconds since the epoch TO_FIXED converts STRING, INTEGER, LONG, or DOUBLE values to fixed-decimal values TO_FIXED(opening_price) TO_DATE converts STRING values to DATETIME values, and specifies the format of the date and time elements in the string TO_DATE(order_date,"yyyy.MM.dd 'at' HH:mm:ss z") TO_DOUBLE converts STRING, INTEGER, LONG, or DOUBLE values to DOUBLE (decimal) values TO_DOUBLE(average_rating) TO_INT converts STRING, INTEGER, LONG, or DOUBLE values to INTEGER (whole number) values TO_INT(average_rating) TO_LONG converts STRING, INTEGER, LONG, or DOUBLE values to LONG (whole number) values TO_LONG(average_rating) TO_STRING converts values of other data types to STRING (character) values TO_STRING(sku_number) Page 243 Data Ingest Guide - Platfora Expressions Aggregate Functions An aggregate function groups the values of multiple rows together based on some defined input expression. Aggregate functions return one value for a group of rows, and are only valid for defining measures in Platfora. In the dataset, measures can be defined using any of the aggregate functions. In the vizboard, only the DISTINCT, MAX, or MIN aggregate functions are allowed. Function Description Example AVG returns the average of all valid numeric values AVG(sale_amount) COUNT returns the number of rows in a dataset COUNT(sales.customers) COUNT_VALID returns the number of rows for which the given expression is valid COUNT_VALID(page_views) DISTINCT returns the number of distinct values for the given expression DISTINCT(user_id) MAX returns the biggest value from the given input expression MAX(sale_amount) MIN returns the smallest value from the given input expression MIN(sale_amount) SUM returns the total of all values from the given input expression SUM(sale_amount) STDDEV calculates the population standard deviation for a group of numeric values STDDEV(sale_amount) VARIANCE calculates the VARIANCE(sale_amount) population variance for a group of numeric values ROLLUP and Window Functions ROLLUP is a modifier to an aggregate expression that turns an aggregate into a windowed aggregate. Window functions (RANK, DENSE_RANK and NTILE) can only be used within a ROLLUP statement. The ROLLUP statement defines the partitioning and ordering of a rowset before the associated aggregate function or window function is applied. Page 244 Data Ingest Guide - Platfora Expressions ROLLUP defines a window or user-specified set of rows within a query result set. A window function then computes a value for each row in the window. You can use window functions to compute aggregated values such as moving averages, cumulative aggregates, running totals, or a top N per group results. ROLLUP statements can be specified in either the dataset or the vizboard. When using a ROLLUP in a vizboard, the measure for which you are calculating the ROLLUP must already exist in the lens you are using in the vizboard. Function Description Example DENSE_RANK assigns the rank (position) of each row in a group (partition) of rows and does not skip rank numbers in the event of tie ROLLUP DENSE_RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED PRECEDING NTILE divides a partitioned group of rows into the specified number of buckets, and returns the bucket number to which the current row belongs ROLLUP NTILE(100) TO () ORDER BY ([Total Records] DESC) ROWS UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING RANK assigns the rank ROLLUP RANK() TO () ORDER BY (position) of each row ([Sales(Sum)] DESC) ROWS UNBOUNDED in a group (partition) PRECEDING of rows and skips rank numbers in the event of tie ROLLUP a modifier to an aggregate function that turns a regular aggregate function into a windowed, partitioned, or adaptive aggregate function 100 * COUNT(Flights) / ROLLUP COUNT(Flights) TO ([Departure Date]) ROW_NUMBER a modifier to an aggregate function that turns a regular aggregate function into a windowed, partitioned, or adaptive aggregate function ROLLUP ROW_NUMBER() TO (Quarter) ORDER BY (Sum_Sales DESC) ROWS UNBOUNDED PRECEDING Page 245 Data Ingest Guide - Platfora Expressions User Defined Functions User defined functions (UDFs) allow you to define your own per-row processing logic, and then expose that functionality to users in the Platfora application expression builder. See User Defined Functions (UDFs) for more information. Comparison Operators Comparison operators are used to compare the equivalency of two expressions of the same data type. The result of a comparison expression is a Boolean value (returns 1 for true, 0 for false, or NULL for invalid). Boolean expressions are most often used to specify data processing conditions or filters. Operator Meaning Example Expression = or == Equal to order_date = "12/22/2011" > Greater than age > 18 !> Not greater than age !> 8 < Less than age < 30 !< Not less than age !< 12 >= Greater than or equal to age >= 20 <= Less than or equal to age <= 29 <> or != or ^= Not equal to age <> 30 Logical Operators Logical operators are used to define Boolean (true / false) expressions. Logical operators are used in expressions to test for a condition, and return 1 if the condition is true or 0 if it is false. Logical operators are often used in lens filters, CASE expressions, PARTITION expressions, and WHERE clauses of queries. Operator Meaning Example Expression AND Test whether two conditions are true. OR Test if either of two conditions are true. Page 246 Data Ingest Guide - Platfora Expressions Operator Meaning BETWEEN Test whether a date or year BETWEEN 2000 AND 2012 numeric value is within the min and max values min_value AND max_value Example Expression (inclusive). IN(list) Test whether a value is product_type within a set. IN("tablet","phone","laptop") LIKE("pattern") Simple inclusive caseinsensitive character pattern matching. The * character matches any number of characters. The ? character matches exactly one character. last_name LIKE("?utch*") matches Kutcher, hutch but not Krutcher or crutch Check whether a field value or expression is null (empty) ship_date IS NULL evaluates to true when the ship_date field is Reverses the value of other operators. • year NOT BETWEEN 2000 AND 2012 value IS NULL NOT company_name LIKE("platfora") matches Platfora or platfora empty • first_name NOT LIKE("Jo?n*") excludes John, jonny but not Jon or Joann • Date.Weekday NOT IN("Saturday","Sunday") • purchase_date IS NOT NULL evaluates to true when the purchase_date field is not empty Arithmetic Operators Arithmetic operators perform basic math operations on two expressions of the same data type resulting in a numeric value. The plus (+) and minus (-) operators can also be used to perform arithmetic operations on DATETIME values. Operator Description Example + Addition amount + 10 (add 10 to the value of the amount field) Page 247 Data Ingest Guide - Platfora Expressions Operator Description Example - Subtraction amount - 10 (subtract 10 from the value of the amount field) * Multiplication amount * 100 (multiply the value of the amount field by 100) / Division bytes / 1024 (divide the value of the bytes field by 1024 and return the quotient) Comparison Operators Comparison operators are used to compare the equivalency of two expressions of the same data type. The result of a comparison expression is a Boolean value (returns 1 for true, 0 for false, or NULL for invalid). Boolean expressions are most often used to specify data processing conditions or filter criteria. Operator Definitions Operator Meaning Example Expression = or == Equal to order_date = "12/22/2011" > Greater than age > 18 !> Not greater than age !> 8 < Less than age < 30 !< Not less than age !< 12 >= Greater than or equal to age >= 20 <= Less than or equal to age <= 29 Page 248 Data Ingest Guide - Platfora Expressions Operator Meaning Example Expression <> or != or ^= Not equal to age <> 30 If you are writing queries with REST and the query string includes an = (equal) character, you must URL encode it as %3D. Failure to encode the character can result in this error: string matching regex `(?i)\Qnot\E\b' expected but end of source found. Logical Operators Logical operators are used to define Boolean (true / false) expressions. Logical operators are used in expressions to test for a condition, and return 1 if the condition is true or 0 if it is false. Logical operators are often used in lens filters, CASE expressions, PARTITION expressions, and WHERE clauses of queries. Operator Meaning AND Test whether two conditions are true. OR Test if either of two conditions are true. BETWEEN Test whether a date or year BETWEEN 2000 AND 2012 numeric value is within the min and max values min_value AND max_value Example Expression (inclusive). IN(list) Test whether a value is product_type within a set. IN("tablet","phone","laptop") LIKE("pattern") Simple inclusive caseinsensitive character pattern matching. The * character matches any number of characters. The ? character matches exactly one character. last_name LIKE("?utch*") matches Kutcher, hutch but not Krutcher or crutch Check whether a field value or expression is null (empty) ship_date IS NULL evaluates to true when the ship_date field is value IS NULL company_name LIKE("platfora") matches Platfora or platfora empty Page 249 Data Ingest Guide - Platfora Expressions Operator Meaning Example Expression NOT Reverses the value of other operators. • year NOT BETWEEN 2000 AND 2012 • first_name NOT LIKE("Jo?n*") excludes John, jonny but not Jon or Joann • Date.Weekday NOT IN("Saturday","Sunday") • purchase_date IS NOT NULL evaluates to true when the purchase_date field is not empty Arithmetic Operators Arithmetic operators perform basic math operations on two expressions of the same data type resulting in a numeric value. The plus (+) and minus (-) operators can also be used to perform arithmetic operations on DATETIME values. Operator Description Example + Addition amount + 10 (add 10 to the value of the amount field) - Subtraction amount - 10 (subtract 10 from the value of the amount field) * Multiplication amount * 100 (multiply the value of the amount field by 100) / Division bytes / 1024 (divide the value of the bytes field by 1024 and return the quotient) Conditional and NULL Processing Conditional and NULL processing allows you to transform or manipulate data values based on certain defined conditions. Conditional processing (CASE) can be done at either the dataset or vizboard level. NULL processing (COALESCE and IS_VALID) is only applicable at the dataset level. During a lens Page 250 Data Ingest Guide - Platfora Expressions build, any NULL values in the source data are converted to default values, so lenses and vizboards have no concept of NULL values. CASE CASE is a row function that evaluates each row in the dataset according to one or more input conditions, and outputs the specified result when the input conditions are met. CASE WHEN input_condition [AND|OR input_condition]THEN output_expression [...] [ELSE other_output_expression] END Returns one value per row of the same type as the output expression. All output expressions must return the same data type. If there are multiple output expressions that return different data types, then you will need to enclose your entire CASE expression in one of the data type conversion functions to explicitly cast all output values to a particular data type. WHEN input_condition Required. The WHEN keyword is used to specify one or more Boolean expressions (see Platfora's supported conditional operators). If an input value meets the condition, then the output expression is applied. Input conditions can include other row functions in their expression, but cannot contain aggregate functions or measure expressions. You can use the AND or OR keywords to combine multiple input conditions. THEN output_expression Required. The THEN keyword is used to specify an output expression when the specified conditions are met. Output expressions can include other row functions in their expression, but cannot contain aggregate functions or measure expressions. ELSE other_output_expression Optional. The ELSE keyword can be used to specify an alternate output expression to use when the specified conditions are not met. If an ELSE expression is not supplied, ELSE NULL is the default. END Required. Denotes the end of CASE function processing. Convert values in the age column into a range-based groupings (binning): CASE WHEN age <= 25 THEN "0-25" WHEN age <= 50 THEN "26-50" ELSE "over 50" END Transform values in the gender column from one string to another: CASE WHEN gender = "M" THEN "Male" WHEN gender = "F" THEN "Female" ELSE "Unknown" END The vehicle column contains the following values: truck, bus, car, scooter, wagon, bike, tricycle, and motorcycle. The following example convert multiple values in the vehicle column into a single value: Page 251 Data Ingest Guide - Platfora Expressions CASE WHEN vehicle in ("bike","scooter","motorcycle) THEN "two-wheelers" ELSE "other" END COALESCE COALESCE is a row function that returns the first valid value (NOT NULL value) from a commaseparated list of expressions. COALESCE(expression[,expression][,...]) Returns one value per row of the same type as the first valid input expression. expression At least one required. A field name or expression. The following example shows an expression to calculate employee yearly income for exempt employees that have a salary and non-exempt employees that have an hourly_wage. This expression checks the values of both fields for each row, and returns the value of the first expression that is valid (NOT NULL). COALESCE(hourly_wage * 40 * 52, salary) IS_VALID IS_VALID is a row function that returns 0 if the returned value is NULL, and 1 if the returned value is NOT NULL. This is useful for computing other calculations where you want to exclude NULL values (such as when computing averages). IS_VALID(expression) Returns 0 if the returned value is NULL, and 1 if the returned value is NOT NULL. expression Required. A field name or expression. Define a computed field using IS_VALID. This returns a row count only for the rows where this field value is NOT NULL. If a value is NULL, it returns 0 for that row. In this example, we create a computed field (sale_amount_not_null) using the sale_amount field as the basis. IS_VALID(sale_amount) Then you can use the sale_amount_not_null computed field to calculate an acurate average for sale_amount that excludes NULL values: SUM(sale_amount)/SUM(sale_amount_not_null) This is what happens automatically when you use the AVG function. Event Series Processing Event series processing allows you to partition rows of input data, order the rows sequentially (typically by a timestamp), and search for matching patterns in a set of rows. Computed fields that are defined in a dataset using a PARTITION expression are considered event series processing computed fields. Page 252 Data Ingest Guide - Platfora Expressions Event series processing computed fields are processed differently than regular computed fields. Instead of computing values from the input of a single row, they compute values from inputs of multiple rows in the dataset. Event series processing computed fields can only be defined in the dataset - not in the vizboard or a lens query. PARTITION PARTITION is an event series processing language that partitions the rows of a dataset, orders the rows sequentially (typically by a timestamp), and searches for matching patterns in a set of rows. Computed fields that are defined in a dataset using a PARTITION expression are considered event series processing computed fields. Event series processing computed fields are processed differently than regular computed fields. Instead of computing values from the input of a single row, they compute values from inputs of multiple rows in the dataset. The PARTITION function can only be used to define a computed field in the dataset definition (pre-lens build). PARTITION cannot be used to define a vizboard computed field. Unlike other expressions, PARTITION expressions cannot be embedded within other functions or expressions - it must be a top-level expression. PARTITION BYfield_name ORDER BY field_name [ASC|DESC] PATTERN (pattern_expression) DEFINE symbol_1 AS filter_expression [,symbol_n AS filter_expression ] [, ...] OUTPUT output_expression To understand how event series processing works, we'll walk through a simple example of a PARTITION expression. This is a simple example of some weblog page view data. Each row represents a page view by a user at a give point in time. Session IDs are used to group together page views that happened in the same user session: Page 253 Data Ingest Guide - Platfora Expressions Suppose you wanted to know how many sessions included the path of page visits to ‘home.html’ then ‘products.html’ then ‘checkout.html’. You could define a PARTITION expression that groups the rows by session, orders by time, and then iterates through the rows from top to bottom to find sessions that match the pattern: PARTITION BY SessionID ORDER BY Timestamp PATTERN (A,B,C) DEFINE A AS Page = "home.html", B AS Page = "product.html", C AS Page = "checkout.html" OUTPUT "TRUE" 1. The PARTITION BY clause partitions (or groups) the rows of the dataset by session. 2. Within each partition, the ORDER BY clause sorts the rows by time (in ascending order by default). 3. Each DEFINE clause specifies a condition used to evaluate a row, and binds that condition to a symbol that is then used in the PATTERN clause. 4. The PATTERN clause checks if the conditions are met in the specified order and frequency. This pattern says that there is a match whenever there are 3 consecutive rows that meet criteria A then B then C. 5. For a row that satisfies all of the PATTERN criteria, the value of the OUTPUT clause is applied. Otherwise the output is NULL for rows that don’t meet all of the PATTERN criteria. Returns one value per row of the same type as the output_expression for rows that match the defined match pattern, otherwise returns NULL for rows that do not match the pattern. Output values are calculated during the lens build process using a special event series MapReduce job. Therefore, sample output values for a PARTITION computed field cannot be shown in the dataset workspace. PARTITION BY field_name Required. The PARTITION BY clause is used to specify a field in the current dataset by which to partition the rows. Rows that share the same value for this field will be grouped Page 254 Data Ingest Guide - Platfora Expressions together, and each group will then be processed independently according to the matching pattern criteria. The partition field cannot be a field of a referenced dataset; it must be a field in the current focus dataset. ORDER BY field_name Optional. The ORDER BY clause specifies a field by which to sort the rows within each partition before applying the match pattern criteria. For event series processing, records are typically ordered by a DATETIME type field, such as a date or a timestamp. The default sort order is ascending (first to last or low to high). The ordering field cannot be a field of a referenced dataset; it must be a field in the current focus dataset. PATTERN (pattern_expression) Required. The PATTERN clause specifies the matching pattern to search for within a partition of rows. The pattern_expression is expressed in a format similar to a regular expression. The pattern_expression can include: • A symbol that represents some match criteria (as declared in the DEFINE clause). • A symbol followed by one of the following regex quantifiers: ? (matches once or not at all - greedy construct) ?? (matches once or not at all - reluctant construct) * (matches zero or more times - greedy construct) *? (matches zero or more times - reluctant construct) + (matches one or more times - greedy construct) +? (matches one or more times - reluctant construct) ** (matches the empty sequence, or one or more of the quantified symbol, with gaps allowed in between. The match need not begin or end with the quantified symbol) *+ (matches the empty sequence, or one or more of the quantified symbol, with gaps allowed in between. The match must end with the quantified symbol) ++ (matches the quantified symbol, followed by zero or more of the quantified symbol, with gaps allowed in between. The match must end with the quantified symbol) +* (matches the quantified symbol, followed by zero or more of the quantified symbol, with gaps allowed in between. The match need not end with the quantified symbol) • A symbol or pattern of symbols anchored by the regex special characters for the beginning of string. Page 255 Data Ingest Guide - Platfora Expressions ^ (marks the beginning of the set of rows that match to the pattern) • patternA|patternB - The alternation operator (pipe symbol) between two symbols or patterns signifies an OR match. • patternA,patternB - The concatenation operator (comma) between two symbols or patterns signifies a match when pattern B immediately follows pattern A. • patternA->patternB - The follows operator (minus and greater-than sign) between two symbols or patterns signifies a match when pattern B eventually follows pattern A. • (pattern_expression) - By default, pattern expressions are matched from left to right. If parenthesis are used to group sub-expressions, the sub-expression within the parenthesis is evaluated first. You cannot use quantifiers outside of parenthesis. For example, you cannot write ((A,B,C)*), to indicate that the asterisk quantifier applies to the whole (A,B,C) expression. DEFINE symbol AS filter_expression Required. The DEFINE clause is used to enumerate symbols used in the PATTERN clause (or in the filter_expression of a subsequent symbol definition). A symbol is a name used to refer to some pattern matching criteria. This can be any name or token that follows Platfora's object naming rules. For example, if the name contains spaces, special characters, keywords, or starts with a number, you must enclose the name in brackets [] to escape it. Otherwise, this can be any logical name that helps you identify a piece of pattern matching logic in your expression. The filter_expression is a Boolean (true or false) expression that operates on each row of the partition. A filter_expression can contain: • The special expression TRUE or 1, meaning allow the match to occur for any row in the partition. • Any field_name in the current dataset. • symbol.field_name - A field from the dataset qualified by the name of a symbol that (1) appears only once in the PATTERN clause, (2) preceeds this symbol in the PATTERN clause, and (3) is not followed by a repetition quantifier in the PATTERN clause. For example: PATTERN (A, B) DEFINE A AS TRUE, B AS product = A.product This means that the expression for symbol B will match to a row if the product field for that row is also equal to the product field for the row that is bound to symbol A. • Any of the comparison operators, such as greater than, less than, equals, and so on. • The keywords AND or OR (for combining multiple criteria in a single filter expression) Page 256 Data Ingest Guide - Platfora Expressions • FIRST|LAST(symbol.field_name) - A field from the dataset, qualified by the name of a symbol that (1) only appears once in the PATTERN clause, (2) preceeds this symbol in the PATTERN clause, and (3) is followed by a repetition quantifier in the PATTERN clause (*,*?,+, or +?). This returns the field value for the first or last row when the pattern matches to a set of rows. For example: PATTERN (A+) DEFINE A AS product = FIRST(A.product) OR COUNT(A)=0 The pattern A+ will match to a series of consecutive rows that all have the same value for the product field as the first row in the sequence. If the current row happens to be the first row in the sequence, then it will also be included in the match. A FIRST or LAST expression evaluates to NULL if it refers to a symbol that ends up matching an empty sequence. Make sure your expression handles the row at the beginning or end of a sequence if you want that row to match as well. • Any computed expression that operates on the fields or expressions listed above and/or on literal values. OUTPUT output_expression Required. An expression that specifies what the output value should be. The output expression can refer to: • The field declared in the PARTITION BY clause. • symbol.field_name - A field from the dataset, qualified by the name of a symbol that (1) appears only once in the PATTERN clause, and (2) is not followed by a repetition quantifier in the PATTERN clause. This will output the matching field value. • COUNT(symbol) where symbol (1) appears only once in the PATTERN clause, and (2) is followed by a repetition quantifier in the PATTERN clause. This will output the sequence number of the row that matched the symbol pattern. • FIRST | LAST | SUM | COUNT | AVG(symbol.field_name) where symbol (1) appears only once in the PATTERN clause, and (2) is followed by a repetition quantifier in the PATTERN clause. This will output an aggregated value for a set of rows that matched the symbol pattern. • Since you can only output a single column value, you can use the PACK_VALUES function to output multiple results in a single column as key/value pairs. 'Session Start Time' Expression Calculate a user session by partitioning by user and ordering by time. The matching logic represented by symbol A checks if the time of the current row is less than 30 minutes from the preceding row. If it is, then it is considered part of the same session as the previous row. Otherwise, the current row is considered the start of a new session. The PATTERN (A+) means that the matching logic represented Page 257 Data Ingest Guide - Platfora Expressions by symbol A must be true for one or more consecutive rows. The output then returns the time of the first row in a session. PARTITION BY UserID ORDER BY Timestamp PATTERN (A+) DEFINE A AS COUNT(A)=0 OR MINUTES_BETWEEN(Timestamp,LAST(A.Timestamp)) < 30 OUTPUT FIRST(A. Timestamp) 'Click Number in Session' Expression Calculate where a click happened in a session by partitioning by session and ordering by time. The matching logic represented by symbol A simply matches to any row in the session. The PATTERN (A +) means that the matching logic represented by symbol A must be true for one or more consecutive rows. The output then returns to count of the row within the partition (based on its order or position in the partition). PARTITION BY [Session ID] ORDER BY Timestamp PATTERN (A+) DEFINE A AS TRUE OUTPUT COUNT(A) 'Path to Page' Expression This is a complicated expression that looks back from the current row's position to determine the previous 4 pages viewed in a session. Since a PARTITION expression can only output one column value as its result, the OUTPUT clause uses the PACK_VALUES function to return the previous page positions 1,2,3, and 4 in one output value. You can then use a series of EXTRACT_VALUE expressions to create individual columns for each prior page view in the path. PARTITION BY SessionID ORDER BY Timestamp PATTERN (^OtherPreviousPages*?, Page4Back??, Page3Back??, Page2Back??, Page1Back??, CurrentPage) DEFINE OtherPreviousPages AS TRUE, Page4Back AS TRUE, Page3Back AS TRUE, Page2Back AS TRUE, Page1Back AS TRUE, CurrentPage AS TRUE OUTPUT PACK_VALUES("Back4",Page4Back.Page, "Back3",Page3Back.Page, "Back2",Page2Back.Page, "Back1",Page1Back.Page) ‘Page -1 Back’ Expression Use the output from the Path to Page expression and extract the last page viewed before the current page. EXTRACT_VALUE([Path to Page],"Back1") Page 258 Data Ingest Guide - Platfora Expressions PACK_VALUES PACK_VALUES is a row function that returns multiple output values packed into a single string of key/ value pairs separated by the Platfora default key and pair separators. This is useful when the OUTPUT clause of a PARTITION expression returns multiple output values. The string returned is in a format that can be read by the EXTRACT_VALUE function. PACK_VALUES uses the same key and pair separator values that EXTRACT_VALUE uses (the Unicode escape sequences u0003 and u0002, respectively). PACK_VALUES(key_string,value_expression[,key_string,value_expression] [,...]) Returns one value per row of type STRING. If the value for either key_string or value_expression of a pair is null or contains either of the two separators, the full key/value pair is omitted from the return value. key_string At least one required. A field name of any type, a literal string or number, or an expression that returns any value. value_expression At least one required. A field name of any type, a literal string or number, or an expression that returns any value. The expression must include one value_expression instance for each key_string instance. Combine the values of the custid and age fields into a single string field. PACK_VALUES("ID",custid,"Age",age) The following expression returns ID\u00035555\u0002Age\u000329 when the value of the custid field is 5555 and the value of the age field is 29: PACK_VALUES("ID",custid,"Age",age) The following expression returns Age\u000329 when the value of the age field is 29: PACK_VALUES("ID",NULL,"Age",age) The following expression returns 29 as a STRING value when the age field is an INTEGER and its value is 29: EXTRACT_VALUE(PACK_VALUES("ID",custid,"Age",age),"Age") You might want to use the PACK_VALUES function to combine multiple field values into a single value in the OUTPUT clause of the PARTITION (event series processing) function. Then you can use the EXTRACT_VALUE function in a different computed field in the dataset to get one of the values returned by the PARTITION function. For example, in the example below, the PARTITION function creates a set of rows that defines the previous five web pages accessed in a particular user session: PARTITION BY Session ORDER BY Time DESC PATTERN (A?, B?, C?, D?, E) DEFINE A AS true, B AS true, C AS true, D AS true, E AS true OUTPUT PACK_VALUES("A", A.Page, "B", B.Page, "C", C.Page, "D", D.Page) Page 259 Data Ingest Guide - Platfora Expressions String Functions String functions allow you to manipulate and transform textual data, such as combining string values or extracting a portion of a string value. CONCAT CONCAT is a row function that returns a string by concatenating (combining together) the results of multiple string expressions. CONCAT(value_expression[,value_expression][,...]) Returns one value per row of type STRING. value_expression At least one required. A field name of any type, a literal string or number, or an expression that returns any value. Combine the values of the month, day, and year fields into a single date field formatted as MM/DD/ YYYY. CONCAT(month,"/",day,"/",year) ARRAY_CONTAINS ARRAY_CONTAINS is a row function that performs a whole string match against a string containing delimited values and returns a 1 or 0 depending on whether or not the string contains the search value. ARRAY_CONTAINS(array_string,"delimiter","search_string") Returns one value per row of type INTEGER. A return value of 1 indicates a positive match, and a return value of 0 indicates no match. array_string Required. The name of a field or expression of type STRING (or a literal string) that contains a valid array. delimiter Required. The delimiter used between values in the array string. This can be a name of a field or expression of type STRING. search_string Required. The literal string that you want to search for. This can be a name of a field or expression of type STRING. If you had a device field that contained a comma delimited list formatted like this: Safari,iPad You could determine whether or not the device used was an iPad using the following expression: Page 260 Data Ingest Guide - Platfora Expressions ARRAY_CONTAINS(device,",","iPad") The following expressions return 1: ARRAY_CONTAINS("platfora","|","platfora") ARRAY_CONTAINS("platfora|hadoop|2.3","|","hadoop") The following expressions return 0: ARRAY_CONTAINS("platfora","|","plat") ARRAY_CONTAINS("platfora,hadoop","|","platfora") FILE_NAME FILE_NAME is a row function that returns the original file name from the source file system. This is useful when the source data that comprises a dataset comes from multiple files, and there is useful information in the file names themselves (such as dates or server names). You can use FILE_NAME in combination with other string processing functions to extract useful information from the file name. FILE_NAME() Returns one value per row of type STRING. Your dataset is based on daily log files that use an 8 character date as part of the file name. For example, 20120704.log is the file name used for the log file created on July 4, 2012. The following expression uses FILE_NAME in combination with SUBSTRING and TO_DATE to create a date field from the first 8 characters of the file name. TO_DATE(SUBSTRING(FILE_NAME(),0,8),"yyyyMMdd") Your dataset is based on log files that use the server IP address as part of the file name. For example, 172.12.131.118.log is the log file name for server 172.12.131.118. The following expression uses FILE_NAME in combination with REGEX to extract the IP address from the file name. REGEX(FILE_NAME(),"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\.log") FILE_PATH FILE_PATH is a row function that returns the full URI path from the source file system. This is useful when the source data that comprises a dataset comes from multiple files, and there is useful information in the directory names or file names themselves (such as dates or server names). You can use FILE_PATH in combination with other string processing functions to extract useful information from the file path. FILE_PATH() Returns one value per row of type STRING. Your dataset is based on daily log files that are organized into directories by date on the source file system, and the file names are the server IP address of the server that produced the log file. For Page 261 Data Ingest Guide - Platfora Expressions example, the URI path to a log file produced by server 172.12.131.118 on July 4, 2012 is hdfs://myhdfsserver.com/data/logs/20120704/172.12.131.118.log. The following expression uses FILE_PATH in combination with REGEX and TO_DATE to create a date field from the date directory name. TO_DATE(REGEX(FILE_PATH(),"hdfs://myhdfs-server.com/data/logs/(\d{8})/(?: \d{1,3}\.*)+\.log"),"yyyyMMdd") And the following expression uses FILE_NAME and REGEX to extract the server IP address from the file name: REGEX(FILE_NAME(),"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\.log") EXTRACT_COOKIE EXTRACT_COOKIE is a row function that extracts the value of the given cookie identifier from a semicolon delimited list of cookie key=value pairs. This function can be used to extract a particular cookie value from a combined web access log Cookie column. EXTRACT_COOKIE("cookie_list_string",cookie_key_string) Returns the value of the specified cookie key as type STRING. cookie_list_string Required. A field or literal string that has a semi-colon delimited list of cookie key=value pairs. cookie_key_string Required. The cookie key name for which to extract the cookie value. Extract the value of the vID cookie from a literal cookie string: EXTRACT_COOKIE("SSID=ABC; vID=44", "vID") returns 44 Extract the value of the vID cookie from a field named Cookie: EXTRACT_COOKIE(Cookie,"vID") EXTRACT_VALUE EXTRACT_VALUE is a row function that extracts the value for the given key from a string containing delimited key/value pairs. EXTRACT_VALUE(string,key_name [,delimiter] [,pair_delimiter]) Returns the value of the specified key as type STRING. string Required. A field or literal string that contains a delimited list of key/value pairs. key_name Required. The key name for which to extract the value. Page 262 Data Ingest Guide - Platfora Expressions delimiter Optional. The delimiter used between the key and the value. If not specified, the value u0003 is used. This is the Unicode escape sequence for the start of text character (which is the default delimiter used by Hive). pair_delimiter Optional. The delimiter used between key/value pairs when the input string contains more than one key/ value pair. If not specified, the value u0002 is used. This is the Unicode escape sequence for the end of text character (which is the default delimiter used by Hive). Extract the value of the lastname key from a literal string of key/value pairs: EXTRACT_VALUE("firstname;daria|lastname;hutch","lastname",";","|") returns hutch Extract the value of the email key from a string field named contact_info that contains strings in the format of key:value,key:value: EXTRACT_VALUE(contact_info,"email",":",",") INSTR INSTR is a row function that returns an integer indicating the position of a character within a string that is the first character of the occurrence of a substring. Platfora's INSTR function is similar to the FIND function in Excel, except that the first letter is position 0 and the order of the arguments is reversed. INSTR(string,substring,position,occurrence) Returns one value per row of type INTEGER. The first position is indicated with the value of zero (0). string Required. The name of a field or expression of type STRING (or a literal string). substring Required. A literal string or name of a field that specifies the substring to search for in string. position Optional. An integer that specifies at which character in string to start searching for substring. A value of 0 (zero) starts the search at the beginning of string. Use a positive integer to start searching from the beginning of string, and use a negative integer to start searching from the end of string. When no position is specified, INSTR searches at the beginning of the string (0). occurrence Optional. A positive integer that specifies which occurrence of substring to search for. When no occurrence is specified, INSTR searches for the first occurrence of the substring (1). Return the position of the first occurrence of the substring "http://" starting at the end of the url field: INSTR(url,"http://",-1,1) Page 263 Data Ingest Guide - Platfora Expressions The following expression searches for the second occurrence of the substring "st" starting at the beginning of the string "bestteststring". INSTR finds that the substring starts at the seventh character in the string, so it returns 6: INSTR("bestteststring","st",0,2) The following expression searches backward for the second occurrence of the substring "st" starting at 7 characters before the end of the string "bestteststring". INSTR finds that the substring starts at the third character in the string, so it returns 2: INSTR("bestteststring","st",-7,2) JAVA_STRING JAVA_STRING is a row function that returns the unescaped version of a Java unicode character escape sequence as a string value. This is useful when you want to specify unicode characters in an expression. For example, you can use JAVA_STRING to specify the unicode value representing a control character. JAVA_STRING(unicode_escape_sequence) Returns the unescaped version of the specified unicode character, one value per row of type STRING. unicode_escape_sequence Required. A STRING value containing a unicode character expressed as a Java unicode escape sequence. Unicode escape sequences consist ofa backslash '\' (ASCII character 92, hex 0x5c), a 'u' (ASCII 117, hex 0x75), optionally one or more additional 'u' characters, and four hexadecimal digits (the characters '0' through '9' or 'a' through 'f' or 'A' through 'F'). Such sequences represent the UTF-16 encoding of a Unicode character. For example, the letter 'a' is equivalent to '\u0061'. Evaluates whether the currency field is equal to the yen symbol. CASE WHEN currency == JAVA_STRING("\u00a5") THEN "yes" ELSE "no" END JOIN_STRINGS JOIN_STRINGS is a row function that returns a string by concatenating (combining together) the results of multiple values with the separator in between each non-null value. JOIN_STRINGS(separator,value_expression[,value_expression][,...]) Returns one value per row of type STRING. separator Required. A field name of type STRING, a literal string, or an expression that returns a string. value_expression At least one required. A field name of any type, a literal string or number, or an expression that returns any value. Combine the values of the month, day, and year fields into a single date field formatted as MM/DD/ YYYY. Page 264 Data Ingest Guide - Platfora Expressions JOIN_STRINGS("/",month,day,year) The following expression returns NULL: JOIN_STRINGS("+",NULL,NULL,NULL) The following expression returns a+b: JOIN_STRINGS("+","a","b",NULL) JSON_ARRAY_CONTAINS JSON_ARRAY_CONTAINS is a row function that performs a whole string match against a string formatted as a JSON array and returns a 1 or 0 depending on whether or not the string contains the search value. JSON_ARRAY_CONTAINS(json_array_string,"search_string") Returns one value per row of type INTEGER. A return value of 1 indicates a positive match, and a return value of 0 indicates no match. json_array_string Required. The name of a field or expression of type STRING (or a literal string) that contains a valid JSON array. A JSON array is an ordered sequence of values separated by commas and enclosed in square brackets. search_string Required. The literal string that you want to search for. This can be a name of a field or expression of type STRING. If you have a software field that contains a JSON array formatted like this: ["hadoop","platfora"] The following expression returns 1: JSON_ARRAY_CONTAINS(software,"platfora") JSON_DOUBLE JSON_DOUBLE is a row function that extracts a DOUBLE value from a field in a JSON object. JSON_DOUBLE(json_string,"json_field") Returns one value per row of type DOUBLE. json_string Required. The name of a field or expression of type STRING (or a literal string) that contains a valid JSON object. json_field Required. The key or name of the field value you want to extract. Page 265 Data Ingest Guide - Platfora Expressions For top-level fields, specify the name identifier (key) of the field. To access fields within a nested object, specify a dot-separated path of field names (for example top_level_field_name.nested_field_name). To extract a value from an array, specify the dot-separated path of field names and the array position starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0). If the name identifier contains dot or period characters within the name itself, escape the name by enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name] If the field name is null (empty), use brackets with nothing in between as the identifier, for example []. If you had a top_scores field that contained a JSON object formatted like this (with the values contained in an array): {"practice_scores":["538.67","674.99","1021.52"], "test_scores": ["753.21","957.88","1032.87"]} You could extract the third value of the test_scores array using the expression: JSON_DOUBLE(top_scores,"test_scores.2") JSON_FIXED JSON_FIXED is a row function that extracts a FIXED value from a field in a JSON object. JSON_FIXED(json_string,"json_field") Returns one value per row of type FIXED. json_string Required. The name of a field or expression of type STRING (or a literal string) that contains a valid JSON object. json_field Required. The key or name of the field value you want to extract. For top-level fields, specify the name identifier (key) of the field. To access fields within a nested object, specify a dot-separated path of field names (for example top_level_field_name.nested_field_name). To extract a value from an array, specify the dot-separated path of field names and the array position starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0). If the name identifier contains dot or period characters within the name itself, escape the name by enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name] If the field name is null (empty), use brackets with nothing in between as the identifier, for example []. Page 266 Data Ingest Guide - Platfora Expressions If you had a top_scores field that contained a JSON object formatted like this (with the values contained in an array): {"practice_scores":["538.67","674.99","1021.52"], "test_scores": ["753.21","957.88","1032.87"]} You could extract the third value of the test_scores array using the expression: JSON_FIXED(top_scores,"test_scores.2") JSON_INTEGER JSON_INTEGER is a row function that extracts an INTEGER value from a field in a JSON object. JSON_INTEGER(json_string,"json_field") Returns one value per row of type INTEGER. json_string Required. The name of a field or expression of type STRING (or a literal string) that contains a valid JSON object. json_field Required. The key or name of the field value you want to extract. For top-level fields, specify the name identifier (key) of the field. To access fields within a nested object, specify a dot-separated path of field names (for example top_level_field_name.nested_field_name). To extract a value from an array, specify the dot-separated path of field names and the array position starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0). If the name identifier contains dot or period characters within the name itself, escape the name by enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name] If the field name is null (empty), use brackets with nothing in between as the identifier, for example []. If you had an address field that contained a JSON object formatted like this: {"street_address":"123 B Street", "city":"San Mateo", "state":"CA", "zip_code":"94403"} You could extract the zip_code value using the expression: JSON_INTEGER(address,"zip_code") If you had a top_scores field that contained a JSON object formatted like this (with the values contained in an array): {"practice_scores":["538","674","1021"], "test_scores": ["753","957","1032"]} Page 267 Data Ingest Guide - Platfora Expressions You could extract the third value of the test_scores array using the expression: JSON_INTEGER(top_scores,"test_scores.2") JSON_LONG JSON_LONG is a row function that extracts a LONG value from a field in a JSON object. JSON_LONG(json_string,"json_field") Returns one value per row of type LONG. json_string Required. The name of a field or expression of type STRING (or a literal string) that contains a valid JSON object. json_field Required. The key or name of the field value you want to extract. For top-level fields, specify the name identifier (key) of the field. To access fields within a nested object, specify a dot-separated path of field names (for example top_level_field_name.nested_field_name). To extract a value from an array, specify the dot-separated path of field names and the array position starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0). If the name identifier contains dot or period characters within the name itself, escape the name by enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name] If the field name is null (empty), use brackets with nothing in between as the identifier, for example []. If you had a top_scores field that contained a JSON object formatted like this (with the values contained in an array): {"practice_scores":["538","674","1021"], "test_scores": ["753","957","1032"]} You could extract the third value of the test_scores array using the expression: JSON_LONG(top_scores,"test_scores.2") JSON_STRING JSON_STRING is a row function that extracts a STRING value from a field in a JSON object. JSON_STRING(json_string,"json_field") Returns one value per row of type STRING. json_string Required. The name of a field or expression of type STRING (or a literal string) that contains a valid JSON object. Page 268 Data Ingest Guide - Platfora Expressions json_field Required. The key or name of the field value you want to extract. For top-level fields, specify the name identifier (key) of the field. To access fields within a nested object, specify a dot-separated path of field names (for example top_level_field_name.nested_field_name). To extract a value from an array, specify the dot-separated path of field names and the array position starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0). If the name identifier contains dot or period characters within the name itself, escape the name by enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name] If the field name is null (empty), use brackets with nothing in between as the identifier, for example []. If you had an address field that contained a JSON object formatted like this: {"street_address":"123 B Street", "city":"San Mateo", "state":"CA", "zip":"94403"} You could extract the state value using the expression: JSON_STRING(address,"state") If you had a misc field that contained a JSON object formatted like this (with the values contained in an array): {"hobbies":["sailing","hiking","cooking"], "interests": ["art","music","travel"]} You could extract the first value of the hobbies array using the expression: JSON_STRING(misc,"hobbies.0") LENGTH LENGTH is a row function that returns the count of characters in a string value. LENGTH(string) Returns one value per row of type INTEGER. string Required. The name of a field or expression of type STRING (or a literal string). Return count of characters from values in the name field. For example, the value Bob would return a length of 3, Julie would return a length of 5, and so on: LENGTH(name) Page 269 Data Ingest Guide - Platfora Expressions REGEX REGEX is a row function that performs a whole string match against a string value with a regular expression and returns the portion of the string matching the first capturing group of the regular expression. REGEX(string_expression,"regex_matching_pattern") Returns the matched STRING value of the first capturing group of the regular expression. If there is no match, returns NULL. string_expression Required. The name of a field or expression of type STRING (or a literal string). regex_matching_pattern Required. A regular expression pattern based on the regular expression pattern matching syntax of the Java programming language. To return a non-NULL value, the regular expression pattern must match the entire string value. This section lists a summary of the most commonly used constructs for defining a regular expression matching pattern. See the Regular Expression Reference for more information about regular expression support in Platfora. Literal and Special Characters The most basic form of pattern matching is the match of literal characters. For example, if the regular expression is foo and the input string is foo, the match will succeed because the strings are identical. Certain characters are reserved for special use in regular expressions. These special characters are often called metacharacters. If you want to use special characters as literal characters, they must be escaped. You can escape a single character using a \ (backslash), or escape a character sequence by enclosing it in \Q ... \E. To escape literal double-quotes, double the double-quotes (""). Character Name Character Reserved For opening bracket [ start of a character class closing bracket ] end of a character class hyphen - character ranges within a character class backslash \ general escape character caret ^ beginning of string, negating of a character class dollar sign $ end of string period . matching any single character Page 270 Data Ingest Guide - Platfora Expressions Character Name Character Reserved For pipe | alternation (OR) operator question mark ? optional quantifier, quantifier minimizer asterisk * zero or more quantifier plus sign + once or more quantifier opening parenthesis ( start of a subexpression group closing parenthesis ) end of a subexpression group opening brace { start of min/max quantifier closing brace } end of min/max quantifier Character Class Constructs A character class allows you to specify a set of characters, enclosed in square brackets, that can produce a single character match. There are also a number of special predefined character classes (backslash character sequences that are shorthand for the most common character sets). Construct Type Description [abc] simple matches a or b or c [^abc] negation matches any character except a or b or c Page 271 Data Ingest Guide - Platfora Expressions Construct Type Description [a-zA-Z] range matches a through z , or A through Z (inclusive) [a-d[m-p]] union matches a through d , or m through p [a-z&&[def]] intersection matches d , e , or f [a-z&&[^xq]] subtraction matches a through z , except for x and q Predefined Character Classes Page 272 Data Ingest Guide - Platfora Expressions Predefined character classes offer convenient shorthands for commonly used regular expressions. Construct Description Example . matches any single character (except newline) .at matches "cat", "hat", and also"bat" in the phrase "batch files" \d \D matches any digit character (equivalent to \d [0-9] ) matches "3" in "C3PO" and "2" in "file_2.txt" matches any non-digit character (equivalent to \D [^0-9] matches "S" in "900S" and "Q" in "Q45" ) \s matches any single white-space character (equivalent to [ \t\n\x0B\f\r] \sbook matches "book" in "blue book" but nothing in "notebook" ) \S matches any single non-white-space character \Sbook matches "book" in "notebook" but nothing in "blue book" \w matches any alphanumeric character, including r\w* underscore (equivalent to matches "rm" and "root" [A-Za-z0-9_] ) \W matches any non-alphanumeric character (equivalent to [^A-Za-z0-9_] \W matches "&" in "stmd &" , "%" in "100%", and "$" in "$HOME" ) Line and Word Boundaries Boundary matching constructs are used to specify where in a string to apply a matching pattern. For example, you can search for a particular pattern within a word boundary, or search for a pattern at the beginning or end of a line. Construct Description Example ^ matches from the beginning of a line (multiline matches are currently not supported) ^172 Page 273 will match the "172" in IP address "172.18.1.11" but not in "192.172.2.33" Data Ingest Guide - Platfora Expressions Construct Description Example $ matches from the end of a line (multi-line matches are currently not supported) d$ matches within a word boundary \bis\b \b will match the "d" in "maid" but not in "made" matches the word "is" in "this is my island", but not the "is" part of "this" or "island". \bis matches both "is" and the "is" in "island", but not in "this". \B \Bb matches within a non-word boundary matches "b" in "sbin" but not in "bash" Quantifiers Quantifiers specify how often the preceding regular expression construct should match. There are three classes of quantifiers: greedy, reluctant, and possessive. The difference between greedy, reluctant, and possessive quantifiers involves what part of the string to try for the initial match, and how to retry if the initial attempt does not produce a match. Greedy ReluctantPossessiveDescription ConstructConstructConstruct Example ? matches the previous character or construct once or not at all st?on matches the previous character or construct zero or more times if* matches the previous character or construct one or more times if+ matches the previous character or construct exactly o{2} * + {n} ?? *? +? {n}? ?+ *+ ++ {n}+ n times Page 274 matches "son" in "johnson" and "ston" in "johnston" but nothing in "clinton" or "version" matches "if", "iff" in "diff", or "i" in "print" matches "if", "iff" in "diff", but nothing in "print" matches "oo" in "lookup" and the first two o's in "fooooo" but nothing in "mount" Data Ingest Guide - Platfora Expressions Greedy ReluctantPossessiveDescription ConstructConstructConstruct Example {n,} o{2,} {n,}? {n,}+ matches the previous character or construct at least matches "oo" in "lookup" all five o's in "fooooo" but nothing in "mount" n times {n,m} {n,m}? {n,m}+ matches the previous character or construct at least F{2,4} matches "FF" in "#FF0000" and the last four F's in "#FFFFFF" n times, but no more than m times Groups are specified by a pair of parenthesis around a subpattern in the regular expression. A pattern can have more than one group and the groups can be nested. The groups are numbered 1-n from left to right, starting with the first opening parenthesis. There is always an implicit group 0, which contains the entire match. For example, the pattern: (a(b*))+(c) contains three groups: group 1: (a(b*)) group 2: (b*) group 3: (c) Capturing Groups By default, a group captures the text that produces a match, and only the most recent match is captured. The REGEX function returns the string that matches the first capturing group in the regular expression. For example, if the input string to the expression above was abc, the entire REGEX function would match to abc, but only return the result of group 1, which is ab. Non-Capturing Groups In some cases, you may want to use parenthesis to group subpatterns, but not capture text. A noncapturing group starts with (?: (a question mark and colon following the opening parenthesis). For example, h(?:a|i|o)t matches hat or hit or hot, but does not capture the a, i, or o from the subexpression. Match all possible email address strings with a pattern of username@provider.domain, but only return the provider portion of the email address from the email field: REGEX(email,"^[a-zA-Z0-9._%+-]+@([a-zA-Z0-9._-]+)\.[a-zA-Z]{2,4}$") Match the request line of a web log, where the value is in the format of: Page 275 Data Ingest Guide - Platfora Expressions GET /some_page.html HTTP/1.1 and return just the requested HTML page names: REGEX(weblog.request_line,"GET\s/([a-zA-Z0-9._%-]+\.[html])\sHTTP/[0-9.]+") Extract the inches portion from a height field where example values are 6'2", 5'11" (notice the escaping of the literal quote with a double double-quote): REGEX(height, "\d\'(\d)+""") Extract all of the contents of the device field when the value is either iPod, iPad, or iPhone: REGEX(device,"(iP[ao]d|iPhone)") REGEX_REPLACE REGEX_REPLACE is a row function that evaluates a string value against a regular expression to determine if there is a match, and replaces matched strings with the specified replacement value. REGEX_REPLACE(string_expression,"regex_match_pattern","regex_replace_pattern") Returns the regex_replace_pattern as a STRING value when regex_match_pattern produces a match. If there is no match, returns the value of string_expression as a STRING. string_expression Required. The name of a field or expression of type STRING (or a literal string). regex_match_pattern Required. A string literal or regular expression pattern based on the regular expression pattern matching syntax of the Java programming language. You can use capturing groups to create backreferences that can be used in the regex_replace_pattern. You might want to use a string literal to make a case-sensitive match. For example, when you enter jane as the match value, the function matches jane but not Jane. The function matches all occurrences of a string literal in the string expression. regex_replace_pattern Required. A string literal or regular expression pattern based on the regular expression pattern matching syntax of the Java programming language. You can refer to backreferences from the regex_match_pattern using the syntax $n (where n is the group number). This section lists a summary of the most commonly used constructs for defining a regular expression matching pattern. See the Regular Expression Reference for more information. Literal and Special Characters The most basic form of pattern matching is the match of literal characters. For example, if the regular expression is foo and the input string is foo, the match will succeed because the strings are identical. Certain characters are reserved for special use in regular expressions. These special characters are often called metacharacters. If you want to use special characters as literal characters, they must be escaped. Page 276 Data Ingest Guide - Platfora Expressions You can escape a single character using a \ (backslash), or escape a character sequence by enclosing it in \Q ... \E. Character Name Character Reserved For opening bracket [ start of a character class closing bracket ] end of a character class hyphen - character ranges within a character class backslash \ general escape character caret ^ beginning of string, negating of a character class dollar sign $ end of string period . matching any single character pipe | alternation (OR) operator question mark ? optional quantifier, quantifier minimizer asterisk * zero or more quantifier plus sign + once or more quantifier opening parenthesis ( start of a subexpression group closing parenthesis ) end of a subexpression group opening brace { start of min/max quantifier closing brace } end of min/max quantifier Character Class Constructs Page 277 Data Ingest Guide - Platfora Expressions A character class allows you to specify a set of characters, enclosed in square brackets, that can produce a single character match. There are also a number of special predefined character classes (backslash character sequences that are shorthand for the most common character sets). Construct Type Description [abc] simple matches a or b or c [^abc] negation matches any character except a or b or c [a-zA-Z] range matches a through z , or A through Z (inclusive) [a-d[m-p]] union matches a through d , or m through p [a-z&&[def]] intersection matches d , e , or f Page 278 Data Ingest Guide - Platfora Expressions Construct Type Description [a-z&&[^xq]] subtraction matches a through z , except for x and q Predefined Character Classes Predefined character classes offer convenient shorthands for commonly used regular expressions. Construct Description Example . matches any single character (except newline) .at matches "cat", "hat", and also"bat" in the phrase "batch files" \d \D matches any digit character (equivalent to \d [0-9] ) matches "3" in "C3PO" and "2" in "file_2.txt" matches any non-digit character (equivalent to \D [^0-9] matches "S" in "900S" and "Q" in "Q45" ) \s matches any single white-space character (equivalent to [ \t\n\x0B\f\r] \sbook matches "book" in "blue book" but nothing in "notebook" ) \S matches any single non-white-space character \Sbook matches "book" in "notebook" but nothing in "blue book" \w matches any alphanumeric character, including r\w* underscore (equivalent to matches "rm" and "root" [A-Za-z0-9_] ) Page 279 Data Ingest Guide - Platfora Expressions Construct Description Example \W matches any non-alphanumeric character (equivalent to \W [^A-Za-z0-9_] matches "&" in "stmd &" , "%" in "100%", and "$" in "$HOME" ) Line and Word Boundaries Boundary matching constructs are used to specify where in a string to apply a matching pattern. For example, you can search for a particular pattern within a word boundary, or search for a pattern at the beginning or end of a line. Construct Description Example ^ matches from the beginning of a line (multiline matches are currently not supported) ^172 matches from the end of a line (multi-line matches are currently not supported) d$ matches within a word boundary \bis\b $ \b will match the "172" in IP address "172.18.1.11" but not in "192.172.2.33" will match the "d" in "maid" but not in "made" matches the word "is" in "this is my island", but not the "is" part of "this" or "island". \bis matches both "is" and the "is" in "island", but not in "this". \B matches within a non-word boundary \Bb matches "b" in "sbin" but not in "bash" Quantifiers Quantifiers specify how often the preceding regular expression construct should match. There are three classes of quantifiers: greedy, reluctant, and possessive. The difference between greedy, reluctant, and Page 280 Data Ingest Guide - Platfora Expressions possessive quantifiers involves what part of the string to try for the initial match, and how to retry if the initial attempt does not produce a match. Greedy ReluctantPossessiveDescription ConstructConstructConstruct Example ? matches the previous character or construct once or not at all st?on matches the previous character or construct zero or more times if* matches the previous character or construct one or more times if+ matches the previous character or construct exactly o{2} * + {n} ?? *? +? {n}? ?+ *+ ++ {n}+ matches "son" in "johnson" and "ston" in "johnston" but nothing in "clinton" or "version" matches "if", "iff" in "diff", or "i" in "print" matches "if", "iff" in "diff", but nothing in "print" matches "oo" in "lookup" and the first two o's in "fooooo" but nothing in "mount" n times {n,} {n,}? {n,}+ matches the previous character or construct at least o{2,} matches "oo" in "lookup" all five o's in "fooooo" but nothing in "mount" n times {n,m} {n,m}? {n,m}+ matches the previous character or construct at least F{2,4} matches "FF" in "#FF0000" and the last four F's in "#FFFFFF" n times, but no more than m times Match the values in a phone_number field where phone number values are formatted as xxx.xxx.xxxx and replace them with phone number values formatted as (xxx) xxx-xxxx: REGEX_REPLACE(phone_number,"([0-9]{3})\.([[0-9]]{3})\.([[0-9]] {4})","\($1\) $2-$3") Match the values in a name field where name values are formatted as firstname lastname and replace them with name values formatted as lastname, firstname: Page 281 Data Ingest Guide - Platfora Expressions REGEX_REPLACE(name,"(.*) (.*)","$2, $1") Match the string literal mrs in a title field and replace it with the string literal Mrs. REGEX_REPLACE(title,"mrs","Mrs") SPLIT SPLIT is a row function that breaks down a delimited input string into sections and returns the specified section of the string. A section is considered any sub-string between the specified delimiter. SPLIT(input_string_expression,"delimiter_string",position_integer) Returns one value per row of type STRING. input_string_expression Required. The name of a field or expression of type STRING (or a literal string). delimiter_string Required. A literal string representing the delimiter used to separate values in the input string. The delimiter can be a single character or multiple characters. position_integer Required. An integer representing the position of the section in the input string that you want to extract. Positive integers count the position from the beginning of the string, and negative integers count the position from the end of the string. A value of 0 returns NULL. Return the third section of the literal delimited string: Restaurants>Location>San Francisco: SPLIT("Restaurants>Location>San Francisco",">", -1) returns San Francisco Return the first section of a phone_number field where phone number values are in the format of 123-456-7890: SPLIT(phone_number,"-",1) SUBSTRING SUBSTRING is a row function that returns the specified characters of a string value based on the given start and end position. SUBSTRING(string,start,end) Returns one value per row of type STRING. string Required. The name of a field or expression of type STRING (or a literal string). start Page 282 Data Ingest Guide - Platfora Expressions Required. An integer that specifies where the returned characters start (inclusive), with 0 being the first character of the string. If start is greater than the number of characters, then an empty string is returned. If start is greater than end, then an empty string is returned. end Required. A positive integer that specifies where the returned characters end (exclusive), with the end character not being part of the return value. If end is greater than the number of characters, the whole string value (from start) is returned. Return the first letter of the name field: SUBSTRING(name,0,1) TO_LOWER TO_LOWER is a row function that converts all alphabetic characters in a string to lower case. TO_LOWER(string_expression) Returns one value per row of type STRING. string_expression Required. The name of a field or expression of type STRING (or a literal string). Return the literal input string 123 Main Street in all lower case letters:: TO_LOWER("123 Main Street") returns 123 main street TO_UPPER TO_UPPER is a row function that converts all alphabetic characters in a string to upper case. TO_UPPER(string_expression) Returns one value per row of type STRING. string_expression Required. The name of a field or expression of type STRING (or a literal string). Return the literal input string 123 Main Street in all upper case letters: TO_UPPER("123 Main Street") returns 123 MAIN STREET TRIM TRIM is a row function that removes leading and trailing spaces from a string value. TRIM(string_expression) Returns one value per row of type STRING. string_expression Page 283 Data Ingest Guide - Platfora Expressions Required. The name of a field or expression of type STRING (or a literal string). Return the value of the area_code field without any leading or trailing spaces. For example, if the input string is " 650 ", then the return value would be "650": TRIM(area_code) Return the value of the phone_number field without any leading or trailing spaces. For example, if the input string is " 650 123-4567 ", then the return value would be "650 123-4567" (note that the extra spaces in the middle of the string are not removed, only the spaces at the beginning and end of the string): TRIM(phone_number) XPATH_STRING XPATH_STRING is a row function that takes an XML-formatted string and returns the first string matching the given XPath expression. XPATH_STRING(xml_formatted_string,"xpath_expression") Returns one value per row of type STRING. If the XPath expression matches more than one string in the given XML node, this function will return the first match only. To return all matches, use XPATH_STRINGS instead. xml_formatted_string Required. The name of a field or a literal string that contains a valid XML node (a snippet of XML consisting of a parent element and one or more child nodes). xpath_expression Required. An XPath expression that refers to a node, element, or attribute within the XML string passed to this expression. Any XPath expression that complies to the XML Path Language (XPath) Version 1.0 specification is valid. These example XPATH_STRING expressions assume you have a field in your dataset named address that contains XML-formatted strings such as this: <list> <address type="work"> <street>1300 So. El Camino Real</street1> <street>Suite 600</street2> <city>San Mateo</city> <state>CA</state> <zipcode>94403</zipcode> </address> <address type="home"> <street>123 Oakdale Street</street1> <street/> <city>San Francisco</city> <state>CA</state> Page 284 Data Ingest Guide - Platfora Expressions <zipcode>94123</zipcode> </address> </list> Get the zipcode value from any address element where the type attribute equals home: XPATH_STRING(address,"//address[@type='home']/zipcode") returns: 94123 Get the city value from the second address element: XPATH_STRING(address,"/list/address[2]/city") returns: San Francisco Get the values from all child elements of the first address element (as one string): XPATH_STRING(address,"/list/address") returns: 1300 So. El Camino RealSuite 600 San MateoCA94403 XPATH_STRINGS XPATH_STRINGS is a row function that takes an XML-formatted string and returns a newline-separated array of strings matching the given XPath expression. XPATH_STRINGS(xml_formatted_string,"xpath_expression") Returns one value per row of type STRING. If the XPath expression matches more than one string in the given XML node, this function will return all matches separated by a newline (you cannot specify a different delimiter). xml_formatted_string Required. The name of a field or a literal string that contains a valid XML node (a snippet of XML consisting of a parent element and one or more child nodes). xpath_expression Required. An XPath expression that refers to a node, element, or attribute within the XML string passed to this expression. Any XPath expression that complies to the XML Path Language (XPath) Version 1.0 specification is valid. These example XPATH_STRINGS expressions assume you have a field in your dataset named address that contains XML-formatted strings such as this: <list> <address type="work"> <street>1300 So. El Camino Real</street1> <street>Suite 600</street2> <city>San Mateo</city> <state>CA</state> <zipcode>94403</zipcode> Page 285 Data Ingest Guide - Platfora Expressions </address> <address type="home"> <street>123 Oakdale Street</street1> <street/> <city>San Francisco</city> <state>CA</state> <zipcode>94123</zipcode> </address> </list> Get all zipcode values from all address elements: XPATH_STRINGS(address,"//address/zipcode") returns: 94123 94403 Get all street values from the first address element: XPATH_STRINGS(address,"/list/address[1]/street") returns: 1300 So. El Camino Real Suite 600 Get the values from all child elements of all address elements (as one string per line): XPATH_STRINGS(address,"/list/address") returns: 123 Oakdale StreetSan FranciscoCA94123 1300 So. El Camino RealSuite 600 San MateoCA94403 XPATH_XML XPATH_XML is a row function that takes an XML-formatted string and returns an XML-formatted string matching the given XPath expression. XPATH_XML(xml_formatted_string,"xpath_expression") Returns one value per row of type STRING in XML format. xml_formatted_string Required. The name of a field or a literal string that contains a valid XML node (a snippet of XML consisting of a parent element and one or more child nodes). xpath_expression Required. An XPath expression that refers to a node, element, or attribute within the XML string passed to this expression. Any XPath expression that complies to the XML Path Language (XPath) Version 1.0 specification is valid. Page 286 Data Ingest Guide - Platfora Expressions These example XPATH_STRING expressions assume you have a field in your dataset named address that contains XML-formatted strings such as this: <list> <address type="work"> <street>1300 So. El Camino Real</street1> <street>Suite 600</street2> <city>San Mateo</city> <state>CA</state> <zipcode>94403</zipcode> </address> <address type="home"> <street>123 Oakdale Street</street1> <street/> <city>San Francisco</city> <state>CA</state> <zipcode>94123</zipcode> </address> </list> Get the last address node and its child nodes in XML format: XPATH_XML(address,"//address[last()]") returns: <address type="home"> <street>123 Oakdale Street</street1> <street/> <city>San Francisco</city> <state>CA</state> <zipcode>94123</zipcode> </address> Get the city value from the second address node in XML format: XPATH_XML(address,"/list/address[2]/city") returns: <city>San Francisco</city> Get the first address node and its child nodes in XML format: XPATH_XML(address,"/list/address[1]") returns: <address type="work"> <street>1300 So. El Camino Real</street1> <street>Suite 600</street2> <city>San Mateo</city> <state>CA</state> <zipcode>94403</zipcode> </address> Page 287 Data Ingest Guide - Platfora Expressions URL Functions URL functions allow you to extract different portions of a URL string, and decode text that is URLencoded. URL_AUTHORITY URL_AUTHORITY is a row function that returns the authority portion of a URL string. The authority portion of a URL is the part that has the information on how to locate and connect to the server. URL_AUTHORITY(string) Returns the authority portion of a URL as a STRING value, or NULL if the input string is not a valid URL. For example, in the string http://www.platfora.com/company/contact.html, the authority portion is www.platfora.com. In the string http://user:password@mycompany.com:8012/mypage.html, the authority portion is user:password@mycompany.com:8012. In the string mailto:username@mycompany.com?subject=Topic, the authority portion is NULL. string Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format of: protocol:authority[/path][?query][#fragment]. The authority portion of the URL contains the host information, which can be specified as a domain name (www.platfora.com), a host name (localhost), or an IP address (127.0.0.1). The host information can be preceeded by optional user information terminated with @ (for example, username:password@platfora.com), and followed by an optional port number preceded by a colon (for example, localhost:8001). Return the authority portion of URL string values in the referrer field: URL_AUTHORITY(referrer) Return the authority portion of a literal URL string: URL_AUTHORITY("http://user:password@mycompany.com:8012/mypage.html") returns user:password@mycompany.com:8012 URL_FRAGMENT URL_FRAGMENT is a row function that returns the fragment portion of a URL string. URL_FRAGMENT(string) Returns the fragment portion of a URL as a STRING value, NULL if the URL or does not contain a fragment, or NULL if the input string is not a valid URL. Page 288 Data Ingest Guide - Platfora Expressions For example, in the string http://www.platfora.com/contact.html#phone, the fragment portion is phone. In the string http://www.platfora.com/contact.html, the fragment portion is NULL. In the string http://platfora.com/news.php?topic=press#Platfora%20News, the fragment portion is Platfora%20News. string Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format of: protocol:authority[/path][?query][#fragment]. The optional fragment portion of the URL is separated by a hash mark (#) and provides direction to a secondary resource, such as a heading or anchor identifier. Return the fragment portion of URL string values in the request field: URL_FRAGMENT(request) Return the fragment portion of a literal URL string: URL_FRAGMENT("http://platfora.com/news.php?topic=press#Platfora%20News") returns Platfora%20News Return and decode the fragment portion of a literal URL string: URLDECODE(URL_FRAGMENT("http://platfora.com/news.php? topic=press#Platfora%20News")) returns Platfora News URL_HOST URL_HOST is a row function that returns the host, domain, or IP address portion of a URL string. URL_HOST(string) Returns the host portion of a URL as a STRING value, or NULL if the input string is not a valid URL. For example, in the string http://www.platfora.com/company/contact.html, the host portion is www.platfora.com. In the string http://admin:admin@127.0.0.1:8001/index.html, the host portion is 127.0.0.1. In the string mailto:username@mycompany.com?subject=Topic, the host portion is NULL. string Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format of: protocol:authority[/path][?query][#fragment]. The authority portion of the URL contains the host information, which can be specified as a domain name (www.platfora.com), a host name (localhost), or an IP address (127.0.0.1). Page 289 Data Ingest Guide - Platfora Expressions Return the host portion of URL string values in the referrer field: URL_HOST(referrer) Return the host portion of a literal URL string: URL_HOST("http://user:password@mycompany.com:8012/mypage.html") returns mycompany.com URL_PATH URL_PATH is a row function that returns the path portion of a URL string. URL_PATH(string) Returns the path portion of a URL as a STRING value, NULL if the URL or does not contain a path, or NULL if the input string is not a valid URL. For example, in the string http://www.platfora.com/company/contact.html, the path portion is /company/contact.html. In the string http://admin:admin@127.0.0.1:8001/index.html, the path portion is / index.html. In the string mailto:username@mycompany.com?subject=Topic, the path portion is username@mycompany.com. string Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format of: protocol:authority[/path][?query][#fragment]. The optional path portion of the URL is a sequence of resource location segments separated by a forward slash (/), conceptually similar to a directory path. Return the path portion of URL string values in the request field: URL_PATH(request) Return the path portion of a literal URL string: URL_PATH("http://platfora.com/company/contact.html") returns /company/ contact.html URL_PORT URL_PORT is a row function that returns the port portion of a URL string. URL_PORT(string) Returns the port portion of a URL as an INTEGER value. If the URL does not specify a port, then returns -1. If the input string is not a valid URL, returns NULL. Page 290 Data Ingest Guide - Platfora Expressions For example, in the string http://localhost:8001, the port portion is 8001. string Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format of: protocol:authority[/path][?query][#fragment]. The authority portion of the URL contains the host information, which can be specified as a domain name (www.platfora.com), a host name (localhost), or an IP address (127.0.0.1). The host information can be followed by an optional port number preceded by a colon (for example, localhost:8001). Return the port portion of URL string values in the referrer field: URL_PORT(referrer) Return the port portion of a literal URL string: URL_PORT("http://user:password@mycompany.com:8012/mypage.html") returns 8012 URL_PROTOCOL URL_PROTOCOL is a row function that returns the protocol (or URI scheme name) portion of a URL string. URL_PROTOCOL(string) Returns the protocol portion of a URL as a STRING value, or NULL if the input string is not a valid URL. For example, in the string http://www.platfora.com, the protocol portion is http. In the string ftp://ftp.platfora.com/articles/platfora.pdf, the protocol portion is ftp. string Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format of: protocol:authority[/path][?query][#fragment] The protocol portion of a URL consists of a sequence of characters beginning with a letter and followed by any combination of letter, number, plus (+), period (.), or hyphen (-) characters, followed by a colon (:). For example: http:, ftp:, mailto: Return the protocol portion of URL string values in the referrer field: URL_PROTOCOL(referrer) Return the protocol portion of the literal URL string: URL_PROTOCOL("http://www.platfora.com") returns http Page 291 Data Ingest Guide - Platfora Expressions URL_QUERY URL_QUERY is a row function that returns the query portion of a URL string. URL_QUERY(string) Returns the query portion of a URL as a STRING value, NULL if the URL or does not contain a query, or NULL if the input string is not a valid URL. For example, in the string http://www.platfora.com/contact.html, the query portion is NULL. In the string http://platfora.com/news.php? topic=press&timeframe=today#Platfora%20News, the query portion is topic=press&timeframe=today. In the string mailto:username@mycompany.com?subject=Topic, the query portion is subject=Topic. string Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format of: protocol:authority[/path][?query][#fragment]. The optional query portion of the URL is separated by a question mark (?) and typically contains an unordered list of key=value pairs separated by an ampersand (&) or semicolon (;). Return the query portion of URL string values in the request field: URL_QUERY(request) Return the query portion of a literal URL string: URL_QUERY("http://platfora.com/news.php?topic=press&timeframe=today") returns topic=press&timeframe=today URLDECODE URLDECODE is a row function that decodes a string that has been encoded with the application/ x-www-form-urlencoded media type. URL encoding, also known as percent-encoding, is a mechanism for encoding information in a Uniform Resource Identifier (URI). When sent in an HTTP GET request, application/x-www-form-urlencoded data is included in the query component of the request URI. When sent in an HTTP POST request, the data is placed in the body of the message, and the name of the media type is included in the message Content-Type header. URLDECODE(string) Returns a value of type STRING with characters decoded as follows: • Alphanumeric characters (a-z, A-Z, 0-9) remain unchanged. • The special characters hyphen (-), comma (,), underscore (_), period (.), and asterisk (*) remain unchanged. Page 292 Data Ingest Guide - Platfora Expressions • The plus sign (+) character is converted to a space character. • The percent character (%) is interpreted as the start of a special escaped sequence, where in the sequence %HH, HH represents the hexadecimal value of the byte. For example, some common escape sequences are: percent encoding sequence value %20 space %0A or %0D or %0D%0A newline %22 double quote (") %25 percent (%) %2D hyphen (-) %2E period (.) %3C less than (<) %3D greater than (>) %5C backslash (\) %7C pipe (|) string Required. A field or expression that returns a STRING value. It is assumed that all characters in the input string are one of the following: lower-case letters (a-z), upper-case letters (A-Z), numeric digits (0-9), or the hyphen (-), comma (,), underscore (_), period (.) or asterisk (*) character. The percent character (%) is allowed, but is interpreted as the start of a special escaped sequence. The plus character (+) is allowed, but is interpreted as a space character. Decode the values of the url_query field: URLDECODE(url_query) Convert a literal URL encoded string (N%2FA%20or%20%22not%20applicable%22) to a humanreadable value (N/A or "not applicable"): URLDECODE("N%2FA%20or%20%22not%20applicable%22") returns N/A or "not applicable" IP Address Functions IP address functions allow you to manipulate and transform STRING data consisting of IP address values. Page 293 Data Ingest Guide - Platfora Expressions CIDR_MATCH CIDR_MATCH is a row function that compares two STRING arguments representing a CIDR mask and an IP address, and returns 1 if the IP address falls within the specified subnet mask or 0 if it does not. CIDR_MATCH(CIDR_string, IP_string) Returns an INTEGER value of 1 if the IP address falls within the subnet indicated by the CIDR mask and 0 if it does not. CIDR_string Required. A field or expression that returns a STRING value containing either an IPv4 or IPv6 CIDR mask (Classless InterDomain Routing subnet notation). An IPv4 CIDR mask can only successfully match IPv4 addresses, and an IPv6 CIDR mask can only successfully match IPv6 addresses. IP_string Required. A field or expression that returns a STRING value containing either an IPv4 or IPv6 internet protocol (IP) address. Compare an IPv4 CIDR subnet mask to an IPv4 IP address: CIDR_MATCH("60.145.56.0/24","60.145.56.246") returns 1 CIDR_MATCH("60.145.56.0/30","60.145.56.246") returns 0 Compare an IPv6 CIDR subnet mask to an IPv6 IP address: CIDR_MATCH("fe80::/70","FE80::0202:B3FF:FE1E:8329") returns 1 CIDR_MATCH("fe80::/72","FE80::0202:B3FF:FE1E:8329") returns 0 HEX_TO_IP HEX_TO_IP is a row function that converts a hexadecimal-encoded STRING to a text representation of an IP address. HEX_TO_IP(string) Returns a value of type STRING representing either an IPv4 or IPv6 address. The type of IP address returned depends on the input string. An 8 character hexadecimal string will return an IPv4 address. A 32 character long hexadecimal string will return an IPv6 address. IPv6 addresses are represented in full length, without removing any leading zeros and without using the compressed :: notation. For example, 2001:0db8:0000:0000:0000:ff00:0042:8329 rather than 2001:db8::ff00:42:8329. Input strings that do not contain either 8 or 32 valid hexadecimal characters will return NULL. string Page 294 Data Ingest Guide - Platfora Expressions Required. A field or expression that returns a hexadecimal-encoded STRING value. The hexadecimal string must be either 8 characters long (in which case it is converted to an IPv4 address) or 32 characters long (in which case it is converted to an IPv6 address). Return a plain text IP address for each hexadecimal-encoded string value in the byte_encoded_ips column: HEX_TO_IP(byte_encoded_ips) Convert an 8 character hexadecimal-encoded string to a plain text IPv4 address: HEX_TO_IP(AB20FE01) returns 171.32.254.1 Convert a 32 character hexadecimal-encoded string to a plain text IPv6 address: HEX_TO_IP(FE800000000000000202B3FFFE1E8329) returns fe80:0000:0000:0000:0202:b3ff:fe1e:8329 Date and Time Functions Date and time functions allow you to manipulate and transform datetime values, such as calculating time differences between two datetime values, or extracting a portion of a datetime value. DAYS_BETWEEN DAYS_BETWEEN is a row function that calculates the whole number of days (ignoring time) between two DATETIME values (value1-value2). DAYS_BETWEEN(datetime_1,datetime_2) Returns one value per row of type INTEGER. datetime_1 Required. A field or expression of type DATETIME. datetime_2 Required. A field or expression of type DATETIME. Calculate the number of days to ship a product by subtracting the value of the order_date field from the ship_date field: DAYS_BETWEEN(ship_date,order_date) Calculate the number of days since a product's release by subtracting the value of the release_date field in the product dataset from the current date (the result of the expression): DAYS_BETWEEN(NOW(),product.release_date) DATE_ADD DATE_ADD is a row function that adds the specified time interval to a DATETIME value. Page 295 Data Ingest Guide - Platfora Expressions DATE_ADD(datetime,quantity,"interval") Returns a value of type DATETIME. datetime Required. A field name or expression that returns a DATETIME value. quantity Required. An integer value. To add time intervals, use a positive integer. To subtract time intervals, use a negative integer. interval Required. One of the following time intervals: • millisecond - Adds the specified number of milliseconds to a datetime value. • second - Adds the specified number of seconds to a datetime value. • minute - Adds the specified number of minutes to a datetime value. • hour - Adds the specified number of hours to a datetime value. • day - Adds the specified number of days to a datetime value. • week - Adds the specified number of weeks to a datetime value. • month - Adds the specified number of months to a datetime value. • quarter - Adds the specified number of quarters to a datetime value. • year - Adds the specified number of years to a datetime value. • weekyear - Adds the specified number of weekyears to a datetime value. Add 45 days to the value of the invoice_date field to calculate the date a payment is due: DATE_ADD(invoice_date,45,"day") HOURS_BETWEEN HOURS_BETWEEN is a row function that calculates the whole number of hours (ignoring minutes, seconds, and milliseconds) between two DATETIME values (value1-value2). HOURS_BETWEEN(datetime_1,datetime_2) Returns one value per row of type INTEGER. datetime_1 Required. A field or expression of type DATETIME. datetime_2 Required. A field or expression of type DATETIME. Calculate the number of hours to ship a product by subtracting the value of the ship_date field from the order_date field: HOURS_BETWEEN(ship_date,order_date) Page 296 Data Ingest Guide - Platfora Expressions Calculate the number of hours since an advertisement was viewed by subtracting the value of the adview_timestamp field in the impressions dataset from the current date and time (the result of the expression): HOURS_BETWEEN(NOW(),impressions.adview_timestamp) EXTRACT EXTRACT is a row function that returns the specified portion of a DATETIME value. EXTRACT("extract_value",datetime) Returns the specified extracted value as type INTEGER. EXTRACT removes leading zeros. For example, the month of April returns a value of 4, not 04. extract_value Required. One of the following extract values: • millisecond - Returns the millisecond portion of a datetime value. For example, an input datetime value of 2012-08-15 20:38:40.213 would return an integer value of 213. • second - Returns the second portion of a datetime value. For example, an input datetime value of 2012-08-15 20:38:40.213 would return an integer value of 40. • minute - Returns the minute portion of a datetime value. For example, an input datetime value of 2012-08-15 20:38:40.213 would return an integer value of 38. • hour - Returns the hour portion of a datetime value. For example, an input datetime value of 2012-08-15 20:38:40.213 would return an integer value of 20. • day - Returns the day portion of a datetime value. For example, an input datetime value of 2012-08-15 would return an integer value of 15. • week - Returns the ISO week number for the input datetime value. For example, an input datetime value of 2012-01-02 would return an integer value of 1 (the first ISO week of 2012 starts on Monday January 2). An input datetime value of 2012-01-01 would return an integer value of 52 (January 1, 2012 is part of the last ISO week of 2011). • month - Returns the month portion of a datetime value. For example, an input datetime value of 2012-08-15 would return an integer value of 8. • quarter - Returns the quarter number for the input datetime value, where quarters start on January 1, April 1, July 1, or October 1. For example, an input datetime value of 2012-08-15 would return a integer value of 3. • year - Returns the year portion of a datetime value. For example, an input datetime value of 2012-01-01 would return an integer value of 2012. • weekyear - Returns the year value that corresponds the the ISO week number of the input datetime value. For example, an input datetime value of 2012-01-02 would return an integer value of 2012 (the first ISO week of 2012 starts on Monday January 2). An input datetime value of 2012-01-01 would return an integer value of 2011 (January 1, 2012 is part of the last ISO week of 2011). datetime Required. A field name or expression that returns a DATETIME value. Page 297 Data Ingest Guide - Platfora Expressions Extract the hour portion from the order_date datetime field: EXTRACT("hour",order_date) Cast the value of the order_date string field to a datetime value using TO_DATE, and extract the ISO week year: EXTRACT("weekyear",TO_DATE(order_date,"MM/dd/yyyy HH:mm:ss")) MILLISECONDS_BETWEEN MILLISECONDS_BETWEEN is a row function that calculates the whole number of milliseconds between two DATETIME values (value1-value2). MILLISECONDS_BETWEEN(datetime_1,datetime_2) Returns one value per row of type INTEGER. datetime_1 Required. A field or expression of type DATETIME. datetime_2 Required. A field or expression of type DATETIME. Calculate the number of milliseconds it took to serve a web page by subtracting the value of the request_timestamp field from the response_timestamp field: MILLISECONDS_BETWEEN(request_timestamp,response_timestamp) MINUTES_BETWEEN MINUTES_BETWEEN is a row function that calculates the whole number of minutes (ignoring seconds and milliseconds) between two DATETIME values (value1-value2). MINUTES_BETWEEN(datetime_1,datetime_2) Returns one value per row of type INTEGER. datetime_1 Required. A field or expression of type DATETIME. datetime_2 Required. A field or expression of type DATETIME. Calculate the number of minutes it took for a user to click on an advertisement by subtracting the value of the impression_timestamp field from the conversion_timestamp field: MINUTES_BETWEEN(impression_timestamp,conversion_timestamp) Calculate the number of minutes since a user last logged in by subtracting the login_timestamp field in the weblogs dataset from the current date and time (the result of the expression): Page 298 Data Ingest Guide - Platfora Expressions MINUTES_BETWEEN(NOW(),weblogs.login_timestamp) NOW NOW is a scalar function that returns the current system date and time as a DATETIME value. It can be used in other expressions involving DATETIME type fields, such as , , or . Note that the value of NOW is only evaluated at the time a lens is built (it is not re-evaluated with each query). NOW() Returns the current system date and time as a DATETIME value. Calculate a user's age using to subtract the value of the birthdate field in the users dataset from the current date: YEAR_DIFF(NOW(),users.birthdate) Calculate the number of days since a product's release using to subtract the value of the release_date field from the current date: DAYS_BETWEEN(NOW(),release_date) SECONDS_BETWEEN SECONDS_BETWEEN is a row function that calculates the whole number of seconds (ignoring milliseconds) between two DATETIME values (value1-value2). SECONDS_BETWEEN(datetime_1,datetime_2) Returns one value per row of type INTEGER. datetime_1 Required. A field or expression of type DATETIME. datetime_2 Required. A field or expression of type DATETIME. Calculate the number of seconds it took for a user to click on an advertisement by subtracting the value of the impression_timestamp field from the conversion_timestamp field: SECONDS_BETWEEN(impression_timestamp,conversion_timestamp) Calculate the number of seconds since a user last logged in by subtracting the login_timestamp field in the weblogs dataset from the current date and time (the result of the expression): SECONDS_BETWEEN(NOW(),weblogs.login_timestamp) TRUNC TRUNC is a row function that truncates a DATETIME value to the specified format. TRUNC(datetime,"format") Page 299 Data Ingest Guide - Platfora Expressions Returns a value of type DATETIME truncated to the specified format. datetime Required. A field or expression that returns a DATETIME value. format Required. One of the following format values: • millisecond - Returns a datetime value truncated to millisecond granularity. Has no effect since millisecond is already the most granular format for datetime values. For example, an input datetime value of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:38:40.213. • second - Returns a datetime value truncated to second granularity. For example, an input datetime value of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:38:40.000. • minute - Returns a datetime value truncated to minute granularity. For example, an input datetime value of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:38:00.000. • hour - Returns a datetime value truncated to hour granularity. For example, an input datetime value of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:00:00.000. • day - Returns a datetime value truncated to day granularity. For example, an input datetime value of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 00:00:00.000. • week - Returns a datetime value truncated to the first day of the week (starting on a Monday). For example, an input datetime value of 2012-08-15 (a Wednesday) would return a datetime value of 2012-08-13 (the Monday prior). • month - Returns a datetime value truncated to the first day of the month. For example, an input datetime value of 2012-08-15 would return a datetime value of 2012-08-01. • quarter - Returns a datetime value truncated to the first day of the quarter (January 1, April 1, July 1, or October 1). For example, an input datetime value of 2012-08-15 20:38:40.213 would return a datetime value of 2012-07-01. • year - Returns a datetime value truncated to the first day of the year (January 1). For example, an input datetime value of 2012-08-15 would return a datetime value of 2012-01-01. • weekyear - Returns a datetime value trucated to the first day of the ISO weekyear (the ISO week starting with the Monday which is nearest in time to January 1). For example, an input datetime value of 2008-08-15 would return a datetime value of 2007-12-31. The first day of the ISO weekyear for 2008 is December 31, 2007 (the prior Monday closest to January 1). Truncate the order_date datetime field to day granularity: TRUNC(order_date,"day") Cast the value of the order_date string field to a datetime value using TO_DATE, and truncate it to day granularity: TRUNC(TO_DATE(order_date,"MM/dd/yyyy HH:mm:ss"),"day") Page 300 Data Ingest Guide - Platfora Expressions YEAR_DIFF YEAR_DIFF is a row function that calculates the fractional number of years between two DATETIME values (value1-value2). YEAR_DIFF(datetime_1,datetime_2) Returns one value per row of type DOUBLE. datetime_1 Required. A field or expression of type DATETIME. datetime_2 Required. A field or expression of type DATETIME. Calculate the number of years a user has been a customer by subtracting the value of the registration_date field from the current date (the result of the expression): YEAR_DIFF(NOW(),registration_date) Calculate a user's age by subtracting the value of the birthdate field in the users dataset from the current date (the result of the expression): YEAR_DIFF(NOW(),users.birthdate) Math Functions Math functions allow you to perform basic math calculations on numeric values. You can also use arithmetic operators to perform simple math calculations. DIV DIV is a row function that divides two LONG values and returns a quotient value of type LONG (the result is truncated to 0 decimal places). DIV(dividend,divisor) Returns one value per row of type LONG. dividend Required. A field or expression of type LONG. divisor Required. A field or expression of type LONG. Cast the value of the file_size field to LONG and divide by 1024: DIV(TO_LONG(file_size),1024) Page 301 Data Ingest Guide - Platfora Expressions EXP EXP is a row function that raises the mathematical constant e to the power (exponent) of a numeric value and returns a value of type DOUBLE. EXP(power) Returns one value per row of type DOUBLE. power Required. A field or expression of a numeric type. Raise e to the power in the Value field. EXP(Value) When the Value field value is 2.0, the result is equal to 7.3890 when truncated to four decimal places. FLOOR FLOOR is a row function that returns the largest integer that is less than or equal to the input argument. FLOOR(double) Returns one value per row of type DOUBLE. double Required. A field or expression of type DOUBLE. Return the floor value of 32.6789: FLOOR(32.6789) returns 32.0 HASH HASH is a row function that evenly partitions data values into the specified number of buckets. It creates a hash of the input value and assigns that value a bucket number. Equal values will always hash to the same bucket number. HASH(field_name,integer) Returns one value per row of type INTEGER corresponding to the bucket number that the input value hashes to. field_name Required. The name of the field whose values you want to partition. integer Required. The desired number of buckets. This parameter can be a numeric value of any data type, but when it is a non-integer value, Platfora truncates the value to an integer. When the value is zero, the function returns NULL. When the value is negative, the function uses absolute value. Page 302 Data Ingest Guide - Platfora Expressions Partition the values of the username field into 20 buckets: HASH(username,20) LN LN is a row function that returns the natural logarithm of a number. The natural logarithm is the logarithm to the base e, where e (Euler's number) is a mathematical constant approximately equal to 2.718281828. The natural logarithm of a number x is the power to which the constant e must be raised in order to equal x. LN(positive_number) Returns the exponent to which base e must be raised to obtain the input value, where e denotes the constant number 2.718281828. The return value is the same data type as the input value. For example, LN(7.389) is 2, because e to the power of 2 is approximately 7.389. positive_number Required. A field or expression that returns a number greater than 0. Inputs can be of type INTEGER, LONG, DOUBLE, or FIXED. Return the natural logarithm of base number e, which is approximately 2.718281828: LN(2.718281828) returns 1 LN(3.0000) returns 1.098612 LN(300.0000) returns 5.703782 MOD MOD is a row function that divides two LONG values and returns the remainder value of type LONG (the result is truncated to 0 decimal places). MOD(dividend,divisor) Returns one value per row of type LONG. dividend Required. A field or expression of type LONG. divisor Required. A field or expression of type LONG. Cast the value of the file_size field to LONG and divide by 1024: MOD(TO_LONG(file_size),1024) Page 303 Data Ingest Guide - Platfora Expressions POW POW is a row function that raises the a numeric value to the power (exponent) of another numeric value and returns a value of type DOUBLE. POW(index,power) Returns one value per row of type DOUBLE. index Required. A field or expression of a numeric type. power Required. A field or expression of a numeric type. Calculate the compound annual growth rate (CAGR) percentage for a given investment over a five year span. 100 * POW(end_value/start_value, 0.2) - 1 Calculate the square of the Value field. POW(Value,2) Calculate the square root of the Value field. POW(Value,0.5) The following expression returns 1. POW(0,0) ROUND ROUND is a row function that rounds a DOUBLE value to the specified number of decimal places. ROUND(double,number_decimal_places) Returns one value per row of type DOUBLE. double Required. A field or expression of type DOUBLE. number_decimal_places Required. An integer that specifies the number of decimal places to round to. Round the number 32.4678954 to two decimal places: ROUND(32.4678954,2) returns 32.47 Page 304 Data Ingest Guide - Platfora Expressions Data Type Conversion Functions Data type conversion functions allow you to cast data values from one data type to another. These functions are used implicitly whenever you set the data type of a field or column in the Platfora user interface. The supported data types are: INTEGER, LONG, DOUBLE, FIXED, DATETIME, and STRING. EPOCH_MS_TO_DATE EPOCH_MS_TO_DATE is a row function that converts LONG values to DATETIME values, where the input number represents the number of milliseconds since the epoch. EPOCH_MS_TO_DATE(long_expression) Returns one value per row of type DATETIME in UTC format yyyy-MM-dd HH:mm:ss:SSS Z. long_expression Required. A field or expression of type LONG representing the number of milliseconds since the epoch datetime (January 1, 1970 00:00:00:000 GMT). Convert a number representing the number of milliseconds from the epoch to a human-readable date and time: EPOCH_MS_TO_DATE(1360260240000) returns 2013-02-07T18:04:00:000Z or February 7, 2013 18:04:00:000 GMT Or if your data is in seconds instead of milliseconds: EPOCH_MS_TO_DATE(1360260240 * 1000) returns 2013-02-07T18:04:00:000Z or February 7, 2013 18:04:00:000 GMT TO_CURRENCY This function is deprecated. Use the TO_FIXED function instead. TO_DATE TO_DATE is a row function that converts STRING values to DATETIME values, and specifies the format of the date and time elements in the string. TO_DATE(string_expression,"date_format") Returns one value per row of type DATETIME (which by definition is in UTC). string_expression Required. A field or expression of type STRING. date_format Required. A pattern that describes how the date is formatted. Use the following pattern symbols to define your date format. The count and ordering of the pattern letters determines the datetime format. Any characters in the pattern that are not in the ranges of a-z and Page 305 Data Ingest Guide - Platfora Expressions A-Z are treated as quoted delimiter text. For instance, characters such as slash (/) or colon (:) will appear in the resulting output even they are not escaped with single quotes. Table 2: Date Pattern Symbols SymbolMeaning Presentation Examples G era text AD C century of era (0 or greater) number 20 Y year of era (0 or greater) year 1996 x week year year 1996 w week number of week year number 27 e day of week (number) number 2 E day of week (name) text Tuesday; Tue y year year 1996 D day of year number 189 M month of year month July; Jul; 07 3 or more uses text, otherwise uses a number d day of month number 10 If the number of pattern letters is 3 or more, the text form is used; otherwise the number is used. a half day of day text PM K hour of half day (0-11) number 0 h clock hour of half day (1-12) number 12 H hour of day (0-23) number 0 k clock hour of day (1-24) number 24 m minute of hour number 30 s second of minute number 55 S fraction of second number 978 Page 306 Notes Numeric presentation for year and week year fields are handled specially. For example, if the count of 'y' is 2, the year will be displayed as the zero-based year of the century, which is two digits. If the number of pattern letters is 4 or more, the full form is used; otherwise a short or abbreviated form is used. Data Ingest Guide - Platfora Expressions SymbolMeaning Presentation Examples Notes z time zone text Pacific Standard Time; PST If the number of pattern letters is 4 or more, the full form is used; otherwise a short or abbreviated form is used. Z time zone offset/id zone -0800; -08:00; America/ Los_Angeles 'Z' outputs offset without a colon, 'ZZ' outputs the offset with a colon, 'ZZZ' or more outputs the zone id. ' escape character for text-based delimiters delimiter '' literal representation of literal a single quote ' Define a new DATETIME computed field based on the order_date base field, which contains timestamps in the format of: 2014.07.10 at 15:08:56 PDT: TO_DATE(order_date,"yyyy.MM.dd 'at' HH:mm:ss z") Define a new DATETIME computed field by first combining individual month, day, year, and depart_time fields (using CONCAT), and performing a transformation on depart_time to make sure threedigit times are converted to four-digit times (using REGEX_REPLACE): TO_DATE(CONCAT(month,"/",day,"/",year,":",REGEX_REPLACE(depart_time,"\b(\d{3})\b", dd/yyyy:HHmm") Define a new DATETIME computed field based on the created_at base field, which contains timestamps in the format of: Sat Jan 25 16:35:23 +0800 2014 (this is the timestamp format returned by Twitter's API): TO_DATE(created_at,"EEE MMM dd HH:mm:ss Z yyyy") TO_DOUBLE TO_DOUBLE is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to DOUBLE (decimal) values. TO_DOUBLE(expression) Returns one value per row of type DOUBLE. expression Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, or DOUBLE. Convert the values of the average_rating field to a double data type: TO_DOUBLE(average_rating) Convert the average_rating field to a double data type, but first transform the occurrence of any NA values to NULL values using a CASE expression: Page 307 Data Ingest Guide - Platfora Expressions TO_DOUBLE(CASE WHEN average_rating="N/A" then NULL ELSE average_rating END) TO_FIXED TO_FIXED is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to fixeddecimal values. Using a FIXED data type to represent monetary values allows you to calculate and aggregate monetary values with accuracy to a ten-thousandth of a monetary unit. TO_FIXED(expression) Returns one value per row of type FIXED (fixed-decimal value to 10000th accuracy). expression Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, or DOUBLE. Convert the opening_price field to a fixed decimal data type: TO_FIXED(opening_price) Convert the sale_price field to a fixed decimal data type, but first transform the occurrence of any N/A string values to NULL values using a CASE expression: TO_FIXED(CASE WHEN sale_price="N/A" then NULL ELSE sale_price END) TO_INT TO_INT is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to INTEGER (whole number) values. When converting DOUBLE values, everything after the decimal will be truncated (not rounded up or down). TO_INT(expression) Returns one value per row of type INTEGER. expression Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, or DOUBLE. Convert the values of the average_rating field to an integer data type: TO_INT(average_rating) Convert the flight_duration field to an integer data type, but first transform the occurrence of any NA values to NULL values using a CASE expression: TO_INT(CASE WHEN flight_duration="N/A" then NULL ELSE flight_duration END) Page 308 Data Ingest Guide - Platfora Expressions TO_LONG TO_LONG is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to LONG (whole number) values. When converting DOUBLE values, everything after the decimal will be truncated (not rounded up or down). TO_LONG(expression) Returns one value per row of type LONG. expression Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, or DOUBLE. Convert the values of the average_rating field to a long data type: TO_LONG(average_rating) Convert the average_rating field to a long data type, but first transform the occurrence of any NA values to NULL values using a CASE expression: TO_LONG(CASE WHEN average_rating="N/A" then NULL ELSE average_rating END) TO_STRING TO_STRING is a row function that converts values of other data types to STRING (character) values. TO_STRING(expression) TO_STRING(datetime_expression,date_format) Returns one value per row of type STRING. expression A field or expression of type FIXED, STRING, INTEGER, LONG, or DOUBLE. datetime_expression A field or expression of type DATETIME. date_format If converting a DATETIME to a string, a pattern that describes how the date is formatted. See TO_DATE for the date format patterns. Convert the values of the sku_number field to a string data type: TO_STRING(sku_number) Convert values in the age column into a range-based groupings (binning), and cast output values to a STRING: Page 309 Data Ingest Guide - Platfora Expressions TO_STRING(CASE WHEN age <= 25 THEN "0-25" WHEN age <= 50 THEN "26-50" ELSE "over 50" END) Convert the values of a timestamp datetime field to a string, where the timestamp values are in the format of: 2002.07.10 at 15:08:56 PDT: TO_STRING(timestamp,"yyyy.MM.dd 'at' HH:mm:ss z") Aggregate Functions An aggregate function groups the values of multiple rows together based on some defined input expression. Aggregate functions return one value for a group of rows, and are only valid for defining measures in Platfora. Aggregate functions cannot be combined with row functions. AVG AVG is an aggregate function that returns the average of all valid numeric values. It sums all values in the provided expression and divides by the number of valid (NOT NULL) rows. If you want to compute an average that includes all values in the row count (including NULL values), you can use a SUM/COUNT expression instead. AVG(numeric_field) Returns a value of type DOUBLE. numeric_field Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregate functions can only take field names as input. Get the average of the valid sale_amount field values: AVG(sale_amount) Get the average of the valid net_worth field values in the billionaires data set, which resides in the samples namespace: AVG([(samples) billionaires].net_worth) Get the average of all page_views field values in the web_logs dataset (including NULL values): SUM(page_views)/COUNT(web_logs) COUNT COUNT is an aggregate function that returns the number of rows in a dataset. COUNT([namespace_name]dataset_name) Returns a value of type INTEGER. namespace_name Page 310 Data Ingest Guide - Platfora Expressions Optional. The name of the namespace in which the dataset resides. If not specified, uses the default namespace. dataset_name Required. The name of the dataset for which to obtain a count of rows. If you want to count rows of a down-stream dataset that is related to the current dataset, you can specify the hierarchy of dataset names in the format of: parent_dataset_name.child_dataset_name.[...] Count the rows in the sales dataset: COUNT(sales) Count the rows in the billionaires dataset, which resides in the samples namespace: COUNT([(samples) billionaires]) Count the rows in the customer dataset, which is a related dataset down-stream of sales: COUNT(sales.customers) COUNT_VALID COUNT_VALID is an aggregate function that returns the number of rows for which the given expression is valid (excludes NULL values). COUNT_VALID(field) Returns a numeric value of type INTEGER. field Required. A field name. Unlike row functions, aggregate functions can only take field names as input. Count the valid values in the page_views field: COUNT_VALID(page_views) DISTINCT DISTINCT is an aggregate function that returns the number of distinct values for the given expression. DISTINCT(field) Returns a numeric value of type INTEGER. field Required. A field name. Unlike row functions, aggregate functions can only take field names as input. Count the unique values of the user_id field in the currently selected dataset: DISTINCT(user_id) Page 311 Data Ingest Guide - Platfora Expressions Count the unique values of the name field in the billionaires dataset, which resides in the samples namespace: DISTINCT([(samples) billionaires].name) Count the unique values of the customer_id field in the customer dataset, which is a related dataset down-stream of web sales: DISTINCT([web sales].customers.customer_id) MAX MAX is an aggregate function that returns the biggest value from the given input expression. MAX(numeric_or_datetime_field) Returns a numeric or datetime value of the same type as the input expression. numeric_or_datetime_field Required. A field of type INTEGER, LONG, DOUBLE, FIXED, or DATETIME. Unlike row functions, aggregate functions can only take field names as input. Get the highest value from the sale_amount field: MAX(sale_amount) Get the latest date from the Session Timestamp datetime field: MAX([Session Timestamp]) MIN MIN is an aggregate function that returns the smallest value from the given input expression. MIN(numeric_or_datetime_field) Returns a numeric or datetime value of the same type as the input expression. numeric_or_datetime_field Required. A field of type INTEGER, LONG, DOUBLE, FIXED, or DATETIME. Unlike row functions, aggregate functions can only take field names as input. Get the lowest value from the sale_amount field: MIN(sale_amount) Get the earliest date from the Session Timestamp datetime field: MIN([Session Timestamp]) SUM SUM is an aggregate function that returns the total of all values from the given input expression. Page 312 Data Ingest Guide - Platfora Expressions SUM(numeric_field) Returns a numeric value of the same type as the input expression. numeric_field Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregate functions can only take field names as input. Add the values of the sale_amount field: SUM(sale_amount) Add values of the session count field in the users dataset, which is a related dataset down-stream of clicks: SUM(clicks.users.[session count]) STDDEV STDDEV is an aggregate function that calculates the population standard deviation for a group of numeric values. Standard deviation is the square root of the variance. STDDEV(numeric_field) Returns a value of type DOUBLE. If there are less than two values in the input group, returns NULL. numeric_field Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregate functions can only take field names as input. Calculate the standard deviation of the values contained in the sale_amount field: STDDEV(sale_amount) VARIANCE VARIANCE is an aggregate function that calculates the population variance for a group of numeric values. Variance measures the amount by which all values in a group vary from the average value of the group. Data with low variance contains values that are identical or similar. Data with high variance contains values that are not similar. Variance is calculated as the average of the squares of the deviations from the mean. Squaring the deviations ensures that negative and positive deviations do not cancel each other out. VARIANCE(numeric_field) Returns a value of type DOUBLE. If there are less than two values in the input group, returns NULL. numeric_field Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregate functions can only take field names as input. Page 313 Data Ingest Guide - Platfora Expressions Get the population variance of the values contained in the sale_amount field: VARIANCE(sale_amount) ROLLUP and Window Functions Window functions can only be used in conjunction with ROLLUP. ROLLUP is a modifier to an aggregate expression that determines the partitioning and ordering of a rowset before the associated aggregate function or window function is applied. ROLLUP defines a window or user-specified set of rows within a query result set. A window function then computes a value for each row in the window. You can use window functions to compute aggregated values such as moving averages, cumulative aggregates, running totals, or a top N per group results. ROLLUP ROLLUP is a modifier to an aggregate function that turns a regular aggregate function into a windowed, partitioned, or adaptive aggregate function. This is useful when you want to compute an aggregation over a subset of rows within the overall result of a viz query. ROLLUP aggregate_expression [ WHERE input_group_condition [...] ] [ TO ([partitioning_columns]) [ ORDER BY (ordering_column [ASC | DESC]) ROWS|RANGE window_boundary [window_boundary] | BETWEEN window_boundary AND window_boundary ] ] where window_boundary can be one of: UNBOUNDED PRECEDING value PRECEDING value FOLLOWING UNBOUNDED FOLLOWING A regular measure is the result of an aggregation (such as SUM or AVG) applied to some fact or metric column of a dataset. For example, suppose we had a dataset with the following rows and columns: Date Sale Amount Product Region 05/01/2013 100 gadget west 05/01/2013 200 widget east 06/01/2013 100 gadget east 06/01/2013 400 widget west 07/01/2013 300 widget west Page 314 Data Ingest Guide - Platfora Expressions Date Sale Amount Product Region 07/01/2013 200 gadget east To define a regular measure called Total Sales, we would use the expression: SUM([Sale Amount]) When this measure is used in a visualization, the group of input records passed into the aggregate calculation is determined by the dimensions selected by the user when they create the viz. For example, if the user chose Region as a dimension in the viz, there would be two input groups for which the measure would be calculated: Total Sales / Region east west 500 800 If an aggregate expression includes a ROLLUP clause, the column(s) specified in the TO clause of the ROLLUP expression determine the additional partitions over which to compute the aggregate expression. It divides the overall rows returned by the viz query into subsets or buckets, and then computes the aggregate expression within each bucket. Every ROLLUP expression has implicit partitioning defined: an absent TO clause treats the entire result set as one partition; an empty TO clause partitions by whatever dimension columns are present in the viz query. The WHERE clause is used to filter the input rows that flow into each partition. Input rows that meet the WHERE clause criteria will be partitioned, and rows that don't will not be partitioned. The ORDER BY with a RANGE or ROW clause is used to define a window frame within each partition over which to compute the aggregate expression. When a ROLLUP measure is used in a visualization, the aggregate calculation is computed across a set of input rows that are related to, but separate from, the other dimension(s) used in the viz. This is similar to the type of calculation that is done with a regular measure. However unlike a regular measure, a ROLLUP measure does not cause the input rows to be grouped into a single result set; the input rows still retain their separate identities. The ROLLUP clause determines how the input rows are split up for processing by the ROLLUP's aggregate function. ROLLUP expressions can be written to make the partitioning adaptive to whatever dimension columns are selected in the visualization. This is done by using a reference name as the partitioning column, as opposed to a regular column. For example, suppose we wanted to be able to calculate the total sales for any granularity of date. We could create an adaptive measure called Rollup Sales to Date that partitions total sales by date as follows: ROLLUP SUM([Sale Amount]) TO (Date) Page 315 Data Ingest Guide - Platfora Expressions When this measure is used in a visualization, the group of input records passed into the aggregate calculation is determined by the dimension fields selected by the user in the viz, but partitioned by the granularity of Date selected by the user. For example, if the user chose the dimensions Date.Month and Region in the viz, then total sales would be grouped by month and region, but the ROLLUP measure expression would aggregate the sales by month only. Notice that the results for the east and west regions are the same - this is because the aggregation expression is only considering rows that share the same month when calculating the sum of sales. Month / (Measures) / Region May 2013 June 2013 July 2013 Rollup Sales to Date Rollup Sales to Date Rollup Sales to Date east | west east | west east | west 300 | 300 500 | 500 500 | 500 Suppose within the date partition, we wanted to calculate the cumulative total day to day. We could define a window measure called Running Total to Date that looks at each day and all preceding days as follows: ROLLUP SUM([Sale Amount]) TO (Date) ORDER BY (Date.Date) ROWS UNBOUNDED PRECEDING When this measure is used in a visualization, the group of input records passed into the aggregate calculation is determined by the dimension fields selected by the user in the viz, and partitioned by the granularity of Date selected by the user. Within each partition the rows are ordered chronologically (by Date.Date), and the sum amount is then calculated per date partition by looking at the current row (or mark), and all rows that come before it within the partition. For example, if the user chose the dimension Date.Month in the viz, then the ROLLUP measure expression would cumulatively aggregate the sales within each month. Month / (Measures) / Date.Date May 2013 June 2013 July 2013 2013-05-01 2013-06-01 2013-07-01 Running Total to Date Rollup Sales to Date Rollup Sales to Date 300 500 500 Returns a numeric value per partition based on the output type of the aggregate_expression. aggregate_expression Page 316 Data Ingest Guide - Platfora Expressions Required. An expression containing an aggregate or window function. Simple aggregate functions such as COUNT, AVG, SUM, MIN, and MAX are supported. Window functions such as RANK, DENSE_RANK, and NTILE are supported and can only be used in conjuction with ROLLUP. Complex aggregate functions such as STDDEV and VARIANCE are not supported. WHERE input_group_condition The WHERE clause limits the group of input rows over which to compute the aggregate expression. The input group condition is a Boolean (true or false) condition defined using a comparison operator expression. Any row that does not satisfy the condition will be excluded from the input group used to calculate the aggregated measure value. For example (note that datetime values must be specified in yyyy-MM-dd format): WHERE Date.Date BETWEEN 2012-06-01 AND 2012-07-31 WHERE Date.Year BETWEEN 2009 AND 2013 WHERE Company LIKE("Plat*") WHERE Code IN("a","b","c") WHERE Sales < 50.00 WHERE Age >= 21 You can specify multiple WHERE clauses in a ROLLUP expression. TO ([partitioning_columns]) The TO clause is used to specify the dimension column(s) used to partition a group of input rows. This allows you to calculate a measure value for a specific dimension group (a subset of input rows) that are somehow related to the other dimension groups used in a visualization (all input rows). It is possible to define an empty group (meaning all rows) by using empty parenthesis. When used in a visualization, measure values are computed for groups of input rows that return the same value for the columns specified in the partitioning list. For example, if the Date.Month column is used as a partitioning column, then all records that have the same value for Date.Month will be grouped together in order to calculate the measure value. The aggregate expression is applied to the group specified in the TO clause independently of the other dimension groupings used in the visualization. Note that the partitioning column(s) specified in the TO clause of an adaptive measure expression must also be included as dimensions (or grouping columns) in the visualization. A partitioning column can also be the name of a reference field. Using a reference field allows the partition criteria to dynamically adapt based on any field of the referenced dataset that is used in a viz. For example, if the partition column is a reference field pointing to the Date dimension, then any subfield of Date (Date.Year, Date.Month, etc.) can be used as the partitioning column by selecting it in a viz. Page 317 Data Ingest Guide - Platfora Expressions A TO clause with an empty partitioning list treats each mark in the result set as an input group. For example, if the viz includes the Month and Region columns, then TO() would be equivalent to TO(Month,Region). ORDER BY (ordering_column) The optional ORDER BY clause orders the input rows using the values in the specified column within each partition identified in the TO clause. Use the ORDER BY clause along with the ROWS or RANGE clauses to define windows over which to compute the aggregate function. This is useful for computing moving averages, cumulative aggregates, running totals, or a top value per group of input rows. The ordering column specified in the ORDER BY clause can be a dimension, measure, or an aggregate expression (for example ORDER BY (SUM(Sales))). If the ordering column is a dimension, it must be included in the viz. By default, rows are sorted in ascending order (low to high values). You can use the DESC keyword to sort in descending order (high to low values). ROWS | RANGE Required when using ORDER BY. Further limits the rows within the partition by specifying start and end points within the partition. This is done by specifying a range of rows with respect to the current row either by logical association (RANGE) or physical association (ROWS). Use either a ROWS or RANGE clause to express the window boundary (the set of input rows in each partition, relative to the current row, over which to compute the aggregate expression). The window boundary can include one, several, or all rows of the partition. When using the RANGE clause, the ordering column used in the ORDER BY clause must be a sub-column of a reference to Platfora's built-in Date dimension dataset. window_boundary A window boundary is required when using either ROWS or RANGE. This defines the set of rows, relative to the current row, over which to compute the aggregate expression. The row order is based on the ordering specified in the ORDER BY clause. A PRECEEDING clause defines a lower window boundary (the number of rows to include before the current row). The FOLLOWING clause defines an upper window boundary (the number of rows to include after the current row). The window boundary expression must include either a PRECEEDING or FOLLOWING clause, or both. If PRECEEDING is omitted, the current row is considered the first row in the window. Similarly, if FOLLOWING is omitted, the current row is considered the last row in the window. The UNBOUNDED keyword includes all rows in the direction specified. When you need to specify both a start and end of a window, use the BETWEEN and AND keywords. For example: ROWS 2 PRECEDING means that the window is three rows in size, starting with two rows preceding until and including the current row. Page 318 Data Ingest Guide - Platfora Expressions ROWS BETWEEN 2 PRECEDING AND 5 FOLLOWING means that the window is eight rows in size, starting with two rows preceding, the current row, and five rows following the current row. The current row is included in the set of rows by default. You can exclude the current row from the window by specifying a window start and end point before or after the current row. For example: ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING starts the window with all rows that come before the current row, and ends the window one row before the current row, thereby excluding the current row from the window. Calculate the percentage of flight records in the same departure date period. Note that the departure_date field is a reference to the Date dataset, meaning that the group to which the measure is applied can adapt to any downstream field of departure_date (departure_date.Year, departure_date.Month, and so on). When used in a viz, this will calculate the percentage of flights for each dimension group in the viz that share the same value for departure_date: 100 * COUNT(Flights) / ROLLUP COUNT(Flights) TO ([Departure Date]) Normalize the number of flights using the carrier American Airlines (AA) as the benchmark. This will allow you to compare the number of flights for other carriers against the fixed baseline number of flights for AA (if AA = 100 percent, then all other carriers will fall either above or below that percentage): 100 * COUNT(Flights) / ROLLUP COUNT(Flights) WHERE [Carrier Code]="AA" Calculate a generic percentage of total sales. When this measure is used in a visualization, it will show the percentage of total sales that a mark in the viz is contributing to the total for all marks in the viz. The input rows depend on the dimensions selected in the viz. 100 * SUM(sales) / ROLLUP SUM(sales) TO () Calculate the cumulative total of sales for a given year on a month-to-month basis (year-to-month sales totals): ROLLUP SUM(sales) TO (Date.Year) ORDER BY (Date.Month) ROWS UNBOUNDED PRECEDING Calculate the cumulative total of sales (for all input rows) for all previous years, but exclude the current year from the total. ROLLUP SUM(sales) TO () ORDER BY (Date.Year) ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING DENSE_RANK DENSE_RANK is a windowing aggregate function that orders rows by a measure value and assigns a rank number to each row in the given partition. Rank positions are not skipped in the event of a tie. DENSE_RANK must be used within a ROLLUP expression. ROLLUP DENSE_RANK() TO ([partitioning_column]) ORDER BY (measure_expression [ASC | DESC]) Page 319 Data Ingest Guide - Platfora Expressions ROWS|RANGE window_boundary [window_boundary] | BETWEEN window_boundary AND window_boundary ] where window_boundary can be one of: UNBOUNDED PRECEDING value PRECEDING value FOLLOWING UNBOUNDED FOLLOWING DENSE_RANK is a window aggregate function used to assign a ranking number to each row in a group. If multiple rows have the same ranking value (there is a tie), then the tied rows are given the same rank value and subsequent rank positions are not skipped. The TO clause of the ROLLUP is used to specify the dimension column(s) used to partition a group of input rows. To define a global ranking that can adapt to any dimension groupings used in a viz, specify an empty TO clause. The ORDER BY clause of the ROLLUP expression determines how to order the rows before they are ranked. The ORDER BY clause should specify the measure field for which you want to calculate the ranks. The ranked rows in the partition are numbered starting at one. For example, suppose we had a dataset with the following rows and columns and you want to rank the Quarters and Regions according to the values in the Sales column. Quarter Region Sales 2010 Q1 North 100 2010 Q1 South 200 2010 Q1 East 300 2010 Q1 West 400 2010 Q2 North 400 2010 Q2 South 250 2010 Q2 East 150 2010 Q2 West 250 Supposing the lens has an existing measure field called Sales(Sum), you could then define a measure called Sales_Dense_Rank using the following expression: ROLLUP DENSE_RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED PRECEDING Page 320 Data Ingest Guide - Platfora Expressions When you include the Quarter, Region, and Sales_Dense_Rank columns in the viz, you get the following data points. Notice that tied values are given the same rank number and no rank positions are skipped: Quarter Region SalesRank 2010 Q1 North 6 2010 Q1 South 4 2010 Q1 East 2 2010 Q1 West 1 2010 Q2 North 1 2010 Q2 South 3 2010 Q2 East 5 2010 Q2 West 3 Returns a value of type LONG. ROLLUP Required. DENSE_RANK must be used within a ROLLUPROLLUP expression in place of the aggregate_expression of the ROLLUP. The TO clause of the ROLLUP expression specifies the dimension group(s) over which to calculate the window function. An empty TO calculates the window function over all rows in the query as one group. The ORDER BY clause of the ROLLUP expression specifies a measure field or aggregate expression. Rank the sum of all sales in descending order, so the highest sales is given the ranking of 1. ROLLUP DENSE_RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED PRECEDING Rank the sum of all sales within a given quarter in descending order, so the highest sales in each quarter is given the ranking of 1. ROLLUP DENSE_RANK() TO (Quarter) ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED PRECEDING Page 321 Data Ingest Guide - Platfora Expressions NTILE NTILE is a windowing aggregate function that divides a partitioned group of rows into the specified number of buckets, and returns the bucket number to which the current row belongs. NTILE must be used within a ROLLUP expression. ROLLUP NTILE(integer) TO ([partitioning_column]) ORDER BY (measure_expression [ASC | DESC]) ROWS|RANGE window_boundary [window_boundary] | BETWEEN window_boundary AND window_boundary ] where window_boundary can be one of: UNBOUNDED PRECEDING value PRECEDING value FOLLOWING UNBOUNDED FOLLOWING NTILE is a window aggregate function typically used to calculate percentiles. A percentile (or centile) is a measure used in statistics indicating the value below which a given percentage of records in a group falls. For example, the 20th percentile is the value (or score) below which 20 percent of the records may be found. The term percentile is often used in the reporting of test scores. For example, if a score is in the 86th percentile, it is higher than 86% of the other scores. The 25th percentile is also known as the first quartile (Q1), the 50th percentile as the median or second quartile (Q2), and the 75th percentile as the third quartile (Q3). In general, percentiles, deciles and quartiles are specific types of ntiles. NTILE must be used within a ROLLUPROLLUP expression in place of the aggregate_expression of the ROLLUP. The TO clause of the ROLLUP is used to specify a fixed dimension column used to partition a group of input rows. To define a global NTILE ranking that can adapt to any dimension groupings used in a viz, specify an empty TO clause. The ORDER BY clause of the ROLLUP expression determines how to order the rows before they are divided into buckets. The ORDER BY clause should specify the measure field for which you want to calculate NTILE bucket values. A centile would be 100 buckets, a decile would be 10 buckets, a quartile 4 buckets, and so on. The buckets in the partition are numbered starting at one. For example, suppose we had a dataset with the following rows and columns and you want to divide the year-to-date sales into four buckets (quartiles) with the highest quartile ranked as 1 and the lowest ranked as 4. Supposing a measure field has been defined called Sum_YTD_Sales, defined as Page 322 Data Ingest Guide - Platfora Expressions SUM([Sales YTD]), you could then define a measure called YTD_Sales_Quartile using the following expression: ROLLUP NTILE(4) TO() ORDER BY(Sum_YTD_Sales DESC) ROWS UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING Name Gender Sales YTD YTD_Sales_Quartile Chen F 3,500,000 1 John M 3,100,000 1 Pete M 2,900,000 1 Daria F 2,500,000 2 Jennie F 2,200,000 2 Mary F 2,100,000 2 Mike M 1,900,000 3 Brian M 1,700,000 3 Molly F 1,500,000 3 Theresa F 1,200,000 4 Hans M 900,000 4 Ben M 500,000 4 Because the TO clause of the ROLLUP expression is empty, the quartile partitioning adapts to whatever dimensions are used in the viz. For example, if you include the Gender dimension field in the viz, the quartiles would then be computed per gender. The following example divides each gender into buckets with each gender having 6 year-to-date sales values. The two extra values (the remainder of 6 / 4) are allocated to buckets 1 and 2, which therefore have one more value than buckets 3 or 4. Name Gender Sales YTD YTD_Sales_Quartile (partitioned by Gender) Chen F 3,500,000 1 Daria F 2,500,000 1 Jennie F 2,200,000 2 Mary F 2,100,000 2 Molly F 1,500,000 3 Page 323 Data Ingest Guide - Platfora Expressions Name Gender Sales YTD YTD_Sales_Quartile (partitioned by Gender) Theresa F 1,200,000 4 John M 3,100,000 1 Pete M 2,900,000 1 Mike M 1,900,000 2 Brian M 1,700,000 2 Hans M 900,000 3 Ben M 500,000 4 Returns a value of type LONG. ROLLUP Required. NTILE must be used within a ROLLUPROLLUP expression in place of the aggregate_expression of the ROLLUP. The TO clause of the ROLLUP expression specifies the dimension group(s) over which to calculate the window function. An empty TO calculates the window function over all rows in the query as one group. The ORDER BY clause of the ROLLUP expression specifies a measure field or aggregate expression. integer Required. An integer that specifies the number of buckets to divide the partitioned rows into. Perhaps the most common use case for NTILE is to get a global ranking of result rows. For example, if you wanted to get the percentile of Total Records per City, you may think the expression to use is: ROLLUP NTILE(100) TO (City) ORDER BY ([Total Records] DESC) ROWS UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. However, by leaving the TO clause blank, the percentile buckets can adapt to whatever dimension(s) you use in the viz. To calculate the Total Records percentiles by City, you could define a global Total_Records_Percentiles measure and then use this measure in conjunction with the City dimension in the viz (or any other dimension for that matter). ROLLUP NTILE(100) TO () ORDER BY ([Total Records] DESC) ROWS UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING RANK RANK is a windowing aggregate function that orders rows by a measure value and assigns a rank number to each row in the given partition. Rank positions are skipped in the event of a tie. RANK must be used within a ROLLUP expression. ROLLUP RANK() Page 324 Data Ingest Guide - Platfora Expressions TO ([partitioning_column]) ORDER BY (measure_expression [ASC | DESC]) ROWS|RANGE window_boundary [window_boundary] | BETWEEN window_boundary AND window_boundary ] where window_boundary can be one of: UNBOUNDED PRECEDING value PRECEDING value FOLLOWING UNBOUNDED FOLLOWING RANK is a window aggregate function used to assign a ranking number to each row in a group. If multiple rows have the same ranking value (there is a tie), then the tied rows are given the same rank value and the subsequent rank position is skipped. The TO clause of the ROLLUP is used to specify the dimension column(s) used to partition a group of input rows. To define a global ranking that can adapt to any dimension groupings used in a viz, specify an empty TO clause. The ORDER BY clause of the ROLLUP expression determines how to order the rows before they are ranked. The ORDER BY clause should specify the measure field for which you want to calculate the ranks. The ranked rows in the partition are numbered starting at one. For example, suppose we had a dataset with the following rows and columns and you want to rank the Quarters and Regions according to the values in the Sales column. Quarter Region Sales 2010 Q1 North 100 2010 Q1 South 200 2010 Q1 East 300 2010 Q1 West 400 2010 Q2 North 400 2010 Q2 South 250 2010 Q2 East 150 2010 Q2 West 250 Page 325 Data Ingest Guide - Platfora Expressions Supposing the lens has an existing measure field called Sales(Sum), you could then define a measure called Sales_Rank using the following expression: ROLLUP RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED PRECEDING When you include the Quarter, Region, and Sales_Rank columns in the viz, you get the following data points. Notice that tied values are given the same rank number and the rank positions 2 and 5 are skipped: Quarter Region SalesRank 2010 Q1 North 8 2010 Q1 South 6 2010 Q1 East 3 2010 Q1 West 1 2010 Q2 North 1 2010 Q2 South 4 2010 Q2 East 7 2010 Q2 West 4 Returns a value of type LONG. ROLLUP Required. RANK must be used within a ROLLUPROLLUP expression in place of the aggregate_expression of the ROLLUP. The TO clause of the ROLLUP expression specifies the dimension group(s) over which to calculate the window function. An empty TO calculates the window function over all rows in the query as one group. The ORDER BY clause of the ROLLUP expression specifies a measure field or aggregate expression. Rank the sum of all sales in descending order, so the highest sales is given the ranking of 1. ROLLUP RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED PRECEDING Rank the sum of all sales within a given quarter in descending order, so the highest sales in each quarter is given the ranking of 1. ROLLUP RANK() TO (Quarter) ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED PRECEDING Page 326 Data Ingest Guide - Platfora Expressions ROW_NUMBER ROW_NUMBER is a windowing aggregate function that assigns a unique, sequential number to each row in a group (partition) of rows, starting at 1 for the first row in each partition. ROW_NUMBER must be used within a ROLLUP expression, which acts as a modifier for ROW_NUMBER. Use a column in the ROLLUP order by clause to determine on which column to determine the row number. ROLLUP ROW_NUMBER(integer) TO ([partitioning_column]) ORDER BY (ordering_column [ASC | DESC]) ROWS|RANGE window_boundary [window_boundary] | BETWEEN window_boundary AND window_boundary ] where window_boundary can be one of: UNBOUNDED PRECEDING value PRECEDING value FOLLOWING UNBOUNDED FOLLOWING For example, suppose we had a dataset with the following rows and columns: Quarter Region Sales 2010 Q1 North 100 2010 Q1 South 200 2010 Q1 East 300 2010 Q1 West 400 2010 Q2 North 400 2010 Q2 South 250 2010 Q2 East 150 2010 Q2 West 250 Suppose you want to assign a unique ID to the sales of each region by quarter in descending order. In this example, a measure field is defined called Sum_Sales with the expression SUM(Sales). You could then define a measure called SalesNumber using the following expression: ROLLUP ROW_NUMBER() TO (Quarter) ORDER BY (Sum_Sales DESC) ROWS UNBOUNDED PRECEDING Page 327 Data Ingest Guide - Platfora Expressions When you include the Quarter, Region, and SalesNumber columns in the viz, you get the following data points: Quarter Region SalesNumber 2010 Q1 North 4 2010 Q1 South 3 2010 Q1 East 2 2010 Q1 West 1 2010 Q2 North 1 2010 Q2 South 2 2010 Q2 East 4 2010 Q2 West 3 Returns a value of type LONG. None Assign a unique ID to the sales of each region by quarter in descending order, so the highest sales is given the number of 1. ROLLUP ROW_NUMBER() TO (Quarter) ORDER BY (Sum_Sales DESC) ROWS UNBOUNDED PRECEDING User Defined Functions (UDFs) User defined functions (UDFs) allow you to define your own per-row processing logic, and then expose that functionality to users in the Platfora application expression builder. User defined functions can only be used to implement new row functions, not aggregate functions. If a computed field that uses a UDF is included in a lens, the UDF will be executed once for each row during the lens build process. This is good to keep in mind when writing UDF Java programs, so you do not write programs that negatively impact lens build resources or execution times. Writing a Platfora UDF Java Program User defined functions (UDFs) are written in the Java programming language and implement the Platfora-provided Java interface, com.platfora.udf.UserDefinedFunction. Verify that any JAR file that the UDF will use is compatible with the existing libraries Platfora uses. You can find those libraries in $PLATFORA_HOME/lib. Page 328 Data Ingest Guide - Platfora Expressions To define a user defined function for Platfora, you must have the Java Development Kit (JDK) version 6 or 7 installed on the machine where you plan to do your development. You will also need the com.platfora.udf.UserDefinedFunction interface Java code from your Platfora master server installation. If you go to the $PLATFORA_HOME/tools/udf directory of your Platfora master server installation, you will find two files: • platfora-udf.jar – This is the compiled code for the com.platfora.udf.UserDefinedFunction interface. You must link to this jar file (place it in the CLASSPATH) when you compile your UDF Java program. • /com/platfora/udf/UserDefinedFunction.java – This is the source code for the Java interface that your UDF classes need to implement. The source code is provided as reference documentation of the Platfora UserDefinedFunction interface. You can refer to this file when writing your UDF Java programs. 1. Copy the file $PLATFORA_HOME/tools/udf/platfora-udf.jar to a directory on the machine where you plan to develop and compile your UDF program. 2. Write a Java program that implements com.platfora.udf.UserDefinedFunction interface. For example, here is a sample Java program that defines a REPEAT_STRING user defined function. This simple function repeats an input string a specified number of times. import java.util.List; /** * Sample user-defined function implementation that demonstrates * how to create a REPEAT_STRING function. */ public class RepeatString implements com.platfora.udf.UserDefinedFunction { /** * Returns the name of the user-defined function. * The first character in the name must be a letter, * and subsequent characters must be either letters, * digits, or underscores. You cannot name your function * the same name as an existing Platfora * built-in function. Names are case-insensitive. */ @Override public String getFunctionName() { return "REPEAT_STRING"; } /** * Returns one of the following values, reflecting the * return type of the user-defined function: * DATETIME, DOUBLE, FIXED, INTEGER, LONG, or STRING. */ Page 329 Data Ingest Guide - Platfora Expressions @Override public String getReturnType() { return "STRING"; } /** * Returns an array of Strings, one for each of the * input arguments to the user-defined function, * specifying the required data type for each argument. * The Strings should be of the following values: * DATETIME, DOUBLE, FIXED, INTEGER, LONG, STRING. */ @Override public String[] getArgumentTypes() { return new String[] { "STRING", "INTEGER" }; } /** * Returns a human-readable description of what the function * does, to be displayed to Platfora users in the * Expression Builder. May return null. */ @Override public String getDescription() { return "The REPEAT_STRING function returns an input string repeated " + " a specified number of times."; } /** * Returns a human-readable description explaining the * value that the function returns, to be displayed to * Platfora users in the Expression Builder. May return null. */ @Override public String getReturnValueDescription() { return "Returns one value per row of type STRING"; } /** * Returns a human-readable example of the function syntax, * to be displayed to Platfora users in the Expression * Builder. May return null. */ @Override public String getExampleUsage() { return "CONCAT(\"It's a \", REPEAT_STRING(\"Mad \",4), \" World\")"; } /** Page 330 Data Ingest Guide - Platfora Expressions * The compute method performs the actual work of evaluating * the user-defined function. The method should operate on the * argument values provided to calculate the function return value * and return a Java object of the appropriate type to represent * the return value. The following mapping describes the Java * object type that is used to represent each Platfora data type: * DATETIME -> java.util.Date * DOUBLE -> java.lang.Double * FIXED -> java.lang.Long * INTEGER -> java.lang.Integer * LONG -> java.lang.Long * STRING -> java.lang.String * Note on FIXED type: fixed-precision numbers in Platfora * are represented as Longs that have been scaled by a * factor of 10,000. * * In the event that the user-defined function * encounters invalid inputs, or the function return value is not * defined given the inputs provided, the compute method should return * null rather than throwing an exception. The compute method should * avoid throwing any exceptions. * * @param arguments The values of the function inputs. * * The entries in this list will match the specification * provided by getArgumentTypes method in type, number, and order: * for example, if getArgumentTypes returned an array of * length 3 with the values STRING, DOUBLE, STRING, then * the arguments parameter will hold be a list of 3 Java * objects: a java.lang.String, a java.lang.Double, and a * java.lang.String. Any of the values within the * arguments List may be null. */ @Override public String compute(List arguments) { // cast the inputs to the correct types final String toRepeat = (String) arguments.get(0); final Integer numberOfRepeats = (Integer) arguments.get(1); // check for invalid inputs if (toRepeat == null || numberOfRepeats == null || numberOfRepeats < 0) return null; // repeat the input string the specified number of times final StringBuilder builder = new StringBuilder(); for (int i = 0; i < numberOfRepeats; i++) { builder.append(toRepeat); } return builder.toString(); Page 331 Data Ingest Guide - Platfora Expressions } } 3. Compile your .java UDF program file into a .class file (make sure to link to the platforaudf.jar file or place it in your Java CLASSPATH). The target Java version must be Java 1.6. Compiling with a target of Java 1.7 will result in an error when the UDF is used. For example, to compile the RepeatString.java program using Java 1.6: javac -source 1.6 -target 1.6 -cp platfora-udf.jar RepeatString.java 4. Create a Java archive file (.jar) containing your .class file. For example: jar cf repeat-string-udf.jar RepeatString.class After you have written and compiled your UDF Java program, you must then install and enable it on the Platfora master server. See Adding a UDF to the Platfora Expression Builder. Adding a UDF to the Platfora Expression Builder After you have written and compiled a user defined function (UDF) Java class, you must install your class on the Platfora master server and enable it so that it can be seen and used in the Platfora expression builder. This task is performed on the Platfora master server. Before you begin, you must have written and compiled a Java class for your user defined function. See Writing a Platfora UDF Java Program. 1. Create a directory named extlib in the Platfora data directory on the Platfora master server. For example: $ mkdir $PLATFORA_DATA_DIR/extlib 2. Copy the Java archive (.jar) file containing your UDF class to the $PLATFORA_DATA_DIR/ extlib directory on the Platfora master server. For example: $ cp repeat-string-udf.jar $PLATFORA_DATA_DIR/extlib/ 3. Set the Platfora server configuration property, platfora.udf.class.names, so it contains the name of your UDF Java class. If you have more than one class, separate the class names with a comma. For example, to set this property using the platfora-config command-line utility: $ $PLATFORA_HOME/bin/platfora-config set --key platfora.udf.class.names --value RepeatString 4. Restart the Platfora server: $ platfora-services restart Page 332 Data Ingest Guide - Platfora Expressions The user defined function will then be available for defining computed field expressions in the Add Field dialog of the Platfora application. Due to the way some web browsers cache Javascript files, the newly added function may not appear in the Functions list for up to 24 hours. However, the function is immediately available for use and recognized by the Expression autocomplete feature. Regular Expression Reference Regular expressions vary in complexity using a combination of basic constructs to describe a string matching pattern. This reference describes the most common regular expression matching patterns, but is not a comprehensive list. Regular expressions, also referred to as regex or regexp, are a standardized collection of special characters and constructs used for matching strings of text. They provide a flexible and precise language for matching particular characters, words, or patterns of characters. Platfora regular expressions are based on the pattern matching syntax of the Java programming language. For more in depth information on writing valid regular expressions, refer to the Java regular expression pattern documentation. Page 333 Data Ingest Guide - Platfora Expressions Platfora makes use of regular expressions in the following contexts: • In computed field expressions that use the REGEX or REGEX_REPLACE functions. • In PARTITION expression statements for event series processing computed fields. • In the Regex file parser in data ingest. • In the data source location path descriptor in data ingest. • In lens filter expressions. Regex Literal and Special Characters The most basic form of regular expression pattern matching is the match of a literal character or string. Regular expressions also have a number of special characters that affect the way a pattern is matched. This section describes the regular expression syntax for referring to literal characters, special characters, non-printable characters (such as a tab or a newline), and special character escaping. The most basic form of pattern matching is the match of literal characters. For example, if the regular expression is foo and the input string is foo, the match will succeed because the strings are identical. Certain characters are reserved for special use in regular expressions. These special characters are often called metacharacters. If you want to use special characters as literal characters, they must be escaped. Character Name Character Reserved For opening bracket [ start of a character class closing bracket ] end of a character class hyphen - character ranges within a character class backslash \ general escape character caret ^ beginning of string, negating of a character class dollar sign $ end of string period . matching any single character pipe | alternation (OR) operator question mark ? optional quantifier, quantifier minimizer asterisk * zero or more quantifier plus sign + once or more quantifier opening parenthesis ( start of a subexpression group closing parenthesis ) end of a subexpression group Page 334 Data Ingest Guide - Platfora Expressions Character Name Character Reserved For opening brace { start of min/max quantifier closing brace } end of min/max quantifier There are two ways to force a special character to be treated as an ordinary character: • Precede the special character with a \ (backslash character). For example, to specify an asterisk as a literal character instead of a quantifier, use \*. • Enclose the special character(s) within \Q (starting quote) and \E (ending quote). Everything between \Q and \E is then treated as literal characters. • To escape literal double-quotes in a REGEX() expression, double the double-quotes (""). For example, to extract the inches portion from a height field where example values are 6'2", 5'11": REGEX(height, "\'(\d)+""$") You can use special character sequence constructs to specify non-printable characters in a regular expression. Some of the most commonly used constructs are: Construct Matches \n newline character \r carriage return character \t tab character \f form feed character Regex Character Classes A character class allows you to specify a set of characters, enclosed in square brackets, that can produce a single character match. There are also a number of special predefined character classes (backslash character sequences that are shorthand for the most common character sets). A character class matches only to a single character. For example, gr[ae]y will match to gray or grey, but not to graay or graey. The order of the characters inside the brackets does not matter. You can use a hyphen inside a character class to specify a range of characters. For example, [az] matches a single lower-case letter between a and z. You can also use more than one range, or a combination of ranges and single characters. For example, [0-9X] matches a numeric digit or the letter X. Again, the order of the characters and the ranges does not matter. Page 335 Data Ingest Guide - Platfora Expressions A caret following an opening bracket specifies characters to exclude from a match. For example, [^abc] will match any character except a, b, or c. Construct Type Description [abc] simple matches a or b or c [^abc] negation matches any character except a or b or c [a-zA-Z] range matches a through z , or A through Z (inclusive) [a-d[m-p]] union matches a through d , or m through p [a-z&&[def]] intersection matches d , e , or f Page 336 Data Ingest Guide - Platfora Expressions Construct Type Description [a-z&&[^xq]] subtraction matches a through z , except for x and q Predefined character classes offer convenient shorthands for commonly used regular expressions. Construct Description Example . matches any single character (except newline) .at matches "cat", "hat", and also"bat" in the phrase "batch files" \d \D matches any digit character (equivalent to \d [0-9] ) matches "3" in "C3PO" and "2" in "file_2.txt" matches any non-digit character (equivalent to \D [^0-9] matches "S" in "900S" and "Q" in "Q45" ) \s matches any single white-space character (equivalent to [ \t\n\x0B\f\r] \sbook matches "book" in "blue book" but nothing in "notebook" ) \S matches any single non-white-space character \Sbook matches "book" in "notebook" but nothing in "blue book" \w matches any alphanumeric character, including r\w* underscore (equivalent to matches "rm" and "root" [A-Za-z0-9_] ) \W matches any non-alphanumeric character (equivalent to [^A-Za-z0-9_] ) Page 337 \W matches "&" in "stmd &" , "%" in "100%", and "$" in "$HOME" Data Ingest Guide - Platfora Expressions POSIX has a set of character classes that denote certain common ranges. They are similar to bracket and predefined character classes, except they take into account the locale (the local language/coding system). \p{Lower} a lower-case alphabetic character, [a-z] \p{Upper} an upper-case alphabetic character, [A-Z] \p{ASCII} an ASCII character, [\x00-\x7F] \p{Alpha} an alphabetic character, [a-zA-z] \p{Digit} a decimal digit, [0-9] \p{Alnum} an alphanumeric character, [a-zA-z0-9] \p{Punct} a punctuation character, one of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ \p{Graph} a visible character, [\p{Alnum}\p{Punct}] \p{Print} a printable character, [\p{Graph}\x20] \p{Blank} a space or tab, [ t] \p{Cntrl} a control character, [\x00-\x1F\x7F] \p{XDigit} a hexidecimal digit, [0-9a-fA-F] \p{Space} a whitespace character, [ \t\n\x0B\f\r] Page 338 Data Ingest Guide - Platfora Expressions Regex Line and Word Boundaries Boundary matching constructs are used to specify where in a string to apply a matching pattern. For example, you can search for a particular pattern within a word boundary, or search for a pattern at the beginning or end of a line. Construct Description Example ^ matches from the beginning of a line (multiline matches are currently not supported) ^172 matches from the end of a line (multi-line matches are currently not supported) d$ matches within a word boundary \bis\b $ \b will match the "172" in IP address "172.18.1.11" but not in "192.172.2.33" will match the "d" in "maid" but not in "made" matches the word "is" in "this is my island", but not the "is" part of "this" or "island". \bis matches both "is" and the "is" in "island", but not in "this". \B matches within a non-word boundary \Bb matches "b" in "sbin" but not in "bash" Regex Quantifiers Quantifiers specify how often the preceding regular expression construct should match. There are three classes of quantifiers: greedy, reluctant, and possessive. The difference between greedy, reluctant, and possessive quantifiers involves what part of the string to try for the initial match, and how to retry if the initial attempt does not produce a match. By default, quantifiers are greedy. A greedy quantifier will first try for a match with the entire input string. If that produces a match, then the match is considered a success, and the engine can move on to the next construct in the regular expression. If the first try does not produce a match, the engine backsoff one character at a time until a match is found. So a greedy quantifier checks for possible matches in order from the longest possible input string to the shortest possible input string, recursively trying from right to left. Adding a ? (question mark) to a greedy quantifier makes it reluctant. A reluctant quantifier will first try for a match from the beginning of the input string, starting with the shortest possible piece of the string that matches the regex construct. If that produces a match, then the match is considered a success, and the engine can move on to the next construct in the regular expression. If the first try does not produce a match, the engine adds one character at a time until a match is found. So a reluctant quantifier checks Page 339 Data Ingest Guide - Platfora Expressions for possible matches in order from the shortest possible input string to the longest possible input string, recursively trying from left to right. Adding a + (plus sign) to a greedy quantifier makes it possessive. A possessive quantifier is like a greedy quantifier on the first attempt (it tries for a match with the entire input string). The difference is that unlike a greedy quantifier, a possessive quantifier does not retry a shorter string if a match is not found. If the initial match fails, the possessive quantifier reports a failed match. It does not make any more attempts. Greedy ReluctantPossessiveDescription ConstructConstructConstruct Example ? matches the previous character or construct once or not at all st?on matches the previous character or construct zero or more times if* matches the previous character or construct one or more times if+ matches the previous character or construct exactly o{2} * + {n} ?? *? +? {n}? ?+ *+ ++ {n}+ matches "son" in "johnson" and "ston" in "johnston" but nothing in "clinton" or "version" matches "if", "iff" in "diff", or "i" in "print" matches "if", "iff" in "diff", but nothing in "print" matches "oo" in "lookup" and the first two o's in "fooooo" but nothing in "mount" n times {n,} {n,}? {n,}+ matches the previous character or construct at least o{2,} matches "oo" in "lookup" all five o's in "fooooo" but nothing in "mount" n times {n,m} {n,m}? {n,m}+ matches the previous character or construct at least n times, but no more than m times Page 340 F{2,4} matches "FF" in "#FF0000" and the last four F's in "#FFFFFF" Data Ingest Guide - Platfora Expressions Regex Capturing Groups Groups are specified by a pair of parenthesis around a subpattern in the regular expression. By placing part of a regular expression inside parentheses, you group that part of the regular expression together. This allows you to apply regex operators and quantifiers to the entire group at once. Besides grouping part of a regular expression together, parenthesis also create a capturing group. Capturing groups are used to determine which matching values to save or return from your regular expression. A regular expression can have more than one group and the groups can be nested. The groups are numbered 1-n from left to right, starting with the first opening parenthesis. There is always an implicit group 0, which contains the entire match. For example, the pattern: (a(b*))+(c) contains three groups: group 1: (a(b*)) group 2: (b*) group 3: (c) By default, a group captures the text that produces a match. Besides grouping part of a regular expression together, parenthesis also create a capturing group or a backreference. The portion of the string matched by the grouped subexpression is captured in memory for later retrieval or use. Capturing Groups and the Regex Line Parser When you choose the Regex line parser during the Parse Data phase of the data ingest process, Platfora uses capturing groups to determine what parts of the regular expression to return as columns. The Regex line parser applies the user-supplied regular expression against each line in the source file, and returns each capturing group in the regular expression as a column value. For example, suppose you had user records in a file, and the lines were formatted like this: Name: John Smith Address: 123 Main St. Age: 25 Comment: Active Name: Sally R. Jones Address: 2 E. El Camino Real Age: 32 Name: Rod Rogers Address: 55 Elm Street Comment: Suspended You could use the following regular expression to extract the Full Name, Last Name, Address, Age, and Comment column values: Name: (.*\s(\p{Alpha}+)) Address:\s+(.*) Age:\s+([0-9]+)(?:\s+Comment:\s +(.*))? Capturing Groups and the REGEX Function The REGEX function can be used to extract a portion of a string value. For the REGEX function, only the value of the first capturing group is returned. For example, if you wanted to match all possible email address strings with a pattern of username@provider.domain, but only return the provider portion of the email address from the email field: REGEX(email,"^[a-zA-Z0-9._%+-]+@([a-zA-Z0-9._-]+)\.[a-zA-Z]{2,4}$") Capturing Groups and the REGEX_REPLACE Function Page 341 Data Ingest Guide - Platfora Expressions The REGEX_REPLACE function is used to match a string value, and replace matched strings with another value. The REGEX_REPLACE function takes three arguments: an input string, a matching regex, and a replacement regex. Capturing groups can be used to capture backreferences (see Backreferences), but do not control what portions of the match are returned (the entire match is always returned). Backreferences allow you to capture and reuse a subexpression match inside the same regular expression. You can reuse a capturing group as a backreference by referring to its group number preceded by a backslash (for example, \1 refers to capturing group 1, \2 refers to capturing group 2, and so on). For example, if you wanted to match a pair of HTML tags and their enclosed text, you could capture the opening tag into a backreference, and then reuse it to match the corresponding closing tag: (<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\2>) This regular expression contains two capturing groups, the outermost capturing group (which captures the entire string), and one which captures the string matched by [A-Z][A-Z0-9]* into backreference number two. This backreference can then be reused with \2 (backslash two) to match the corresponding closing HTML tag. When referring to capturing groups in the previous regular expression, the backreference syntax is slightly different. The backreference group number is preceded by a dollar sign instead of a backslash (for example, $1 refers to capturing group 1 of the previous expression). An example of this would be the REGEX_REPLACE function, which takes two regular expressions: one for the matching string, and one for the replacement string. The following example matches the values in a phone_number field where phone number values are formatted as xxx.xxx.xxxx, and replaces them with phone number values formatted as (xxx) xxxxxxx. Notice the backreferences in the replacement expression; they refer to the capturing groups of the previous matching expression: REGEX_REPLACE(phone_number,"([0-9]{3})\.([[0-9]]{3})\.([[0-9]] {4})","\($1\) $2-$3") In some cases, you may want to use parenthesis to group subpatterns, but not capture text. A noncapturing group starts with (?: (a question mark and colon following the opening parenthesis). For example, h(?:a|i|o)t matches hat or hit or hot, but does not capture the a, i, or o from the subexpression. Page 342 Appendix A Platfora Expression Language Reference An expression computes or produces a value by combining field or column values, constant values, operators, and functions. Platfora has a built-in expression language. You use the language's functions and operators in dataset computed fields, vizboard computed fields, lens filters, and programmatic lens queries. Topics: • Expression Quick Reference • Comparison Operators • Logical Operators • Arithmetic Operators • Conditional and NULL Processing • Event Series Processing • String Functions • URL Functions • IP Address Functions • Date and Time Functions • Math Functions • Data Type Conversion Functions • Aggregate Functions • ROLLUP and Window Functions • User Defined Functions (UDFs) • Regular Expression Reference Expression Quick Reference An expression is a combination of columns (or fields), constant values, operators, and functions used to evaluate, transform, or produce a value. Simple expressions can be combined to make more complex expressions. This quick reference describes the functions and operators that can be used to write expressions. Page 343 Data Ingest Guide - Platfora Expression Language Reference Platfora's built-in statements, functions and operators are divided into the following categories: • Conditional and NULL Processing • Event Series Processing • String Processing • Date and Time Processing • URL Processing • IP Address Processing • Mathematical Processing • Data Type Conversion • Aggregation and Measure Processing • ROLLUP and Window Calculations • User Defined Functions • Comparison Operators • Logical Operators • Arithmetic Operators Conditional and NULL Processing Conditional and NULL processing allows you to transform or manipulate data values based on certain defined conditions. Conditional processing (CASE) can be done at either the dataset or vizboard level. NULL processing (COALESCE and IS_VALID) is only applicable at the dataset level. During a lens build, any NULL values in the source data are converted to default values, so lenses and vizboards have no concept of NULL values. Function Description Example CASE evaluates each row in the dataset according to one or more input conditions, and outputs the specified result when the input conditions are met CASE WHEN gender = "M" THEN "Male" WHEN gender = "F" THEN "Female" ELSE "Unknown" END COALESCE returns the first valid value (NOT NULL value) from a commaseparated list of expressions COALESCE(hourly_wage * 40 * 52, salary) IS_VALID returns 0 if the returned value is NULL, and 1 if the returned value is NOT NULL. IS_VALID(sale_amount) Page 344 Data Ingest Guide - Platfora Expression Language Reference Event Series Processing Event series processing allows you to partition rows of input data, order the rows sequentially (typically by a timestamp), and search for matching patterns in a set of rows. Computed fields that are defined in a dataset using a PARTITION expression are considered event series processing computed fields. Event series processing computed fields are processed differently than regular computed fields. Instead of computing values from the input of a single row, they compute values from inputs of multiple rows in the dataset. Event series processing computed fields can only be defined in the dataset - not in the vizboard. Function Description Example PACK_VALUES returns multiple PACK_VALUES("ID",custid,"Age",age) output values packed into a single string of key/value pairs separated by the Platfora default key and pair separators - useful when the OUTPUT clause of a PARTITION expression returns multiple output values PARTITION partitions the rows of a dataset, orders the rows sequentially (typically by a timestamp), and searches for matching patterns in a set of rows PARTITION BY SessionID ORDER BY Timestamp PATTERN (A,B,C) DEFINE A AS Page = "home.html", B AS Page = "product.html", C AS Page = "checkout.html" OUTPUT "TRUE" String Functions String functions allow you to manipulate and transform textual data, such as combining string values or extracting a portion of a string value. Function Description Example ARRAY_CONTAINS performs a whole string match against a string containing delimited values and returns a 1 or 0 depending on whether or not the string contains the search value. ARRAY_CONTAINS(device,",","iPad") Page 345 Data Ingest Guide - Platfora Expression Language Reference Function Description Example CONCAT concatenates (combines together) the results of multiple string expressions CONCAT(month,"/",day,"/",year) FILE_NAME returns the original file TO_DATE(SUBSTRING(FILE_NAME(),0,8),"yyyyMMdd") name from the source file system FILE_PATH returns the full URI path from the source file system TO_DATE(REGEX(FILE_PATH(),"hdfs:// myhdfs-server.com/data/logs/(\d{8})/(?: \d{1,3}\.*)+\.log"),"yyyyMMdd") EXTRACT_COOKIE extracts the value of the given cookie identifier from a semicolon delimited list of cookie key=value pairs. EXTRACT_COOKIE("SSID=ABC; vID=44", "vID") returns 44 EXTRACT_VALUE extracts the value for the given key from a string containing delimited key/value pairs. EXTRACT_VALUE("firstname;daria| lastname;hutch","lastname",";","|") returns INSTR returns an integer indicating the position of a character within a string that is the first character of the occurrence of a substring. INSTR(url,"http://",-1,1) JAVA_STRING returns the unescaped version of a Java unicode character escape sequence as a string value CASE WHEN currency == JAVA_STRING("\u00a5") THEN "yes" ELSE "no" END JOIN_STRINGS concatenates JOIN_STRINGS("/",month,day,year) (combines together) the results of multiple string expressions with the separator in between each non-null value hutch Page 346 Data Ingest Guide - Platfora Expression Language Reference Function Description Example JSON_ARRAY_CONTAINS performs a whole string match against a string formatted as a JSON array and returns a 1 or 0 depending on whether or not the string contains the search value JSON_ARRAY_CONTAINS(software,"platfora") JSON_DOUBLE extracts a DOUBLE value from a field in a JSON object JSON_DOUBLE(top_scores,"test_scores.2") JSON_FIXED extracts a FIXED value JSON_FIXED(top_scores,"test_scores.2") from a field in a JSON object JSON_INTEGER extracts an INTEGER value from a field in a JSON object JSON_INTEGER(top_scores,"test_scores.2") JSON_LONG extracts a LONG value from a field in a JSON object JSON_LONG(top_scores,"test_scores.2") JSON_STRING extracts a STRING value from a field in a JSON object JSON_STRING(misc,"hobbies.0") LENGTH returns the count of characters in a string value LENGTH(name) REGEX performs a whole REGEX(weblog.request_line,"GET\s/([a-zAstring match against Z0-9._%-]+\.[html])\sHTTP/[0-9.]+") a string value with a regular expression and returns the portion of the string matching the first capturing group of the regular expression Page 347 Data Ingest Guide - Platfora Expression Language Reference Function Description Example REGEX_REPLACE evaluates a string value against a regular expression to determine if there is a match, and replaces matched strings with the specified replacement value REGEX_REPLACE(phone_number,"([0-9] {3})\.([[0-9]]{3})\.([[0-9]]{4})","\($1\) $2-$3") SPLIT breaks down a delimited input string into sections and returns the specified section of the string SPLIT("Restaurants>Location>San Francisco",">", -1) returns San Francisco SUBSTRING returns the specified characters of a string value based on the given start and end position SUBSTRING(name,0,1) TO_LOWER converts all alphabetic characters in a string to lower case TO_LOWER("123 Main Street") returns 123 converts all alphabetic characters in a string to upper case TO_UPPER("123 Main Street") returns 123 TRIM removes leading and trailing spaces from a string value TRIM(area_code) XPATH_STRING takes an XMLformatted string and returns the first string matching the given XPath expression XPATH_STRING(address,"// address[@type='home']/zipcode") XPATH_STRINGS takes an XMLformatted string and returns a newlineseparated array of strings matching the given XPath expression XPATH_STRINGS(address,"/list/address[1]/ street") TO_UPPER main street MAIN STREET Page 348 Data Ingest Guide - Platfora Expression Language Reference Function Description Example XPATH_XML takes an XMLformatted string and returns an XMLformatted string matching the given XPath expression XPATH_XML(address,"//address[last()]") Date and Time Functions Date and time functions allow you to manipulate and transform datetime values, such as calculating time differences between two datetime values, or extracting a portion of a datetime value. Function Description Example DAYS_BETWEEN calculates the whole number of days (ignoring time) between two DATETIME values DAYS_BETWEEN(ship_date,order_date) DATE_ADD adds the specified time DATE_ADD(invoice_date,45,"day") interval to a DATETIME value HOURS_BETWEEN calculates the whole number of hours (ignoring minutes, seconds, and milliseconds) between two DATETIME values HOURS_BETWEEN(NOW(),impressions.adview_timestam EXTRACT returns the specified portion of a DATETIME value EXTRACT("hour",order_date) MILLISECONDS_BETWEEN calculates the MILLISECONDS_BETWEEN(request_timestamp,response_ MINUTES_BETWEEN calculates the whole MINUTES_BETWEEN(impression_timestamp,conversion_t whole number of milliseconds between two DATETIME values number of minutes (ignoring seconds and milliseconds) between two DATETIME values NOW returns the current system date and time as a DATETIME value YEAR_DIFF(NOW(),users.birthdate) Page 349 Data Ingest Guide - Platfora Expression Language Reference Function Description Example SECONDS_BETWEEN calculates the whole number of seconds (ignoring milliseconds) between two DATETIME values SECONDS_BETWEEN(impression_timestamp,conversion_ TRUNC truncates a DATETIME value to the specified format TRUNC(TO_DATE(order_date,"MM/dd/yyyy HH:mm:ss"),"day") YEAR_DIFF calculates the fractional number of years between two DATETIME values YEAR_DIFF(NOW(),users.birthdate) URL Functions URL functions allow you to extract different portions of a URL string, and decode text that is URLencoded. Function Description Example URL_AUTHORITY returns the authority URL_AUTHORITY("http:// portion of a URL string user:password@mycompany.com:8012/ mypage.html") returns user:password@mycompany.com:8012 URL_FRAGMENT returns the fragment URL_FRAGMENT("http://platfora.com/ portion of a URL string news.php?topic=press#Platfora%20News") returns Platfora%20News URL_HOST returns the host, URL_HOST("http:// domain, or IP address user:password@mycompany.com:8012/ portion of a URL string mypage.html") returns mycompany.com URL_PATH returns the path URL_PATH("http://platfora.com/company/ portion of a URL string contact.html") returns /company/contact.html URL_PORT returns the port URL_PORT("http:// portion of a URL string user:password@mycompany.com:8012/ mypage.html") returns 8012 URL_PROTOCOL returns the protocol URL_PROTOCOL("http://www.platfora.com") (or URI scheme name) returns http portion of a URL string Page 350 Data Ingest Guide - Platfora Expression Language Reference Function Description Example URL_QUERY returns the query URL_QUERY("http://platfora.com/news.php? portion of a URL string topic=press&timeframe=today") returns topic=press&timeframe=today URLDECODE decodes a string that has been encoded with the application/xwww-form-urlencoded media type URLDECODE("N%2FA%20or%20%22not %20applicable%22") IP Address Functions IP address functions allow you to manipulate and transform STRING data consisting of IP address values. Function Description Example CIDR_MATCH compares two CIDR_MATCH("60.145.56.0/24","60.145.56.246") STRING arguments returns 1 representing a CIDR mask and an IP address, and returns 1 if the IP address falls within the specified subnet mask or 0 if it does not HEX_TO_IP converts a HEX_TO_IP(AB20FE01) returns 171.32.254.1 hexadecimal-encoded STRING to a text representation of an IP address Math Functions Math functions allow you to perform basic math calculations on numeric values. You can also use the arithmetic operators to perform simple math calculations, such as addition, subtraction, division and multiplication. Function Description Example DIV divides two LONG values and returns a quotient value of type LONG DIV(TO_LONG(file_size),1024) Page 351 Data Ingest Guide - Platfora Expression Language Reference Function Description Example EXP raises the EXP(Value) mathematical constant e to the power (exponent) of a numeric value and returns a value of type DOUBLE. FLOOR returns the largest integer that is less than or equal to the input argument FLOOR(32.6789) returns 32.0 HASH evenly partitions data values into the specified number of buckets HASH(username,20) LN returns the natural logarithm of a number LN(2.718281828) returns 1 MOD divides two LONG values and returns the remainder value of type LONG MOD(TO_LONG(file_size),1024) POW raises a numeric 100 * POW(end_value/start_value, 0.2) - 1 value to the power (exponent) of another numeric value and returns a value of type DOUBLE. ROUND rounds a DOUBLE value to the specified number of decimal places ROUND(32.4678954,2) returns 32.47 Page 352 Data Ingest Guide - Platfora Expression Language Reference Data Type Conversion Functions Data type conversion functions allow you to cast data values from one data type to another. These functions are used implicitly whenever you set the data type of a field or column in the Platfora user interface. The supported data types are: INTEGER, LONG, DOUBLE, FIXED, DATETIME, and STRING Function Description Example EPOCH_MS_TO_DATEconverts LONG values EPOCH_MS_TO_DATE(1360260240000) to DATETIME values, returns 2013-02-07T18:04:00:000Z where the input number represents the number of milliseconds since the epoch TO_FIXED converts STRING, INTEGER, LONG, or DOUBLE values to fixed-decimal values TO_FIXED(opening_price) TO_DATE converts STRING values to DATETIME values, and specifies the format of the date and time elements in the string TO_DATE(order_date,"yyyy.MM.dd 'at' HH:mm:ss z") TO_DOUBLE converts STRING, INTEGER, LONG, or DOUBLE values to DOUBLE (decimal) values TO_DOUBLE(average_rating) TO_INT converts STRING, INTEGER, LONG, or DOUBLE values to INTEGER (whole number) values TO_INT(average_rating) TO_LONG converts STRING, INTEGER, LONG, or DOUBLE values to LONG (whole number) values TO_LONG(average_rating) TO_STRING converts values of other data types to STRING (character) values TO_STRING(sku_number) Page 353 Data Ingest Guide - Platfora Expression Language Reference Aggregate Functions An aggregate function groups the values of multiple rows together based on some defined input expression. Aggregate functions return one value for a group of rows, and are only valid for defining measures in Platfora. In the dataset, measures can be defined using any of the aggregate functions. In the vizboard, only the DISTINCT, MAX, or MIN aggregate functions are allowed. Function Description Example AVG returns the average of all valid numeric values AVG(sale_amount) COUNT returns the number of rows in a dataset COUNT(sales.customers) COUNT_VALID returns the number of rows for which the given expression is valid COUNT_VALID(page_views) DISTINCT returns the number of distinct values for the given expression DISTINCT(user_id) MAX returns the biggest value from the given input expression MAX(sale_amount) MIN returns the smallest value from the given input expression MIN(sale_amount) SUM returns the total of all values from the given input expression SUM(sale_amount) STDDEV calculates the population standard deviation for a group of numeric values STDDEV(sale_amount) VARIANCE calculates the VARIANCE(sale_amount) population variance for a group of numeric values ROLLUP and Window Functions ROLLUP is a modifier to an aggregate expression that turns an aggregate into a windowed aggregate. Window functions (RANK, DENSE_RANK and NTILE) can only be used within a ROLLUP statement. The ROLLUP statement defines the partitioning and ordering of a rowset before the associated aggregate function or window function is applied. Page 354 Data Ingest Guide - Platfora Expression Language Reference ROLLUP defines a window or user-specified set of rows within a query result set. A window function then computes a value for each row in the window. You can use window functions to compute aggregated values such as moving averages, cumulative aggregates, running totals, or a top N per group results. ROLLUP statements can be specified in either the dataset or the vizboard. When using a ROLLUP in a vizboard, the measure for which you are calculating the ROLLUP must already exist in the lens you are using in the vizboard. Function Description Example DENSE_RANK assigns the rank (position) of each row in a group (partition) of rows and does not skip rank numbers in the event of tie ROLLUP DENSE_RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED PRECEDING NTILE divides a partitioned group of rows into the specified number of buckets, and returns the bucket number to which the current row belongs ROLLUP NTILE(100) TO () ORDER BY ([Total Records] DESC) ROWS UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING RANK assigns the rank ROLLUP RANK() TO () ORDER BY (position) of each row ([Sales(Sum)] DESC) ROWS UNBOUNDED in a group (partition) PRECEDING of rows and skips rank numbers in the event of tie ROLLUP a modifier to an aggregate function that turns a regular aggregate function into a windowed, partitioned, or adaptive aggregate function 100 * COUNT(Flights) / ROLLUP COUNT(Flights) TO ([Departure Date]) ROW_NUMBER a modifier to an aggregate function that turns a regular aggregate function into a windowed, partitioned, or adaptive aggregate function ROLLUP ROW_NUMBER() TO (Quarter) ORDER BY (Sum_Sales DESC) ROWS UNBOUNDED PRECEDING Page 355 Data Ingest Guide - Platfora Expression Language Reference User Defined Functions User defined functions (UDFs) allow you to define your own per-row processing logic, and then expose that functionality to users in the Platfora application expression builder. See User Defined Functions (UDFs) for more information. Comparison Operators Comparison operators are used to compare the equivalency of two expressions of the same data type. The result of a comparison expression is a Boolean value (returns 1 for true, 0 for false, or NULL for invalid). Boolean expressions are most often used to specify data processing conditions or filters. Operator Meaning Example Expression = or == Equal to order_date = "12/22/2011" > Greater than age > 18 !> Not greater than age !> 8 < Less than age < 30 !< Not less than age !< 12 >= Greater than or equal to age >= 20 <= Less than or equal to age <= 29 <> or != or ^= Not equal to age <> 30 Logical Operators Logical operators are used to define Boolean (true / false) expressions. Logical operators are used in expressions to test for a condition, and return 1 if the condition is true or 0 if it is false. Logical operators are often used in lens filters, CASE expressions, PARTITION expressions, and WHERE clauses of queries. Operator Meaning Example Expression AND Test whether two conditions are true. OR Test if either of two conditions are true. Page 356 Data Ingest Guide - Platfora Expression Language Reference Operator Meaning BETWEEN Test whether a date or year BETWEEN 2000 AND 2012 numeric value is within the min and max values min_value AND max_value Example Expression (inclusive). IN(list) Test whether a value is product_type within a set. IN("tablet","phone","laptop") LIKE("pattern") Simple inclusive caseinsensitive character pattern matching. The * character matches any number of characters. The ? character matches exactly one character. last_name LIKE("?utch*") matches Kutcher, hutch but not Krutcher or crutch Check whether a field value or expression is null (empty) ship_date IS NULL evaluates to true when the ship_date field is Reverses the value of other operators. • year NOT BETWEEN 2000 AND 2012 value IS NULL NOT company_name LIKE("platfora") matches Platfora or platfora empty • first_name NOT LIKE("Jo?n*") excludes John, jonny but not Jon or Joann • Date.Weekday NOT IN("Saturday","Sunday") • purchase_date IS NOT NULL evaluates to true when the purchase_date field is not empty Arithmetic Operators Arithmetic operators perform basic math operations on two expressions of the same data type resulting in a numeric value. The plus (+) and minus (-) operators can also be used to perform arithmetic operations on DATETIME values. Operator Description Example + Addition amount + 10 (add 10 to the value of the amount field) Page 357 Data Ingest Guide - Platfora Expression Language Reference Operator Description Example - Subtraction amount - 10 (subtract 10 from the value of the amount field) * Multiplication amount * 100 (multiply the value of the amount field by 100) / Division bytes / 1024 (divide the value of the bytes field by 1024 and return the quotient) Comparison Operators Comparison operators are used to compare the equivalency of two expressions of the same data type. The result of a comparison expression is a Boolean value (returns 1 for true, 0 for false, or NULL for invalid). Boolean expressions are most often used to specify data processing conditions or filter criteria. Operator Definitions Operator Meaning Example Expression = or == Equal to order_date = "12/22/2011" > Greater than age > 18 !> Not greater than age !> 8 < Less than age < 30 !< Not less than age !< 12 >= Greater than or equal to age >= 20 <= Less than or equal to age <= 29 Page 358 Data Ingest Guide - Platfora Expression Language Reference Operator Meaning Example Expression <> or != or ^= Not equal to age <> 30 If you are writing queries with REST and the query string includes an = (equal) character, you must URL encode it as %3D. Failure to encode the character can result in this error: string matching regex `(?i)\Qnot\E\b' expected but end of source found. Logical Operators Logical operators are used to define Boolean (true / false) expressions. Logical operators are used in expressions to test for a condition, and return 1 if the condition is true or 0 if it is false. Logical operators are often used in lens filters, CASE expressions, PARTITION expressions, and WHERE clauses of queries. Operator Meaning AND Test whether two conditions are true. OR Test if either of two conditions are true. BETWEEN Test whether a date or year BETWEEN 2000 AND 2012 numeric value is within the min and max values min_value AND max_value Example Expression (inclusive). IN(list) Test whether a value is product_type within a set. IN("tablet","phone","laptop") LIKE("pattern") Simple inclusive caseinsensitive character pattern matching. The * character matches any number of characters. The ? character matches exactly one character. last_name LIKE("?utch*") matches Kutcher, hutch but not Krutcher or crutch Check whether a field value or expression is null (empty) ship_date IS NULL evaluates to true when the ship_date field is value IS NULL company_name LIKE("platfora") matches Platfora or platfora empty Page 359 Data Ingest Guide - Platfora Expression Language Reference Operator Meaning Example Expression NOT Reverses the value of other operators. • year NOT BETWEEN 2000 AND 2012 • first_name NOT LIKE("Jo?n*") excludes John, jonny but not Jon or Joann • Date.Weekday NOT IN("Saturday","Sunday") • purchase_date IS NOT NULL evaluates to true when the purchase_date field is not empty Arithmetic Operators Arithmetic operators perform basic math operations on two expressions of the same data type resulting in a numeric value. The plus (+) and minus (-) operators can also be used to perform arithmetic operations on DATETIME values. Operator Description Example + Addition amount + 10 (add 10 to the value of the amount field) - Subtraction amount - 10 (subtract 10 from the value of the amount field) * Multiplication amount * 100 (multiply the value of the amount field by 100) / Division bytes / 1024 (divide the value of the bytes field by 1024 and return the quotient) Page 360 Data Ingest Guide - Platfora Expression Language Reference Conditional and NULL Processing Conditional and NULL processing allows you to transform or manipulate data values based on certain defined conditions. Conditional processing (CASE) can be done at either the dataset or vizboard level. NULL processing (COALESCE and IS_VALID) is only applicable at the dataset level. During a lens build, any NULL values in the source data are converted to default values, so lenses and vizboards have no concept of NULL values. CASE CASE is a row function that evaluates each row in the dataset according to one or more input conditions, and outputs the specified result when the input conditions are met. Syntax CASE WHEN input_condition [AND|OR input_condition]THEN output_expression [...] [ELSE other_output_expression] END Return Value Returns one value per row of the same type as the output expression. All output expressions must return the same data type. If there are multiple output expressions that return different data types, then you will need to enclose your entire CASE expression in one of the data type conversion functions to explicitly cast all output values to a particular data type. Input Parameters WHEN input_condition Required. The WHEN keyword is used to specify one or more Boolean expressions (see Platfora's supported conditional operators). If an input value meets the condition, then the output expression is applied. Input conditions can include other row functions in their expression, but cannot contain aggregate functions or measure expressions. You can use the AND or OR keywords to combine multiple input conditions. THEN output_expression Required. The THEN keyword is used to specify an output expression when the specified conditions are met. Output expressions can include other row functions in their expression, but cannot contain aggregate functions or measure expressions. ELSE other_output_expression Optional. The ELSE keyword can be used to specify an alternate output expression to use when the specified conditions are not met. If an ELSE expression is not supplied, ELSE NULL is the default. END Required. Denotes the end of CASE function processing. Page 361 Data Ingest Guide - Platfora Expression Language Reference Examples Convert values in the age column into a range-based groupings (binning): CASE WHEN age <= 25 THEN "0-25" WHEN age <= 50 THEN "26-50" ELSE "over 50" END Transform values in the gender column from one string to another: CASE WHEN gender = "M" THEN "Male" WHEN gender = "F" THEN "Female" ELSE "Unknown" END The vehicle column contains the following values: truck, bus, car, scooter, wagon, bike, tricycle, and motorcycle. The following example convert multiple values in the vehicle column into a single value: CASE WHEN vehicle in ("bike","scooter","motorcycle) THEN "two-wheelers" ELSE "other" END COALESCE COALESCE is a row function that returns the first valid value (NOT NULL value) from a commaseparated list of expressions. Syntax COALESCE(expression[,expression][,...]) Return Value Returns one value per row of the same type as the first valid input expression. Input Parameters expression At least one required. A field name or expression. Examples The following example shows an expression to calculate employee yearly income for exempt employees that have a salary and non-exempt employees that have an hourly_wage. This expression checks the values of both fields for each row, and returns the value of the first expression that is valid (NOT NULL). COALESCE(hourly_wage * 40 * 52, salary) IS_VALID IS_VALID is a row function that returns 0 if the returned value is NULL, and 1 if the returned value is NOT NULL. This is useful for computing other calculations where you want to exclude NULL values (such as when computing averages). Syntax IS_VALID(expression) Page 362 Data Ingest Guide - Platfora Expression Language Reference Return Value Returns 0 if the returned value is NULL, and 1 if the returned value is NOT NULL. Input Parameters expression Required. A field name or expression. Examples Define a computed field using IS_VALID. This returns a row count only for the rows where this field value is NOT NULL. If a value is NULL, it returns 0 for that row. In this example, we create a computed field (sale_amount_not_null) using the sale_amount field as the basis. IS_VALID(sale_amount) Then you can use the sale_amount_not_null computed field to calculate an acurate average for sale_amount that excludes NULL values: SUM(sale_amount)/SUM(sale_amount_not_null) This is what happens automatically when you use the AVG function. Event Series Processing Event series processing allows you to partition rows of input data, order the rows sequentially (typically by a timestamp), and search for matching patterns in a set of rows. Computed fields that are defined in a dataset using a PARTITION expression are considered event series processing computed fields. Event series processing computed fields are processed differently than regular computed fields. Instead of computing values from the input of a single row, they compute values from inputs of multiple rows in the dataset. Event series processing computed fields can only be defined in the dataset - not in the vizboard or a lens query. PARTITION PARTITION is an event series processing language that partitions the rows of a dataset, orders the rows sequentially (typically by a timestamp), and searches for matching patterns in a set of rows. Computed fields that are defined in a dataset using a PARTITION expression are considered event series processing computed fields. Event series processing computed fields are processed differently Page 363 Data Ingest Guide - Platfora Expression Language Reference than regular computed fields. Instead of computing values from the input of a single row, they compute values from inputs of multiple rows in the dataset. The PARTITION function can only be used to define a computed field in the dataset definition (pre-lens build). PARTITION cannot be used to define a vizboard computed field. Unlike other expressions, PARTITION expressions cannot be embedded within other functions or expressions - it must be a top-level expression. Syntax PARTITION BYfield_name ORDER BY field_name [ASC|DESC] PATTERN (pattern_expression) DEFINE symbol_1 AS filter_expression [,symbol_n AS filter_expression ] [, ...] OUTPUT output_expression Description To understand how event series processing works, we'll walk through a simple example of a PARTITION expression. This is a simple example of some weblog page view data. Each row represents a page view by a user at a give point in time. Session IDs are used to group together page views that happened in the same user session: Suppose you wanted to know how many sessions included the path of page visits to ‘home.html’ then ‘products.html’ then ‘checkout.html’. You could define a PARTITION expression that groups the rows by session, orders by time, and then iterates through the rows from top to bottom to find sessions that match the pattern: PARTITION BY SessionID ORDER BY Timestamp PATTERN (A,B,C) DEFINE A AS Page = "home.html", Page 364 Data Ingest Guide - Platfora Expression Language Reference B AS Page = "product.html", C AS Page = "checkout.html" OUTPUT "TRUE" 1. The PARTITION BY clause partitions (or groups) the rows of the dataset by session. 2. Within each partition, the ORDER BY clause sorts the rows by time (in ascending order by default). 3. Each DEFINE clause specifies a condition used to evaluate a row, and binds that condition to a symbol that is then used in the PATTERN clause. 4. The PATTERN clause checks if the conditions are met in the specified order and frequency. This pattern says that there is a match whenever there are 3 consecutive rows that meet criteria A then B then C. 5. For a row that satisfies all of the PATTERN criteria, the value of the OUTPUT clause is applied. Otherwise the output is NULL for rows that don’t meet all of the PATTERN criteria. Return Value Returns one value per row of the same type as the output_expression for rows that match the defined match pattern, otherwise returns NULL for rows that do not match the pattern. Output values are calculated during the lens build process using a special event series MapReduce job. Therefore, sample output values for a PARTITION computed field cannot be shown in the dataset workspace. Input Parameters PARTITION BY field_name Required. The PARTITION BY clause is used to specify a field in the current dataset by which to partition the rows. Rows that share the same value for this field will be grouped Page 365 Data Ingest Guide - Platfora Expression Language Reference together, and each group will then be processed independently according to the matching pattern criteria. The partition field cannot be a field of a referenced dataset; it must be a field in the current focus dataset. ORDER BY field_name Optional. The ORDER BY clause specifies a field by which to sort the rows within each partition before applying the match pattern criteria. For event series processing, records are typically ordered by a DATETIME type field, such as a date or a timestamp. The default sort order is ascending (first to last or low to high). The ordering field cannot be a field of a referenced dataset; it must be a field in the current focus dataset. PATTERN (pattern_expression) Required. The PATTERN clause specifies the matching pattern to search for within a partition of rows. The pattern_expression is expressed in a format similar to a regular expression. The pattern_expression can include: • A symbol that represents some match criteria (as declared in the DEFINE clause). • A symbol followed by one of the following regex quantifiers: ? (matches once or not at all - greedy construct) ?? (matches once or not at all - reluctant construct) * (matches zero or more times - greedy construct) *? (matches zero or more times - reluctant construct) + (matches one or more times - greedy construct) +? (matches one or more times - reluctant construct) ** (matches the empty sequence, or one or more of the quantified symbol, with gaps allowed in between. The match need not begin or end with the quantified symbol) *+ (matches the empty sequence, or one or more of the quantified symbol, with gaps allowed in between. The match must end with the quantified symbol) ++ (matches the quantified symbol, followed by zero or more of the quantified symbol, with gaps allowed in between. The match must end with the quantified symbol) +* (matches the quantified symbol, followed by zero or more of the quantified symbol, with gaps allowed in between. The match need not end with the quantified symbol) • A symbol or pattern of symbols anchored by the regex special characters for the beginning of string. Page 366 Data Ingest Guide - Platfora Expression Language Reference ^ (marks the beginning of the set of rows that match to the pattern) • patternA|patternB - The alternation operator (pipe symbol) between two symbols or patterns signifies an OR match. • patternA,patternB - The concatenation operator (comma) between two symbols or patterns signifies a match when pattern B immediately follows pattern A. • patternA->patternB - The follows operator (minus and greater-than sign) between two symbols or patterns signifies a match when pattern B eventually follows pattern A. • (pattern_expression) - By default, pattern expressions are matched from left to right. If parenthesis are used to group sub-expressions, the sub-expression within the parenthesis is evaluated first. You cannot use quantifiers outside of parenthesis. For example, you cannot write ((A,B,C)*), to indicate that the asterisk quantifier applies to the whole (A,B,C) expression. DEFINE symbol AS filter_expression Required. The DEFINE clause is used to enumerate symbols used in the PATTERN clause (or in the filter_expression of a subsequent symbol definition). A symbol is a name used to refer to some pattern matching criteria. This can be any name or token that follows Platfora's object naming rules. For example, if the name contains spaces, special characters, keywords, or starts with a number, you must enclose the name in brackets [] to escape it. Otherwise, this can be any logical name that helps you identify a piece of pattern matching logic in your expression. The filter_expression is a Boolean (true or false) expression that operates on each row of the partition. A filter_expression can contain: • The special expression TRUE or 1, meaning allow the match to occur for any row in the partition. • Any field_name in the current dataset. • symbol.field_name - A field from the dataset qualified by the name of a symbol that (1) appears only once in the PATTERN clause, (2) preceeds this symbol in the PATTERN clause, and (3) is not followed by a repetition quantifier in the PATTERN clause. For example: PATTERN (A, B) DEFINE A AS TRUE, B AS product = A.product This means that the expression for symbol B will match to a row if the product field for that row is also equal to the product field for the row that is bound to symbol A. • Any of the comparison operators, such as greater than, less than, equals, and so on. • The keywords AND or OR (for combining multiple criteria in a single filter expression) Page 367 Data Ingest Guide - Platfora Expression Language Reference • FIRST|LAST(symbol.field_name) - A field from the dataset, qualified by the name of a symbol that (1) only appears once in the PATTERN clause, (2) preceeds this symbol in the PATTERN clause, and (3) is followed by a repetition quantifier in the PATTERN clause (*,*?,+, or +?). This returns the field value for the first or last row when the pattern matches to a set of rows. For example: PATTERN (A+) DEFINE A AS product = FIRST(A.product) OR COUNT(A)=0 The pattern A+ will match to a series of consecutive rows that all have the same value for the product field as the first row in the sequence. If the current row happens to be the first row in the sequence, then it will also be included in the match. A FIRST or LAST expression evaluates to NULL if it refers to a symbol that ends up matching an empty sequence. Make sure your expression handles the row at the beginning or end of a sequence if you want that row to match as well. • Any computed expression that operates on the fields or expressions listed above and/or on literal values. OUTPUT output_expression Required. An expression that specifies what the output value should be. The output expression can refer to: • The field declared in the PARTITION BY clause. • symbol.field_name - A field from the dataset, qualified by the name of a symbol that (1) appears only once in the PATTERN clause, and (2) is not followed by a repetition quantifier in the PATTERN clause. This will output the matching field value. • COUNT(symbol) where symbol (1) appears only once in the PATTERN clause, and (2) is followed by a repetition quantifier in the PATTERN clause. This will output the sequence number of the row that matched the symbol pattern. • FIRST | LAST | SUM | COUNT | AVG(symbol.field_name) where symbol (1) appears only once in the PATTERN clause, and (2) is followed by a repetition quantifier in the PATTERN clause. This will output an aggregated value for a set of rows that matched the symbol pattern. • Since you can only output a single column value, you can use the PACK_VALUES function to output multiple results in a single column as key/value pairs. Examples 'Session Start Time' Expression Calculate a user session by partitioning by user and ordering by time. The matching logic represented by symbol A checks if the time of the current row is less than 30 minutes from the preceding row. If it is, then it is considered part of the same session as the previous row. Otherwise, the current row is Page 368 Data Ingest Guide - Platfora Expression Language Reference considered the start of a new session. The PATTERN (A+) means that the matching logic represented by symbol A must be true for one or more consecutive rows. The output then returns the time of the first row in a session. PARTITION BY UserID ORDER BY Timestamp PATTERN (A+) DEFINE A AS COUNT(A)=0 OR MINUTES_BETWEEN(Timestamp,LAST(A.Timestamp)) < 30 OUTPUT FIRST(A. Timestamp) 'Click Number in Session' Expression Calculate where a click happened in a session by partitioning by session and ordering by time. The matching logic represented by symbol A simply matches to any row in the session. The PATTERN (A +) means that the matching logic represented by symbol A must be true for one or more consecutive rows. The output then returns to count of the row within the partition (based on its order or position in the partition). PARTITION BY [Session ID] ORDER BY Timestamp PATTERN (A+) DEFINE A AS TRUE OUTPUT COUNT(A) 'Path to Page' Expression This is a complicated expression that looks back from the current row's position to determine the previous 4 pages viewed in a session. Since a PARTITION expression can only output one column value as its result, the OUTPUT clause uses the PACK_VALUES function to return the previous page positions 1,2,3, and 4 in one output value. You can then use a series of EXTRACT_VALUE expressions to create individual columns for each prior page view in the path. PARTITION BY SessionID ORDER BY Timestamp PATTERN (^OtherPreviousPages*?, Page4Back??, Page3Back??, Page2Back??, Page1Back??, CurrentPage) DEFINE OtherPreviousPages AS TRUE, Page4Back AS TRUE, Page3Back AS TRUE, Page2Back AS TRUE, Page1Back AS TRUE, CurrentPage AS TRUE OUTPUT PACK_VALUES("Back4",Page4Back.Page, "Back3",Page3Back.Page, "Back2",Page2Back.Page, "Back1",Page1Back.Page) ‘Page -1 Back’ Expression Use the output from the Path to Page expression and extract the last page viewed before the current page. EXTRACT_VALUE([Path to Page],"Back1") Page 369 Data Ingest Guide - Platfora Expression Language Reference PACK_VALUES PACK_VALUES is a row function that returns multiple output values packed into a single string of key/ value pairs separated by the Platfora default key and pair separators. This is useful when the OUTPUT clause of a PARTITION expression returns multiple output values. The string returned is in a format that can be read by the EXTRACT_VALUE function. PACK_VALUES uses the same key and pair separator values that EXTRACT_VALUE uses (the Unicode escape sequences u0003 and u0002, respectively). Syntax PACK_VALUES(key_string,value_expression[,key_string,value_expression] [,...]) Return Value Returns one value per row of type STRING. If the value for either key_string or value_expression of a pair is null or contains either of the two separators, the full key/value pair is omitted from the return value. Input Parameters key_string At least one required. A field name of any type, a literal string or number, or an expression that returns any value. value_expression At least one required. A field name of any type, a literal string or number, or an expression that returns any value. The expression must include one value_expression instance for each key_string instance. Examples Combine the values of the custid and age fields into a single string field. PACK_VALUES("ID",custid,"Age",age) The following expression returns ID\u00035555\u0002Age\u000329 when the value of the custid field is 5555 and the value of the age field is 29: PACK_VALUES("ID",custid,"Age",age) The following expression returns Age\u000329 when the value of the age field is 29: PACK_VALUES("ID",NULL,"Age",age) The following expression returns 29 as a STRING value when the age field is an INTEGER and its value is 29: EXTRACT_VALUE(PACK_VALUES("ID",custid,"Age",age),"Age") You might want to use the PACK_VALUES function to combine multiple field values into a single value in the OUTPUT clause of the PARTITION (event series processing) function. Then you can use the EXTRACT_VALUE function in a different computed field in the dataset to get one of the values returned Page 370 Data Ingest Guide - Platfora Expression Language Reference by the PARTITION function. For example, in the example below, the PARTITION function creates a set of rows that defines the previous five web pages accessed in a particular user session: PARTITION BY Session ORDER BY Time DESC PATTERN (A?, B?, C?, D?, E) DEFINE A AS true, B AS true, C AS true, D AS true, E AS true OUTPUT PACK_VALUES("A", A.Page, "B", B.Page, "C", C.Page, "D", D.Page) String Functions String functions allow you to manipulate and transform textual data, such as combining string values or extracting a portion of a string value. CONCAT CONCAT is a row function that returns a string by concatenating (combining together) the results of multiple string expressions. Syntax CONCAT(value_expression[,value_expression][,...]) Return Value Returns one value per row of type STRING. Input Parameters value_expression At least one required. A field name of any type, a literal string or number, or an expression that returns any value. Examples Combine the values of the month, day, and year fields into a single date field formatted as MM/DD/ YYYY. CONCAT(month,"/",day,"/",year) ARRAY_CONTAINS ARRAY_CONTAINS is a row function that performs a whole string match against a string containing delimited values and returns a 1 or 0 depending on whether or not the string contains the search value. Syntax ARRAY_CONTAINS(array_string,"delimiter","search_string") Page 371 Data Ingest Guide - Platfora Expression Language Reference Return Value Returns one value per row of type INTEGER. A return value of 1 indicates a positive match, and a return value of 0 indicates no match. Input Parameters array_string Required. The name of a field or expression of type STRING (or a literal string) that contains a valid array. delimiter Required. The delimiter used between values in the array string. This can be a name of a field or expression of type STRING. search_string Required. The literal string that you want to search for. This can be a name of a field or expression of type STRING. Examples If you had a device field that contained a comma delimited list formatted like this: Safari,iPad You could determine whether or not the device used was an iPad using the following expression: ARRAY_CONTAINS(device,",","iPad") The following expressions return 1: ARRAY_CONTAINS("platfora","|","platfora") ARRAY_CONTAINS("platfora|hadoop|2.3","|","hadoop") The following expressions return 0: ARRAY_CONTAINS("platfora","|","plat") ARRAY_CONTAINS("platfora,hadoop","|","platfora") FILE_NAME FILE_NAME is a row function that returns the original file name from the source file system. This is useful when the source data that comprises a dataset comes from multiple files, and there is useful information in the file names themselves (such as dates or server names). You can use FILE_NAME in combination with other string processing functions to extract useful information from the file name. Syntax FILE_NAME() Page 372 Data Ingest Guide - Platfora Expression Language Reference Return Value Returns one value per row of type STRING. Examples Your dataset is based on daily log files that use an 8 character date as part of the file name. For example, 20120704.log is the file name used for the log file created on July 4, 2012. The following expression uses FILE_NAME in combination with SUBSTRING and TO_DATE to create a date field from the first 8 characters of the file name. TO_DATE(SUBSTRING(FILE_NAME(),0,8),"yyyyMMdd") Your dataset is based on log files that use the server IP address as part of the file name. For example, 172.12.131.118.log is the log file name for server 172.12.131.118. The following expression uses FILE_NAME in combination with REGEX to extract the IP address from the file name. REGEX(FILE_NAME(),"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\.log") FILE_PATH FILE_PATH is a row function that returns the full URI path from the source file system. This is useful when the source data that comprises a dataset comes from multiple files, and there is useful information in the directory names or file names themselves (such as dates or server names). You can use FILE_PATH in combination with other string processing functions to extract useful information from the file path. Syntax FILE_PATH() Return Value Returns one value per row of type STRING. Examples Your dataset is based on daily log files that are organized into directories by date on the source file system, and the file names are the server IP address of the server that produced the log file. For example, the URI path to a log file produced by server 172.12.131.118 on July 4, 2012 is hdfs://myhdfsserver.com/data/logs/20120704/172.12.131.118.log. The following expression uses FILE_PATH in combination with REGEX and TO_DATE to create a date field from the date directory name. TO_DATE(REGEX(FILE_PATH(),"hdfs://myhdfs-server.com/data/logs/(\d{8})/(?: \d{1,3}\.*)+\.log"),"yyyyMMdd") And the following expression uses FILE_NAME and REGEX to extract the server IP address from the file name: REGEX(FILE_NAME(),"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\.log") Page 373 Data Ingest Guide - Platfora Expression Language Reference EXTRACT_COOKIE EXTRACT_COOKIE is a row function that extracts the value of the given cookie identifier from a semicolon delimited list of cookie key=value pairs. This function can be used to extract a particular cookie value from a combined web access log Cookie column. Syntax EXTRACT_COOKIE("cookie_list_string",cookie_key_string) Return Value Returns the value of the specified cookie key as type STRING. Input Parameters cookie_list_string Required. A field or literal string that has a semi-colon delimited list of cookie key=value pairs. cookie_key_string Required. The cookie key name for which to extract the cookie value. Examples Extract the value of the vID cookie from a literal cookie string: EXTRACT_COOKIE("SSID=ABC; vID=44", "vID") returns 44 Extract the value of the vID cookie from a field named Cookie: EXTRACT_COOKIE(Cookie,"vID") EXTRACT_VALUE EXTRACT_VALUE is a row function that extracts the value for the given key from a string containing delimited key/value pairs. Syntax EXTRACT_VALUE(string,key_name [,delimiter] [,pair_delimiter]) Return Value Returns the value of the specified key as type STRING. Input Parameters string Required. A field or literal string that contains a delimited list of key/value pairs. key_name Required. The key name for which to extract the value. Page 374 Data Ingest Guide - Platfora Expression Language Reference delimiter Optional. The delimiter used between the key and the value. If not specified, the value u0003 is used. This is the Unicode escape sequence for the start of text character (which is the default delimiter used by Hive). pair_delimiter Optional. The delimiter used between key/value pairs when the input string contains more than one key/ value pair. If not specified, the value u0002 is used. This is the Unicode escape sequence for the end of text character (which is the default delimiter used by Hive). Examples Extract the value of the lastname key from a literal string of key/value pairs: EXTRACT_VALUE("firstname;daria|lastname;hutch","lastname",";","|") returns hutch Extract the value of the email key from a string field named contact_info that contains strings in the format of key:value,key:value: EXTRACT_VALUE(contact_info,"email",":",",") INSTR INSTR is a row function that returns an integer indicating the position of a character within a string that is the first character of the occurrence of a substring. Platfora's INSTR function is similar to the FIND function in Excel, except that the first letter is position 0 and the order of the arguments is reversed. Syntax INSTR(string,substring,position,occurrence) Return Value Returns one value per row of type INTEGER. The first position is indicated with the value of zero (0). Input Parameters string Required. The name of a field or expression of type STRING (or a literal string). substring Required. A literal string or name of a field that specifies the substring to search for in string. position Optional. An integer that specifies at which character in string to start searching for substring. A value of 0 (zero) starts the search at the beginning of string. Use a positive integer to start searching from the beginning of string, and use a negative integer to start searching from the end of string. When no position is specified, INSTR searches at the beginning of the string (0). Page 375 Data Ingest Guide - Platfora Expression Language Reference occurrence Optional. A positive integer that specifies which occurrence of substring to search for. When no occurrence is specified, INSTR searches for the first occurrence of the substring (1). Examples Return the position of the first occurrence of the substring "http://" starting at the end of the url field: INSTR(url,"http://",-1,1) The following expression searches for the second occurrence of the substring "st" starting at the beginning of the string "bestteststring". INSTR finds that the substring starts at the seventh character in the string, so it returns 6: INSTR("bestteststring","st",0,2) The following expression searches backward for the second occurrence of the substring "st" starting at 7 characters before the end of the string "bestteststring". INSTR finds that the substring starts at the third character in the string, so it returns 2: INSTR("bestteststring","st",-7,2) JAVA_STRING JAVA_STRING is a row function that returns the unescaped version of a Java unicode character escape sequence as a string value. This is useful when you want to specify unicode characters in an expression. For example, you can use JAVA_STRING to specify the unicode value representing a control character. Syntax JAVA_STRING(unicode_escape_sequence) Return Value Returns the unescaped version of the specified unicode character, one value per row of type STRING. Input Parameters unicode_escape_sequence Required. A STRING value containing a unicode character expressed as a Java unicode escape sequence. Unicode escape sequences consist ofa backslash '\' (ASCII character 92, hex 0x5c), a 'u' (ASCII 117, hex 0x75), optionally one or more additional 'u' characters, and four hexadecimal digits (the characters '0' through '9' or 'a' through 'f' or 'A' through 'F'). Such sequences represent the UTF-16 encoding of a Unicode character. For example, the letter 'a' is equivalent to '\u0061'. Examples Evaluates whether the currency field is equal to the yen symbol. CASE WHEN currency == JAVA_STRING("\u00a5") THEN "yes" ELSE "no" END Page 376 Data Ingest Guide - Platfora Expression Language Reference JOIN_STRINGS JOIN_STRINGS is a row function that returns a string by concatenating (combining together) the results of multiple values with the separator in between each non-null value. Syntax JOIN_STRINGS(separator,value_expression[,value_expression][,...]) Return Value Returns one value per row of type STRING. Input Parameters separator Required. A field name of type STRING, a literal string, or an expression that returns a string. value_expression At least one required. A field name of any type, a literal string or number, or an expression that returns any value. Examples Combine the values of the month, day, and year fields into a single date field formatted as MM/DD/ YYYY. JOIN_STRINGS("/",month,day,year) The following expression returns NULL: JOIN_STRINGS("+",NULL,NULL,NULL) The following expression returns a+b: JOIN_STRINGS("+","a","b",NULL) JSON_ARRAY_CONTAINS JSON_ARRAY_CONTAINS is a row function that performs a whole string match against a string formatted as a JSON array and returns a 1 or 0 depending on whether or not the string contains the search value. Syntax JSON_ARRAY_CONTAINS(json_array_string,"search_string") Return Value Returns one value per row of type INTEGER. A return value of 1 indicates a positive match, and a return value of 0 indicates no match. Page 377 Data Ingest Guide - Platfora Expression Language Reference Input Parameters json_array_string Required. The name of a field or expression of type STRING (or a literal string) that contains a valid JSON array. A JSON array is an ordered sequence of values separated by commas and enclosed in square brackets. search_string Required. The literal string that you want to search for. This can be a name of a field or expression of type STRING. Examples If you have a software field that contains a JSON array formatted like this: ["hadoop","platfora"] The following expression returns 1: JSON_ARRAY_CONTAINS(software,"platfora") JSON_DOUBLE JSON_DOUBLE is a row function that extracts a DOUBLE value from a field in a JSON object. Syntax JSON_DOUBLE(json_string,"json_field") Return Value Returns one value per row of type DOUBLE. Input Parameters json_string Required. The name of a field or expression of type STRING (or a literal string) that contains a valid JSON object. json_field Required. The key or name of the field value you want to extract. For top-level fields, specify the name identifier (key) of the field. To access fields within a nested object, specify a dot-separated path of field names (for example top_level_field_name.nested_field_name). To extract a value from an array, specify the dot-separated path of field names and the array position starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0). If the name identifier contains dot or period characters within the name itself, escape the name by enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name] Page 378 Data Ingest Guide - Platfora Expression Language Reference If the field name is null (empty), use brackets with nothing in between as the identifier, for example []. Examples If you had a top_scores field that contained a JSON object formatted like this (with the values contained in an array): {"practice_scores":["538.67","674.99","1021.52"], "test_scores": ["753.21","957.88","1032.87"]} You could extract the third value of the test_scores array using the expression: JSON_DOUBLE(top_scores,"test_scores.2") JSON_FIXED JSON_FIXED is a row function that extracts a FIXED value from a field in a JSON object. Syntax JSON_FIXED(json_string,"json_field") Return Value Returns one value per row of type FIXED. Input Parameters json_string Required. The name of a field or expression of type STRING (or a literal string) that contains a valid JSON object. json_field Required. The key or name of the field value you want to extract. For top-level fields, specify the name identifier (key) of the field. To access fields within a nested object, specify a dot-separated path of field names (for example top_level_field_name.nested_field_name). To extract a value from an array, specify the dot-separated path of field names and the array position starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0). If the name identifier contains dot or period characters within the name itself, escape the name by enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name] If the field name is null (empty), use brackets with nothing in between as the identifier, for example []. Examples If you had a top_scores field that contained a JSON object formatted like this (with the values contained in an array): Page 379 Data Ingest Guide - Platfora Expression Language Reference {"practice_scores":["538.67","674.99","1021.52"], "test_scores": ["753.21","957.88","1032.87"]} You could extract the third value of the test_scores array using the expression: JSON_FIXED(top_scores,"test_scores.2") JSON_INTEGER JSON_INTEGER is a row function that extracts an INTEGER value from a field in a JSON object. Syntax JSON_INTEGER(json_string,"json_field") Return Value Returns one value per row of type INTEGER. Input Parameters json_string Required. The name of a field or expression of type STRING (or a literal string) that contains a valid JSON object. json_field Required. The key or name of the field value you want to extract. For top-level fields, specify the name identifier (key) of the field. To access fields within a nested object, specify a dot-separated path of field names (for example top_level_field_name.nested_field_name). To extract a value from an array, specify the dot-separated path of field names and the array position starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0). If the name identifier contains dot or period characters within the name itself, escape the name by enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name] If the field name is null (empty), use brackets with nothing in between as the identifier, for example []. Examples If you had an address field that contained a JSON object formatted like this: {"street_address":"123 B Street", "city":"San Mateo", "state":"CA", "zip_code":"94403"} You could extract the zip_code value using the expression: JSON_INTEGER(address,"zip_code") Page 380 Data Ingest Guide - Platfora Expression Language Reference If you had a top_scores field that contained a JSON object formatted like this (with the values contained in an array): {"practice_scores":["538","674","1021"], "test_scores": ["753","957","1032"]} You could extract the third value of the test_scores array using the expression: JSON_INTEGER(top_scores,"test_scores.2") JSON_LONG JSON_LONG is a row function that extracts a LONG value from a field in a JSON object. Syntax JSON_LONG(json_string,"json_field") Return Value Returns one value per row of type LONG. Input Parameters json_string Required. The name of a field or expression of type STRING (or a literal string) that contains a valid JSON object. json_field Required. The key or name of the field value you want to extract. For top-level fields, specify the name identifier (key) of the field. To access fields within a nested object, specify a dot-separated path of field names (for example top_level_field_name.nested_field_name). To extract a value from an array, specify the dot-separated path of field names and the array position starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0). If the name identifier contains dot or period characters within the name itself, escape the name by enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name] If the field name is null (empty), use brackets with nothing in between as the identifier, for example []. Examples If you had a top_scores field that contained a JSON object formatted like this (with the values contained in an array): {"practice_scores":["538","674","1021"], "test_scores": ["753","957","1032"]} Page 381 Data Ingest Guide - Platfora Expression Language Reference You could extract the third value of the test_scores array using the expression: JSON_LONG(top_scores,"test_scores.2") JSON_STRING JSON_STRING is a row function that extracts a STRING value from a field in a JSON object. Syntax JSON_STRING(json_string,"json_field") Return Value Returns one value per row of type STRING. Input Parameters json_string Required. The name of a field or expression of type STRING (or a literal string) that contains a valid JSON object. json_field Required. The key or name of the field value you want to extract. For top-level fields, specify the name identifier (key) of the field. To access fields within a nested object, specify a dot-separated path of field names (for example top_level_field_name.nested_field_name). To extract a value from an array, specify the dot-separated path of field names and the array position starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0). If the name identifier contains dot or period characters within the name itself, escape the name by enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name] If the field name is null (empty), use brackets with nothing in between as the identifier, for example []. Examples If you had an address field that contained a JSON object formatted like this: {"street_address":"123 B Street", "city":"San Mateo", "state":"CA", "zip":"94403"} You could extract the state value using the expression: JSON_STRING(address,"state") If you had a misc field that contained a JSON object formatted like this (with the values contained in an array): Page 382 Data Ingest Guide - Platfora Expression Language Reference {"hobbies":["sailing","hiking","cooking"], "interests": ["art","music","travel"]} You could extract the first value of the hobbies array using the expression: JSON_STRING(misc,"hobbies.0") LENGTH LENGTH is a row function that returns the count of characters in a string value. Syntax LENGTH(string) Return Value Returns one value per row of type INTEGER. Input Parameters string Required. The name of a field or expression of type STRING (or a literal string). Examples Return count of characters from values in the name field. For example, the value Bob would return a length of 3, Julie would return a length of 5, and so on: LENGTH(name) REGEX REGEX is a row function that performs a whole string match against a string value with a regular expression and returns the portion of the string matching the first capturing group of the regular expression. Syntax REGEX(string_expression,"regex_matching_pattern") Return Value Returns the matched STRING value of the first capturing group of the regular expression. If there is no match, returns NULL. Input Parameters string_expression Required. The name of a field or expression of type STRING (or a literal string). regex_matching_pattern Page 383 Data Ingest Guide - Platfora Expression Language Reference Required. A regular expression pattern based on the regular expression pattern matching syntax of the Java programming language. To return a non-NULL value, the regular expression pattern must match the entire string value. Regular Expression Constructs This section lists a summary of the most commonly used constructs for defining a regular expression matching pattern. See the Regular Expression Reference for more information about regular expression support in Platfora. Literal and Special Characters The most basic form of pattern matching is the match of literal characters. For example, if the regular expression is foo and the input string is foo, the match will succeed because the strings are identical. Certain characters are reserved for special use in regular expressions. These special characters are often called metacharacters. If you want to use special characters as literal characters, they must be escaped. You can escape a single character using a \ (backslash), or escape a character sequence by enclosing it in \Q ... \E. To escape literal double-quotes, double the double-quotes (""). Character Name Character Reserved For opening bracket [ start of a character class closing bracket ] end of a character class hyphen - character ranges within a character class backslash \ general escape character caret ^ beginning of string, negating of a character class dollar sign $ end of string period . matching any single character pipe | alternation (OR) operator question mark ? optional quantifier, quantifier minimizer asterisk * zero or more quantifier plus sign + once or more quantifier opening parenthesis ( start of a subexpression group closing parenthesis ) end of a subexpression group opening brace { start of min/max quantifier Page 384 Data Ingest Guide - Platfora Expression Language Reference Character Name Character Reserved For closing brace } end of min/max quantifier Character Class Constructs A character class allows you to specify a set of characters, enclosed in square brackets, that can produce a single character match. There are also a number of special predefined character classes (backslash character sequences that are shorthand for the most common character sets). Construct Type Description [abc] simple matches a or b or c [^abc] negation matches any character except a or b or c [a-zA-Z] range matches a through z , or A through Z (inclusive) [a-d[m-p]] union matches a through d , or m through p Page 385 Data Ingest Guide - Platfora Expression Language Reference Construct Type Description [a-z&&[def]] intersection matches d , e , or f [a-z&&[^xq]] subtraction matches a through z , except for x and q Predefined Character Classes Predefined character classes offer convenient shorthands for commonly used regular expressions. Construct Description Example . matches any single character (except newline) .at matches "cat", "hat", and also"bat" in the phrase "batch files" \d \D matches any digit character (equivalent to \d [0-9] ) matches "3" in "C3PO" and "2" in "file_2.txt" matches any non-digit character (equivalent to \D [^0-9] matches "S" in "900S" and "Q" in "Q45" ) \s matches any single white-space character (equivalent to [ \t\n\x0B\f\r] \sbook matches "book" in "blue book" but nothing in "notebook" ) \S matches any single non-white-space character \Sbook matches "book" in "notebook" but nothing in "blue book" Page 386 Data Ingest Guide - Platfora Expression Language Reference Construct Description Example \w matches any alphanumeric character, including r\w* underscore (equivalent to matches "rm" and "root" [A-Za-z0-9_] ) \W matches any non-alphanumeric character (equivalent to [^A-Za-z0-9_] \W matches "&" in "stmd &" , "%" in "100%", and "$" in "$HOME" ) Line and Word Boundaries Boundary matching constructs are used to specify where in a string to apply a matching pattern. For example, you can search for a particular pattern within a word boundary, or search for a pattern at the beginning or end of a line. Construct Description Example ^ matches from the beginning of a line (multiline matches are currently not supported) ^172 matches from the end of a line (multi-line matches are currently not supported) d$ matches within a word boundary \bis\b $ \b will match the "172" in IP address "172.18.1.11" but not in "192.172.2.33" will match the "d" in "maid" but not in "made" matches the word "is" in "this is my island", but not the "is" part of "this" or "island". \bis matches both "is" and the "is" in "island", but not in "this". \B matches within a non-word boundary \Bb matches "b" in "sbin" but not in "bash" Quantifiers Quantifiers specify how often the preceding regular expression construct should match. There are three classes of quantifiers: greedy, reluctant, and possessive. The difference between greedy, reluctant, and Page 387 Data Ingest Guide - Platfora Expression Language Reference possessive quantifiers involves what part of the string to try for the initial match, and how to retry if the initial attempt does not produce a match. Greedy ReluctantPossessiveDescription ConstructConstructConstruct Example ? matches the previous character or construct once or not at all st?on matches the previous character or construct zero or more times if* matches the previous character or construct one or more times if+ matches the previous character or construct exactly o{2} * + {n} ?? *? +? {n}? ?+ *+ ++ {n}+ matches "son" in "johnson" and "ston" in "johnston" but nothing in "clinton" or "version" matches "if", "iff" in "diff", or "i" in "print" matches "if", "iff" in "diff", but nothing in "print" matches "oo" in "lookup" and the first two o's in "fooooo" but nothing in "mount" n times {n,} {n,}? {n,}+ matches the previous character or construct at least o{2,} matches "oo" in "lookup" all five o's in "fooooo" but nothing in "mount" n times {n,m} {n,m}? {n,m}+ matches the previous character or construct at least F{2,4} matches "FF" in "#FF0000" and the last four F's in "#FFFFFF" n times, but no more than m times Capturing and Non-Capturing Groups Groups are specified by a pair of parenthesis around a subpattern in the regular expression. A pattern can have more than one group and the groups can be nested. The groups are numbered 1-n from left to right, starting with the first opening parenthesis. There is always an implicit group 0, which contains the entire match. For example, the pattern: (a(b*))+(c) Page 388 Data Ingest Guide - Platfora Expression Language Reference contains three groups: group 1: (a(b*)) group 2: (b*) group 3: (c) Capturing Groups By default, a group captures the text that produces a match, and only the most recent match is captured. The REGEX function returns the string that matches the first capturing group in the regular expression. For example, if the input string to the expression above was abc, the entire REGEX function would match to abc, but only return the result of group 1, which is ab. Non-Capturing Groups In some cases, you may want to use parenthesis to group subpatterns, but not capture text. A noncapturing group starts with (?: (a question mark and colon following the opening parenthesis). For example, h(?:a|i|o)t matches hat or hit or hot, but does not capture the a, i, or o from the subexpression. Examples Match all possible email address strings with a pattern of username@provider.domain, but only return the provider portion of the email address from the email field: REGEX(email,"^[a-zA-Z0-9._%+-]+@([a-zA-Z0-9._-]+)\.[a-zA-Z]{2,4}$") Match the request line of a web log, where the value is in the format of: GET /some_page.html HTTP/1.1 and return just the requested HTML page names: REGEX(weblog.request_line,"GET\s/([a-zA-Z0-9._%-]+\.[html])\sHTTP/[0-9.]+") Extract the inches portion from a height field where example values are 6'2", 5'11" (notice the escaping of the literal quote with a double double-quote): REGEX(height, "\d\'(\d)+""") Extract all of the contents of the device field when the value is either iPod, iPad, or iPhone: REGEX(device,"(iP[ao]d|iPhone)") REGEX_REPLACE REGEX_REPLACE is a row function that evaluates a string value against a regular expression to determine if there is a match, and replaces matched strings with the specified replacement value. Syntax REGEX_REPLACE(string_expression,"regex_match_pattern","regex_replace_pattern") Page 389 Data Ingest Guide - Platfora Expression Language Reference Return Value Returns the regex_replace_pattern as a STRING value when regex_match_pattern produces a match. If there is no match, returns the value of string_expression as a STRING. Input Parameters string_expression Required. The name of a field or expression of type STRING (or a literal string). regex_match_pattern Required. A string literal or regular expression pattern based on the regular expression pattern matching syntax of the Java programming language. You can use capturing groups to create backreferences that can be used in the regex_replace_pattern. You might want to use a string literal to make a case-sensitive match. For example, when you enter jane as the match value, the function matches jane but not Jane. The function matches all occurrences of a string literal in the string expression. regex_replace_pattern Required. A string literal or regular expression pattern based on the regular expression pattern matching syntax of the Java programming language. You can refer to backreferences from the regex_match_pattern using the syntax $n (where n is the group number). Regular Expression Constructs This section lists a summary of the most commonly used constructs for defining a regular expression matching pattern. See the Regular Expression Reference for more information. Literal and Special Characters The most basic form of pattern matching is the match of literal characters. For example, if the regular expression is foo and the input string is foo, the match will succeed because the strings are identical. Certain characters are reserved for special use in regular expressions. These special characters are often called metacharacters. If you want to use special characters as literal characters, they must be escaped. You can escape a single character using a \ (backslash), or escape a character sequence by enclosing it in \Q ... \E. Character Name Character Reserved For opening bracket [ start of a character class closing bracket ] end of a character class hyphen - character ranges within a character class backslash \ general escape character caret ^ beginning of string, negating of a character class Page 390 Data Ingest Guide - Platfora Expression Language Reference Character Name Character Reserved For dollar sign $ end of string period . matching any single character pipe | alternation (OR) operator question mark ? optional quantifier, quantifier minimizer asterisk * zero or more quantifier plus sign + once or more quantifier opening parenthesis ( start of a subexpression group closing parenthesis ) end of a subexpression group opening brace { start of min/max quantifier closing brace } end of min/max quantifier Character Class Constructs A character class allows you to specify a set of characters, enclosed in square brackets, that can produce a single character match. There are also a number of special predefined character classes (backslash character sequences that are shorthand for the most common character sets). Construct Type Description [abc] simple matches a or b or c [^abc] negation matches any character except a or b or c Page 391 Data Ingest Guide - Platfora Expression Language Reference Construct Type Description [a-zA-Z] range matches a through z , or A through Z (inclusive) [a-d[m-p]] union matches a through d , or m through p [a-z&&[def]] intersection matches d , e , or f [a-z&&[^xq]] subtraction matches a through z , except for x and q Predefined Character Classes Page 392 Data Ingest Guide - Platfora Expression Language Reference Predefined character classes offer convenient shorthands for commonly used regular expressions. Construct Description Example . matches any single character (except newline) .at matches "cat", "hat", and also"bat" in the phrase "batch files" \d \D matches any digit character (equivalent to \d [0-9] ) matches "3" in "C3PO" and "2" in "file_2.txt" matches any non-digit character (equivalent to \D [^0-9] matches "S" in "900S" and "Q" in "Q45" ) \s matches any single white-space character (equivalent to [ \t\n\x0B\f\r] \sbook matches "book" in "blue book" but nothing in "notebook" ) \S matches any single non-white-space character \Sbook matches "book" in "notebook" but nothing in "blue book" \w matches any alphanumeric character, including r\w* underscore (equivalent to matches "rm" and "root" [A-Za-z0-9_] ) \W matches any non-alphanumeric character (equivalent to [^A-Za-z0-9_] \W matches "&" in "stmd &" , "%" in "100%", and "$" in "$HOME" ) Line and Word Boundaries Boundary matching constructs are used to specify where in a string to apply a matching pattern. For example, you can search for a particular pattern within a word boundary, or search for a pattern at the beginning or end of a line. Construct Description Example ^ matches from the beginning of a line (multiline matches are currently not supported) ^172 Page 393 will match the "172" in IP address "172.18.1.11" but not in "192.172.2.33" Data Ingest Guide - Platfora Expression Language Reference Construct Description Example $ matches from the end of a line (multi-line matches are currently not supported) d$ matches within a word boundary \bis\b \b will match the "d" in "maid" but not in "made" matches the word "is" in "this is my island", but not the "is" part of "this" or "island". \bis matches both "is" and the "is" in "island", but not in "this". \B \Bb matches within a non-word boundary matches "b" in "sbin" but not in "bash" Quantifiers Quantifiers specify how often the preceding regular expression construct should match. There are three classes of quantifiers: greedy, reluctant, and possessive. The difference between greedy, reluctant, and possessive quantifiers involves what part of the string to try for the initial match, and how to retry if the initial attempt does not produce a match. Greedy ReluctantPossessiveDescription ConstructConstructConstruct Example ? matches the previous character or construct once or not at all st?on matches the previous character or construct zero or more times if* matches the previous character or construct one or more times if+ matches the previous character or construct exactly o{2} * + {n} ?? *? +? {n}? ?+ *+ ++ {n}+ n times Page 394 matches "son" in "johnson" and "ston" in "johnston" but nothing in "clinton" or "version" matches "if", "iff" in "diff", or "i" in "print" matches "if", "iff" in "diff", but nothing in "print" matches "oo" in "lookup" and the first two o's in "fooooo" but nothing in "mount" Data Ingest Guide - Platfora Expression Language Reference Greedy ReluctantPossessiveDescription ConstructConstructConstruct Example {n,} o{2,} {n,}? {n,}+ matches the previous character or construct at least matches "oo" in "lookup" all five o's in "fooooo" but nothing in "mount" n times {n,m} {n,m}? {n,m}+ matches the previous character or construct at least F{2,4} matches "FF" in "#FF0000" and the last four F's in "#FFFFFF" n times, but no more than m times Examples Match the values in a phone_number field where phone number values are formatted as xxx.xxx.xxxx and replace them with phone number values formatted as (xxx) xxx-xxxx: REGEX_REPLACE(phone_number,"([0-9]{3})\.([[0-9]]{3})\.([[0-9]] {4})","\($1\) $2-$3") Match the values in a name field where name values are formatted as firstname lastname and replace them with name values formatted as lastname, firstname: REGEX_REPLACE(name,"(.*) (.*)","$2, $1") Match the string literal mrs in a title field and replace it with the string literal Mrs. REGEX_REPLACE(title,"mrs","Mrs") SPLIT SPLIT is a row function that breaks down a delimited input string into sections and returns the specified section of the string. A section is considered any sub-string between the specified delimiter. Syntax SPLIT(input_string_expression,"delimiter_string",position_integer) Return Value Returns one value per row of type STRING. Input Parameters input_string_expression Page 395 Data Ingest Guide - Platfora Expression Language Reference Required. The name of a field or expression of type STRING (or a literal string). delimiter_string Required. A literal string representing the delimiter used to separate values in the input string. The delimiter can be a single character or multiple characters. position_integer Required. An integer representing the position of the section in the input string that you want to extract. Positive integers count the position from the beginning of the string, and negative integers count the position from the end of the string. A value of 0 returns NULL. Examples Return the third section of the literal delimited string: Restaurants>Location>San Francisco: SPLIT("Restaurants>Location>San Francisco",">", -1) returns San Francisco Return the first section of a phone_number field where phone number values are in the format of 123-456-7890: SPLIT(phone_number,"-",1) SUBSTRING SUBSTRING is a row function that returns the specified characters of a string value based on the given start and end position. Syntax SUBSTRING(string,start,end) Return Value Returns one value per row of type STRING. Input Parameters string Required. The name of a field or expression of type STRING (or a literal string). start Required. An integer that specifies where the returned characters start (inclusive), with 0 being the first character of the string. If start is greater than the number of characters, then an empty string is returned. If start is greater than end, then an empty string is returned. end Required. A positive integer that specifies where the returned characters end (exclusive), with the end character not being part of the return value. If end is greater than the number of characters, the whole string value (from start) is returned. Page 396 Data Ingest Guide - Platfora Expression Language Reference Examples Return the first letter of the name field: SUBSTRING(name,0,1) TO_LOWER TO_LOWER is a row function that converts all alphabetic characters in a string to lower case. Syntax TO_LOWER(string_expression) Return Value Returns one value per row of type STRING. Input Parameters string_expression Required. The name of a field or expression of type STRING (or a literal string). Examples Return the literal input string 123 Main Street in all lower case letters:: TO_LOWER("123 Main Street") returns 123 main street TO_UPPER TO_UPPER is a row function that converts all alphabetic characters in a string to upper case. Syntax TO_UPPER(string_expression) Return Value Returns one value per row of type STRING. Input Parameters string_expression Required. The name of a field or expression of type STRING (or a literal string). Examples Return the literal input string 123 Main Street in all upper case letters: TO_UPPER("123 Main Street") returns 123 MAIN STREET Page 397 Data Ingest Guide - Platfora Expression Language Reference TRIM TRIM is a row function that removes leading and trailing spaces from a string value. Syntax TRIM(string_expression) Return Value Returns one value per row of type STRING. Input Parameters string_expression Required. The name of a field or expression of type STRING (or a literal string). Examples Return the value of the area_code field without any leading or trailing spaces. For example, if the input string is " 650 ", then the return value would be "650": TRIM(area_code) Return the value of the phone_number field without any leading or trailing spaces. For example, if the input string is " 650 123-4567 ", then the return value would be "650 123-4567" (note that the extra spaces in the middle of the string are not removed, only the spaces at the beginning and end of the string): TRIM(phone_number) XPATH_STRING XPATH_STRING is a row function that takes an XML-formatted string and returns the first string matching the given XPath expression. Syntax XPATH_STRING(xml_formatted_string,"xpath_expression") Return Value Returns one value per row of type STRING. If the XPath expression matches more than one string in the given XML node, this function will return the first match only. To return all matches, use XPATH_STRINGS instead. Input Parameters xml_formatted_string Required. The name of a field or a literal string that contains a valid XML node (a snippet of XML consisting of a parent element and one or more child nodes). Page 398 Data Ingest Guide - Platfora Expression Language Reference xpath_expression Required. An XPath expression that refers to a node, element, or attribute within the XML string passed to this expression. Any XPath expression that complies to the XML Path Language (XPath) Version 1.0 specification is valid. Examples These example XPATH_STRING expressions assume you have a field in your dataset named address that contains XML-formatted strings such as this: <list> <address type="work"> <street>1300 So. El Camino Real</street1> <street>Suite 600</street2> <city>San Mateo</city> <state>CA</state> <zipcode>94403</zipcode> </address> <address type="home"> <street>123 Oakdale Street</street1> <street/> <city>San Francisco</city> <state>CA</state> <zipcode>94123</zipcode> </address> </list> Get the zipcode value from any address element where the type attribute equals home: XPATH_STRING(address,"//address[@type='home']/zipcode") returns: 94123 Get the city value from the second address element: XPATH_STRING(address,"/list/address[2]/city") returns: San Francisco Get the values from all child elements of the first address element (as one string): XPATH_STRING(address,"/list/address") returns: 1300 So. El Camino RealSuite 600 San MateoCA94403 XPATH_STRINGS XPATH_STRINGS is a row function that takes an XML-formatted string and returns a newline-separated array of strings matching the given XPath expression. Syntax XPATH_STRINGS(xml_formatted_string,"xpath_expression") Page 399 Data Ingest Guide - Platfora Expression Language Reference Return Value Returns one value per row of type STRING. If the XPath expression matches more than one string in the given XML node, this function will return all matches separated by a newline (you cannot specify a different delimiter). Input Parameters xml_formatted_string Required. The name of a field or a literal string that contains a valid XML node (a snippet of XML consisting of a parent element and one or more child nodes). xpath_expression Required. An XPath expression that refers to a node, element, or attribute within the XML string passed to this expression. Any XPath expression that complies to the XML Path Language (XPath) Version 1.0 specification is valid. Examples These example XPATH_STRINGS expressions assume you have a field in your dataset named address that contains XML-formatted strings such as this: <list> <address type="work"> <street>1300 So. El Camino Real</street1> <street>Suite 600</street2> <city>San Mateo</city> <state>CA</state> <zipcode>94403</zipcode> </address> <address type="home"> <street>123 Oakdale Street</street1> <street/> <city>San Francisco</city> <state>CA</state> <zipcode>94123</zipcode> </address> </list> Get all zipcode values from all address elements: XPATH_STRINGS(address,"//address/zipcode") returns: 94123 94403 Get all street values from the first address element: XPATH_STRINGS(address,"/list/address[1]/street") Page 400 Data Ingest Guide - Platfora Expression Language Reference returns: 1300 So. El Camino Real Suite 600 Get the values from all child elements of all address elements (as one string per line): XPATH_STRINGS(address,"/list/address") returns: 123 Oakdale StreetSan FranciscoCA94123 1300 So. El Camino RealSuite 600 San MateoCA94403 XPATH_XML XPATH_XML is a row function that takes an XML-formatted string and returns an XML-formatted string matching the given XPath expression. Syntax XPATH_XML(xml_formatted_string,"xpath_expression") Return Value Returns one value per row of type STRING in XML format. Input Parameters xml_formatted_string Required. The name of a field or a literal string that contains a valid XML node (a snippet of XML consisting of a parent element and one or more child nodes). xpath_expression Required. An XPath expression that refers to a node, element, or attribute within the XML string passed to this expression. Any XPath expression that complies to the XML Path Language (XPath) Version 1.0 specification is valid. Examples These example XPATH_STRING expressions assume you have a field in your dataset named address that contains XML-formatted strings such as this: <list> <address type="work"> <street>1300 So. El Camino Real</street1> <street>Suite 600</street2> <city>San Mateo</city> <state>CA</state> <zipcode>94403</zipcode> </address> <address type="home"> <street>123 Oakdale Street</street1> Page 401 Data Ingest Guide - Platfora Expression Language Reference <street/> <city>San Francisco</city> <state>CA</state> <zipcode>94123</zipcode> </address> </list> Get the last address node and its child nodes in XML format: XPATH_XML(address,"//address[last()]") returns: <address type="home"> <street>123 Oakdale Street</street1> <street/> <city>San Francisco</city> <state>CA</state> <zipcode>94123</zipcode> </address> Get the city value from the second address node in XML format: XPATH_XML(address,"/list/address[2]/city") returns: <city>San Francisco</city> Get the first address node and its child nodes in XML format: XPATH_XML(address,"/list/address[1]") returns: <address type="work"> <street>1300 So. El Camino Real</street1> <street>Suite 600</street2> <city>San Mateo</city> <state>CA</state> <zipcode>94403</zipcode> </address> URL Functions URL functions allow you to extract different portions of a URL string, and decode text that is URLencoded. URL_AUTHORITY URL_AUTHORITY is a row function that returns the authority portion of a URL string. The authority portion of a URL is the part that has the information on how to locate and connect to the server. Page 402 Data Ingest Guide - Platfora Expression Language Reference Syntax URL_AUTHORITY(string) Return Value Returns the authority portion of a URL as a STRING value, or NULL if the input string is not a valid URL. For example, in the string http://www.platfora.com/company/contact.html, the authority portion is www.platfora.com. In the string http://user:password@mycompany.com:8012/mypage.html, the authority portion is user:password@mycompany.com:8012. In the string mailto:username@mycompany.com?subject=Topic, the authority portion is NULL. Input Parameters string Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format of: protocol:authority[/path][?query][#fragment]. The authority portion of the URL contains the host information, which can be specified as a domain name (www.platfora.com), a host name (localhost), or an IP address (127.0.0.1). The host information can be preceeded by optional user information terminated with @ (for example, username:password@platfora.com), and followed by an optional port number preceded by a colon (for example, localhost:8001). Examples Return the authority portion of URL string values in the referrer field: URL_AUTHORITY(referrer) Return the authority portion of a literal URL string: URL_AUTHORITY("http://user:password@mycompany.com:8012/mypage.html") returns user:password@mycompany.com:8012 URL_FRAGMENT URL_FRAGMENT is a row function that returns the fragment portion of a URL string. Syntax URL_FRAGMENT(string) Page 403 Data Ingest Guide - Platfora Expression Language Reference Return Value Returns the fragment portion of a URL as a STRING value, NULL if the URL or does not contain a fragment, or NULL if the input string is not a valid URL. For example, in the string http://www.platfora.com/contact.html#phone, the fragment portion is phone. In the string http://www.platfora.com/contact.html, the fragment portion is NULL. In the string http://platfora.com/news.php?topic=press#Platfora%20News, the fragment portion is Platfora%20News. Input Parameters string Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format of: protocol:authority[/path][?query][#fragment]. The optional fragment portion of the URL is separated by a hash mark (#) and provides direction to a secondary resource, such as a heading or anchor identifier. Examples Return the fragment portion of URL string values in the request field: URL_FRAGMENT(request) Return the fragment portion of a literal URL string: URL_FRAGMENT("http://platfora.com/news.php?topic=press#Platfora%20News") returns Platfora%20News Return and decode the fragment portion of a literal URL string: URLDECODE(URL_FRAGMENT("http://platfora.com/news.php? topic=press#Platfora%20News")) returns Platfora News URL_HOST URL_HOST is a row function that returns the host, domain, or IP address portion of a URL string. Syntax URL_HOST(string) Return Value Returns the host portion of a URL as a STRING value, or NULL if the input string is not a valid URL. For example, in the string http://www.platfora.com/company/contact.html, the host portion is www.platfora.com. Page 404 Data Ingest Guide - Platfora Expression Language Reference In the string http://admin:admin@127.0.0.1:8001/index.html, the host portion is 127.0.0.1. In the string mailto:username@mycompany.com?subject=Topic, the host portion is NULL. Input Parameters string Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format of: protocol:authority[/path][?query][#fragment]. The authority portion of the URL contains the host information, which can be specified as a domain name (www.platfora.com), a host name (localhost), or an IP address (127.0.0.1). Examples Return the host portion of URL string values in the referrer field: URL_HOST(referrer) Return the host portion of a literal URL string: URL_HOST("http://user:password@mycompany.com:8012/mypage.html") returns mycompany.com URL_PATH URL_PATH is a row function that returns the path portion of a URL string. Syntax URL_PATH(string) Return Value Returns the path portion of a URL as a STRING value, NULL if the URL or does not contain a path, or NULL if the input string is not a valid URL. For example, in the string http://www.platfora.com/company/contact.html, the path portion is /company/contact.html. In the string http://admin:admin@127.0.0.1:8001/index.html, the path portion is / index.html. In the string mailto:username@mycompany.com?subject=Topic, the path portion is username@mycompany.com. Input Parameters string Page 405 Data Ingest Guide - Platfora Expression Language Reference Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format of: protocol:authority[/path][?query][#fragment]. The optional path portion of the URL is a sequence of resource location segments separated by a forward slash (/), conceptually similar to a directory path. Examples Return the path portion of URL string values in the request field: URL_PATH(request) Return the path portion of a literal URL string: URL_PATH("http://platfora.com/company/contact.html") returns /company/ contact.html URL_PORT URL_PORT is a row function that returns the port portion of a URL string. Syntax URL_PORT(string) Return Value Returns the port portion of a URL as an INTEGER value. If the URL does not specify a port, then returns -1. If the input string is not a valid URL, returns NULL. For example, in the string http://localhost:8001, the port portion is 8001. Input Parameters string Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format of: protocol:authority[/path][?query][#fragment]. The authority portion of the URL contains the host information, which can be specified as a domain name (www.platfora.com), a host name (localhost), or an IP address (127.0.0.1). The host information can be followed by an optional port number preceded by a colon (for example, localhost:8001). Examples Return the port portion of URL string values in the referrer field: URL_PORT(referrer) Return the port portion of a literal URL string: Page 406 Data Ingest Guide - Platfora Expression Language Reference URL_PORT("http://user:password@mycompany.com:8012/mypage.html") returns 8012 URL_PROTOCOL URL_PROTOCOL is a row function that returns the protocol (or URI scheme name) portion of a URL string. Syntax URL_PROTOCOL(string) Return Value Returns the protocol portion of a URL as a STRING value, or NULL if the input string is not a valid URL. For example, in the string http://www.platfora.com, the protocol portion is http. In the string ftp://ftp.platfora.com/articles/platfora.pdf, the protocol portion is ftp. Input Parameters string Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format of: protocol:authority[/path][?query][#fragment] The protocol portion of a URL consists of a sequence of characters beginning with a letter and followed by any combination of letter, number, plus (+), period (.), or hyphen (-) characters, followed by a colon (:). For example: http:, ftp:, mailto: Examples Return the protocol portion of URL string values in the referrer field: URL_PROTOCOL(referrer) Return the protocol portion of the literal URL string: URL_PROTOCOL("http://www.platfora.com") returns http URL_QUERY URL_QUERY is a row function that returns the query portion of a URL string. Syntax URL_QUERY(string) Page 407 Data Ingest Guide - Platfora Expression Language Reference Return Value Returns the query portion of a URL as a STRING value, NULL if the URL or does not contain a query, or NULL if the input string is not a valid URL. For example, in the string http://www.platfora.com/contact.html, the query portion is NULL. In the string http://platfora.com/news.php? topic=press&timeframe=today#Platfora%20News, the query portion is topic=press&timeframe=today. In the string mailto:username@mycompany.com?subject=Topic, the query portion is subject=Topic. Input Parameters string Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format of: protocol:authority[/path][?query][#fragment]. The optional query portion of the URL is separated by a question mark (?) and typically contains an unordered list of key=value pairs separated by an ampersand (&) or semicolon (;). Examples Return the query portion of URL string values in the request field: URL_QUERY(request) Return the query portion of a literal URL string: URL_QUERY("http://platfora.com/news.php?topic=press&timeframe=today") returns topic=press&timeframe=today URLDECODE URLDECODE is a row function that decodes a string that has been encoded with the application/ x-www-form-urlencoded media type. URL encoding, also known as percent-encoding, is a mechanism for encoding information in a Uniform Resource Identifier (URI). When sent in an HTTP GET request, application/x-www-form-urlencoded data is included in the query component of the request URI. When sent in an HTTP POST request, the data is placed in the body of the message, and the name of the media type is included in the message Content-Type header. Syntax URLDECODE(string) Return Value Returns a value of type STRING with characters decoded as follows: Page 408 Data Ingest Guide - Platfora Expression Language Reference • Alphanumeric characters (a-z, A-Z, 0-9) remain unchanged. • The special characters hyphen (-), comma (,), underscore (_), period (.), and asterisk (*) remain unchanged. • The plus sign (+) character is converted to a space character. • The percent character (%) is interpreted as the start of a special escaped sequence, where in the sequence %HH, HH represents the hexadecimal value of the byte. For example, some common escape sequences are: percent encoding sequence value %20 space %0A or %0D or %0D%0A newline %22 double quote (") %25 percent (%) %2D hyphen (-) %2E period (.) %3C less than (<) %3D greater than (>) %5C backslash (\) %7C pipe (|) Input Parameters string Required. A field or expression that returns a STRING value. It is assumed that all characters in the input string are one of the following: lower-case letters (a-z), upper-case letters (A-Z), numeric digits (0-9), or the hyphen (-), comma (,), underscore (_), period (.) or asterisk (*) character. The percent character (%) is allowed, but is interpreted as the start of a special escaped sequence. The plus character (+) is allowed, but is interpreted as a space character. Examples Decode the values of the url_query field: URLDECODE(url_query) Convert a literal URL encoded string (N%2FA%20or%20%22not%20applicable%22) to a humanreadable value (N/A or "not applicable"): Page 409 Data Ingest Guide - Platfora Expression Language Reference URLDECODE("N%2FA%20or%20%22not%20applicable%22") returns N/A or "not applicable" IP Address Functions IP address functions allow you to manipulate and transform STRING data consisting of IP address values. CIDR_MATCH CIDR_MATCH is a row function that compares two STRING arguments representing a CIDR mask and an IP address, and returns 1 if the IP address falls within the specified subnet mask or 0 if it does not. Syntax CIDR_MATCH(CIDR_string, IP_string) Return Value Returns an INTEGER value of 1 if the IP address falls within the subnet indicated by the CIDR mask and 0 if it does not. Input Parameters CIDR_string Required. A field or expression that returns a STRING value containing either an IPv4 or IPv6 CIDR mask (Classless InterDomain Routing subnet notation). An IPv4 CIDR mask can only successfully match IPv4 addresses, and an IPv6 CIDR mask can only successfully match IPv6 addresses. IP_string Required. A field or expression that returns a STRING value containing either an IPv4 or IPv6 internet protocol (IP) address. Examples Compare an IPv4 CIDR subnet mask to an IPv4 IP address: CIDR_MATCH("60.145.56.0/24","60.145.56.246") returns 1 CIDR_MATCH("60.145.56.0/30","60.145.56.246") returns 0 Compare an IPv6 CIDR subnet mask to an IPv6 IP address: CIDR_MATCH("fe80::/70","FE80::0202:B3FF:FE1E:8329") returns 1 CIDR_MATCH("fe80::/72","FE80::0202:B3FF:FE1E:8329") returns 0 Page 410 Data Ingest Guide - Platfora Expression Language Reference HEX_TO_IP HEX_TO_IP is a row function that converts a hexadecimal-encoded STRING to a text representation of an IP address. Syntax HEX_TO_IP(string) Return Value Returns a value of type STRING representing either an IPv4 or IPv6 address. The type of IP address returned depends on the input string. An 8 character hexadecimal string will return an IPv4 address. A 32 character long hexadecimal string will return an IPv6 address. IPv6 addresses are represented in full length, without removing any leading zeros and without using the compressed :: notation. For example, 2001:0db8:0000:0000:0000:ff00:0042:8329 rather than 2001:db8::ff00:42:8329. Input strings that do not contain either 8 or 32 valid hexadecimal characters will return NULL. Input Parameters string Required. A field or expression that returns a hexadecimal-encoded STRING value. The hexadecimal string must be either 8 characters long (in which case it is converted to an IPv4 address) or 32 characters long (in which case it is converted to an IPv6 address). Examples Return a plain text IP address for each hexadecimal-encoded string value in the byte_encoded_ips column: HEX_TO_IP(byte_encoded_ips) Convert an 8 character hexadecimal-encoded string to a plain text IPv4 address: HEX_TO_IP(AB20FE01) returns 171.32.254.1 Convert a 32 character hexadecimal-encoded string to a plain text IPv6 address: HEX_TO_IP(FE800000000000000202B3FFFE1E8329) returns fe80:0000:0000:0000:0202:b3ff:fe1e:8329 Date and Time Functions Date and time functions allow you to manipulate and transform datetime values, such as calculating time differences between two datetime values, or extracting a portion of a datetime value. Page 411 Data Ingest Guide - Platfora Expression Language Reference DAYS_BETWEEN DAYS_BETWEEN is a row function that calculates the whole number of days (ignoring time) between two DATETIME values (value1-value2). Syntax DAYS_BETWEEN(datetime_1,datetime_2) Return Value Returns one value per row of type INTEGER. Input Parameters datetime_1 Required. A field or expression of type DATETIME. datetime_2 Required. A field or expression of type DATETIME. Examples Calculate the number of days to ship a product by subtracting the value of the order_date field from the ship_date field: DAYS_BETWEEN(ship_date,order_date) Calculate the number of days since a product's release by subtracting the value of the release_date field in the product dataset from the current date (the result of the expression): DAYS_BETWEEN(NOW(),product.release_date) DATE_ADD DATE_ADD is a row function that adds the specified time interval to a DATETIME value. Syntax DATE_ADD(datetime,quantity,"interval") Return Value Returns a value of type DATETIME. Input Parameters datetime Required. A field name or expression that returns a DATETIME value. quantity Page 412 Data Ingest Guide - Platfora Expression Language Reference Required. An integer value. To add time intervals, use a positive integer. To subtract time intervals, use a negative integer. interval Required. One of the following time intervals: • millisecond - Adds the specified number of milliseconds to a datetime value. • second - Adds the specified number of seconds to a datetime value. • minute - Adds the specified number of minutes to a datetime value. • hour - Adds the specified number of hours to a datetime value. • day - Adds the specified number of days to a datetime value. • week - Adds the specified number of weeks to a datetime value. • month - Adds the specified number of months to a datetime value. • quarter - Adds the specified number of quarters to a datetime value. • year - Adds the specified number of years to a datetime value. • weekyear - Adds the specified number of weekyears to a datetime value. Examples Add 45 days to the value of the invoice_date field to calculate the date a payment is due: DATE_ADD(invoice_date,45,"day") HOURS_BETWEEN HOURS_BETWEEN is a row function that calculates the whole number of hours (ignoring minutes, seconds, and milliseconds) between two DATETIME values (value1-value2). Syntax HOURS_BETWEEN(datetime_1,datetime_2) Return Value Returns one value per row of type INTEGER. Input Parameters datetime_1 Required. A field or expression of type DATETIME. datetime_2 Required. A field or expression of type DATETIME. Examples Calculate the number of hours to ship a product by subtracting the value of the ship_date field from the order_date field: Page 413 Data Ingest Guide - Platfora Expression Language Reference HOURS_BETWEEN(ship_date,order_date) Calculate the number of hours since an advertisement was viewed by subtracting the value of the adview_timestamp field in the impressions dataset from the current date and time (the result of the expression): HOURS_BETWEEN(NOW(),impressions.adview_timestamp) EXTRACT EXTRACT is a row function that returns the specified portion of a DATETIME value. Syntax EXTRACT("extract_value",datetime) Return Value Returns the specified extracted value as type INTEGER. EXTRACT removes leading zeros. For example, the month of April returns a value of 4, not 04. Input Parameters extract_value Required. One of the following extract values: • millisecond - Returns the millisecond portion of a datetime value. For example, an input datetime value of 2012-08-15 20:38:40.213 would return an integer value of 213. • second - Returns the second portion of a datetime value. For example, an input datetime value of 2012-08-15 20:38:40.213 would return an integer value of 40. • minute - Returns the minute portion of a datetime value. For example, an input datetime value of 2012-08-15 20:38:40.213 would return an integer value of 38. • hour - Returns the hour portion of a datetime value. For example, an input datetime value of 2012-08-15 20:38:40.213 would return an integer value of 20. • day - Returns the day portion of a datetime value. For example, an input datetime value of 2012-08-15 would return an integer value of 15. • week - Returns the ISO week number for the input datetime value. For example, an input datetime value of 2012-01-02 would return an integer value of 1 (the first ISO week of 2012 starts on Monday January 2). An input datetime value of 2012-01-01 would return an integer value of 52 (January 1, 2012 is part of the last ISO week of 2011). • month - Returns the month portion of a datetime value. For example, an input datetime value of 2012-08-15 would return an integer value of 8. • quarter - Returns the quarter number for the input datetime value, where quarters start on January 1, April 1, July 1, or October 1. For example, an input datetime value of 2012-08-15 would return a integer value of 3. • year - Returns the year portion of a datetime value. For example, an input datetime value of 2012-01-01 would return an integer value of 2012. Page 414 Data Ingest Guide - Platfora Expression Language Reference • weekyear - Returns the year value that corresponds the the ISO week number of the input datetime value. For example, an input datetime value of 2012-01-02 would return an integer value of 2012 (the first ISO week of 2012 starts on Monday January 2). An input datetime value of 2012-01-01 would return an integer value of 2011 (January 1, 2012 is part of the last ISO week of 2011). datetime Required. A field name or expression that returns a DATETIME value. Examples Extract the hour portion from the order_date datetime field: EXTRACT("hour",order_date) Cast the value of the order_date string field to a datetime value using TO_DATE, and extract the ISO week year: EXTRACT("weekyear",TO_DATE(order_date,"MM/dd/yyyy HH:mm:ss")) MILLISECONDS_BETWEEN MILLISECONDS_BETWEEN is a row function that calculates the whole number of milliseconds between two DATETIME values (value1-value2). Syntax MILLISECONDS_BETWEEN(datetime_1,datetime_2) Return Value Returns one value per row of type INTEGER. Input Parameters datetime_1 Required. A field or expression of type DATETIME. datetime_2 Required. A field or expression of type DATETIME. Examples Calculate the number of milliseconds it took to serve a web page by subtracting the value of the request_timestamp field from the response_timestamp field: MILLISECONDS_BETWEEN(request_timestamp,response_timestamp) MINUTES_BETWEEN MINUTES_BETWEEN is a row function that calculates the whole number of minutes (ignoring seconds and milliseconds) between two DATETIME values (value1-value2). Page 415 Data Ingest Guide - Platfora Expression Language Reference Syntax MINUTES_BETWEEN(datetime_1,datetime_2) Return Value Returns one value per row of type INTEGER. Input Parameters datetime_1 Required. A field or expression of type DATETIME. datetime_2 Required. A field or expression of type DATETIME. Examples Calculate the number of minutes it took for a user to click on an advertisement by subtracting the value of the impression_timestamp field from the conversion_timestamp field: MINUTES_BETWEEN(impression_timestamp,conversion_timestamp) Calculate the number of minutes since a user last logged in by subtracting the login_timestamp field in the weblogs dataset from the current date and time (the result of the expression): MINUTES_BETWEEN(NOW(),weblogs.login_timestamp) NOW NOW is a scalar function that returns the current system date and time as a DATETIME value. It can be used in other expressions involving DATETIME type fields, such as , , or . Note that the value of NOW is only evaluated at the time a lens is built (it is not re-evaluated with each query). Syntax NOW() Return Value Returns the current system date and time as a DATETIME value. Examples Calculate a user's age using to subtract the value of the birthdate field in the users dataset from the current date: YEAR_DIFF(NOW(),users.birthdate) Calculate the number of days since a product's release using to subtract the value of the release_date field from the current date: DAYS_BETWEEN(NOW(),release_date) Page 416 Data Ingest Guide - Platfora Expression Language Reference SECONDS_BETWEEN SECONDS_BETWEEN is a row function that calculates the whole number of seconds (ignoring milliseconds) between two DATETIME values (value1-value2). Syntax SECONDS_BETWEEN(datetime_1,datetime_2) Return Value Returns one value per row of type INTEGER. Input Parameters datetime_1 Required. A field or expression of type DATETIME. datetime_2 Required. A field or expression of type DATETIME. Examples Calculate the number of seconds it took for a user to click on an advertisement by subtracting the value of the impression_timestamp field from the conversion_timestamp field: SECONDS_BETWEEN(impression_timestamp,conversion_timestamp) Calculate the number of seconds since a user last logged in by subtracting the login_timestamp field in the weblogs dataset from the current date and time (the result of the expression): SECONDS_BETWEEN(NOW(),weblogs.login_timestamp) TRUNC TRUNC is a row function that truncates a DATETIME value to the specified format. Syntax TRUNC(datetime,"format") Return Value Returns a value of type DATETIME truncated to the specified format. Input Parameters datetime Required. A field or expression that returns a DATETIME value. format Required. One of the following format values: Page 417 Data Ingest Guide - Platfora Expression Language Reference • millisecond - Returns a datetime value truncated to millisecond granularity. Has no effect since millisecond is already the most granular format for datetime values. For example, an input datetime value of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:38:40.213. • second - Returns a datetime value truncated to second granularity. For example, an input datetime value of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:38:40.000. • minute - Returns a datetime value truncated to minute granularity. For example, an input datetime value of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:38:00.000. • hour - Returns a datetime value truncated to hour granularity. For example, an input datetime value of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:00:00.000. • day - Returns a datetime value truncated to day granularity. For example, an input datetime value of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 00:00:00.000. • week - Returns a datetime value truncated to the first day of the week (starting on a Monday). For example, an input datetime value of 2012-08-15 (a Wednesday) would return a datetime value of 2012-08-13 (the Monday prior). • month - Returns a datetime value truncated to the first day of the month. For example, an input datetime value of 2012-08-15 would return a datetime value of 2012-08-01. • quarter - Returns a datetime value truncated to the first day of the quarter (January 1, April 1, July 1, or October 1). For example, an input datetime value of 2012-08-15 20:38:40.213 would return a datetime value of 2012-07-01. • year - Returns a datetime value truncated to the first day of the year (January 1). For example, an input datetime value of 2012-08-15 would return a datetime value of 2012-01-01. • weekyear - Returns a datetime value trucated to the first day of the ISO weekyear (the ISO week starting with the Monday which is nearest in time to January 1). For example, an input datetime value of 2008-08-15 would return a datetime value of 2007-12-31. The first day of the ISO weekyear for 2008 is December 31, 2007 (the prior Monday closest to January 1). Examples Truncate the order_date datetime field to day granularity: TRUNC(order_date,"day") Cast the value of the order_date string field to a datetime value using TO_DATE, and truncate it to day granularity: TRUNC(TO_DATE(order_date,"MM/dd/yyyy HH:mm:ss"),"day") YEAR_DIFF YEAR_DIFF is a row function that calculates the fractional number of years between two DATETIME values (value1-value2). Syntax YEAR_DIFF(datetime_1,datetime_2) Page 418 Data Ingest Guide - Platfora Expression Language Reference Return Value Returns one value per row of type DOUBLE. Input Parameters datetime_1 Required. A field or expression of type DATETIME. datetime_2 Required. A field or expression of type DATETIME. Examples Calculate the number of years a user has been a customer by subtracting the value of the registration_date field from the current date (the result of the expression): YEAR_DIFF(NOW(),registration_date) Calculate a user's age by subtracting the value of the birthdate field in the users dataset from the current date (the result of the expression): YEAR_DIFF(NOW(),users.birthdate) Math Functions Math functions allow you to perform basic math calculations on numeric values. You can also use arithmetic operators to perform simple math calculations. DIV DIV is a row function that divides two LONG values and returns a quotient value of type LONG (the result is truncated to 0 decimal places). Syntax DIV(dividend,divisor) Return Value Returns one value per row of type LONG. Input Parameters dividend Required. A field or expression of type LONG. divisor Required. A field or expression of type LONG. Page 419 Data Ingest Guide - Platfora Expression Language Reference Examples Cast the value of the file_size field to LONG and divide by 1024: DIV(TO_LONG(file_size),1024) EXP EXP is a row function that raises the mathematical constant e to the power (exponent) of a numeric value and returns a value of type DOUBLE. Syntax EXP(power) Return Value Returns one value per row of type DOUBLE. Input Parameters power Required. A field or expression of a numeric type. Examples Raise e to the power in the Value field. EXP(Value) When the Value field value is 2.0, the result is equal to 7.3890 when truncated to four decimal places. FLOOR FLOOR is a row function that returns the largest integer that is less than or equal to the input argument. Syntax FLOOR(double) Return Value Returns one value per row of type DOUBLE. Input Parameters double Required. A field or expression of type DOUBLE. Examples Return the floor value of 32.6789: FLOOR(32.6789) returns 32.0 Page 420 Data Ingest Guide - Platfora Expression Language Reference HASH HASH is a row function that evenly partitions data values into the specified number of buckets. It creates a hash of the input value and assigns that value a bucket number. Equal values will always hash to the same bucket number. Syntax HASH(field_name,integer) Return Value Returns one value per row of type INTEGER corresponding to the bucket number that the input value hashes to. Input Parameters field_name Required. The name of the field whose values you want to partition. integer Required. The desired number of buckets. This parameter can be a numeric value of any data type, but when it is a non-integer value, Platfora truncates the value to an integer. When the value is zero, the function returns NULL. When the value is negative, the function uses absolute value. Examples Partition the values of the username field into 20 buckets: HASH(username,20) LN LN is a row function that returns the natural logarithm of a number. The natural logarithm is the logarithm to the base e, where e (Euler's number) is a mathematical constant approximately equal to 2.718281828. The natural logarithm of a number x is the power to which the constant e must be raised in order to equal x. Syntax LN(positive_number) Return Value Returns the exponent to which base e must be raised to obtain the input value, where e denotes the constant number 2.718281828. The return value is the same data type as the input value. For example, LN(7.389) is 2, because e to the power of 2 is approximately 7.389. Input Parameters positive_number Page 421 Data Ingest Guide - Platfora Expression Language Reference Required. A field or expression that returns a number greater than 0. Inputs can be of type INTEGER, LONG, DOUBLE, or FIXED. Examples Return the natural logarithm of base number e, which is approximately 2.718281828: LN(2.718281828) returns 1 LN(3.0000) returns 1.098612 LN(300.0000) returns 5.703782 MOD MOD is a row function that divides two LONG values and returns the remainder value of type LONG (the result is truncated to 0 decimal places). Syntax MOD(dividend,divisor) Return Value Returns one value per row of type LONG. Input Parameters dividend Required. A field or expression of type LONG. divisor Required. A field or expression of type LONG. Examples Cast the value of the file_size field to LONG and divide by 1024: MOD(TO_LONG(file_size),1024) POW POW is a row function that raises the a numeric value to the power (exponent) of another numeric value and returns a value of type DOUBLE. Syntax POW(index,power) Return Value Returns one value per row of type DOUBLE. Page 422 Data Ingest Guide - Platfora Expression Language Reference Input Parameters index Required. A field or expression of a numeric type. power Required. A field or expression of a numeric type. Examples Calculate the compound annual growth rate (CAGR) percentage for a given investment over a five year span. 100 * POW(end_value/start_value, 0.2) - 1 Calculate the square of the Value field. POW(Value,2) Calculate the square root of the Value field. POW(Value,0.5) The following expression returns 1. POW(0,0) ROUND ROUND is a row function that rounds a DOUBLE value to the specified number of decimal places. Syntax ROUND(double,number_decimal_places) Return Value Returns one value per row of type DOUBLE. Input Parameters double Required. A field or expression of type DOUBLE. number_decimal_places Required. An integer that specifies the number of decimal places to round to. Examples Round the number 32.4678954 to two decimal places: ROUND(32.4678954,2) returns 32.47 Page 423 Data Ingest Guide - Platfora Expression Language Reference Data Type Conversion Functions Data type conversion functions allow you to cast data values from one data type to another. These functions are used implicitly whenever you set the data type of a field or column in the Platfora user interface. The supported data types are: INTEGER, LONG, DOUBLE, FIXED, DATETIME, and STRING. EPOCH_MS_TO_DATE EPOCH_MS_TO_DATE is a row function that converts LONG values to DATETIME values, where the input number represents the number of milliseconds since the epoch. Syntax EPOCH_MS_TO_DATE(long_expression) Return Value Returns one value per row of type DATETIME in UTC format yyyy-MM-dd HH:mm:ss:SSS Z. Input Parameters long_expression Required. A field or expression of type LONG representing the number of milliseconds since the epoch datetime (January 1, 1970 00:00:00:000 GMT). Examples Convert a number representing the number of milliseconds from the epoch to a human-readable date and time: EPOCH_MS_TO_DATE(1360260240000) returns 2013-02-07T18:04:00:000Z or February 7, 2013 18:04:00:000 GMT Or if your data is in seconds instead of milliseconds: EPOCH_MS_TO_DATE(1360260240 * 1000) returns 2013-02-07T18:04:00:000Z or February 7, 2013 18:04:00:000 GMT TO_CURRENCY This function is deprecated. Use the TO_FIXED function instead. TO_DATE TO_DATE is a row function that converts STRING values to DATETIME values, and specifies the format of the date and time elements in the string. Syntax TO_DATE(string_expression,"date_format") Page 424 Data Ingest Guide - Platfora Expression Language Reference Return Value Returns one value per row of type DATETIME (which by definition is in UTC). Input Parameters string_expression Required. A field or expression of type STRING. date_format Required. A pattern that describes how the date is formatted. Date Pattern Format Use the following pattern symbols to define your date format. The count and ordering of the pattern letters determines the datetime format. Any characters in the pattern that are not in the ranges of a-z and A-Z are treated as quoted delimiter text. For instance, characters such as slash (/) or colon (:) will appear in the resulting output even they are not escaped with single quotes. Table 3: Date Pattern Symbols SymbolMeaning Presentation Examples G era text AD C century of era (0 or greater) number 20 Y year of era (0 or greater) year 1996 x week year year 1996 w week number of week year number 27 e day of week (number) number 2 E day of week (name) text Tuesday; Tue y year year 1996 D day of year number 189 M month of year month July; Jul; 07 3 or more uses text, otherwise uses a number d day of month number 10 If the number of pattern letters is 3 or more, the text form is used; otherwise the number is used. Page 425 Notes Numeric presentation for year and week year fields are handled specially. For example, if the count of 'y' is 2, the year will be displayed as the zero-based year of the century, which is two digits. If the number of pattern letters is 4 or more, the full form is used; otherwise a short or abbreviated form is used. Data Ingest Guide - Platfora Expression Language Reference SymbolMeaning Presentation Examples Notes a half day of day text K hour of half day (0-11) number 0 h clock hour of half day (1-12) number 12 H hour of day (0-23) number 0 k clock hour of day (1-24) number 24 m minute of hour number 30 s second of minute number 55 S fraction of second number 978 z time zone text Pacific Standard Time; PST If the number of pattern letters is 4 or more, the full form is used; otherwise a short or abbreviated form is used. Z time zone offset/id zone -0800; -08:00; America/ Los_Angeles 'Z' outputs offset without a colon, 'ZZ' outputs the offset with a colon, 'ZZZ' or more outputs the zone id. ' escape character for text-based delimiters delimiter '' literal representation of literal a single quote PM ' Examples Define a new DATETIME computed field based on the order_date base field, which contains timestamps in the format of: 2014.07.10 at 15:08:56 PDT: TO_DATE(order_date,"yyyy.MM.dd 'at' HH:mm:ss z") Define a new DATETIME computed field by first combining individual month, day, year, and depart_time fields (using CONCAT), and performing a transformation on depart_time to make sure threedigit times are converted to four-digit times (using REGEX_REPLACE): TO_DATE(CONCAT(month,"/",day,"/",year,":",REGEX_REPLACE(depart_time,"\b(\d{3})\b", dd/yyyy:HHmm") Define a new DATETIME computed field based on the created_at base field, which contains timestamps in the format of: Sat Jan 25 16:35:23 +0800 2014 (this is the timestamp format returned by Twitter's API): TO_DATE(created_at,"EEE MMM dd HH:mm:ss Z yyyy") Page 426 Data Ingest Guide - Platfora Expression Language Reference TO_DOUBLE TO_DOUBLE is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to DOUBLE (decimal) values. Syntax TO_DOUBLE(expression) Return Value Returns one value per row of type DOUBLE. Input Parameters expression Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, or DOUBLE. Examples Convert the values of the average_rating field to a double data type: TO_DOUBLE(average_rating) Convert the average_rating field to a double data type, but first transform the occurrence of any NA values to NULL values using a CASE expression: TO_DOUBLE(CASE WHEN average_rating="N/A" then NULL ELSE average_rating END) TO_FIXED TO_FIXED is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to fixeddecimal values. Using a FIXED data type to represent monetary values allows you to calculate and aggregate monetary values with accuracy to a ten-thousandth of a monetary unit. Syntax TO_FIXED(expression) Return Value Returns one value per row of type FIXED (fixed-decimal value to 10000th accuracy). Input Parameters expression Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, or DOUBLE. Page 427 Data Ingest Guide - Platfora Expression Language Reference Examples Convert the opening_price field to a fixed decimal data type: TO_FIXED(opening_price) Convert the sale_price field to a fixed decimal data type, but first transform the occurrence of any N/A string values to NULL values using a CASE expression: TO_FIXED(CASE WHEN sale_price="N/A" then NULL ELSE sale_price END) TO_INT TO_INT is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to INTEGER (whole number) values. When converting DOUBLE values, everything after the decimal will be truncated (not rounded up or down). Syntax TO_INT(expression) Return Value Returns one value per row of type INTEGER. Input Parameters expression Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, or DOUBLE. Examples Convert the values of the average_rating field to an integer data type: TO_INT(average_rating) Convert the flight_duration field to an integer data type, but first transform the occurrence of any NA values to NULL values using a CASE expression: TO_INT(CASE WHEN flight_duration="N/A" then NULL ELSE flight_duration END) TO_LONG TO_LONG is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to LONG (whole number) values. When converting DOUBLE values, everything after the decimal will be truncated (not rounded up or down). Syntax TO_LONG(expression) Page 428 Data Ingest Guide - Platfora Expression Language Reference Return Value Returns one value per row of type LONG. Input Parameters expression Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, or DOUBLE. Examples Convert the values of the average_rating field to a long data type: TO_LONG(average_rating) Convert the average_rating field to a long data type, but first transform the occurrence of any NA values to NULL values using a CASE expression: TO_LONG(CASE WHEN average_rating="N/A" then NULL ELSE average_rating END) TO_STRING TO_STRING is a row function that converts values of other data types to STRING (character) values. Syntax TO_STRING(expression) TO_STRING(datetime_expression,date_format) Return Value Returns one value per row of type STRING. Input Parameters expression A field or expression of type FIXED, STRING, INTEGER, LONG, or DOUBLE. datetime_expression A field or expression of type DATETIME. date_format If converting a DATETIME to a string, a pattern that describes how the date is formatted. See TO_DATE for the date format patterns. Examples Convert the values of the sku_number field to a string data type: Page 429 Data Ingest Guide - Platfora Expression Language Reference TO_STRING(sku_number) Convert values in the age column into a range-based groupings (binning), and cast output values to a STRING: TO_STRING(CASE WHEN age <= 25 THEN "0-25" WHEN age <= 50 THEN "26-50" ELSE "over 50" END) Convert the values of a timestamp datetime field to a string, where the timestamp values are in the format of: 2002.07.10 at 15:08:56 PDT: TO_STRING(timestamp,"yyyy.MM.dd 'at' HH:mm:ss z") Aggregate Functions An aggregate function groups the values of multiple rows together based on some defined input expression. Aggregate functions return one value for a group of rows, and are only valid for defining measures in Platfora. Aggregate functions cannot be combined with row functions. AVG AVG is an aggregate function that returns the average of all valid numeric values. It sums all values in the provided expression and divides by the number of valid (NOT NULL) rows. If you want to compute an average that includes all values in the row count (including NULL values), you can use a SUM/COUNT expression instead. Syntax AVG(numeric_field) Return Value Returns a value of type DOUBLE. Input Parameters numeric_field Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregate functions can only take field names as input. Examples Get the average of the valid sale_amount field values: AVG(sale_amount) Get the average of the valid net_worth field values in the billionaires data set, which resides in the samples namespace: AVG([(samples) billionaires].net_worth) Page 430 Data Ingest Guide - Platfora Expression Language Reference Get the average of all page_views field values in the web_logs dataset (including NULL values): SUM(page_views)/COUNT(web_logs) COUNT COUNT is an aggregate function that returns the number of rows in a dataset. Syntax COUNT([namespace_name]dataset_name) Return Value Returns a value of type INTEGER. Input Parameters namespace_name Optional. The name of the namespace in which the dataset resides. If not specified, uses the default namespace. dataset_name Required. The name of the dataset for which to obtain a count of rows. If you want to count rows of a down-stream dataset that is related to the current dataset, you can specify the hierarchy of dataset names in the format of: parent_dataset_name.child_dataset_name.[...] Examples Count the rows in the sales dataset: COUNT(sales) Count the rows in the billionaires dataset, which resides in the samples namespace: COUNT([(samples) billionaires]) Count the rows in the customer dataset, which is a related dataset down-stream of sales: COUNT(sales.customers) COUNT_VALID COUNT_VALID is an aggregate function that returns the number of rows for which the given expression is valid (excludes NULL values). Syntax COUNT_VALID(field) Page 431 Data Ingest Guide - Platfora Expression Language Reference Return Value Returns a numeric value of type INTEGER. Input Parameters field Required. A field name. Unlike row functions, aggregate functions can only take field names as input. Examples Count the valid values in the page_views field: COUNT_VALID(page_views) DISTINCT DISTINCT is an aggregate function that returns the number of distinct values for the given expression. Syntax DISTINCT(field) Return Value Returns a numeric value of type INTEGER. Input Parameters field Required. A field name. Unlike row functions, aggregate functions can only take field names as input. Examples Count the unique values of the user_id field in the currently selected dataset: DISTINCT(user_id) Count the unique values of the name field in the billionaires dataset, which resides in the samples namespace: DISTINCT([(samples) billionaires].name) Count the unique values of the customer_id field in the customer dataset, which is a related dataset down-stream of web sales: DISTINCT([web sales].customers.customer_id) MAX MAX is an aggregate function that returns the biggest value from the given input expression. Page 432 Data Ingest Guide - Platfora Expression Language Reference Syntax MAX(numeric_or_datetime_field) Return Value Returns a numeric or datetime value of the same type as the input expression. Input Parameters numeric_or_datetime_field Required. A field of type INTEGER, LONG, DOUBLE, FIXED, or DATETIME. Unlike row functions, aggregate functions can only take field names as input. Examples Get the highest value from the sale_amount field: MAX(sale_amount) Get the latest date from the Session Timestamp datetime field: MAX([Session Timestamp]) MIN MIN is an aggregate function that returns the smallest value from the given input expression. Syntax MIN(numeric_or_datetime_field) Return Value Returns a numeric or datetime value of the same type as the input expression. Input Parameters numeric_or_datetime_field Required. A field of type INTEGER, LONG, DOUBLE, FIXED, or DATETIME. Unlike row functions, aggregate functions can only take field names as input. Examples Get the lowest value from the sale_amount field: MIN(sale_amount) Get the earliest date from the Session Timestamp datetime field: MIN([Session Timestamp]) Page 433 Data Ingest Guide - Platfora Expression Language Reference SUM SUM is an aggregate function that returns the total of all values from the given input expression. Syntax SUM(numeric_field) Return Value Returns a numeric value of the same type as the input expression. Input Parameters numeric_field Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregate functions can only take field names as input. Examples Add the values of the sale_amount field: SUM(sale_amount) Add values of the session count field in the users dataset, which is a related dataset down-stream of clicks: SUM(clicks.users.[session count]) STDDEV STDDEV is an aggregate function that calculates the population standard deviation for a group of numeric values. Standard deviation is the square root of the variance. Syntax STDDEV(numeric_field) Return Value Returns a value of type DOUBLE. If there are less than two values in the input group, returns NULL. Input Parameters numeric_field Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregate functions can only take field names as input. Examples Calculate the standard deviation of the values contained in the sale_amount field: STDDEV(sale_amount) Page 434 Data Ingest Guide - Platfora Expression Language Reference VARIANCE VARIANCE is an aggregate function that calculates the population variance for a group of numeric values. Variance measures the amount by which all values in a group vary from the average value of the group. Data with low variance contains values that are identical or similar. Data with high variance contains values that are not similar. Variance is calculated as the average of the squares of the deviations from the mean. Squaring the deviations ensures that negative and positive deviations do not cancel each other out. Syntax VARIANCE(numeric_field) Return Value Returns a value of type DOUBLE. If there are less than two values in the input group, returns NULL. Input Parameters numeric_field Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregate functions can only take field names as input. Examples Get the population variance of the values contained in the sale_amount field: VARIANCE(sale_amount) ROLLUP and Window Functions Window functions can only be used in conjunction with ROLLUP. ROLLUP is a modifier to an aggregate expression that determines the partitioning and ordering of a rowset before the associated aggregate function or window function is applied. ROLLUP defines a window or user-specified set of rows within a query result set. A window function then computes a value for each row in the window. You can use window functions to compute aggregated values such as moving averages, cumulative aggregates, running totals, or a top N per group results. ROLLUP ROLLUP is a modifier to an aggregate function that turns a regular aggregate function into a windowed, partitioned, or adaptive aggregate function. This is useful when you want to compute an aggregation over a subset of rows within the overall result of a viz query. Syntax ROLLUP aggregate_expression [ WHERE input_group_condition [...] ] [ TO ([partitioning_columns]) Page 435 Data Ingest Guide - Platfora Expression Language Reference ] [ ORDER BY (ordering_column [ASC | DESC]) ROWS|RANGE window_boundary [window_boundary] | BETWEEN window_boundary AND window_boundary ] where window_boundary can be one of: UNBOUNDED PRECEDING value PRECEDING value FOLLOWING UNBOUNDED FOLLOWING Description A regular measure is the result of an aggregation (such as SUM or AVG) applied to some fact or metric column of a dataset. For example, suppose we had a dataset with the following rows and columns: Date Sale Amount Product Region 05/01/2013 100 gadget west 05/01/2013 200 widget east 06/01/2013 100 gadget east 06/01/2013 400 widget west 07/01/2013 300 widget west 07/01/2013 200 gadget east To define a regular measure called Total Sales, we would use the expression: SUM([Sale Amount]) When this measure is used in a visualization, the group of input records passed into the aggregate calculation is determined by the dimensions selected by the user when they create the viz. For example, if the user chose Region as a dimension in the viz, there would be two input groups for which the measure would be calculated: Total Sales / Region east west 500 800 Page 436 Data Ingest Guide - Platfora Expression Language Reference If an aggregate expression includes a ROLLUP clause, the column(s) specified in the TO clause of the ROLLUP expression determine the additional partitions over which to compute the aggregate expression. It divides the overall rows returned by the viz query into subsets or buckets, and then computes the aggregate expression within each bucket. Every ROLLUP expression has implicit partitioning defined: an absent TO clause treats the entire result set as one partition; an empty TO clause partitions by whatever dimension columns are present in the viz query. The WHERE clause is used to filter the input rows that flow into each partition. Input rows that meet the WHERE clause criteria will be partitioned, and rows that don't will not be partitioned. The ORDER BY with a RANGE or ROW clause is used to define a window frame within each partition over which to compute the aggregate expression. When a ROLLUP measure is used in a visualization, the aggregate calculation is computed across a set of input rows that are related to, but separate from, the other dimension(s) used in the viz. This is similar to the type of calculation that is done with a regular measure. However unlike a regular measure, a ROLLUP measure does not cause the input rows to be grouped into a single result set; the input rows still retain their separate identities. The ROLLUP clause determines how the input rows are split up for processing by the ROLLUP's aggregate function. ROLLUP expressions can be written to make the partitioning adaptive to whatever dimension columns are selected in the visualization. This is done by using a reference name as the partitioning column, as opposed to a regular column. For example, suppose we wanted to be able to calculate the total sales for any granularity of date. We could create an adaptive measure called Rollup Sales to Date that partitions total sales by date as follows: ROLLUP SUM([Sale Amount]) TO (Date) When this measure is used in a visualization, the group of input records passed into the aggregate calculation is determined by the dimension fields selected by the user in the viz, but partitioned by the granularity of Date selected by the user. For example, if the user chose the dimensions Date.Month and Region in the viz, then total sales would be grouped by month and region, but the ROLLUP measure expression would aggregate the sales by month only. Notice that the results for the east and west regions are the same - this is because the aggregation expression is only considering rows that share the same month when calculating the sum of sales. Month / (Measures) / Region May 2013 June 2013 July 2013 Rollup Sales to Date Rollup Sales to Date Rollup Sales to Date east | west east | west east | west 300 | 300 500 | 500 500 | 500 Page 437 Data Ingest Guide - Platfora Expression Language Reference Suppose within the date partition, we wanted to calculate the cumulative total day to day. We could define a window measure called Running Total to Date that looks at each day and all preceding days as follows: ROLLUP SUM([Sale Amount]) TO (Date) ORDER BY (Date.Date) ROWS UNBOUNDED PRECEDING When this measure is used in a visualization, the group of input records passed into the aggregate calculation is determined by the dimension fields selected by the user in the viz, and partitioned by the granularity of Date selected by the user. Within each partition the rows are ordered chronologically (by Date.Date), and the sum amount is then calculated per date partition by looking at the current row (or mark), and all rows that come before it within the partition. For example, if the user chose the dimension Date.Month in the viz, then the ROLLUP measure expression would cumulatively aggregate the sales within each month. Month / (Measures) / Date.Date May 2013 June 2013 July 2013 2013-05-01 2013-06-01 2013-07-01 Running Total to Date Rollup Sales to Date Rollup Sales to Date 300 500 500 Return Value Returns a numeric value per partition based on the output type of the aggregate_expression. Input Parameters aggregate_expression Required. An expression containing an aggregate or window function. Simple aggregate functions such as COUNT, AVG, SUM, MIN, and MAX are supported. Window functions such as RANK, DENSE_RANK, and NTILE are supported and can only be used in conjuction with ROLLUP. Complex aggregate functions such as STDDEV and VARIANCE are not supported. WHERE input_group_condition The WHERE clause limits the group of input rows over which to compute the aggregate expression. The input group condition is a Boolean (true or false) condition defined using a comparison operator expression. Any row that does not satisfy the condition will be excluded from the input group used to calculate the aggregated measure value. For example (note that datetime values must be specified in yyyy-MM-dd format): WHERE Date.Date BETWEEN 2012-06-01 AND 2012-07-31 WHERE Date.Year BETWEEN 2009 AND 2013 Page 438 Data Ingest Guide - Platfora Expression Language Reference WHERE Company LIKE("Plat*") WHERE Code IN("a","b","c") WHERE Sales < 50.00 WHERE Age >= 21 You can specify multiple WHERE clauses in a ROLLUP expression. TO ([partitioning_columns]) The TO clause is used to specify the dimension column(s) used to partition a group of input rows. This allows you to calculate a measure value for a specific dimension group (a subset of input rows) that are somehow related to the other dimension groups used in a visualization (all input rows). It is possible to define an empty group (meaning all rows) by using empty parenthesis. When used in a visualization, measure values are computed for groups of input rows that return the same value for the columns specified in the partitioning list. For example, if the Date.Month column is used as a partitioning column, then all records that have the same value for Date.Month will be grouped together in order to calculate the measure value. The aggregate expression is applied to the group specified in the TO clause independently of the other dimension groupings used in the visualization. Note that the partitioning column(s) specified in the TO clause of an adaptive measure expression must also be included as dimensions (or grouping columns) in the visualization. A partitioning column can also be the name of a reference field. Using a reference field allows the partition criteria to dynamically adapt based on any field of the referenced dataset that is used in a viz. For example, if the partition column is a reference field pointing to the Date dimension, then any subfield of Date (Date.Year, Date.Month, etc.) can be used as the partitioning column by selecting it in a viz. A TO clause with an empty partitioning list treats each mark in the result set as an input group. For example, if the viz includes the Month and Region columns, then TO() would be equivalent to TO(Month,Region). ORDER BY (ordering_column) The optional ORDER BY clause orders the input rows using the values in the specified column within each partition identified in the TO clause. Use the ORDER BY clause along with the ROWS or RANGE clauses to define windows over which to compute the aggregate function. This is useful for computing moving averages, cumulative aggregates, running totals, or a top value per group of input rows. The ordering column specified in the ORDER BY clause can be a dimension, measure, or an aggregate expression (for example ORDER BY (SUM(Sales))). If the ordering column is a dimension, it must be included in the viz. By default, rows are sorted in ascending order (low to high values). You can use the DESC keyword to sort in descending order (high to low values). ROWS | RANGE Page 439 Data Ingest Guide - Platfora Expression Language Reference Required when using ORDER BY. Further limits the rows within the partition by specifying start and end points within the partition. This is done by specifying a range of rows with respect to the current row either by logical association (RANGE) or physical association (ROWS). Use either a ROWS or RANGE clause to express the window boundary (the set of input rows in each partition, relative to the current row, over which to compute the aggregate expression). The window boundary can include one, several, or all rows of the partition. When using the RANGE clause, the ordering column used in the ORDER BY clause must be a sub-column of a reference to Platfora's built-in Date dimension dataset. window_boundary A window boundary is required when using either ROWS or RANGE. This defines the set of rows, relative to the current row, over which to compute the aggregate expression. The row order is based on the ordering specified in the ORDER BY clause. A PRECEEDING clause defines a lower window boundary (the number of rows to include before the current row). The FOLLOWING clause defines an upper window boundary (the number of rows to include after the current row). The window boundary expression must include either a PRECEEDING or FOLLOWING clause, or both. If PRECEEDING is omitted, the current row is considered the first row in the window. Similarly, if FOLLOWING is omitted, the current row is considered the last row in the window. The UNBOUNDED keyword includes all rows in the direction specified. When you need to specify both a start and end of a window, use the BETWEEN and AND keywords. For example: ROWS 2 PRECEDING means that the window is three rows in size, starting with two rows preceding until and including the current row. ROWS BETWEEN 2 PRECEDING AND 5 FOLLOWING means that the window is eight rows in size, starting with two rows preceding, the current row, and five rows following the current row. The current row is included in the set of rows by default. You can exclude the current row from the window by specifying a window start and end point before or after the current row. For example: ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING starts the window with all rows that come before the current row, and ends the window one row before the current row, thereby excluding the current row from the window. Examples Calculate the percentage of flight records in the same departure date period. Note that the departure_date field is a reference to the Date dataset, meaning that the group to which the measure is applied can adapt to any downstream field of departure_date (departure_date.Year, departure_date.Month, and so on). When used in a viz, this will calculate the percentage of flights for each dimension group in the viz that share the same value for departure_date: 100 * COUNT(Flights) / ROLLUP COUNT(Flights) TO ([Departure Date]) Page 440 Data Ingest Guide - Platfora Expression Language Reference Normalize the number of flights using the carrier American Airlines (AA) as the benchmark. This will allow you to compare the number of flights for other carriers against the fixed baseline number of flights for AA (if AA = 100 percent, then all other carriers will fall either above or below that percentage): 100 * COUNT(Flights) / ROLLUP COUNT(Flights) WHERE [Carrier Code]="AA" Calculate a generic percentage of total sales. When this measure is used in a visualization, it will show the percentage of total sales that a mark in the viz is contributing to the total for all marks in the viz. The input rows depend on the dimensions selected in the viz. 100 * SUM(sales) / ROLLUP SUM(sales) TO () Calculate the cumulative total of sales for a given year on a month-to-month basis (year-to-month sales totals): ROLLUP SUM(sales) TO (Date.Year) ORDER BY (Date.Month) ROWS UNBOUNDED PRECEDING Calculate the cumulative total of sales (for all input rows) for all previous years, but exclude the current year from the total. ROLLUP SUM(sales) TO () ORDER BY (Date.Year) ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING DENSE_RANK DENSE_RANK is a windowing aggregate function that orders rows by a measure value and assigns a rank number to each row in the given partition. Rank positions are not skipped in the event of a tie. DENSE_RANK must be used within a ROLLUP expression. Syntax ROLLUP DENSE_RANK() TO ([partitioning_column]) ORDER BY (measure_expression [ASC | DESC]) ROWS|RANGE window_boundary [window_boundary] | BETWEEN window_boundary AND window_boundary ] where window_boundary can be one of: UNBOUNDED PRECEDING value PRECEDING value FOLLOWING UNBOUNDED FOLLOWING Description DENSE_RANK is a window aggregate function used to assign a ranking number to each row in a group. If multiple rows have the same ranking value (there is a tie), then the tied rows are given the same rank value and subsequent rank positions are not skipped. Page 441 Data Ingest Guide - Platfora Expression Language Reference The TO clause of the ROLLUP is used to specify the dimension column(s) used to partition a group of input rows. To define a global ranking that can adapt to any dimension groupings used in a viz, specify an empty TO clause. The ORDER BY clause of the ROLLUP expression determines how to order the rows before they are ranked. The ORDER BY clause should specify the measure field for which you want to calculate the ranks. The ranked rows in the partition are numbered starting at one. For example, suppose we had a dataset with the following rows and columns and you want to rank the Quarters and Regions according to the values in the Sales column. Quarter Region Sales 2010 Q1 North 100 2010 Q1 South 200 2010 Q1 East 300 2010 Q1 West 400 2010 Q2 North 400 2010 Q2 South 250 2010 Q2 East 150 2010 Q2 West 250 Supposing the lens has an existing measure field called Sales(Sum), you could then define a measure called Sales_Dense_Rank using the following expression: ROLLUP DENSE_RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED PRECEDING When you include the Quarter, Region, and Sales_Dense_Rank columns in the viz, you get the following data points. Notice that tied values are given the same rank number and no rank positions are skipped: Quarter Region SalesRank 2010 Q1 North 6 2010 Q1 South 4 2010 Q1 East 2 2010 Q1 West 1 Page 442 Data Ingest Guide - Platfora Expression Language Reference Quarter Region SalesRank 2010 Q2 North 1 2010 Q2 South 3 2010 Q2 East 5 2010 Q2 West 3 Return Value Returns a value of type LONG. Input Parameters ROLLUP Required. DENSE_RANK must be used within a ROLLUPROLLUP expression in place of the aggregate_expression of the ROLLUP. The TO clause of the ROLLUP expression specifies the dimension group(s) over which to calculate the window function. An empty TO calculates the window function over all rows in the query as one group. The ORDER BY clause of the ROLLUP expression specifies a measure field or aggregate expression. Examples Rank the sum of all sales in descending order, so the highest sales is given the ranking of 1. ROLLUP DENSE_RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED PRECEDING Rank the sum of all sales within a given quarter in descending order, so the highest sales in each quarter is given the ranking of 1. ROLLUP DENSE_RANK() TO (Quarter) ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED PRECEDING NTILE NTILE is a windowing aggregate function that divides a partitioned group of rows into the specified number of buckets, and returns the bucket number to which the current row belongs. NTILE must be used within a ROLLUP expression. Syntax ROLLUP NTILE(integer) TO ([partitioning_column]) ORDER BY (measure_expression [ASC | DESC]) ROWS|RANGE window_boundary [window_boundary] | BETWEEN window_boundary AND window_boundary ] Page 443 Data Ingest Guide - Platfora Expression Language Reference where window_boundary can be one of: UNBOUNDED PRECEDING value PRECEDING value FOLLOWING UNBOUNDED FOLLOWING Description NTILE is a window aggregate function typically used to calculate percentiles. A percentile (or centile) is a measure used in statistics indicating the value below which a given percentage of records in a group falls. For example, the 20th percentile is the value (or score) below which 20 percent of the records may be found. The term percentile is often used in the reporting of test scores. For example, if a score is in the 86th percentile, it is higher than 86% of the other scores. The 25th percentile is also known as the first quartile (Q1), the 50th percentile as the median or second quartile (Q2), and the 75th percentile as the third quartile (Q3). In general, percentiles, deciles and quartiles are specific types of ntiles. NTILE must be used within a ROLLUPROLLUP expression in place of the aggregate_expression of the ROLLUP. The TO clause of the ROLLUP is used to specify a fixed dimension column used to partition a group of input rows. To define a global NTILE ranking that can adapt to any dimension groupings used in a viz, specify an empty TO clause. The ORDER BY clause of the ROLLUP expression determines how to order the rows before they are divided into buckets. The ORDER BY clause should specify the measure field for which you want to calculate NTILE bucket values. A centile would be 100 buckets, a decile would be 10 buckets, a quartile 4 buckets, and so on. The buckets in the partition are numbered starting at one. For example, suppose we had a dataset with the following rows and columns and you want to divide the year-to-date sales into four buckets (quartiles) with the highest quartile ranked as 1 and the lowest ranked as 4. Supposing a measure field has been defined called Sum_YTD_Sales, defined as SUM([Sales YTD]), you could then define a measure called YTD_Sales_Quartile using the following expression: ROLLUP NTILE(4) TO() ORDER BY(Sum_YTD_Sales DESC) ROWS UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING Name Gender Sales YTD YTD_Sales_Quartile Chen F 3,500,000 1 John M 3,100,000 1 Pete M 2,900,000 1 Daria F 2,500,000 2 Page 444 Data Ingest Guide - Platfora Expression Language Reference Name Gender Sales YTD YTD_Sales_Quartile Jennie F 2,200,000 2 Mary F 2,100,000 2 Mike M 1,900,000 3 Brian M 1,700,000 3 Molly F 1,500,000 3 Theresa F 1,200,000 4 Hans M 900,000 4 Ben M 500,000 4 Because the TO clause of the ROLLUP expression is empty, the quartile partitioning adapts to whatever dimensions are used in the viz. For example, if you include the Gender dimension field in the viz, the quartiles would then be computed per gender. The following example divides each gender into buckets with each gender having 6 year-to-date sales values. The two extra values (the remainder of 6 / 4) are allocated to buckets 1 and 2, which therefore have one more value than buckets 3 or 4. Name Gender Sales YTD YTD_Sales_Quartile (partitioned by Gender) Chen F 3,500,000 1 Daria F 2,500,000 1 Jennie F 2,200,000 2 Mary F 2,100,000 2 Molly F 1,500,000 3 Theresa F 1,200,000 4 John M 3,100,000 1 Pete M 2,900,000 1 Mike M 1,900,000 2 Brian M 1,700,000 2 Hans M 900,000 3 Ben M 500,000 4 Page 445 Data Ingest Guide - Platfora Expression Language Reference Return Value Returns a value of type LONG. Input Parameters ROLLUP Required. NTILE must be used within a ROLLUPROLLUP expression in place of the aggregate_expression of the ROLLUP. The TO clause of the ROLLUP expression specifies the dimension group(s) over which to calculate the window function. An empty TO calculates the window function over all rows in the query as one group. The ORDER BY clause of the ROLLUP expression specifies a measure field or aggregate expression. integer Required. An integer that specifies the number of buckets to divide the partitioned rows into. Examples Perhaps the most common use case for NTILE is to get a global ranking of result rows. For example, if you wanted to get the percentile of Total Records per City, you may think the expression to use is: ROLLUP NTILE(100) TO (City) ORDER BY ([Total Records] DESC) ROWS UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. However, by leaving the TO clause blank, the percentile buckets can adapt to whatever dimension(s) you use in the viz. To calculate the Total Records percentiles by City, you could define a global Total_Records_Percentiles measure and then use this measure in conjunction with the City dimension in the viz (or any other dimension for that matter). ROLLUP NTILE(100) TO () ORDER BY ([Total Records] DESC) ROWS UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING RANK RANK is a windowing aggregate function that orders rows by a measure value and assigns a rank number to each row in the given partition. Rank positions are skipped in the event of a tie. RANK must be used within a ROLLUP expression. Syntax ROLLUP RANK() TO ([partitioning_column]) ORDER BY (measure_expression [ASC | DESC]) ROWS|RANGE window_boundary [window_boundary] | BETWEEN window_boundary AND window_boundary ] where window_boundary can be one of: Page 446 Data Ingest Guide - Platfora Expression Language Reference UNBOUNDED PRECEDING value PRECEDING value FOLLOWING UNBOUNDED FOLLOWING Description RANK is a window aggregate function used to assign a ranking number to each row in a group. If multiple rows have the same ranking value (there is a tie), then the tied rows are given the same rank value and the subsequent rank position is skipped. The TO clause of the ROLLUP is used to specify the dimension column(s) used to partition a group of input rows. To define a global ranking that can adapt to any dimension groupings used in a viz, specify an empty TO clause. The ORDER BY clause of the ROLLUP expression determines how to order the rows before they are ranked. The ORDER BY clause should specify the measure field for which you want to calculate the ranks. The ranked rows in the partition are numbered starting at one. For example, suppose we had a dataset with the following rows and columns and you want to rank the Quarters and Regions according to the values in the Sales column. Quarter Region Sales 2010 Q1 North 100 2010 Q1 South 200 2010 Q1 East 300 2010 Q1 West 400 2010 Q2 North 400 2010 Q2 South 250 2010 Q2 East 150 2010 Q2 West 250 Supposing the lens has an existing measure field called Sales(Sum), you could then define a measure called Sales_Rank using the following expression: ROLLUP RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED PRECEDING Page 447 Data Ingest Guide - Platfora Expression Language Reference When you include the Quarter, Region, and Sales_Rank columns in the viz, you get the following data points. Notice that tied values are given the same rank number and the rank positions 2 and 5 are skipped: Quarter Region SalesRank 2010 Q1 North 8 2010 Q1 South 6 2010 Q1 East 3 2010 Q1 West 1 2010 Q2 North 1 2010 Q2 South 4 2010 Q2 East 7 2010 Q2 West 4 Return Value Returns a value of type LONG. Input Parameters ROLLUP Required. RANK must be used within a ROLLUPROLLUP expression in place of the aggregate_expression of the ROLLUP. The TO clause of the ROLLUP expression specifies the dimension group(s) over which to calculate the window function. An empty TO calculates the window function over all rows in the query as one group. The ORDER BY clause of the ROLLUP expression specifies a measure field or aggregate expression. Examples Rank the sum of all sales in descending order, so the highest sales is given the ranking of 1. ROLLUP RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED PRECEDING Rank the sum of all sales within a given quarter in descending order, so the highest sales in each quarter is given the ranking of 1. ROLLUP RANK() TO (Quarter) ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED PRECEDING Page 448 Data Ingest Guide - Platfora Expression Language Reference ROW_NUMBER ROW_NUMBER is a windowing aggregate function that assigns a unique, sequential number to each row in a group (partition) of rows, starting at 1 for the first row in each partition. ROW_NUMBER must be used within a ROLLUP expression, which acts as a modifier for ROW_NUMBER. Use a column in the ROLLUP order by clause to determine on which column to determine the row number. Syntax ROLLUP ROW_NUMBER(integer) TO ([partitioning_column]) ORDER BY (ordering_column [ASC | DESC]) ROWS|RANGE window_boundary [window_boundary] | BETWEEN window_boundary AND window_boundary ] where window_boundary can be one of: UNBOUNDED PRECEDING value PRECEDING value FOLLOWING UNBOUNDED FOLLOWING Description For example, suppose we had a dataset with the following rows and columns: Quarter Region Sales 2010 Q1 North 100 2010 Q1 South 200 2010 Q1 East 300 2010 Q1 West 400 2010 Q2 North 400 2010 Q2 South 250 2010 Q2 East 150 2010 Q2 West 250 Page 449 Data Ingest Guide - Platfora Expression Language Reference Suppose you want to assign a unique ID to the sales of each region by quarter in descending order. In this example, a measure field is defined called Sum_Sales with the expression SUM(Sales). You could then define a measure called SalesNumber using the following expression: ROLLUP ROW_NUMBER() TO (Quarter) ORDER BY (Sum_Sales DESC) ROWS UNBOUNDED PRECEDING When you include the Quarter, Region, and SalesNumber columns in the viz, you get the following data points: Quarter Region SalesNumber 2010 Q1 North 4 2010 Q1 South 3 2010 Q1 East 2 2010 Q1 West 1 2010 Q2 North 1 2010 Q2 South 2 2010 Q2 East 4 2010 Q2 West 3 Return Value Returns a value of type LONG. Input Parameters None Examples Assign a unique ID to the sales of each region by quarter in descending order, so the highest sales is given the number of 1. ROLLUP ROW_NUMBER() TO (Quarter) ORDER BY (Sum_Sales DESC) ROWS UNBOUNDED PRECEDING Page 450 Data Ingest Guide - Platfora Expression Language Reference User Defined Functions (UDFs) User defined functions (UDFs) allow you to define your own per-row processing logic, and then expose that functionality to users in the Platfora application expression builder. User defined functions can only be used to implement new row functions, not aggregate functions. If a computed field that uses a UDF is included in a lens, the UDF will be executed once for each row during the lens build process. This is good to keep in mind when writing UDF Java programs, so you do not write programs that negatively impact lens build resources or execution times. Writing a Platfora UDF Java Program User defined functions (UDFs) are written in the Java programming language and implement the Platfora-provided Java interface, com.platfora.udf.UserDefinedFunction. Verify that any JAR file that the UDF will use is compatible with the existing libraries Platfora uses. You can find those libraries in $PLATFORA_HOME/lib. To define a user defined function for Platfora, you must have the Java Development Kit (JDK) version 6 or 7 installed on the machine where you plan to do your development. You will also need the com.platfora.udf.UserDefinedFunction interface Java code from your Platfora master server installation. If you go to the $PLATFORA_HOME/tools/udf directory of your Platfora master server installation, you will find two files: • platfora-udf.jar – This is the compiled code for the com.platfora.udf.UserDefinedFunction interface. You must link to this jar file (place it in the CLASSPATH) when you compile your UDF Java program. • /com/platfora/udf/UserDefinedFunction.java – This is the source code for the Java interface that your UDF classes need to implement. The source code is provided as reference documentation of the Platfora UserDefinedFunction interface. You can refer to this file when writing your UDF Java programs. 1. Copy the file $PLATFORA_HOME/tools/udf/platfora-udf.jar to a directory on the machine where you plan to develop and compile your UDF program. 2. Write a Java program that implements com.platfora.udf.UserDefinedFunction interface. For example, here is a sample Java program that defines a REPEAT_STRING user defined function. This simple function repeats an input string a specified number of times. import java.util.List; /** * Sample user-defined function implementation that demonstrates * how to create a REPEAT_STRING function. */ Page 451 Data Ingest Guide - Platfora Expression Language Reference public class RepeatString implements com.platfora.udf.UserDefinedFunction { /** * Returns the name of the user-defined function. * The first character in the name must be a letter, * and subsequent characters must be either letters, * digits, or underscores. You cannot name your function * the same name as an existing Platfora * built-in function. Names are case-insensitive. */ @Override public String getFunctionName() { return "REPEAT_STRING"; } /** * Returns one of the following values, reflecting the * return type of the user-defined function: * DATETIME, DOUBLE, FIXED, INTEGER, LONG, or STRING. */ @Override public String getReturnType() { return "STRING"; } /** * Returns an array of Strings, one for each of the * input arguments to the user-defined function, * specifying the required data type for each argument. * The Strings should be of the following values: * DATETIME, DOUBLE, FIXED, INTEGER, LONG, STRING. */ @Override public String[] getArgumentTypes() { return new String[] { "STRING", "INTEGER" }; } /** * Returns a human-readable description of what the function * does, to be displayed to Platfora users in the * Expression Builder. May return null. */ @Override public String getDescription() { return "The REPEAT_STRING function returns an input string repeated " + " a specified number of times."; } Page 452 Data Ingest Guide - Platfora Expression Language Reference /** * Returns a human-readable description explaining the * value that the function returns, to be displayed to * Platfora users in the Expression Builder. May return null. */ @Override public String getReturnValueDescription() { return "Returns one value per row of type STRING"; } /** * Returns a human-readable example of the function syntax, * to be displayed to Platfora users in the Expression * Builder. May return null. */ @Override public String getExampleUsage() { return "CONCAT(\"It's a \", REPEAT_STRING(\"Mad \",4), \" World\")"; } /** * The compute method performs the actual work of evaluating * the user-defined function. The method should operate on the * argument values provided to calculate the function return value * and return a Java object of the appropriate type to represent * the return value. The following mapping describes the Java * object type that is used to represent each Platfora data type: * DATETIME -> java.util.Date * DOUBLE -> java.lang.Double * FIXED -> java.lang.Long * INTEGER -> java.lang.Integer * LONG -> java.lang.Long * STRING -> java.lang.String * Note on FIXED type: fixed-precision numbers in Platfora * are represented as Longs that have been scaled by a * factor of 10,000. * * In the event that the user-defined function * encounters invalid inputs, or the function return value is not * defined given the inputs provided, the compute method should return * null rather than throwing an exception. The compute method should * avoid throwing any exceptions. * * @param arguments The values of the function inputs. * * The entries in this list will match the specification * provided by getArgumentTypes method in type, number, and order: * for example, if getArgumentTypes returned an array of * length 3 with the values STRING, DOUBLE, STRING, then Page 453 Data Ingest Guide - Platfora Expression Language Reference * the arguments parameter will hold be a list of 3 Java * objects: a java.lang.String, a java.lang.Double, and a * java.lang.String. Any of the values within the * arguments List may be null. */ @Override public String compute(List arguments) { // cast the inputs to the correct types final String toRepeat = (String) arguments.get(0); final Integer numberOfRepeats = (Integer) arguments.get(1); // check for invalid inputs if (toRepeat == null || numberOfRepeats == null || numberOfRepeats < 0) return null; } } // repeat the input string the specified number of times final StringBuilder builder = new StringBuilder(); for (int i = 0; i < numberOfRepeats; i++) { builder.append(toRepeat); } return builder.toString(); 3. Compile your .java UDF program file into a .class file (make sure to link to the platforaudf.jar file or place it in your Java CLASSPATH). The target Java version must be Java 1.6. Compiling with a target of Java 1.7 will result in an error when the UDF is used. For example, to compile the RepeatString.java program using Java 1.6: javac -source 1.6 -target 1.6 -cp platfora-udf.jar RepeatString.java 4. Create a Java archive file (.jar) containing your .class file. For example: jar cf repeat-string-udf.jar RepeatString.class After you have written and compiled your UDF Java program, you must then install and enable it on the Platfora master server. See Adding a UDF to the Platfora Expression Builder. Adding a UDF to the Platfora Expression Builder After you have written and compiled a user defined function (UDF) Java class, you must install your class on the Platfora master server and enable it so that it can be seen and used in the Platfora expression builder. This task is performed on the Platfora master server. Before you begin, you must have written and compiled a Java class for your user defined function. See Writing a Platfora UDF Java Program. 1. Create a directory named extlib in the Platfora data directory on the Platfora master server. Page 454 Data Ingest Guide - Platfora Expression Language Reference For example: $ mkdir $PLATFORA_DATA_DIR/extlib 2. Copy the Java archive (.jar) file containing your UDF class to the $PLATFORA_DATA_DIR/ extlib directory on the Platfora master server. For example: $ cp repeat-string-udf.jar $PLATFORA_DATA_DIR/extlib/ 3. Set the Platfora server configuration property, platfora.udf.class.names, so it contains the name of your UDF Java class. If you have more than one class, separate the class names with a comma. For example, to set this property using the platfora-config command-line utility: $ $PLATFORA_HOME/bin/platfora-config set --key platfora.udf.class.names --value RepeatString 4. Restart the Platfora server: $ platfora-services restart Page 455 Data Ingest Guide - Platfora Expression Language Reference The user defined function will then be available for defining computed field expressions in the Add Field dialog of the Platfora application. Due to the way some web browsers cache Javascript files, the newly added function may not appear in the Functions list for up to 24 hours. However, the function is immediately available for use and recognized by the Expression autocomplete feature. Regular Expression Reference Regular expressions vary in complexity using a combination of basic constructs to describe a string matching pattern. This reference describes the most common regular expression matching patterns, but is not a comprehensive list. Regular expressions, also referred to as regex or regexp, are a standardized collection of special characters and constructs used for matching strings of text. They provide a flexible and precise language for matching particular characters, words, or patterns of characters. Page 456 Data Ingest Guide - Platfora Expression Language Reference Platfora regular expressions are based on the pattern matching syntax of the Java programming language. For more in depth information on writing valid regular expressions, refer to the Java regular expression pattern documentation. Platfora makes use of regular expressions in the following contexts: • In computed field expressions that use the REGEX or REGEX_REPLACE functions. • In PARTITION expression statements for event series processing computed fields. • In the Regex file parser in data ingest. • In the data source location path descriptor in data ingest. • In lens filter expressions. Regex Literal and Special Characters The most basic form of regular expression pattern matching is the match of a literal character or string. Regular expressions also have a number of special characters that affect the way a pattern is matched. This section describes the regular expression syntax for referring to literal characters, special characters, non-printable characters (such as a tab or a newline), and special character escaping. Literal Characters The most basic form of pattern matching is the match of literal characters. For example, if the regular expression is foo and the input string is foo, the match will succeed because the strings are identical. Special Characters Certain characters are reserved for special use in regular expressions. These special characters are often called metacharacters. If you want to use special characters as literal characters, they must be escaped. Character Name Character Reserved For opening bracket [ start of a character class closing bracket ] end of a character class hyphen - character ranges within a character class backslash \ general escape character caret ^ beginning of string, negating of a character class dollar sign $ end of string period . matching any single character pipe | alternation (OR) operator question mark ? optional quantifier, quantifier minimizer Page 457 Data Ingest Guide - Platfora Expression Language Reference Character Name Character Reserved For asterisk * zero or more quantifier plus sign + once or more quantifier opening parenthesis ( start of a subexpression group closing parenthesis ) end of a subexpression group opening brace { start of min/max quantifier closing brace } end of min/max quantifier Escaping Special Characters There are two ways to force a special character to be treated as an ordinary character: • Precede the special character with a \ (backslash character). For example, to specify an asterisk as a literal character instead of a quantifier, use \*. • Enclose the special character(s) within \Q (starting quote) and \E (ending quote). Everything between \Q and \E is then treated as literal characters. • To escape literal double-quotes in a REGEX() expression, double the double-quotes (""). For example, to extract the inches portion from a height field where example values are 6'2", 5'11": REGEX(height, "\'(\d)+""$") Non-Printing Characters You can use special character sequence constructs to specify non-printable characters in a regular expression. Some of the most commonly used constructs are: Construct Matches \n newline character \r carriage return character \t tab character \f form feed character Regex Character Classes A character class allows you to specify a set of characters, enclosed in square brackets, that can produce a single character match. There are also a number of special predefined character classes (backslash character sequences that are shorthand for the most common character sets). Page 458 Data Ingest Guide - Platfora Expression Language Reference Character Class Constructs A character class matches only to a single character. For example, gr[ae]y will match to gray or grey, but not to graay or graey. The order of the characters inside the brackets does not matter. You can use a hyphen inside a character class to specify a range of characters. For example, [az] matches a single lower-case letter between a and z. You can also use more than one range, or a combination of ranges and single characters. For example, [0-9X] matches a numeric digit or the letter X. Again, the order of the characters and the ranges does not matter. A caret following an opening bracket specifies characters to exclude from a match. For example, [^abc] will match any character except a, b, or c. Construct Type Description [abc] simple matches a or b or c [^abc] negation matches any character except a or b or c [a-zA-Z] range matches a through z , or A through Z (inclusive) [a-d[m-p]] union matches a through d , or m through p Page 459 Data Ingest Guide - Platfora Expression Language Reference Construct Type Description [a-z&&[def]] intersection matches d , e , or f [a-z&&[^xq]] subtraction matches a through z , except for x and q Predefined Character Classes Predefined character classes offer convenient shorthands for commonly used regular expressions. Construct Description Example . matches any single character (except newline) .at matches "cat", "hat", and also"bat" in the phrase "batch files" \d \D matches any digit character (equivalent to \d [0-9] ) matches "3" in "C3PO" and "2" in "file_2.txt" matches any non-digit character (equivalent to \D [^0-9] matches "S" in "900S" and "Q" in "Q45" ) \s matches any single white-space character (equivalent to [ \t\n\x0B\f\r] \sbook matches "book" in "blue book" but nothing in "notebook" ) \S matches any single non-white-space character \Sbook matches "book" in "notebook" but nothing in "blue book" Page 460 Data Ingest Guide - Platfora Expression Language Reference Construct Description Example \w matches any alphanumeric character, including r\w* underscore (equivalent to matches "rm" and "root" [A-Za-z0-9_] ) \W matches any non-alphanumeric character (equivalent to [^A-Za-z0-9_] \W matches "&" in "stmd &" , "%" in "100%", and "$" in "$HOME" ) POSIX Character Classes (US-ASCII) POSIX has a set of character classes that denote certain common ranges. They are similar to bracket and predefined character classes, except they take into account the locale (the local language/coding system). \p{Lower} a lower-case alphabetic character, [a-z] \p{Upper} an upper-case alphabetic character, [A-Z] \p{ASCII} an ASCII character, [\x00-\x7F] \p{Alpha} an alphabetic character, [a-zA-z] \p{Digit} a decimal digit, [0-9] \p{Alnum} an alphanumeric character, [a-zA-z0-9] \p{Punct} a punctuation character, one of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ \p{Graph} a visible character, [\p{Alnum}\p{Punct}] \p{Print} a printable character, [\p{Graph}\x20] \p{Blank} a space or tab, [ t] Page 461 Data Ingest Guide - Platfora Expression Language Reference \p{Cntrl} a control character, [\x00-\x1F\x7F] \p{XDigit} a hexidecimal digit, [0-9a-fA-F] \p{Space} a whitespace character, [ \t\n\x0B\f\r] Regex Line and Word Boundaries Boundary matching constructs are used to specify where in a string to apply a matching pattern. For example, you can search for a particular pattern within a word boundary, or search for a pattern at the beginning or end of a line. Construct Description Example ^ matches from the beginning of a line (multiline matches are currently not supported) ^172 matches from the end of a line (multi-line matches are currently not supported) d$ matches within a word boundary \bis\b $ \b will match the "172" in IP address "172.18.1.11" but not in "192.172.2.33" will match the "d" in "maid" but not in "made" matches the word "is" in "this is my island", but not the "is" part of "this" or "island". \bis matches both "is" and the "is" in "island", but not in "this". \B matches within a non-word boundary \Bb matches "b" in "sbin" but not in "bash" Regex Quantifiers Quantifiers specify how often the preceding regular expression construct should match. There are three classes of quantifiers: greedy, reluctant, and possessive. The difference between greedy, reluctant, and possessive quantifiers involves what part of the string to try for the initial match, and how to retry if the initial attempt does not produce a match. Page 462 Data Ingest Guide - Platfora Expression Language Reference Quantifier Constructs By default, quantifiers are greedy. A greedy quantifier will first try for a match with the entire input string. If that produces a match, then the match is considered a success, and the engine can move on to the next construct in the regular expression. If the first try does not produce a match, the engine backsoff one character at a time until a match is found. So a greedy quantifier checks for possible matches in order from the longest possible input string to the shortest possible input string, recursively trying from right to left. Adding a ? (question mark) to a greedy quantifier makes it reluctant. A reluctant quantifier will first try for a match from the beginning of the input string, starting with the shortest possible piece of the string that matches the regex construct. If that produces a match, then the match is considered a success, and the engine can move on to the next construct in the regular expression. If the first try does not produce a match, the engine adds one character at a time until a match is found. So a reluctant quantifier checks for possible matches in order from the shortest possible input string to the longest possible input string, recursively trying from left to right. Adding a + (plus sign) to a greedy quantifier makes it possessive. A possessive quantifier is like a greedy quantifier on the first attempt (it tries for a match with the entire input string). The difference is that unlike a greedy quantifier, a possessive quantifier does not retry a shorter string if a match is not found. If the initial match fails, the possessive quantifier reports a failed match. It does not make any more attempts. Greedy ReluctantPossessiveDescription ConstructConstructConstruct Example ? matches the previous character or construct once or not at all st?on matches the previous character or construct zero or more times if* matches the previous character or construct one or more times if+ matches the previous character or construct exactly o{2} * + {n} ?? *? +? {n}? ?+ *+ ++ {n}+ n times Page 463 matches "son" in "johnson" and "ston" in "johnston" but nothing in "clinton" or "version" matches "if", "iff" in "diff", or "i" in "print" matches "if", "iff" in "diff", but nothing in "print" matches "oo" in "lookup" and the first two o's in "fooooo" but nothing in "mount" Data Ingest Guide - Platfora Expression Language Reference Greedy ReluctantPossessiveDescription ConstructConstructConstruct Example {n,} o{2,} {n,}? {n,}+ matches the previous character or construct at least matches "oo" in "lookup" all five o's in "fooooo" but nothing in "mount" n times {n,m} {n,m}? {n,m}+ matches the previous character or construct at least F{2,4} matches "FF" in "#FF0000" and the last four F's in "#FFFFFF" n times, but no more than m times Regex Capturing Groups Groups are specified by a pair of parenthesis around a subpattern in the regular expression. By placing part of a regular expression inside parentheses, you group that part of the regular expression together. This allows you to apply regex operators and quantifiers to the entire group at once. Besides grouping part of a regular expression together, parenthesis also create a capturing group. Capturing groups are used to determine which matching values to save or return from your regular expression. Group Numbering A regular expression can have more than one group and the groups can be nested. The groups are numbered 1-n from left to right, starting with the first opening parenthesis. There is always an implicit group 0, which contains the entire match. For example, the pattern: (a(b*))+(c) contains three groups: group 1: (a(b*)) group 2: (b*) group 3: (c) Capturing Groups By default, a group captures the text that produces a match. Besides grouping part of a regular expression together, parenthesis also create a capturing group or a backreference. The portion of the string matched by the grouped subexpression is captured in memory for later retrieval or use. Capturing Groups and the Regex Line Parser When you choose the Regex line parser during the Parse Data phase of the data ingest process, Platfora uses capturing groups to determine what parts of the regular expression to return as columns. Page 464 Data Ingest Guide - Platfora Expression Language Reference The Regex line parser applies the user-supplied regular expression against each line in the source file, and returns each capturing group in the regular expression as a column value. For example, suppose you had user records in a file, and the lines were formatted like this: Name: John Smith Address: 123 Main St. Age: 25 Comment: Active Name: Sally R. Jones Address: 2 E. El Camino Real Age: 32 Name: Rod Rogers Address: 55 Elm Street Comment: Suspended You could use the following regular expression to extract the Full Name, Last Name, Address, Age, and Comment column values: Name: (.*\s(\p{Alpha}+)) Address:\s+(.*) Age:\s+([0-9]+)(?:\s+Comment:\s +(.*))? Capturing Groups and the REGEX Function The REGEX function can be used to extract a portion of a string value. For the REGEX function, only the value of the first capturing group is returned. For example, if you wanted to match all possible email address strings with a pattern of username@provider.domain, but only return the provider portion of the email address from the email field: REGEX(email,"^[a-zA-Z0-9._%+-]+@([a-zA-Z0-9._-]+)\.[a-zA-Z]{2,4}$") Capturing Groups and the REGEX_REPLACE Function The REGEX_REPLACE function is used to match a string value, and replace matched strings with another value. The REGEX_REPLACE function takes three arguments: an input string, a matching regex, and a replacement regex. Capturing groups can be used to capture backreferences (see Backreferences), but do not control what portions of the match are returned (the entire match is always returned). Backreferences Backreferences allow you to capture and reuse a subexpression match inside the same regular expression. You can reuse a capturing group as a backreference by referring to its group number preceded by a backslash (for example, \1 refers to capturing group 1, \2 refers to capturing group 2, and so on). For example, if you wanted to match a pair of HTML tags and their enclosed text, you could capture the opening tag into a backreference, and then reuse it to match the corresponding closing tag: (<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\2>) This regular expression contains two capturing groups, the outermost capturing group (which captures the entire string), and one which captures the string matched by [A-Z][A-Z0-9]* into backreference number two. This backreference can then be reused with \2 (backslash two) to match the corresponding closing HTML tag. When referring to capturing groups in the previous regular expression, the backreference syntax is slightly different. The backreference group number is preceded by a dollar sign instead of a backslash (for example, $1 refers to capturing group 1 of the previous expression). An example of this would be Page 465 Data Ingest Guide - Platfora Expression Language Reference the REGEX_REPLACE function, which takes two regular expressions: one for the matching string, and one for the replacement string. The following example matches the values in a phone_number field where phone number values are formatted as xxx.xxx.xxxx, and replaces them with phone number values formatted as (xxx) xxxxxxx. Notice the backreferences in the replacement expression; they refer to the capturing groups of the previous matching expression: REGEX_REPLACE(phone_number,"([0-9]{3})\.([[0-9]]{3})\.([[0-9]] {4})","\($1\) $2-$3") Non-Capturing Groups In some cases, you may want to use parenthesis to group subpatterns, but not capture text. A noncapturing group starts with (?: (a question mark and colon following the opening parenthesis). For example, h(?:a|i|o)t matches hat or hit or hot, but does not capture the a, i, or o from the subexpression. Page 466 Appendix B Lens Query Language Reference Platfora's lens query language is a SQL-like language for programmatically querying the prepared data in a lens. This reference describes the query language syntax and usage. Topics: • SELECT Statement SELECT Statement Queries an aggregate lens. A SELECT statement is input to a progammatic lens query. Syntax [ DEFINE alias-name AS expression [ DEFINE ... ] ] SELECT measure-field [ AS alias-name ] | measure-expression AS alias-name [ , { dimension-field [ AS alias-name ] | row-expression AS alias-name } [ , ...] ] FROM lens-name [ WHERE filter-expression [ AND filter-expression ] ] [ GROUP BY dimension-field [ [, group-ordering ] ] [ HAVING measure-filter-expression ] Description Use SELECT to query an aggregate lens. You cannot query an event series lens. The SELECT must include at least one measure field (column) or expression. Once you've supplied a measure value, your SELECT can contain additional measures or dimensions. If you include non-measure columns in the SELECT, you must include those columns in a GROUP BY clause. Use the DEFINE clause to add one or more computed fields to the lens. Platfora always queries the current version of the lens-name. Keep in mind lens definitions can change. If you write a query against a column that is later dropped from the lens, a previously working query can fail as a result and return an error message. Page 467 Data Ingest Guide - Lens Query Language Reference Querying via REST A lens query is meant to support external applications that want to access Platfora lens data. For this reason, you query a lens by making API calls to the query REST resource: https://hostname:port/api/v1/query The resource supports passing the statement as a GET or POST with application/x-www-formurlencoded URL/form parameters. The caller must authenticate as a user with the Analyst (Limited) role or higher to execute a query. To query a specific lens, the caller must have Data Access on all datasets the lens references. A query returns comma separated values (CSV) by default. You have the option of receiving the results in a JSON body. For detailed information about using this REST APIs, see the Platfora API Reference. Writing a SELECT Expression The SELECT expression can contain multiple columns and or expressions. You must specify at least one measure column or measure expression. Once you meet this requirement, you can include additional dimension columns or row expressions. Recall that a measure is a aggregate numeric value where as a dimension is a numeric, text, or time-based. A measure expression supports addition, subtraction, multiplication, and division. Inputs values can be column references (fields) or expressions that contain any of these supported functions: • aggregate ( that is, AVG(), COUNT(), SUM() and so forth • ROLLUP() • EXP() • POW() • SQRT() Measure expressions can also include literal (integers or string) values. When constructing a measure expression, make sure you understand the expression syntax rules and limitations of aggregate functions. See the Expression and Query Language Reference for information on the aggregate function limitations and expression syntax. If the SELECT statement includes a dimension value, you must include the column in your GROUP BY clause. A dimension or row expression supports addition, subtraction, multiplication, and division of row values. Your SELECT can reference columns or supply row expressions that include following functions: • data type conversion • date and time • general processing • math • string • URL Page 468 Data Ingest Guide - Lens Query Language Reference An expression can include literal (integers or string) values or other expressions. Make sure you understand the expression syntax rules. See the Expression and Query Language Reference for information on the expression syntax. When specifying an expression, supply an alias (AS clause) if you want to refer to the expression elsewhere in other clauses. You cannot use an * (asterisk) to retrieve all of the rows in a lens. Specifying Lens and Column Names When you specify the lens-name you use the name as it appears in the Data Catalog user interface. Enclose the name in [ ] (brackets) if it contains spaces or special characters. For example, you would refer to the Web Log-2014 lens as: [Web Log-2014] When specifying a column name, you should follow the expression language rules for field (column) references. This means that for columns belonging to a reference dataset, you must qualify the name using dot-notation as follows: { [ reference-dataset . [...] ] column-name | alias-name } For example, use device.manufacturer to refer the manufacturer column in the device dataset. If you define an alias, use the alias to refer to the column in other parts of your query. DEFINE Clause Defines a computed field to include in a SELECT statement. Syntax DEFINE alias-name AS { expression } Description Use a DEFINE clause to include new computed fields that aren't in the original lens. Using the DEFINE clause is optional. Platfora applies the DEFINE statement before the main SELECT clause. New computed fields can only use fields already in the lens. The expression you write must be a valid expression for a vizboard computed field. This means your computed field is subject to the following restrictions: • You can only define a computed field that operates on fields that exist in the lens. • A vizboard computed field can break if it operates on fields that are later removed from the lens or a focus or referenced dataset. • You cannot use aggregate functions to add new measures from dimension data in the lens. • You can compute new measures from existing measures already in the lens. For example, if an AVG(sales) aggregate exists in the data, you can define a SUM(sales) field because SUM(sales) is necessary to compute AVG(sales). • You cannot use custom user-defined functions (UDFs) in vizboard computed field expressions. Page 469 Data Ingest Guide - Lens Query Language Reference If you specify multiple DEFINE clauses, separate each new DEFINE with a space. A computed field can depend on any fields pre-existing in the lens or other fields created in the query's scope. For example, a computed field you DEFINE can depend on fields also created through other DEFINE statements. WHERE Clause Filters a lens query by one or more predicate expression. Syntax WHERE predicate-expression [ AND predicate-expression ] A predicate-expression can be a comparison: column-name { = | < | > | <= | >= | != } literal Or the predicate-expression can be a list-expression such as this: column-name [ NOT ] { IN list | LIKE pattern | BETWEEN literal AND literal } Description Use WHERE clause to filter a lens query by one or more predicate expressions. Use the AND keyword to join multiple expressions. A WHERE clause can include expressions that make use of the comparison operators or list expressions. For detailed information about expressions syntax, see the Platfora Expression and Query Language Reference. You cannot useIS NULL or IS NOT NULL comparisons in the WHERE clause. You also cannot use relative date filters (LAST integer DAYS) You can use the NOT keyword to negate any list expressions. The following example illustrate several different permutations of expressions structures you can use: SELECT count() FROM [View Summary] WHERE prior_views NOT IN (3,5,7,11,13,17) AND TO_LONG(prior_views) NOT IN (4294967296) AND avebitrate_double NOT IN (3101.0, 2598.0, 804.0) AND video.genre NOT IN ("Silent", "Exercise") AND video.genre NOT LIKE ("*a*") AND date.Date NOT IN (2011-08-04, 2011-06-04, 2011-07-04) AND prior_views > 23 AND avebitrate_double < 3101.0 AND TO_FIXED(avebitrate_double) != 3101.0 AND TO_LONG(prior_views) != 4294967296 AND video.genre <= "Silent" AND date.Date > 2011-08-04 AND date.Date NOT BETWEEN 2012-01-01 AND 2013-01-01 AND video.genre BETWEEN "Exercise" and "Silent" AND prior_views BETWEEN 0 and 100 Page 470 Data Ingest Guide - Lens Query Language Reference AND avebitrate_double NOT BETWEEN 1234.5678 AND 2345.6789 When comparing literal dates, make sure you use the format of yyyy-MM-dd without any enclosing quotation marks or other punctuation. GROUP BY Clause Orders and optionally limit the results of a SELECT statement. Syntax GROUP BY group-ordering [ , group-ordering ] The group-ordering clause has the following syntax: column-name [ SORT [ BY measure-name ] [ { ASC | DESC } ] [ LIMIT integer [ WITH OTHERS ] ] Description Use a GROUP BY clause to order and optionally limit results of a SELECT. If the SELECT statement includes a dimension column, you must supply GROUP BY clause that includes the dimension column. Otherwise, the GROUP BY clause is optional. A GROUP BY can include more than one column. To do this, delimit each column with a , (comma) as illustrated here: GROUP BY col_A, col_B, col_c You can GROUP BY a new computed field that is not defined in the lens. To do this, you add the field using the DEFINE clause and then use the field in the GROUP BY clause. Alternatively, you can define the computed field in the SELECT list, associate an alias with the field, and use the alias in the GROUP BY clause. A SORT specification is optional. If you do not specify SORT, the query returns in an unspecified order. To sort the columns by their values ("natural sorting order"), simply specify ASC (ascending) or DESC (descending). ASC is the default SORT order when sorting by natural values. To SORT a particular column by another measure or measure expression use the SORT BY phrase. You can specify a measure-name in the SORT BY clause that need not be in the SELECT list. You can also order the sort in either ASC or DESC order. Unlike natural value sorts, SORT BY default to the DESC (descending) sorting order. GROUP BY col_A SORT BY meas_1 ASC, col_B SORT DESC, col_c SORT BY measure_expression ASC Using GROUP BY with multiple SORT BY combinations allows you to group values with respect to one another. Consider three potential grouping columns, say Fee, Fi, and Foe. Sorting on column Fee sorts the records on the Fee value. Another SORT BY clause on column Fi, sorts Fi values within the existing Fee sort. Page 471 Data Ingest Guide - Lens Query Language Reference Use the LIMIT keyword to reduce the number of groups returned. For example, if you are sorting airports by the number of departing flights in DESC order (most flights to least flights), you could LIMIT the SORT to the 10 busiest airports. GROUP BY airports SORT BY total_departures DESC LIMIT 10 The LIMIT restricts the results to top 10 busiest departure airports. The LIMIT clause excludes other airports. You can use the WITH OTHERS keyword to combine all the other airports not in the top 10 under a single Others group. HAVING Clause Filters a SELECT statement by a measure expression. Syntax HAVING measure-predicate-expression [ AND measure-predicate-expression ] A measure-predicate-expression has the following form: { measure-column | measure-expression } { = | < | > | <= | >= | != } literal Description The HAVING clause filters the result of the GROUP BY clause by a measure or measure expression. The HAVING conditions apply to the GROUP BY clause. SELECT device.manufacturer, [duration (Avg)] FROM movie_view2G_PSM GROUP BY device.manufacturer HAVING [duration (Max)] = 10800 In the example above, you see a reference to two quick measure fields. Both the duration AVG() and MAX() quick measures are already defined on the lens. Example of Lens Queries This section provides some tips and examples for querying a lens. Discovering Lens Fields When querying a lens, you must use the sql REST API endpoint. Before constructing your query, it is a good idea to list the lens fields with a REST call to the lens resource. One suggested method is to make the following calls: • List the lens by calling GET on the http://hostname:port/api/v1/lenses resource. • Locate the lens id value in the lens list. • Get the lens by calling GET to the http://hostname:port/api/v1/lenses/id resource. • Review the lens fields. Page 472 Data Ingest Guide - Lens Query Language Reference This is one way to discover existing aggregate expressions and quick measures in the lens. For example, listing lens fields give you examples such as the following: ... "fields": { "Active Clusters (Total)": { "name": "Active Clusters (Total)", "expression": "DISTINCT([Temp Field for Count Active Clusters])", "lensExpression": false, "platforaManaged": false, "role": "MEASURE", "type": "LONG" }, "Are there new Active Clusters since Yesterday?": { "name": "Are there new Active Clusters since Yesterday?", "expression": "[Active Clusters (Total)] - ROLLUP [Active Clusters (Total)] TO ([Log DateTime Date].Date) ORDER BY ([Log DateTime Date].Date) ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING", "lensExpression": false, "platforaManaged": false, "role": "MEASURE", "type": "LONG" }, "Avg Page Views per Session": { "name": "Avg Page Views per Session", "expression": "[Total Records]/(DISTINCT(sessionId))", "lensExpression": false, "platforaManaged": false, "role": "MEASURE", "type": "DOUBLE" }, ... Using the JSON description of a lens, you can quickly see measures used in your lens versus navigating the lens in Platfora's UI. Complex DEFINE Clauses This example illustrates the use of multiple DEFINE clauses. Notice the descriptive name for the ROLLUP() computed field. DEFINE Manu_Genre AS CONCAT DEFINE [ROLLUP num_views TO (device.manufacturer) DEFINE [ROLLUP num_views TO ([Manu_Genre]) SELECT device.manufacturer, TO Manu], [ROLLUP num_views FROM moview_view2G_PSM ([device].[manufacturer], [video].[genre]) Manu] as ROLLUP COUNT() TO Manu_Genre] as ROLLUP COUNT() TO Manu_Genre, [Num Views], [ROLLUP num_views TO Manu_Genre] Page 473 Data Ingest Guide - Lens Query Language Reference WHERE Manu_Genre LIKE (\"*Action/Comedy\", \"*Anime\", \"*Drama/Silent \") GROUP BY device.manufacturer SORT ASC, Manu_Genre SORT ASC HAVING [ROLLUP num_views TO Manu] > 30000 AND [ROLLUP num_views TO Manu_Genre] > 1000 Build a WHERE Clause The following example shows a WHERE clause using mixed predicates and row comparison. It also uses the NOT keyword to negate list expressions SELECT count() FROM [(test) View Summary] WHERE prior_views NOT IN (3,5,7,11,13,17) AND TO_LONG(prior_views) NOT IN (4294967296) AND avebitrate_double NOT IN (3101.0, 2598.0, 804.0) AND video.genre NOT IN ("Silent", "Exercise") AND video.genre NOT LIKE ("*a*") AND date.Date NOT IN (2011-08-04, 2011-06-04, 2011-07-04) AND prior_views > 23 AND avebitrate_double < 3101.0 AND TO_FIXED(avebitrate_double) != 3101.0 AND TO_LONG(prior_views) != 4294967296 AND video.genre <= "Silent" and date.Date > 2011-08-04 AND date.Date NOT BETWEEN 2012-01-01 AND 2013-01-01 AND video.genre BETWEEN "Exercise" AND "Silent" AND prior_views BETWEEN 0 AND 100 AND avebitrate_double NOT BETWEEN 1234.5678 AND 2345.6789 You cannot use IS NULL or IS NOT NULL comparisons. You also cannot use relative date filters (LAST integer DAYS). Complex Measure Expression The following example illustrates a measure expression that includes both a ROLLUP and use of aggregate functions. SELECT device.manufacturer, CONCAT([device].[manufacturer], [video].[genre]) AS Manu_Genre, [Num Views], ROLLUP COUNT() TO (device.manufacturer) as [ROLLUP num_views TO Manu], ROLLUP COUNT() TO ([Manu_Genre]) AS [ROLLUP num_views TO Manu_Genre] FROM movie_view2G_PSM WHERE Manu_Genre LIKE (\"*Action/Comedy\", \"*Anime\", \"*Drama/ Silent\") GROUP BY device.manufacturer SORT ASC, Manu_Genre SORT ASC Page 474 Data Ingest Guide - Lens Query Language Reference HAVING [ROLLUP num_views TO Manu] > 30000 AND [ROLLUP num_views TO Manu_Genre] > 1000 Complex Row Expressions This row expression uses multiple row terms and factors: SELECT duration + [days after release] + user.age + user.location.estimatedpopulation AS [Row-Expression multi-factors], [Num Views] FROM movie_view2G_PSM GROUP BY [Row-Expression multi-factors] SORT ASC You'll notice that the Row-Expression multi-factors alias for the SELECT complex expression is reused in the GROUP BY clause. Page 475
© Copyright 2024