DATAIKU - DEC 2014 - GFII

Machine Learning
On
Dirty Data
www.dataiku.com
Dataiku in short
Software editor behind Data Science Studio,
the « Photoshop for Data Science » Our objective: to make data science accessible to all types of profiles
www.dataiku.com
Our clients
•
They build applications with their data: –
–
–
–
–
–
•
Predicting parking spot availability Analysis of web activity and behaviour segmentation
Customer churn anticipation and marketing activation
Maintenance prevention and material breakdown impact reduction
Fraud detection
…
They shorten their innovation cycles: –
–
–
DSS diminishes their entry barriers and gives way to easy reconversion of internal teams
Standardisation of practices and reduction of the number of tools necessary
Easy collaboration between data analysts, business experts, and IT engineers on one platform
www.dataiku.com
Turn Device Logs
Into Next Years' Business
by
Parking ticket machine data
OpenStreetMap
data
Data Science Studio
Cleaning and enrichment of data
Crossing data
Each street is segmented into small pieces that are enriched with geospatial information.
The parking ticket history is joined with the points of interest from OpenStreetMap.
www.dataiku.com
Creation of a predictive algorithm
The availability of parking lots is predicted by street segments from the joined data.
Availability of
the predictions
The algorithm is finally integrated in the iPhone app
« Find me a space ». Predictive Monitoring
for Search Engine Relevance
by
Users searches
Web logs
Words within the requests are analysed by the studio.
Web logs with clicks and bounce rates are imported in the studio.
Data cleaning and enrichment
Customized algorithm
The Data Science team of PagesJaunes identifies unsuccessful searches and train a customized algorithm. Web logs are enriched (time spent on the website per user, localisation, etc.)
Algorithm used
with all data
Long-­‐term monitoring of unsuccessful searches
Dataiku’s technology enabled us to rationalise our work thanks to machine learning on millions of searches.
The process is optimized, we know what and how to do it. Erwan Pigneul, www.dataiku.com
Project Manager
PagesJaunes
Optimizing Last Mile with
Data Science Studio
by
Data Science Studio
Historical delivery
and retrieval data
Cleaning and temporal
enrichment of data
Data aggregation by
geographic location
Incorporation of new deliveries
to the existing model
www.dataiku.com
Modeling of a score
for each delivery
Predictive Model To Optimize
Restaurant Pages
by
Restaurant data
(place, type…)
Data Science Studio
Cleaning and
Enriching
Centralizing the
data
Analysis and
modeling
User feedback
(comments,
length…)
Scoring of a restaurant’s
page parameters in terms
of customer satisfaction
Traffic logs
(visits, clicks,
time…)
www.dataiku.com
Increase website
traffic by
optimizing the
correct parameters
Create value with
data driven applications
DATA IN
Parking ticket machine data
OpenStreetMap
data
ENRICH / COMBINE / COMPUTE
VALUE OUT
Data Science Studio
Cleaning and enrichment of data
Crossing data
Each street is segmented into small pieces that are enriched with geospatial information.
The parking ticket history is joined with the points of interest from OpenStreetMap.
www.dataiku.com
Creation of a predictive algorithm
The availability of parking lots is predicted by street segments from the joined data.
Availability of
the predictions
The algorithm is finally integrated in the iPhone app
« Find me a space ». Churn
Segmentation
Recommender
Lifetime Value
A MODEL
An automated way to make a computer
take a decision from raw (historical) data
Volume
Forecast
Score
Location
The model
can be used Risk
to take
immediateHot
(real-time)
actions through an API
Pricing
Ranking
www.dataiku.com
Event Paths
Fraud
2015 : BUILD YOUR FACTORY
Multiple Data Sources Many Models
Personalised Experience Model
Acquisition
Cost Opportunity
Model
CRM
Stock Optimisation
Model
Logs
Analyst Team
Server Cluster
Light Software
www.dataiku.com
Optimize
Delivery
But …
“Data Science “I spend too much time Superstars are cleaning up my data really hard to hire.” with inappropriate tools.”
“Our models are quite difficult to set up so they are rarely deployed into production.”
www.dataiku.com
“There is too much plumbing involved in making all these Big Data technologies work together and then in successfully deploying applications with them.”
Data Science Studio
A studio for all your data driven applications
Load and prepare
your data
Analyse and build
your models
Publish and run
your projects
For all profiles Collaborative Open and controlled
www.dataiku.com
Data preparation
• Connect to all your data sources • Explore them visually • Transform and enrich them interactively • Save your ‘recipes’ and reuse them later
www.dataiku.com
Analyse and model
• Discover correlations and significant variables • Easily build your first models in a visual interface • Test and improve several models alongside one another • Deploy the models’ results directly inside your infrastructure
www.dataiku.com
Deploy into production
• Go quickly from prototypes to large scale production • Manage data inputs and outputs from the interface • Export and publish your results in several forms • Control the updates with options such as scheduling, partitions, and replications…
www.dataiku.com
Collaborative work
• Enjoy a web interface and a shared platform • Organise your work by projects and by teams • Reuse the team’s work at any time • Make sure everyone is always on the same page: share insights, graphics, comments, etc. with your team
www.dataiku.com
Open and controlled
• Take advantage of open source technologies such as Hadoop, iPython,
scikit-­‐learn, R… • Integrate your own libraries and scripts
• Keep the data safe in your own infrastructure • Keep your innovations under control: algorithms and predictions belong to you
www.dataiku.com
“We could probably better understand ours users. But how ?
“My data is too dirty. I don’t even know where to start ”
“There’s a trend here, but our full historical data is just too big”
You have data
You have ideas
You need a tool
http://www.dataiku.com/dss/trynow/
Dataiku West
2423A Durant Avenue
Florian
Dataiku HQ
florian.douetteau@dataiku.com
2 rue Jean Lantier
75001 Paris France
Berkeley, CA 94704
www.dataiku.com
ANNEXES
www.dataiku.com
A predictive application?
+
Data
=
Algorithms
Predictions
Knowledge Iterations Calculation
Industrialization Deployment
(Machine Learning)
Requirements:
Collection Preparation Crossing
www.dataiku.com