Big data now playing ..... at the sandbox 17

Big data now playing
..... at the sandbox
John.Dunne@cso.ie
17th October 2014
IAOS, Vietnam
Overview
•
•
•
•
•
•
•
Context
How CSO got interested in big data
The sandbox
Learning from other industries
Learning from the past
The sandbox – looking to the future
Concluding comments
Keywords – big data, modernisation, sandbox
2
Big data – working definition
Data that is difficult to collect, store or process
within the conventional systems of statistical
organizations.
Either, their volume, velocity, structure or variety
requires the adoption of new statistical software
processing techniques and/or IT infrastructure to
enable cost-effective insights to be made.
3
Do more with less
Mindset - Opportunities exist
with secondary data sources
4
Legal environment
Data Protection
Freedom of
Information
Key : 3 Legislative pillars
Official Statistics
5
Modernisation and big data
2011 Conference of European Statisticians endorse modernisation strategy
2012 Big data on modernisation agenda
2013 ESSC Scheveningen memorandum on Big data and official statistics
2013 International Big data team gets going
2014 Big data on UNSC agenda
2014 The sandbox goes live at MSIS Dublin
2013 CSO Project - To determine household
composition using smart metering data
Origin of data : Consumer Behaviour Trials in 2009
and 2010
• Over 5000 households in pilot
• 3 months baseline data (reading every 30 mins)
• Pre-trial survey using CATI
http://www.unece.org/stats/documents/2013.09.coll.html
7
Project with pilot data brought challenges
Pilot
Go live
7 million data points per month
ICHEC helped out
2160 million data points per month
Joe, we need a bigger computer
https://www.ichec.ie/
8
The sandbox
The hardware on which the sandbox system is
based is a High Performance Computing cluster
called Stoney. The cluster is hosted in the National
University of Ireland, Galway since April 2009 and
is composed of 60 compute nodes each of which
has two 2.8GHz Intel (Nehalem EP) Xeon X5560
quad-core processors, 48GB of RAM and a 1TB
local disk. Each node is connected to two networks
– an InfiniBand network for accessing the shared
Lustre filesystem and for high performance
communications as well as a Gigabit Ethernet
network for management tasks. In addition, a 20TB
shared filesystem is available to all nodes.
ICHEC will dedicate 20 compute nodes to enable a
Hadoop cluster with 160 cores almost 1TB of RAM
and 20TB of HDFS distributed storage.
The sandbox provides an environment to
o test feasibility of remote access and processing
o test whether existing standards/models/methods
can be applied to big data
o evaluate the usefulness of big data software
tools
o learn by doing with respect to potential uses,
advantages and disadvantages of big data
o facilitate further collaboration in the
international community
10
The toys (data sources)
o twitter data
o mobile phone data
o satellite imagery / aerial photography
o price data/ job vacancy data via scraping
o scanner data/price data sourced via large
vendors
o data from road traffic sensors
o smart meter data on electricity/gas consumption
11
Some of the players
To play, contact
Steven.Vale@unece.org
12
Learning from other industries
- technical partners can have a role to play
Exchange of data for billing
purposes
Irish Mobile
Network
Operators
MNOs
Data Clearing
Houses
ROW Mobile
Network Operators
Learning from the past
- think about the bigger picture
Nordbotten, Thygesen and the statistical archive concept
14
Learning from the past
- do not underestimate privacy concerns
http://www.census.gov/history/pdf/kraus-natdatacenter.pdf
http://blog.modernmechanix.com/the-national-data-center-and-personal-privacy/
The National Data Center and Personal Privacy
By
Arthur R Miller
The sandbox - looking to the future
o Centres for Research and Development
?
o Centres of Excellence
?
o Partner organisations for collecting, processing or
storing data of a less or non sensitive nature
???
o Significant partner organisations enabling the
collection, processing or storing data of a sensitive
nature
?????
16
Concluding remarks
•
•
•
•
•
Think about bigger picture / broader system
An open mind to the possibility of new partners
Be open and transparent
Don’t underestimate privacy concerns
Continue to collaborate and share