Encounters with big data: An introduction to using big data in the social sciences
30 January - 3 February 2017
Cape Town Centre, South Africa
This 5-day course by the UK Data Service and DataFirst will introduce key concepts and discussions around using big data in the social sciences, and introduce attendees to Apache Hadoop, an open source framework for analysing big data. This course will focus on quantitative data and will not cover in any detail text, social media, audio or other forms of non-numeric data.
This introductory level course is aimed at experienced researchers, statisticians, or data managers with:
Day 1: Introducing big data for the social sciences
We will introduce concepts around big data and the challenges of working with it. We reiterate that we should still structure data analyses with a purpose and hypothesis in mind. The difference with big data is that we are not always as ‘in control’ of the data source. We give some examples of what social scientists are doing with big data and ask participants to undertake some group work to design an experiment to determine a national statistic based on web-sourced open data. In the afternoon, we will consider common errors in survey data and discuss how we might deal with these. We will also consider how to manage different data structures. For the final session of the day we will look at ethics and rights in big data including a group exercise on the subject.
Day 2: Setting the scene: using the Apache Hadoop environment
We introduce Apache Hadoop and the structure and format of data it can handle. Users will get to explore components of Hadoop that help import, manage, and explore data. We provide an introduction to data curation, an important stage in the big data analysis process which involves pre-processing, transforming, and loading data onto the system, as well as new types of databases such as graph and NoSQL databases and tools for handling these, such as Spark and MongoDB. We also show participants how to work with data published via the web in JSON format by making API calls. Having looked at loading data into the Hadoop environment, we then move into initial data exploration and cleaning to make the data fit for research purposes using Apache Hive. Finally, we finish off the day by showing how to make use of Apache Zeppelin, a web-based notebook interface to Apache Hadoop. Zeppelin allows the user to create interactive notebooks to both document their work and also run developed code in a variety of programming languages.
Day 3: Generating useful insight from big data
We introduce Apache Spark, a high performance distributed computation engine for big data. We will demonstrate how Spark can be used to load and clean data as an alternative to Hive. We then take the use of Spark further and introduce Machine Learning algorithms to analyse big datasets. To complete the data analysis journey, we will introduce a selection of visualisation packages which will allow the data to be graphed and presented in a more sophisticated manner; this will include the use of interactive graph libraries.
Day 4: Working with your data
We start off by geo-linking data onto maps. We then demonstrate the use of ODBC to enable data to be moved from the Hadoop system back down to the PC where it can be operated on using programming languages such as R or Python, or can be loaded directly into any ODBC compliant program such as Excel or Stata. We will then recap the data journey outlined so far, from sourcing data, through curation and cleaning, to analysis and visualisation. Finally, we split into groups and begin formulating group projects.
Day 5: Presenting and publishing findings and code from big data
We finish work on our group projects and offer an initial presentation to the class. We will allow time to reflect upon and discuss the projects and think about how we might enhance them. In the afternoon, we look at publishing outputs from big data; what do journals need to support replication? Where else can one publish, e.g. in white papers? And what about ways to publish algorithms and code to support your data preparation or analyses? We demonstrate the use of GitHub as a means of storing and sharing open source software repositories.
Eligibility and booking
Priority will be given to those residing and working or studying in South Africa, but anyone can apply. There are a number of prerequisites for the course which you will need to meet, such as competency in statistics and data analysis (specified in the booking form), which we will assess prior to confirmation of a place.
Cost and scholarships
This workshop is being supported by the two research councils in the UK and South Africa, and we are fortunate to be able to offer the summer school at no cost. A returnable deposit of 500 RAND will be required to confirm attendance. This will be returned only if the participant is present in the workshop for the full 5 days and completes the syllabus.
A limited number of scholarships for travel and accommodation are available for those who will not be supported by their own organisation and eligibility is restricted to those who are residing, working or studying in South Africa. Please mention whether you expect to apply for a scholarship in your initial email. Costs should be estimated on the booking form as well as proof of eligibility.
All lunches and one group meal will be provided by the workshop organisers. Further information on logistics will follow.