21/06/2018: Recursion

Thursday 21/06/2018 at 15:00 in room B1.09

The next MSDSlab meetings will be next week Thursday and will be presented by Erik-Jan van Kesteren of Utrecht University.  He will  provide a brief overview and an interactive workshop on the magical mystery that is recursion. We’ll go over the basics, when it’s useful, and we will also program some recursive functions.


Preparation: bring your laptop and your programming language of choice. For the illustration I will use R and RStudio.



11/06/2018: Learning from Partitioned Data

Lianne’s PPT can now be downloaded here.:

Photo Lianne.

Original post:

Monday 11/06/2018 at 15:00 in room B1.09

The next MSDSlab meetings will be on Monday (Instead of Thursday) the 11th of June and will be presented by Lianne Ippel from Maastricht University. Lianne will present on two themes within the topic of learning from partitioned data.

  1. Row-by-row (streaming) learning (horizontal) and
  2. Privacy preserving machine learning  (vertical).


Over the last decade, social research workflow has greatly changed. While previously data were often collected using paper-and-pencil questionnaires, nowadays data are often collected using webpages and smartphone applications. This change in gathering data has had many consequences, though in this talk I focus in particular on the partitioning of data. I will discuss two types of partitioned data. Horizontally partitioned data implies that the same variables are available for each respondent, however, not all respondents are available in one central place (e.g., like streaming data). On the other hand, vertically partitioned data means that the same respondents are available at different sites, or institutes. However, each site can have its own set of features, which might or might not be sharable with other sites, e.g., due to the sensitive nature of the features. For these non-sharable features, privacy-preserving data mining/machine learning techniques are required. While discussing this, your input at this part of the talk will be much appreciated!

Lianne Ippel recently started as a Postdoctoral researcher at the Institute for Data Science at Maastricht University. She received her PhD degree from Tilburg University for her thesis ‚ÄúMultilevel Modeling for Data Streams with Dependent Observations‚ÄĚ, for which she won ‚ÄėBest Thesis Award‚Äô at the General Online Research conference in Cologne (2018). Her research interests are centered around ethical and responsible use of Machine Learning and ¬†Machine learning models in relation to methodological issues such as response style, measurement invariance, and missing data.

LI picture

31/05/2018: Crowdsourcing for Medical Image Analysis

Thank you again to Veronika and everyone who was present. Veronika’s PPT can now be downloaded here.

Original post:


Thursday 31/05/2018 at 15:00 in room B1.09

The next MSDSlab meetings will be this Thursday by Veronika Cheplygina (Eindhoven University of Technology) who will present on the possibilities of crowdsourcing of Medical Image Analysis in an interactive MSDSlab.


Machine learning (ML) has vast potential in medical image analysis, improving possibilities for early diagnosis and prognosis of disease. However, ML needs large amounts of representative, annotated examples for good performance, which may not always be possible with medical images. In this talk I will discuss how crowdsourcing is being used to address this problem. I will cover several existing approaches that do this, as well as discuss (what I think is) a promising alternative. At the end there will be an opportunity to play with some data to investigate this claim.

Veronika Cheplygina is an assistant professor at the Medical Image Analysis group, Eindhoven University of Technology since February 2017. She received her Ph.D. from the Delft University of Technology for her thesis “Dissimilarity-Based Multiple Instance Learning‚Äú in 2015. As part of her PhD, she was a visiting researcher at the Max Planck Institute for Intelligent Systems in Tuebingen, Germany. From 2015 to 2016 she was a postdoc at the Biomedical Imaging Group Rotterdam, Erasmus Medical Center. Her research interests are centered around learning scenarios where few labels are available, such as multiple instance learning, transfer learning, and crowdsourcing. Next to research, Veronika blogs about academic life at http://www.veronikach.com


24/05/2018 – Grand Challenge Design for Medical Image Analysis – Sharing Data, Metrics and Ground Truth for Algorithm Evaluation

Thursday 24/05/2018 at 15:00 in room A.308


Adri√ęnne Mendrik¬†of the¬†Netherlands eScience Center will give a presentation on¬†Grand Challenge Design for Medical Image Analysis.¬†¬†This meeting will be held at a slightly different location than usual at¬†Sjoerd Groenmangebouw¬†A3.08.

Preparation: Have a look at  https://grand-challenge.org/All_Challenges/  which gives an overview of all challenges organized in medical image analysis.


17/05/2018 – Better predictions using big(ger) data sets

Thursday 17/05/2018 at 15:00 in room B1.09

Thomas Debray from the UMCU  will host the next MSDSlab. He will discuss how we can investigate, quantify and improve the generalizability of prediction models by utilizing big datasets from e-health records or meta-analyses with individual participant data.

Preparation: Have a look at the  background readings

Clinical prediction models (CPM) are an important tool in contemporary medical decision making and abundant in the medical literature. These models estimate the probability/risk that a certain condition is present or will occur in the future by combining information from multiple variables (predictors) from an individual, e.g. predictors from patient history, physical examination or medical testing. Unfortunately, many CPM predict much worse than anticipated during their development. A major reason for unsatisfactory performance and limited use in clinical practice is that they are typically developed from relatively small datasets, and subsequently used in populations/settings too different from the original development population/setting, without proper validation and adaptation to the new situation.

Background literature (assessing generalizability of clinical prediction models)
All are optional. For novices I would recommend the BMJ and PLOS MED paper.

  • Riley RD, et al. External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges. BMJ. 2016;353:i3140. (Riley2016a)
  • Debray TPA, et al. A new framework to enhance the interpretation of external validation studies of clinical prediction models. J Clin Epidemiol. 2015;68(3):279‚Äď89. (debray_new_2015)
  • Debray TPA, et al. Individual Participant Data (IPD) Meta-analyses of Diagnostic and Prognostic Modeling Studies: Guidance on Their Use. PLoS Med. 2015;12(10):e1001886. (Debray2015c)
  • Debray TPA, et al. A framework for developing, implementing, and evaluating clinical prediction models in an individual participant data meta-analysis. Stat Med. 2013 Aug 15;32(18):3158‚Äď80. (Debray2012b)


03/05/2018 – Digital Humanities and Text Mining: Stylistic and Intertextual Analysis of Large Corpora

Paul’s presentation and code can now be found the MSDSlab Github page.¬†IMG_20180503_155420437

Original post:

Thursday 03/05/2018 at 15:30 in room B1.09

Paul Vierthaler, a university lecturer at Leiden University in the Digital Humanities,  will discuss the methodological approaches he takes in his research on late Imperial Chinese literature. Paul studies the relationships among historical and fictional documents written in late Ming and early Qing China (1550 to 1700) at the corpus level. To do this, he uses a variety of methods developed by linguists, computer scientists, and biologists. In his talk, Paul will cover stylometric analysis and an intertextuality detection algorithm based on the bioinformatics algorithm BLAST (Basic Local Alignment Search Tool). While this talk will ground the methodology in specific research questions, he will mainly focus on describing his approach to blending information retrieval with literary studies.

This talk will start 30 minutes later than our regular starting time!

Preparation: These are some suggested, but not essential, readings:




26/04/2018 – Visualizing (not so) Big Data

Meys’ slides are now available here.

Original post:

Thursday 26/04/2018 at 15:00 in room A3.17

Wouter Meys, of the Amsterdam based Citizen Data Lab will give a talk on data visualization.¬† The¬†Citizen Data Lab consists of ‚Äėinterdisciplinary teams of researchers, programmers, and designers working on the mapping of urban issues. They develop tools and methods for participatory data collection, visualization and interpretation.‚Äô¬†

 Preparation: Bring laptop with R installed

13/04/2018 – Finding joint and specific sources of variation in linked high-dimensional data

A small GitHub Repository with the R code and PPT used by Katrijn van Deun at the last MSDSlab session can be found at the MSDSlab GitHub page or at: https://github.com/msdslab/MSDS-13-04-2018-RSCA

Original post:

Friday 13/04/2018 at 15:30 in room B1.09


After the¬† high-dimensional data symposium, Katrijn van Deun of¬†Tilburg University, will give an interactive talk on ‘Finding joint and specific sources of variation in linked high-dimensional data’ for the MSDSlab members.

Attention: This presentation is on another day and time than usual.

Preparation: Bring laptop with R installed

22/03/2018 – Hidden Markov Models

Emmeke explaining the model in her graph

Here you can find the slides of Emmeke’s talk


Original announcement:

Thursday 22/03/2018 at 15:00 in room B1.09

In this meeting, Emmeke will introduce us to Hidden Markov Models.


The HMM is a very flexible model and as such is applicable to a wide variety of longitudinally collected data. For example, one can extract student behaviour states from MOOC data and investigate the composition of the different learning states, and the transitions between the different learning states. Or one can extract sleep states based on EEG measurements, and subsequently compare the duration of, and transitions between, different sleep states for patients which do and do not suffer from insomnia.


08/03/2018 – A Probabilistic Active Learning Approach for Learning from Data with Limited Supervision

You can find the slides of the presentation Georg gave here (.pdf, 3MB).

Georg explaining the results in his graphs

Original announcement:

Thursday 08/03/2018 at 15:00 in room B1.09

The speaker for this meeting will be Georg Krempl, who will talk about an approach for learning from data with limited supervision. Here is a shortened abstract:

Machine learning has become widely used throughout commerce, science, and technology. However, the ever increasing volumes of data are contrasted by various constraints, such as limited supervision, processing or storage capacities. This requires techniques to optimise the allocation of these capacities.

Active machine learning aims to provide techniques for selecting the most insightful information (like label annotations of data instances) to be queried from oracles (like human supervisors).

In this talk, I will present our recently developed probabilistic active learning approach PAL. This decision-theoretic approach combines the fast asymptotic runtime of popular heuristics like uncertainty sampling with a direct optimisation of the expected gain in classification performance.

I will conclude this presentation by demonstrating the use of PAL in different active learning scenarios, ranging from label selection in large data pools and evolving data streams to broader settings such as active class selection.