OPEN BACHELOR'S & MASTER'S PROJECTS

INSTITUTE FOR INTERACTIVE SYSTEMS AND DATA SCIENCE

GRAZ UNIVERSITY OF TECHNOLOGY


DI Dr. Simon Walk

simon.walk@tugraz.at
+43 (316) 873 - 5619
Inffeldgasse 16c/I - Room: ID01104
A-8010 Graz
DI Lukas Eberhard

lukas.eberhard@tugraz.at
+43 (316) 873 - 5642
Inffeldgasse 16c/I - Room: ID01104
A-8010 Graz


General Requirements & Project Slides
If you are interested in working on one of the projects listed below we are happy to make an appointment (depending on the project, either in person or via Skype) and discuss details about the projects, what we require and expect from you and how this aligns with your (and our) time schedule. The main goal of each project should be to familiarize yourself with new techniques to analyze content on the web and to produce new results using scientifically grounded approaches (with our guidance).

You can find a PDF that includes bullet-point summaries of the listed projects here.
Quicklinks
For more details, please browse the separate project entries on this page and/or get in contact with the corresponding contact.

Bachelor's Theses & Projects

  1. Interactive Graph Analysis Framework
  2. Crawling and Analyzing Online Communities

Master's Theses & Projects

  1. Factors of Success in Crowdfunding Campaigns
  2. Predicting Different Aspects of Online Products
  3. Fraud Detection via Sequential User-actions
  4. Change-Log Analyses of Software Development Projects
Bachelor's Theses & Projects

Interactive Graph Analysis Framework



The goal of this project is the development of an interactive framework for analyzing arbitrary graphs or networks (e.g., web-graphs or social networks).
  • The framework should provide functionality that allows the import of different binary graph formats (e.g., *.gt files) as well as different textual representations, such as edge- and node-lists (including pottential node and edge attributes), numpy/scipy matrices, etc.!
  • Further, the framework should perform various analyses on the imported/loaded network, such as the calculation (and storage) of the degree distribution, various centrality metrics, measures to quantify connectivity (e.g,. LCC, SCC, etc.), distance metrics, eigenvalues, etc.!
  • Finally, all obtained results should then be stored (inside the graph) for easy retrieval if needed in follow-up analyses of the networks under investigation. Additionally, all results should be visualized using 2D and 3D visualization libraries and presented in a coherent way (e.g., as iPython Notebook)
Requirements: General knowledge of (Social) Network Analyses, Python and some Python libraries (e.g., scipy, numpy, graph-tool, multiprocessing).

Contact: Simon Walk (simon.walk@tugraz.at), Lukas Eberhard (lukas.eberhard@tugraz.at)

Crawling and Analyzing Online Communities



Students will conduct an empirical analysis of a given online community/website. Due to the empirical nature of this project, results will mainly be descriptive statistics about the usage and interactions of and between users and the online community.
  • Depending on the community under investigation, crawling strategies and code to query any potential publicly available APIs have to be developed. Modern APIs usually send their data either in XML or JSON format, which will likely contain unwanted data which has to be parsed, pre-processed and cleaned.
  • The main part of the project is centered around the empirical investigation of the crawled dataset. This includes the calculation of general statistics (e.g., average number of users active per day, number of contributions per day, etc.) as well as more sophisticated approaches to broaden our understanding of the investigated dataset (e.g., in the form of time-series analyses, correlation analyses, social network analyses as well as prediction and/or classification experiments and their evaluation).
Requirements: General knowledge of (Social) Network Analyses, Python and some Python libraries (e.g., scipy, numpy, graph-tool, multiprocessing), XML/JSON parsing and MySQL.

Contact: Simon Walk (simon.walk@tugraz.at)

Master's Theses & Projects

Factors of Success in Crowdfunding Campaigns



Kickstarter and similar crowdfunding websites represent very attractive venues for diverting the risk of creating a new business or product on to the customers. However, only a fraction of all projects on Kickstarter are successfully funded. Identifying these factors of success is key for creating better campaigns/products and opens up a relevant and interesting opportunity for researchers.
  • For this project, the first step would be to crawl, scrape and aggregate a large sample of successful (and unsuccessful) campaigns of any given crowdfunding platform. This data will likely need some form of preprocessing and data cleaning before analyses can be conducted.
  • First empirical analyses (basic statistics that can be aggregated and visualized over time) as well as more sophisticated time-series analyses and correlation analyses will provide insights into features that determine success.
  • Additionally, we want to uncover latent/emergent features of crowdfunding campaigns, which are the best predictors for success. To identify these factors, prediction and classification experiments have to be conducted, which will also provide the basis for evaluating the obtained findings.
Requirements: Web/Data Mining, Python and some Python libraries (e.g., scipy, numpy, sklearn, graph-tool, multiprocessing).

Contact: Simon Walk (simon.walk@tugraz.at)

Predicting Different Aspects of Online Products



There exist many websites that visualize the history of products in online markets, such as camelcamelcamel for amazon or the price history of products on Geizhals. For this project we are looking for a student who is interested in conducting experiments to identify features of products that allow for an accurate prediction of the market price of certain products.
  • The first step of this project will be the crawling, scraping and preprocessing of (historical) marketplace data.
  • In a next step, additional information for the products under investigation is required (likely resulting in more crawling/scraping).
  • Finally, using data mining and knowledge extraction techniques, first basic statistics are determined, which include time-series analyses as well as feature extraction methods.
  • Finally, the generated/trained model (Machine Learning/Neural Networks) will be used to predict the market price of the products, which can be evaluated by calculating the RMSE between the prediction and the actual data.
Requirements: Web/Data Mining, Python and some Python libraries (e.g., scipy, numpy, graph-tool, sklearn, multiprocessing, keras).

Contact: Simon Walk (simon.walk@tugraz.at)

Fraud Detection via Sequential User-actions



Online marketplaces and trading websites struggle with the detection of fraudulent activities. Usually companies resort to an external business to validate credit card details and user credentials. This project is intended to create a real-time module that parses the change-logs (and potential interaction data, such as log-ins or account creation data) and determines if a visitor will engage in fraudulent activities.
  • The first step will be the preprocessing of the Apache logs (and SQL transactions) to generate labeled data (labels exist but need to be assigned to the log-actions).
  • Using data mining and knowledge extraction approaches, basic statistics (e.g., time-series analyses, evolution of activities over time, etc.), social network analyses, pattern mining and Markov chain analyses are conducted.
  • Using the labeled data, prediction and classification experiments are conducted and properly evaluated.
Requirements: Web/Data Mining, Python and some Python libraries (e.g., scipy, numpy, sklearn, multiprocessing, keras), Neuronal Networks, Markov chains.

Contact: Simon Walk (simon.walk@tugraz.at)

Change-Log Analyses of Software Development Projects



There exist an abundance of different software development methodologies (e.g., Waterfall, eXtreme Programming, Scrum, etc.) which are applied to manage and maintain the engineer process. By formulating and devising different hypotheses, which are intended to discretize software development methodologies, and testing these hypotheses using HypTrails, it is possible to analyze how software is developed "in the wild"!
  • First, logs of changes (e.g., subversion or git commit logs) have to be aggregated, preprocessed and analyzed.
  • Then, all the conducted changes have to classified into several different categories (e.g., quick fix, refactoring, adding functionality, editing existing code, etc.) and a classifier has to be trained to allow for automatic labeling of such changes.
  • To validate the classification, crowdsourcing strategies (i.e., Mechanical Turk) could be implemented.
  • Using the labeled commit logs it is then possible to conduct analyses using HypTrails to identify which software development methodology best represents how software is developed "in the wild".
Requirements: Python and some Python libraries (e.g., scipy, numpy, sklearn, multiprocessing, keras), Git/Subversion, HDFS (for large datasets).

Contact: Simon Walk (simon.walk@tugraz.at)