Home | Portfolio | Interests

Portfolio | | Research Projects


A showcase of my work in Data Science, Empirical Software Engineering, and Mining Open Source Software Repositories. Here, you will find selected research projects I contributed to during my PhD, each accompanied by a brief description, figures, and publication details. My work primarily focuses on developing data-driven insights, methods and tools to improve reliability in open-source software supply chains.


LLM Pre-training Datasets

A critical part of creating code suggestion systems is the pre-training of Large Language Models (LLMs) on vast amounts of source code and natural language text, often of questionable origin, quality, or compliance. This may contribute to the presence of bugs and vulnerabilities in code generated by LLMs. In this work, we propose an automated source code autocuration technique that leverages the complete version history of open-source software (OSS) projects to improve the quality of training data. We evaluate this method using “The Stack v2” dataset, comprising almost 600M code sample.

In addition to the full dataset, the Stack v2 has several deduplicated versions. The-stack-v2-train-smol-ids is the most filtered version spanning 17 programming languages. We evaluate our fixing approach on the full and smol (maximally deduplicated) datasets.

This infographic visualizes risks in LLM training data from our award-winning study.

Your browser does not support SVG!
Cracks in The Stack Infographic
Publications:
  • Jahanshahi, M. & Mockus, A.. "Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets." Accepted in Second International Workshop on Large Language Models for Code (LLM4Code 2025). Won the LLM4Code Best Paper Award.
    Preprint - Replication Package - Repository

Collaboration Graph

In this project, we aim to facilitate the understanding of the developer collaboration structure and relationships among projects based on the bi-graph of what projects developers contribute to by providing an interactive collaboration graph of this ecosystem, using the data obtained from World of Code infrastructure. Our attempts to visualize the entirety of projects and developers were stymied by the inability of the layout and visualization tools to process the exceedingly large scale of the full graph. We used WoC to filter the nodes and edges to reduce the scale of the graph that made it amenable to an interactive visualization.

You can access and use the interactive tool at: Authors’ Graph and Projects’ Graph

Authors’ Collaboration Graph
Authors’ Collaboration Graph
Projects’ Collaboration Graph
Projects’ Collaboration Graph
Publications:
  • Lyulina, E., & Jahanshahi, M. (2021, May). "Building the collaboration graph of open-source software ecosystem." In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR) (pp. 618-620). IEEE.
    Paper - Repository

Orphan Vulnerabilities

A key premise of open source software is the ability to copy code to other open source projects (white-box reuse). Such copying accelerates development of new projects, but the code flaws in the original projects, such as vulnerabilities, may also spread even if fixed in the projects from where the code was appropriated. The extent of the spread of vulnerabilities through code reuse, the potential impact of such spread, or avenues for mitigating risk of these secondary vulnerabilities has not been studied in the context of a nearly complete collection of open source code. In this project, we develop a tool, VDiOS, to help identify and fix white-box-reuse-induced vulnerabilities that have been already patched in the original projects (orphan vulnerabilities). We hope that VDiOS will lead to further study and mitigation of risks from orphan vulnerabilities and other orphan code flaws.

VDiOS architecture diagram
VDiOS Architecture Diagram

Publications:
  • Reid, D., Jahanshahi, M., & Mockus, A. (2022, May). "The extent of orphan vulnerabilities from code reuse in open source software." In Proceedings of the 44th International Conference on Software Engineering (ICSE) (pp. 2104-2115). Nominated for ACM SIGSOFT Distinguished Paper Award.
    Paper - Repository

Copy-Based Reuse

In contrast to some studies of dependency-based reuse supported via package managers, no studies of OSS-wide copy-based reuse exist. In this project, we create a dataset that seeks to encourage the studies of OSS-wide copy-based reuse by providing copying activity data that captures whole-file reuse in nearly all OSS. To accomplish that, we develop approaches to detect copy-based reuse by developing an efficient algorithm that exploits World of Code infrastructure. We expect this data will enable future research and tool development that support such reuse and minimize associated risks.

Reuse Identification Data Flow Diagram
Reuse Identification Data Flow Diagram

To gain deeper insights into copy-based reuse, we analyze its prevalence and identify the factors influencing the propensity to reuse. We begin with a set of potential influencing factors, grounded in Social Contagion Theory, related to the propensity to reuse and sample instances of different reuse types. We then survey developers to better understand their intentions for this particular practice.

Generated and Reused Blob Trends
Generated and Reused Blobs Trends
Reused to Generated Blobs Ratio Trend
Reused to Generated Blobs Ratio Trend

Our results indicate that copy-based reuse is common, with many developers being aware of it when writing code. The propensity for a file to be reused varies greatly among languages and between source code and binary files, consistently decreasing over time. Files introduced by popular projects are more likely to be reused, but at least half of reused resources originate from "small" and "medium" projects. Developers had various reasons for reuse but were generally positive about using a package manager in case it was available.

Odds Ratios - Blob-level Logistic Regression Model
Odds Ratios - Blob-level Logistic Regression Model
Odds Ratios - Project-level Logistic Regression Model
Odds Ratios - Project-level Logistic Regression Model
Publications:
  • Jahanshahi, M., Reid, D., & Mockus, A.. "Beyond Dependencies: The Role of Copy-Based Reuse in Open Source Software Development." Accepted in ACM Transactions on Software Engineering and Methodology (TOSEM).
    Preprint - Replication Package - Repository
  • Jahanshahi, M. & Mockus, A. (2024, April). "Dataset: Copy-based Reuse in Open Source Software." In 2024 IEEE/ACM 21st International Conference on Mining Software Repositories (MSR) (pp. 42-47). IEEE.
    Paper - Repository