Demos | | Research Projects

A showcase of my work in Data Science, AI, and Empirical Software Engineering. This page highlights selected research projects from my Ph.D. alongside hands-on AI demos and prototypes. Each entry includes a brief description, relevant figures, and publication or implementation details. My work spans software systems, business data, and large-scale unstructured data, with a focus on building robust, data-driven solutions using AI.

Demos & Prototypes

TaskPilot: Multi-Agent Action Extraction Assistant

TaskPilot is a multi-agent assistant that transforms messy meeting notes or transcripts into cleaned text, concise summaries, and structured action items with owners and deadlines. It demonstrates modular multi-agent reasoning with LLMs using Azure OpenAI and Semantic Kernel.

Use Case: Automated task and summary extraction from unstructured meeting content
Key Technologies: Azure OpenAI, Semantic Kernel (Python SDK), FastAPI, Tailwind CSS

The system coordinates specialized agents for cleaning, summarizing, and extracting tasks, each with defined roles and responsibilities, and is deployed via Azure App Service.

Live Demo | GitHub Repo

*Hosted on a student Azure account. Demo performance may vary due to limited resource quotas.

RAG Interview Assistant Chatbot

A retrieval-augmented generation (RAG) chatbot designed to assist with interview preparation using custom documents such as resumes, job descriptions, and company profiles. This demo showcases LLM integration with Azure services to deliver intelligent, context-aware answers.

Use Case: Personalized interview Q&A from private documents
Key Technologies: Azure Cognitive Search, Azure OpenAI, Azure Cosmos DB, Streamlit, Python

The assistant indexes user data for semantic retrieval, generates answers with GPT-based models, and is deployed via Azure Web App with a Streamlit UI.

Live Demo | GitHub Repo

*Hosted on a student Azure account. Demo performance may vary due to limited resource quotas.

Research Projects

LLM Pre-training Datasets

A critical part of creating code suggestion systems is the pre-training of Large Language Models (LLMs) on vast amounts of source code and natural language text, often of questionable origin, quality, or compliance. This may contribute to the presence of bugs and vulnerabilities in code generated by LLMs. In this work, we propose an automated source code autocuration technique that leverages the complete version history of open-source software (OSS) projects to improve the quality of training data. We evaluate this method using “The Stack v2” dataset, comprising almost 600M code sample.

In addition to the full dataset, the Stack v2 has several deduplicated versions. The-stack-v2-train-smol-ids is the most filtered version spanning 17 programming languages. We evaluate our fixing approach on the full and smol (maximally deduplicated) datasets.

This infographic visualizes risks in LLM training data from our award-winning study.

Cracks in The Stack Infographic

Publications:

Jahanshahi, M. & Mockus, A.. "Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets." Accepted in Second International Workshop on Large Language Models for Code (LLM4Code 2025). Won the LLM4Code Best Paper Award.
Preprint - Replication Package - GitHub Repo

Collaboration Graph

In this project, we aim to facilitate the understanding of the developer collaboration structure and relationships among projects based on the bi-graph of what projects developers contribute to by providing an interactive collaboration graph of this ecosystem, using the data obtained from World of Code infrastructure. Our attempts to visualize the entirety of projects and developers were stymied by the inability of the layout and visualization tools to process the exceedingly large scale of the full graph. We used WoC to filter the nodes and edges to reduce the scale of the graph that made it amenable to an interactive visualization.

You can access and use the interactive tool at: Authors’ Graph and Projects’ Graph

Publications:

Lyulina, E., & Jahanshahi, M. (2021, May). "Building the collaboration graph of open-source software ecosystem." In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR) (pp. 618-620). IEEE.
Paper - GitHub Repo

Orphan Vulnerabilities

A key premise of open source software is the ability to copy code to other open source projects (white-box reuse). Such copying accelerates development of new projects, but the code flaws in the original projects, such as vulnerabilities, may also spread even if fixed in the projects from where the code was appropriated. The extent of the spread of vulnerabilities through code reuse, the potential impact of such spread, or avenues for mitigating risk of these secondary vulnerabilities has not been studied in the context of a nearly complete collection of open source code. In this project, we develop a tool, VDiOS, to help identify and fix white-box-reuse-induced vulnerabilities that have been already patched in the original projects (orphan vulnerabilities). We hope that VDiOS will lead to further study and mitigation of risks from orphan vulnerabilities and other orphan code flaws.

VDiOS architecture diagram — VDiOS Architecture Diagram

Publications:

Reid, D., Jahanshahi, M., & Mockus, A. (2022, May). "The extent of orphan vulnerabilities from code reuse in open source software." In Proceedings of the 44th International Conference on Software Engineering (ICSE) (pp. 2104-2115). Nominated for ACM SIGSOFT Distinguished Paper Award.
Paper - GitHub Repo

Copy-Based Reuse

In contrast to some studies of dependency-based reuse supported via package managers, no studies of OSS-wide copy-based reuse exist. In this project, we create a dataset that seeks to encourage the studies of OSS-wide copy-based reuse by providing copying activity data that captures whole-file reuse in nearly all OSS. To accomplish that, we develop approaches to detect copy-based reuse by developing an efficient algorithm that exploits World of Code infrastructure. We expect this data will enable future research and tool development that support such reuse and minimize associated risks.

To gain deeper insights into copy-based reuse, we analyze its prevalence and identify the factors influencing the propensity to reuse. We begin with a set of potential influencing factors, grounded in Social Contagion Theory, related to the propensity to reuse and sample instances of different reuse types. We then survey developers to better understand their intentions for this particular practice.

Generated and Reused Blob Trends — Generated and Reused Blobs Trends

Our results indicate that copy-based reuse is common, with many developers being aware of it when writing code. The propensity for a file to be reused varies greatly among languages and between source code and binary files, consistently decreasing over time. Files introduced by popular projects are more likely to be reused, but at least half of reused resources originate from "small" and "medium" projects. Developers had various reasons for reuse but were generally positive about using a package manager in case it was available.

Odds Ratios - Blob-level Logistic Regression Model

Odds Ratios - Project-level Logistic Regression Model

Publications:

Jahanshahi, M., Reid, D., & Mockus, A.. "Beyond Dependencies: The Role of Copy-Based Reuse in Open Source Software Development." Accepted in ACM Transactions on Software Engineering and Methodology (TOSEM).
Preprint - Replication Package - GitHub Repo
Jahanshahi, M. & Mockus, A. (2024, April). "Dataset: Copy-based Reuse in Open Source Software." In 2024 IEEE/ACM 21st International Conference on Mining Software Repositories (MSR) (pp. 42-47). IEEE.
Paper - GitHub Repo