
Unlocking 25 Years of R&D Data through graph visualization
A leading European pharmaceutical group had accumulated over 25 years of R&D data across siloed databases, legacy file systems, and institutional knowledge locked in retired employees' notebooks. The challenge: make this data searchable, connected, and actionable for current research teams.
The problem
The company's research data was spread across 14 distinct systems — from Oracle databases storing clinical trial results to SharePoint sites holding patent filings, and even scanned PDFs of handwritten lab notes from the early 2000s. Researchers spent an average of 3.2 hours per day just searching for prior work before starting new experiments.
We estimated that 40% of our exploratory research was unknowingly duplicating work already done internally. The cost wasn't just time — it was missed opportunities for cross-pollination between therapeutic areas.
Our approach
We designed a three-phase pipeline to ingest, link, and visualize the data:
- Ingestion & normalization — ETL pipelines to extract structured data from databases and unstructured data from documents using OCR and NLP
- Entity resolution & linking — Graph construction using Neo4j, linking compounds, researchers, publications, patents, and trial outcomes
- Interactive visualization — A custom web interface allowing researchers to explore the knowledge graph visually
The entity resolution step was particularly challenging. Chemical compound names had evolved over the decades, and the same molecule could appear under 5–7 different identifiers. We used a combination of fuzzy matching and domain-specific embeddings:
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("allenai/scibert_scivocab_uncased")
def find_similar_compounds(query: str, corpus: list[str], threshold: float = 0.82):
query_emb = model.encode([query])
corpus_emb = model.encode(corpus)
similarities = np.dot(corpus_emb, query_emb.T).flatten()
matches = [(corpus[i], similarities[i]) for i in range(len(corpus)) if similarities[i] > threshold]
return sorted(matches, key=lambda x: x[1], reverse=True)
Graph schema
The final knowledge graph contained over 2.3 million nodes and 18 million relationships. The core schema connected five primary entity types:
(Researcher)-[:AUTHORED]->(Publication)
(Publication)-[:REFERENCES]->(Compound)
(Compound)-[:TESTED_IN]->(ClinicalTrial)
(ClinicalTrial)-[:FILED_AS]->(Patent)
(Patent)-[:OWNED_BY]->(TherapeuticArea)
Results
Within three months of deployment, the platform uncovered 12 previously unknown connections between compounds in oncology and autoimmune research. Two of these led to fast-tracked Phase I trials, potentially saving the company 18–24 months of discovery time.
The search time for prior work dropped from 3.2 hours to under 15 minutes. More importantly, researchers reported a qualitative shift: they now started projects by exploring the graph, leading to more collaborative and cross-functional research initiatives.


