A leading European pharmaceutical group had accumulated over 25 years of R&D data across siloed databases, legacy file systems, and institutional knowledge locked in retired employees' notebooks. The challenge: make this data searchable, connected, and actionable for current research teams.

The problem

The company's research data was spread across 14 distinct systems — from Oracle databases storing clinical trial results to SharePoint sites holding patent filings, and even scanned PDFs of handwritten lab notes from the early 2000s. Researchers spent an average of 3.2 hours per day just searching for prior work before starting new experiments.

We estimated that 40% of our exploratory research was unknowingly duplicating work already done internally. The cost wasn't just time — it was missed opportunities for cross-pollination between therapeutic areas.

Our approach

We designed a three-phase pipeline to ingest, link, and visualize the data:

Ingestion & normalization — ETL pipelines to extract structured data from databases and unstructured data from documents using OCR and NLP
Entity resolution & linking — Graph construction using Neo4j, linking compounds, researchers, publications, patents, and trial outcomes
Interactive visualization — A custom web interface allowing researchers to explore the knowledge graph visually

The entity resolution step was particularly challenging. Chemical compound names had evolved over the decades, and the same molecule could appear under 5–7 different identifiers. We used a combination of fuzzy matching and domain-specific embeddings:

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("allenai/scibert_scivocab_uncased")

def find_similar_compounds(query: str, corpus: list[str], threshold: float = 0.82):
    query_emb = model.encode([query])
    corpus_emb = model.encode(corpus)
    similarities = np.dot(corpus_emb, query_emb.T).flatten()
    matches = [(corpus[i], similarities[i]) for i in range(len(corpus)) if similarities[i] > threshold]
    return sorted(matches, key=lambda x: x[1], reverse=True)

Graph schema

The final knowledge graph contained over 2.3 million nodes and 18 million relationships. The core schema connected five primary entity types:

(Researcher)-[:AUTHORED]->(Publication)
(Publication)-[:REFERENCES]->(Compound)
(Compound)-[:TESTED_IN]->(ClinicalTrial)
(ClinicalTrial)-[:FILED_AS]->(Patent)
(Patent)-[:OWNED_BY]->(TherapeuticArea)

Results

Within three months of deployment, the platform uncovered 12 previously unknown connections between compounds in oncology and autoimmune research. Two of these led to fast-tracked Phase I trials, potentially saving the company 18–24 months of discovery time.

The search time for prior work dropped from 3.2 hours to under 15 minutes. More importantly, researchers reported a qualitative shift: they now started projects by exploring the graph, leading to more collaborative and cross-functional research initiatives.

Unlocking 25 Years of R&D Data through graph visualization

The problem

Our approach

Graph schema

Results

The path to data centricity is based on a strong data architecture.

Europe lagging behind the Global South in AI adoption, OECD report

Beyond the Cloud, Advanced AI Computer Vision