Blog

AI and advanced technologies, Digital stewardship, Primary sources

Engineering with purpose: How JSTOR Seeklight combines AI and human expertise to transform digital stewardship

7 minutes

May 30, 2025

A scanned document with handwritten notes and phone numbers on a brownish background. The phrase "Original to Xerox" is written prominently in large red marker. Various annotations in black ink surround the text, including names, phone numbers, and box outlines. In the upper-right corner, there's a badge with red sparkle icons and the caption "Generate metadata with JSTOR Seeklight."

Two and a half years after ChatGPT fundamentally reshaped our conversations around generative AI, there’s no shortage of new tools, bold ideas, and proposed applications. But applying AI to build genuinely effective problem-solving tools in specialized fields—such as digital archives and special collections—demands more than off-the-shelf technology.

In this post, I’ll share what we’ve learned so far in developing JSTOR Seeklight, an AI-powered tool that’s purpose-built to address the unique challenges facing archivists and special collections librarians today, shaped by those very professionals.

JSTOR Seeklight doesn’t just “use AI”—it strategically blends practitioner insights, thoughtful technology, and advanced engineering to activate collections and empower those who steward them.

Purpose-built, community-shaped

Ever since JSTOR began digitizing academic journals decades ago, we’ve consistently partnered with our community to explore how emerging technologies can be used to make knowledge more accessible. Over the past few years, hundreds of conversations with librarians and archivists around the world have revealed that most archives today grapple with overwhelming and ever-growing backlogs of primary source materials. These under-processed collections, lacking robust description or digitization, are practically undiscoverable by researchers and risk being excluded from today’s predominantly digital scholarly discourse. It’s not merely a resource issue—it’s about efficiently and effectively surfacing insights held in these dormant collections at scale.

Our goal in exploring AI applications for digital stewardship was clear from the outset: thoughtfully addressing these challenges without compromising archival integrity or human expertise.

We knew that simply adding AI to the mix wouldn’t suffice; we needed to intervene strategically in existing workflows, complementing rather than replacing human expertise.

Those hundreds of initial conversations were just the beginning—actually building something valuable requires ongoing collaboration not only with seasoned professionals but also student workers and generalists, ensuring usability across the entire archival workforce.

As a starting point, we focused on the creation of descriptive metadata, or the information that refers to the content of material and aids in its discovery. This descriptive text is incredibly time consuming to produce, whether through many hours of manual effort by a trained professional whose time may be better spent elsewhere, or extensive training to prepare student or contingent workers. We saw this prevalent processing bottleneck as an essential step in the stewardship process and ripe for rethinking.

Beyond AI: Intelligent classification, prompts, and a sophisticated pipeline

Yes, anyone can log into ChatGPT and experiment with automating aspects of workflows on their own. We believe that individuals doing exactly that will help build AI literacy that empowers these people—including library professionals—to better evaluate emerging tools. However, JSTOR Seeklight offers something fundamentally different. It’s engineered specifically for scale and built around archival realities.

With off-the-shelf AI tools, it’s easy to upload anything and use prompts to engage in dialogue about its contents. However, the results are highly dependent on prompt specificity, a challenge compounded in specialized contexts. For archival content, one would need to be mindful of many factors to productively activate AI, from parameters for the information you’re requesting to the relevance of file types.

With JSTOR Seeklight, users also begin by uploading content—but from there, we’ve developed a nuanced, multi-layered pipeline integrating JSTOR-developed technologies with advanced AI.

Here are the basic steps:

Intelligent classification: Every uploaded item is first classified according to format, file type, layout, quality, subject matter, and more. This sets the stage for precise, tailored metadata creation. Our classification system draws on multiple technologies—optical character recognition (OCR), handwriting detection, file analysis—to apply AI only when and where it’s most useful.
Refined prompt engineering: Depending on how an uploaded item is classified, JSTOR Seeklight will determine the appropriate set of custom prompts and instructions for the type of item to send to the LLM to generate meaningful data.
Context-aware post-processing: These bespoke prompts guide the model to generate descriptive metadata grounded in archival standards like Dublin Core. We’ve refined our prompt engineering to coach the AI to “think” like an archivist, not just output words that sound descriptive. The goal is discoverability, not just completion.
Human-centered design: Crucially, the output from JSTOR Seeklight is not the final step. Instead, the Stewardship platform incorporates comprehensive review and editing tools, clearly identifies AI-generated metadata, and facilitates efficient human oversight.

This purposeful, layered process ensures the resulting metadata meets archival standards, is evaluated by humans, and can meaningfully support research and teaching. It also sets the foundation for future enhancements, like linking archival content dynamically to broader institutional and scholarly contexts.

Human-centered with AI in-the-loop

The archival professionals we work with help us keep robust quality control mechanisms front-of-mind. JSTOR Seeklight doesn’t automate away human expertise—it strategically empowers it. By handling routine metadata tasks, it frees archivists to apply their specialized knowledge more effectively. It provides a high-quality initial pass of clearly labeled AI-generated metadata, enabling streamlined human review and validation.

Practitioner feedback continually shapes the practical details of this approach. While some refer to this as “human-in-the-loop” architecture, recent conversations advocate for framing it as “AI in-the-loop” instead.

I believe that JSTOR Seeklight exemplifies this subtle difference: AI is carefully integrated into established archival workflows, enhancing rather than redefining the archivist’s role, and respecting archival values as nonnegotiable.

One decision we’ve made highlights this commitment: JSTOR Seeklight deliberately leaves metadata fields blank when information is insufficient. Though counterintuitive from a pure AI standpoint, this respects the rigorous standards of archival integrity, avoiding speculative or unfounded metadata. We prioritize accuracy, trust, and usability above automation for automation’s sake.

Integrated and built to evolve

Because JSTOR Seeklight is part of JSTOR Digital Stewardship Services, it doesn’t live in a silo. It’s connected to our cloud-based digital asset management system, integrates with Portico for long-term preservation, and supports sharing content to JSTOR for unparalleled discovery. That connection matters.

JSTOR Seeklight doesn’t just help you create metadata—it facilitates continuous and dynamic interaction with archival content, enhancing collection management, preservation, and scholarly usability throughout the collection’s entire lifecycle.

Collaboration with librarians, archivists, and metadata specialists continuously guides our roadmap. We’re still in the early days of AI, and much of what JSTOR Seeklight may be able to do in the future will depend on the rapid innovation in this space. While we closely track technological breakthroughs, our primary focus remains firmly on delivering tangible value today—expanding to support additional formats, generating richer insights, and continually improving the accuracy and usability of our AI-generated metadata.

By successfully meeting the core needs of our early adopters, we aim to earn community confidence that allows exploration of more ambitious features, such as automating workflows through intelligent agents, uncovering hidden connections to institutional research and curricular needs, and surfacing archival insights relevant to contemporary issues.

No matter what, transparency and continuous learning will remain central to our approach. As we explore new AI capabilities, we commit to doing so together with you—and to openly sharing our findings. This fosters community trust, accelerates innovation, and ensures JSTOR Seeklight evolves in tandem with practitioner needs.

Ultimately, we’re not building technology for its own sake. We’re engineering it to serve the work of stewardship, to make hidden knowledge more visible, and to ensure historical perspectives remain prominent, accessible, and influential within ongoing scholarly discussions.

Without effective digital stewardship, these valuable historical voices risk fading from view, losing their power to meaningfully inform our contemporary understanding.

Interested in shaping the future of AI-assisted stewardship? Learn more about JSTOR Seeklight and the integrated JSTOR Digital Stewardship Services platform.

About the author

Syed Amaanullah is a Senior Product Manager at ITHAKA, where he leads the strategic product development of AI-driven solutions within JSTOR Digital Stewardship Services. With over 12 years of experience in leading teams and building innovative ed-tech products, he has deep expertise in leveraging the capabilities of technology to drive meaningful outcomes for the academic community. He is proud to help advance ITHAKA’s mission to expand access to knowledge and education.