OCUL launched its AI and Machine Learning (AIML) Program in the summer of 2024, starting with five pilot projects. The program’s pilot projects were chosen from community-developed use cases and designed to create reusable tools and techniques that can transform and adapt for future use. As the projects get underway, OCUL members will be invited to participate in the exploration and evaluation of these new technologies.
The AIML Program is led by Program Director Catherine Steeves and staff from OCUL and Scholars Portal. AIML Technical Manager Pieter Botha will soon be joined by a Program Manager and AI Special Projects Developer. This core group will work with OCUL members to support the pilot projects and goals of the program.
The program team has been exploring opportunities to collaborate with the broader higher education and library communities through potential grants and partnerships. Not only can these opportunities extend the capacity of the program and help support project work, but they also ensure Ontario’s university libraries are recognized partners in the evolution of research and teaching within AIML in higher education.
Presentations and community engagement play a significant role in developing collaboration. So far, the program team has created connections with the Canadian Council of Chief Information Officers, the Canadian Association of Research Libraries, the BC Electronic Library Network, the Coalition for Networked Information, and the Open Forum for Artificial Intelligence. In addition, Catherine Steeves and OCUL Executive Director Amy Greenberg joined Emily Tufts (University Librarian) for AI in the Library as part of Trent University's ongoing AI Hopes and Fears online discussion series. Catherine also presented Academic Libraries and Machine Learning: Strategy, Projects, and Capacity Building at the Canadian Higher Education Information Technology conference.
In the coming months, the program team will create the overarching plan that will build on the OCUL Artificial Intelligence/Machine Learning Report and Strategy to further articulate the program goals and structure, reiterate its guiding principles, and map out the pilot projects. The plan will include a communication strategy, a process for establishing the advisory committees, pilot project activities, coordination and engagement strategies, and updated milestones and timelines.
Find ongoing updates and information about each of the pilot projects on the OCUL AIML Program wiki.
Pilot Projects
The OCUL AIML Program focuses on five projects. A number of these projects run concurrently, and others will begin later in 2025.
Audio to Text
Transcriptions of audio files are an essential element of digital accessibility. This project involves using the Whisper open-access automatic speech recognition system to transcribe audio files, setting up a pipeline through Scholars Portal, and providing access and technical documentation to OCUL member libraries. This will allow members to quickly begin using the system without having to host Whisper locally themselves.
The project team, led by Scholars Portal Associate Director, Systems and Technical Operations Harpinder Singh, is nearing completion of the pipeline, with the development of the Python processing script and containerizing the entire application. Once complete, the team will test the pipeline on the newly established local graphic processing unit (GPU) ecosystem before making it available to OCUL member libraries for testing. Feedback from OCUL member libraries is crucial to informing next steps and ultimately deploying a high-quality solution. Member libraries interested in participating in this testing can email the project team at aimlprogram@scholarsportal.info.
Government Documents
Through a large-scale digitization project at the University of Toronto, a collection of government documents is available on the Internet Archive, but searchable metadata is often unavailable or limited in quality. This project uses AIML tools to provide enhanced metadata and new discovery tools to increase the accessibility and usability of the collection.
While the initial scope of the project targeted metadata extraction, a need to improve the quality of the optical character recognition (OCR) of the current collection became apparent. The existing text was barely understandable by a human, let alone an AI. Several tools were evaluated to assess the OCR performance on sample data. Two tools showed potential, Marker and GOT OCR 2.0, and continue to be evaluated for their metadata extraction capabilities.
While exploring solutions for improving OCR, the team has also developed a Python script to extract metadata from a sample set of government documents. In the first step, it downloads PDF documents from the Internet Archive that matches a list from the sample set. Then, the script extracts images from the PDF files and re-OCRs the contents of the documents. In the next step, the script extracts snippets of text from the documents and embeds the content as vectors in a database for future retrieval augmented generation applications. Metadata is next extracted by using large language models to analyze the content and then stored in a database. Finally, output is converted into friendly formats like JSON and CSV for further processing and evaluation.
The Government Documents project is gradually expanding the scope of pilot documents, ensuring representation of various formats, lengths, image and tables, and both official languages. The project is also testing various models and strategies to extract metadata for different kinds of documents and use cases. Upcoming research focuses on text embedding techniques, information extraction with vision models, and direct image embedding without OCR/text extraction.
Along with Pieter Botha as technical lead and Jacqueline Whyte Appleby, Scholars Portal Associate Director, the project is supported by University of Toronto librarians with a keen interest in the intersection between AI technology and metadata. A project charter is being developed and will be available on the project’s wiki space.
Enhancing Virtual Reference
Ask a Librarian is the Scholars Portal bilingual virtual reference service that connects students, faculty and researchers with real-time library and research assistance through online chat. LibraryH3lp, the software vendor, is developing a chatbot. This pilot project involves analyzing and testing the tool, plus exploring the impact and value of chatbot service to virtual reference.
A project charter that structures and guides the project has been approved by the OCUL Executive Committee. The charter calls for two working groups: one, led by Scholars Portal, will test LibraryH3lp’s chatbot, provide feedback, assist with the development, and ultimately evaluate the tool’s suitability for use by OCUL. The other, to be led by the Program Manager, will synthesize expert knowledge and issues related to use of AI in reference service, share the results of OCUL community consultations and the output of the beta testing working group, and will ultimately make a recommendation to OCUL regarding LibraryH3lp AI tool integration. Membership in the two groups will not be mutually exclusive, and they are expected to work closely together. A call for working group membership will be shared with OCUL member libraries in February 2025.
To date, the Scholars Portal Ask a Librarian team has been liaising with LibraryH3lp regarding the development of their two planned AI tools: the LibraryH3lp chatbot development and the chat transcript anonymization and analysis pipeline. The chatbot, which relies on both an institution-specific knowledge base and on GPT-4, is in the alpha testing phase. Currently it is staff-facing only; the chatbot is intended to act as a co-pilot to help human chat operators answer questions from students at other institutions, although an end user-facing bot is in the LibraryH3lp development plan. Three OCUL member libraries have opted into the alpha testing with LibraryH3lp. These institutions have complete control over which of their web pages are scraped to develop their institutions’ knowledge base.
LibraryH3lp is developing their transcript anonymization and analysis pipeline with US partners and will install it on their Canadian servers when it is in beta format. This tool uses natural language processing (NLP) techniques to identify and remove personal information from chat transcripts and will eventually use NLP to identify and categorize topics and question types from a set of transcripts. Currently, any work to remove personal information or to identify topics or question types must be done manually by the Ask team and local coordinators and it is a very laborious process.
ACE Remediation and Summaries
The Accessible Content E-Portal – known as ACE – is a Scholars Portal service that provides access to print library books in digitized formats. To increase the usability of ACE, this project assesses the variety of AIML tools available to assist with document remediation and explores using open-source tools to create book/chapter summaries and add them to the item’s metadata.
While this project has not officially started, much of the work being done for the Government Documents project around improving OCR, document structure, and metadata generation will also be applicable and help with the initial stages of this accessibility project.
Capacity Building
This project will take an evidence-based approach to understanding several facets of the professional development and training needed to evolve libraries’ academic and research services, workflows, and professional practice for the generative AI era.
The focus of this work to date has been exploring engagement strategies and tools to understand the needs of OCUL member libraries and how the program can support these needs. The OCUL AIML Program Director and Program Manager will establish an engagement plan and work with the AIML Training Coordination Committee to advance this project in coming months. This capacity building work runs in parallel to the other pilot projects and is at the centre of what the AIML Program aims to achieve: the promotion of responsible, ethical AIML use in the academic library environment while building related knowledge and skills across the OCUL membership and beyond. Next steps include a call for membership from OCUL member libraries for the AIML Training Coordination Committee and the scoping of future work.
In the meantime, those attending the 2025 OLA Super Conference can join a conversation with the OCUL AIML team about capacity building for AI in academic libraries on Friday, January 31 at 10:45 a.m.
Infrastructure
Scholars Portal has identified and put in place infrastructure to support the work of the AIML Program projects. During the initial phase of the program, it was quickly realized that existing technology infrastructure was not enough to move the projects forward. To overcome this hurdle, Scholars Portal assembled an AI machine using off-the-shelf components that resulted in the creation of a novel graphic processing units (GPU) environment dubbed FORGE: Future-Oriented Reliable GPU Ecosystem.
With FORGE, OCUL and Scholars Portal can now run large language models that have significantly elevated experimentation capabilities. Although still in the initial stages of setup, the experiences working with FORGE thus far have provided invaluable insights into optimizing systems.
Scholars Portal has also recently partnered with the University of Toronto Libraries to acquire an enterprise-grade GPU server, that allows for running larger models and concurrent applications. In the coming months, the Whisper audio-to-text pipeline will be deployed on this new hardware and the Government Documents project will leverage this tech infrastructure to assess its capabilities.
For More Information
Learn more about the OCUL AIML Program, the pilot projects, and the work of the committees on SPOTDocs.
Questions about the program can be sent to aimlprogram@scholarsportal.info.