Digital
Next-Generation Metadata Catalogue: Revolutionising Data Discovery in the Scottish Public Sector
September 3, 2024 by Stewart Hamilton 6 Comments | Category Data, Digital Scotland
Blog by Masood Alam, Chief Data Architect, and Catherine Ojo, Lead Data Scientist, from the Scottish Government’s Data Division.
In a time when data drives decision-making, the Scottish public sector is making a significant move with the prospective development of a state-of-the-art Federated Data Catalogue and Metadata Repository, using machine learning & graph analytics. This pioneering project aims to revolutionise how public sector organisations manage, discover, and integrate data, setting a new benchmark for data accessibility and use.
The Problem Statement
The data architecture team has found that while there are some metadata catalogues in place across the organisation, there is a need for a more robust solution to better manage data governance and data management across different platforms.
Like many governments and large organisations, the Scottish Government faces challenges in managing data effectively. We lack a complete overview of our data—its location, ownership, or potential insights. Even when data is accessible, metadata is rarely documented, limiting the use of AI, advanced analytics, and evidence-based decision-making.
To tackle this, the data architecture team worked with Gartner to understand trends in metadata management and maturity. According to Gartner’s “Metadata Management Technology Maturity” model, organisations range from Level 0 (Unaware) to Level 5 (Augmented):
- Level 0 (Unaware): No standards, uncoordinated, project-based.
- Level 1 (Inventory): Minimal data collection, “as is” accepted, separate tools.
- Level 2 (Catalog): Technical descriptors, some data lineage, coordinated business descriptors, scheduled updates.
- Level 3 (Proactive): Resolves critical assets, multiple definitions, technical taxonomy, trend analysis.
- Level 4 (Active): Uses machine learning for profiling, content analysis, clustering, resource allocation metrics, and alerts.
- Level 5 (Augmented): Machine learning by example, orchestrates recommendations and responses, infers new assets from use cases.
Organisations without a metadata catalogue are typically at Level 0 or 1, highlighting the need for major improvements in data management to fully realise the value of their data and make better decisions. With the introduction of this catalogue, we aim to support many organisations in reaching Levels 4 and 5, where advanced metadata management and automation can significantly improve data discovery, governance, and overall effectiveness.
What is the Big Idea?
We aim to create the public sector’s first self-service, plug-and-play data catalogue. Using AI and knowledge graphs, we want to streamline data handling, enhance decision-making, and build a more flexible, data-driven public sector.
What’s Under the Hood?
Our catalogue will feature cutting-edge technology:
- Automatic Metadata Cataloguing & Generation: Utilising pre-trained large language models for summarising text to generate metadata, machine learning to profile structured data and identify entities with different names, and graph modelling for continuous catalogue improvement.
- Active Data Discovery: Using knowledge graph models and other tools to detect and suggest relationships between datasets dynamically.
- Automated Data Tagging: Generating dataset synonyms automatically with pre-trained large language models for data classification and knowledge graphs.
- DCAT-3 Metadata Standards: Implementing the latest Data Catalogue Vocabulary standards to enable seamless metadata sharing between organisations.
Key proof of concept features
We plan to develop four key features as a proof of concept (POC) to create a practical demonstrator with our vendor and trial partners.
1.1 Entity Recognition & Mapping
The system will include an entity recognition and mapping module that automatically detects and suggests relationships across datasets using Multi-Modal LLMs.
1.2 Knowledge Graphs with Semantic Notes
knowledge graphs that:
- Visualise data relationships.
- Provide semantic notes to explain complex connections.
1.3 Dataset Classification and Metadata
Machine learning will:
- Automatically classify datasets.
- Generate metadata to enhance data discovery.
It will serve as a metadata generation tool, allowing users to provide real-time feedback to improve results.
1.4 Real-Time DCAT-3 Conversion
The system will adopt DCAT-3 standards to improve metadata sharing across organisations, providing richer metadata descriptions.
Federated Approach and UK-Wide Integration
Our proposal for a federated approach to metadata search aligns with the UK Government’s plans for a National Data Library. By developing APIs to link our Scottish metadata repository with the UK Government, we will contribute to a National Metadata Repository, improving data discovery across the public sector.
A federated model brings several benefits:
- Local Control: Devolved governments and agencies can manage data in their own systems.
- Better Discovery: Enables cross-catalogue searches for more comprehensive data finding.
- Controlled Data Sharing: Facilitates data sharing while keeping local governance in place.
Finally
- Phased Implementation
We will begin with a Proof of Concept (PoC) with Scottish public sector organisations. This careful start lets us refine the system and prove its value before larger trials and a full rollout.
- Why It Matters
Traditional data catalogues are expensive, inflexible, and require manual input from skilled Data Architects. Our Next-Generation Metadata Catalogue addresses these problems:
- Cost-Effective: No high licensing fees.
- Customisable: Easily tailored to public sector needs.
- Open Source: Prevents vendor lock-in and encourages collaboration.
- Standards-Compliant: Built with DCAT-3 standards for compatibility and future use.
- The Future
This project goes beyond creating a data catalogue; it aims to build a more efficient, connected, and data-driven public sector. The Next-Generation Metadata Catalogue will unlock data potential, foster innovation, and improve public services in Scotland and beyond.
Tags: Data, Data architecture, Metadata, scottish government, Scottish Public Sector
Look at using vocbench https://vocbench.uniroma2.it/ and joining in the EU project. https://op.europa.eu/en/web/eu-vocabularies/vocbench
This is affordable. Effective, economic, and just right for Scotland.
So it’s not as though this is new to the Scottish public sector. After all, the Scottish Government played a pivotal role in https://www.w3.org/2013/share-psi/ and also was important in developing https://www.w3.org/TR/dwbp/ and also https://www.w3.org/TR/vocab-dcat-2/, which is now at Version 3. So what happened? Why was 15 years and more lost? Unless we acknowledge lost time, play catch up very quickly, and move to the current state of play with effective persistent ID systems, knowledge graphs, and so on, another 15 years will go by with little to show.
Hi there, how does this initiative interrelate with all the other metadata solutions that SG has been building over the years – like spatialdata.gov.scot, statistics.gov.scot and find.data.gov.scot?
The catalogue we are suggesting is an internal metadata catalogue that connects to various systems within an organisation, making all datasets visible and easily accessible, using federated model. It helps analysts, statisticians, and data owners manage, maintain, and discover data assets efficiently.
On the other hand, spatialdata.gov.scot and statistics.gov.scot focus on specific data types—spatial and statistical. They rely on data producers to upload datasets and don’t integrate with internal systems. Meanwhile, find.data.gov.scot serves more as a directory or aggregator of open data, without offering comprehensive metadata management Like Collibra, Informatica for example.
Features of an Internal Catalogue:
1. Data Discovery and Search: Advanced search with filters for easy dataset discovery.
2. Automated Metadata Ingestion: Integrates with various data sources(live/staging databases) for seamless metadata collection.
3. Data Lineage and Lineage UI: Visualises data flow and dependencies across systems to track any failures and understand flow of data.
4. Data Governance and Policy Management: Supports compliance, access control, and policy enforcement.
5. Collaboration Tools: Enables comments, tagging, discussions, notifications, and subscriptions with different stakeholders within the organisation.
6. Data Quality Insights: Provides visibility into the quality and reliability of data to assist the administrators to fix the problems.
7. Role-Based Access Control (RBAC): Secure, detailed access management for data assets. Roles like Business owner, technical owner can be defined for the data at Column level of the database.
8. Custom Metadata and Extensibility: Allows for custom fields and plugins tailored to organisational needs.
9. Versioning and Data Preview: Tracks changes in datasets and offers previews.
10. Metadata Graph and Open APIs: Uses graph-based representations and APIs for seamless integration with existing systems.
11. Data Standards and Documentation: Supports standardised metadata practices (e.g., DCAT) and thorough documentation for consistent metadata management.
12. This comparison highlights how proposed data catalog provides a comprehensive internal solution for metadata management, while the Scottish platforms focus on open data access for public use. We are more focused on internal discovery, governance, management and data-quality improvement.
While there are proprietary solutions like Collibra and Informatica available in market, they could cost tens of millions for the entire public sector, and they still wouldn’t cater to our specific needs.
I hope this helps clarify things. We’d be happy to offer you a demo of the basic version we currently have. This will give you a clearer idea of what the product looks like and how it can help improve data maturity, quality, and governance within your organisation.
Mr Masood Alam | Chief Data Architect
Phone: +44 7825 018 241
Linkedin: MasoodAlam
Digital Directorate | Scottish Government
Reply sent by Digital Directorate comms team.
This is a really interesting concept! I’m curious about how you manage different taxonomies and ontologies, especially since these can get quite complex in metadata management. Are you creating custom models for this? I noticed you mentioned a multi-modal approach—how are these models trained to handle active metadata scenarios? I’d love to know more about how you make sure these models stay accurate and relevant, especially when dealing with evolving taxonomies or dynamic, context-driven data.
Thank you for your interest! To manage different taxonomies and ontologies, we’ll be using a combination of custom models and a multi-modal approach. These models will be specifically designed to fit our metadata management needs, trained on diverse data sources to effectively handle active metadata scenarios. We’ll employ machine learning techniques to keep the models adaptive, ensuring they remain accurate and relevant. Regular updates, expert input, and continuous retraining will be key in managing evolving taxonomies and dynamic, context-driven data.
Mr Masood Alam | Chief Data Architect
Phone: +44 7825 018 241
Linkedin: MasoodAlam
Digital Directorate | Scottish Government
Reply sent by Digital Directorate comms team.