Database Schema Documentation
Overview
The BGC Viewer uses a SQLite database to index and efficiently search antiSMASH JSON files. The schema is designed to support:
- Fast random access to specific records using byte positions
- Efficient file and record indexing
- Flexible attribute searching for record annotations and feature qualifiers
- Compact storage with minimal redundancy
Schema Structure
Table: metadata
Stores global database metadata.
| Column | Type | Description |
|---|---|---|
key | TEXT | Metadata key (PRIMARY KEY) |
value | TEXT | Metadata value |
Populated with:
version: Package version of bgc-viewer used to create the databasedata_root: Absolute path to the data directory containing the JSON filescreation_date: ISO-formatted timestamp of database creationmodified_date: ISO-formatted timestamp of last database modification
Source in JSON: Not from JSON files - generated during preprocessing
Table: files
Stores file paths relative to the data root.
| Column | Type | Description |
|---|---|---|
id | INTEGER | Auto-incrementing primary key |
path | TEXT | File path relative to data_root (UNIQUE) |
Populated with:
- The relative path of each JSON file processed (e.g.,
NC_003888.3.json,subdir/sample.json)
Source in JSON: Not from JSON - derived from file system path
Indexes:
idx_files_pathonpath
Table: records
Stores record-level information and byte positions for fast random access.
| Column | Type | Description |
|---|---|---|
id | INTEGER | Auto-incrementing primary key |
file_id | INTEGER | Foreign key to files.id |
record_id | TEXT | Record identifier from JSON |
byte_start | INTEGER | Starting byte position of record in file |
byte_end | INTEGER | Ending byte position of record in file |
Populated with:
file_id: Link to the file in thefilestablerecord_id: From JSON pathrecords[].idbyte_start,byte_end: Calculated by scanning the JSON file to find record boundaries
Source in JSON:
{
"records": [
{
"id": "NC_003888.3", // → record_id
...
}
]
}Constraints:
- FOREIGN KEY to
files(id)ON DELETE CASCADE
Indexes:
idx_records_file_idonfile_ididx_records_record_idonrecord_id
Table: attributes
Stores searchable attributes extracted from record annotations and feature qualifiers.
| Column | Type | Description |
|---|---|---|
record_id | INTEGER | Foreign key to records.id |
attribute_name | TEXT | Name/key of the attribute |
attribute_value | TEXT | Value of the attribute (stored as text) |
Populated with:
Attributes are extracted from two sources within each record:
1. Record Annotations:
record_id: Links torecords.idattribute_name,attribute_value: From flattenedannotationsobject
Source in JSON:
{
"records": [
{
"id": "NC_003888.3",
"annotations": {
"molecule_type": "DNA", // → attribute_name='molecule_type', attribute_value='DNA'
"organism": "Streptomyces coelicolor A3(2)", // → attribute_name='organism', attribute_value='Streptomyces coelicolor A3(2)'
"topology": "linear" // → attribute_name='topology', attribute_value='linear'
}
}
]
}2. Feature Qualifiers:
record_id: Links torecords.id- All key-value pairs from feature
qualifiersobjects (excludingtranslationand values over 100 characters)
Source in JSON:
{
"records": [
{
"features": [
{
"type": "CDS",
"qualifiers": {
"locus_tag": "SC_RS00001", // → attribute_name='locus_tag', attribute_value='SC_RS00001'
"product": "hypothetical protein", // → attribute_name='product', attribute_value='hypothetical protein'
"db_xref": ["GeneID:123", "PF00001"] // → Multiple rows with attribute_name='db_xref'
}
}
]
}
]
}Note: The translation qualifier (which contains long amino acid sequences) is excluded to save space. Attribute values longer than 100 characters are also excluded.
Constraints:
- UNIQUE(
record_id,attribute_name,attribute_value) - prevents duplicate entries - FOREIGN KEY to
records(id)ON DELETE CASCADE
Indexes:
idx_attributes_record_idonrecord_ididx_attributes_nameonattribute_nameidx_attributes_valueonattribute_valueidx_attributes_name_valueon(attribute_name, attribute_value)
Entity Relationships
metadata (standalone - global settings)
files (1)
└── records (many)
└── attributes (many) [record_id references records.id]Data Flattening Rules
Complex nested structures in JSON are flattened into multiple attribute rows:
Arrays
Each item in an array creates a separate row with the same attribute_name:
"db_xref": ["GeneID:123", "PF00001"]Creates two rows:
attribute_name='db_xref',attribute_value='GeneID:123'attribute_name='db_xref',attribute_value='PF00001'
Nested Objects
Keys are concatenated with underscores:
"subregion": {
"category": "biosynthetic"
}Creates:
attribute_name='subregion_category',attribute_value='biosynthetic'
Query Examples
Get all records for a file:
SELECT r.*
FROM records r
JOIN files f ON r.file_id = f.id
WHERE f.path = 'NC_003888.3.json';Search for records with a specific attribute value:
SELECT DISTINCT r.*
FROM records r
JOIN attributes a ON a.record_id = r.id
WHERE a.attribute_value LIKE '%biosynthetic%';Search for records by attribute name and value:
SELECT DISTINCT r.*
FROM records r
JOIN attributes a ON a.record_id = r.id
WHERE a.attribute_name = 'product' AND a.attribute_value = 'T1PKS';Get all unique attribute names:
SELECT DISTINCT attribute_name
FROM attributes
ORDER BY attribute_name;Count records per file:
SELECT f.path, COUNT(r.id) as record_count
FROM files f
LEFT JOIN records r ON r.file_id = f.id
GROUP BY f.id, f.path;Performance Considerations
Byte Positions: The
byte_startandbyte_endcolumns in therecordstable enable fast random access to specific records without parsing the entire JSON file.Indexes: Comprehensive indexes on frequently queried columns ensure fast lookups and filtering.
Compact Storage: By excluding the
translationqualifier and limiting attribute values to 100 characters, the database remains compact while retaining searchable data.Normalized Structure: The simple three-table structure (
files,records,attributes) provides efficient joins while avoiding the complexity of a fully polymorphic design.Unique Constraints: The UNIQUE constraint on
(record_id, attribute_name, attribute_value)prevents duplicate entries and reduces database size.Foreign Keys with Cascade: Deleting a file automatically removes all associated records and attributes, maintaining referential integrity.
Schema Version
This schema is used by bgc-viewer version 0.3.0 and later. The schema version can be checked via:
SELECT value FROM metadata WHERE key = 'version';