RCSB PDB Sequence Coordinates Server API

The RCSB PDB Sequence Coordinates Server compiles alignments between structural and sequence databases and integrates protein positional features from multiple resources. Alignment data is available for NCBI RefSeq (including protein and genomic sequences), UniProt and PDB sequences. Protein positional features are integrated from UniProt, CATH, SCOPe and RCSB PDB and collected from the RCSB PDB Data Warehouse. The server offers a GraphQL-based application programming interface (API) to access the integrated content.

GraphQL-based API: use in-browser GraphiQL tool to refer to the full schema documentation

GraphQL-based API

GraphQL server operates on a single URL/endpoint, https://sequence-coordinates.rcsb.org/graphql, and all GraphQL requests for this service should be directed at this endpoint. GraphQL HTTP server handles POST method.

POST request

Requests must use HTTP POST with "application/json" as content type and GraphQL request details included as JSON in the request body, as defined in the proposed GraphQL over HTTP specification.

Variables

In the example above, the query arguments are written inside the query string. The query arguments can also be passed as dynamic values that are called variables. The variable definition looks like ($id: String!) in the example below. It lists a variable, prefixed by $, followed by its type, in this case String (! indicates that a non-null argument is required).

The following is equivalent to the previous query:

Where:

With variable defined like so:

Query variables, should be sent as part of the POST request in an additional parameter called variables.

A valid GraphQL POST request should use the application/json content type, must include query, and may include variables encoded as a JSON document in the request body. Here's an example for a valid body of a POST request:

Response

Regardless of the method by which the query and variables were sent, the response is returned in JSON format. A query might result in some data and some errors. The successful response will be returned in the form of:

Error Handling

Error handling in REST is pretty straightforward, we simply check the HTTP headers to get the status of a response. Depending on the HTTP status code we get ( 200 or 404), we can easily tell what the error is and how to go about resolving it. GraphQL server, on the other hand, will always respond with a 200 OK status code. When an error occurs while processing GraphQL queries, the complete error message is sent to the client with the response. Below is a sample of a typical GraphQL error message when requesting a field that is not defined in the GraphQL schema:

Using GraphQL

GraphQL enables declarative data fetching and gives power to request exactly the data that is needed. The GraphQL end point defines two different queries for sequence alignments and positional features:

alignment
annotations

Alignment Query

`alignment(from: SequenceReference!, to: SequenceReference!, queryId: String!, range:[Int!])`

from and to parameters codify the origin and target sequence databases, respectively, through a set of enumerated values

Next table describes the type of database identifiers used for each SequenceReference value

`SequenceReference`	Database Identifier	Example
`NCBI_GENOME`	NCBI RefSeq Chromosome Accession	NC_000001
`NCBI_PROTEIN`	NCBI RefSeq Protein Accession	NP_789765
`UNIPROT`	UniProt Accession	P01112
`PDB_ENTITY`	RCSB PDB Entity Id / CSM Entity Id	2UZI_3 / AF_AFP68871F1_1
`PDB_INSTANCE`	RCSB PDB Instance Id / CSM Instance Id	2UZI.C / AF_AFP68871F1.A

queryId is a valid identifier in the sequence database defined by from

range is an optional integer list (2-tuple) to filter the alignment to a particular region

Annotations Query

`annotations(reference: SequenceReference!, queryId: String!, sources: [Source!]!, range:[Int!], filters:[FilterInput!])`

reference and queryId indicate the sequence over which annotations will be mapped

reference is a defined by the same enumerated values defined in the alignment queryId parameter
queryId parameter is a valid identifier of the reference database for whom the annotations will be requested

sources array is an enumerated list defining the annotation collections to be requested

range is an optional integer list (2-tuple) to filter annotations that fall in a particular region

filters is an optional array of FilterInput that can be used to select what annotations will be retrieved

operation is an enumerated value (OperationType = contains|equals) that defines the comparison method

field is an enumerated value (FieldName = target_id|type) that defines the field to be compared

values list of allowed values

source only features with the same Source will be filtered

Data Organization

Schemas used to encode sequence alignments and positional features are extensions of the data schemas used in the RCSB PDB Data API. The following definitions and structures are relevant to the way that alignments and annotations are encoded:

Alignments

AlignmentResponse is the root document used to encode alignments

query_sequence contains the sequence of the database entry defined by defined by from and queryId parameters (ref). This field is null when genome scale alignments are requested (i.e. from value is NCBI_GENOME)
target_alignment is a list of TargetAlignment documents that describes the different alignments between the sequence identified by the from and queryId parameters (ref) and the database defined by to

TargetAlignment is the document structure that describes a sequence alignment between the database entry defined by from and queryId parameters (ref) and the entry defined by to and target_id (see next set of bullet points)

target_id identifies the entry from the database defined by the parameter to that is being aligned with the query (defined by from and queryId parameters ref)
target_sequence contains the sequence of the database entry defined by defined by to and target_id
aligned_regions is a list of AlignedRegion documents that defines the sequence alignment through a collection of regions
coverage document object that contains different scores related to the sequence alignment (see Coverage)
orientation integer that identifies the DNA strand of genome alignments (1 positive strand / -1 negative strand)

AlignedRegion sequence alignments are defined by a list of regions that identify the beginning and end positions in the query and target sequences. When alignment data maps residues between protein sequences indexes are aligned one to one from the starting to ending position incrementally (see next Figure). When alignments involve genome sequences 3 consecutive nucleotide indexes are paired with a protein residue with the possible addition of 1 or 2 nucleotide indexes stored in a separte array exon_shift to complete the final nucleotide triad (see Figure).

Protein-Protein Alignment diagram of a sequence alignment between a NCBI protein and PDB Entity. Residues are mapped one by one from starting to end positions within two different regions.

query_begin and query_end identify the start and end positions of the alignment in the query sequence (defined by from and queryId parameters ref)
target_begin and target_end identify the start and end positions of the alignment in the target sequence (defined by to and target_id parameters)
exon_shift list of genomic indexes that are needed to complete the last nucleotide triad of a genome-protein sequence alignment (see next Figure)

Genome-Protein Alignment diagram of a sequence alignment between a NCBI genome region and a PDB Entity. Protein residues are mapped to 3 consecutive genome indexes from the starting to the end position. In those cases where the last nucleotide triad indexes would surpass the ending position then, the missing nucleotides are stored in exon_shift. In this example this situation occurs in the first AlignedRegion where PDB Entity residue index 7 is mapped to genome nucleotide indexes [8,13,14].

Coverage object that contains different scores related to the sequence alignments

query_coverage and query_length contain the percentage of the query sequence that has been aligned and its length (the query sequence is defined by from and queryId parameters ref)
target_coverage and target_length contain the percentage of the target sequence that has been aligned and its length (the target sequence is defined by by to and target_id parameters)

Annotations

[AnnotationFeatures] is the root list of objects that contains the requested annotations

Feature list of documents that desribe positional features
source enumerated value that identifies the provenance type of the positional features (ref)
target_id source entry identifier associated to the positional features

Feature document that describes a positional feature

feature_id Identifier of the feature. When available the same Id as in the provenance_source is used
description Free-form text describing the feature
type Feature category identifier (see Feature Type controlled vocabulary)
feature_positions List of FeaturePosition documents that describes the location of the feature
provenance_source Original database or software name used to obtain the feature
name Name associated to the feature (e.g. protein domain name)
value Numerical value associated to the feature

FeaturePosition document that describes a segment where a feature occurs

beg_seq_id Index at which this segment of the feature begins
end_seq_id Index at which this segment of the feature ends. If the positional feature maps to a single residue this field will be null
beg_ori_id Index at which this segment of the feature begins on the original provenance_source. When reference and source point to the same reference system this file will be null
end_ori_id Index at which this segment of the feature ends on the original provenance_source. If the positional feature maps to a single residue this field will be null. When reference and source point to the same reference system this file will be null
value A numerical value of the feature for this segment

GraphQL Schema

All GraphQL queries are validated and executed against the GraphQL schema. The GraphQL schema contains the elements that define sequence alignments and positional features.

You can use GraphiQL, which is a "graphical interactive in-browser GraphQL IDE", to explore GraphQL schema. It lets you try different queries, helps with auto completion and built-in validation. The collapsible Docs panel (Documentation Explorer) on the right side of the page allows you to navigate through the schema definitions. Click on the root Query link to start exploring the GraphQL schema.

Examples

This section contains additional examples for using the GraphQL-based RCSB PDB Sequence Coordinates Server API.

UniProt - PDB Entity alignment

Fetch alignments between a UniProt Accession and PDB Entities:

Computed Structure Model - NCBI protein alignment

Fetch alignments between a Computed Structure Model and NCBI proteins:

Mapping UniProt annotations to a PDB Instance

Fetch all positional features for a particular PDB Instance:

Human Chromosome 1 - PDB Entity alignment

Map all PDB Entities that fall in Human Chromosome 1:

Mapping PDB Instance ligands binding sites to Human Chromosome 1

Fetch protein-ligand binding sites for PDB Instances that fall within Human Chromosome 1:

Note, that label_asym_id is used to identify polymer entity instances.

Mapping a PDB Instance to NCBI RefSeq proteins

Fetch alignments between a PDB Instance and NCB RefSeq proteins:

Migration Guides

Migrating from 1D Coordinates Service

The following guide will help you migrate from the 1D Coordinates Service API to the Sequence Coordinates Service. This page describes the changes between both APIs.

License

Sequence Coordinates Server usage is available under the same terms and condition as RCSB PDB (see usage policies)

Acknowledgements

To cite this service, please reference:

Joan Segura, Yana Rose, John Westbrook, Stephen K Burley, Jose M Duarte, RCSB Protein Data Bank 1D tools and services, Bioinformatics, Volume 36, Issue 22-23, 1 December 2020, Pages 5526–5527. doi: 10.1093/bioinformatics/btaa1012

Contact Us

Contact info@rcsb.org with questions or feedback about this service.