

However, this format has been discontinued and ought to be replaced by corresponding SwissProt/UniProt databases as explained above.įor more information on why IPI was discontinued see the IPI home page. Sequence databases based on the International Protein Index (IPI) used to be a very common in proteomics. You now have a UniProt sequence database containing only sequences from your species that can be used as input to a SearchGUI search.įor more details on UniProt databases see. It is generally advised to use a simple database and subsequently improve the complexity if needed. Note that the presence of isoforms makes the downstream protein inference task more complex. Again the choice depends on the properties of your experiment.
#Compomics searchgui download
Next click on the Download link in the upper right corner, and under the FASTA header select and download either Canonical sequence data in FASTA format or Canonical and isoform sequence data in FASTA format. Which option to chose depends on the properties of your experiment and on how well-annotated your given species is. Next select one of the three provided options: show only reviewed entries (UniProtKB/Swiss-Prot), show unreviewed entries (UniProtKB/TrEMBL), or show only entries from a complete proteome set. To get the SwissProt/UniProt sequence database for your species go to and type organism:'name of your species' into the Query field at the top, e.g., organism:"homo sapiens". This ensures that you get reviewed, maintained and well-annotated sequences that can easily be linked to a long list of other resources. Replace or remove the part depending on if you have a user defined tag or not.Īs mentioned above it is strongly recommended to use sequence databases based on SwissProt/UniProt. To parse databases with generic FASTA headers in Mascot we recommend the following Mascot database parse rules: accession: ">generic|\(*\)|(.*)"ĭescription: ">generic|*|\(.*\)" If you have a sequence database that cannot be parsed, please let us know be setting up an issue at the SearchGUI home page. Non-standard home made sequence databases with non-standard headers can also be used, but the downstream usage may be limited, e.g., in PeptideShaker.ĭatabases that do not match the standard header formats of the common databases (like UniProt, NCBInr etc) can be added using a generic header format (supported from SearchGUI version 1.7.3 and PeptideShaker version 0.14.6 onwards): >generic||Įxamples: >generic_contig-535081|AC:123132|Hypothetical proteinĪC:123132 will then be used as the protein accession number and Hypothetical protein as the protein description (if provided).

It is strongly recommended to use one of the standard databases, and of these UniProt is the preferred option.
#Compomics searchgui plus
SearchGUI supports the most encountered databases like UniProt, Ensembl, NextProt, NCBI and IPI, plus a long list of other databases. However, the format of the header varies from database to database. The header contains information about the protein, e.g., protein accession number, database and species. In a FASTA file each sequence is represented by a header and the sequence itself. SearchGUI therefore requires that the sequences are stored in this format. The standard format for sequence databases is called FASTA. It is therefore essential to use the correct sequence database. On the other hand, adding sequences of proteins that cannot occur in your experiment will increase the rate of false identifications. If a sequence is not in the database the corresponding peptide/protein cannot be identified. Besides the spectra themselves, the sequence database to search in is the most important input of the search.
