Sample Queries
When to use MDX or SQL
OLAP stores data in a multidimensional cube, in SQL the data are stored in multiple tables.
Querying both data structures is relatively simple. With MDX you use the WHERE clause to effectively
slice the data cube to return your data of interest. In SQL you use the WHERE clause to return data
from a particular table. Joins can be used to put multiple tables together. If you are looking over
a large number of simulations, creating joins can be tedious, and will require prior knowledge of table
names and their locations on servers. Using MDX you can select your tables of interest without having to
know the names or locations of the underlying tables - you simply use the WHERE clause to slice out the your
simulations of interest by multiple variables - Temperature, Run, Maximum time step and so on.
In retrieving data without any further calculation on the data, both MDX and SQL are comparable.
Querying the database using SQL
Data are stored in multiple tables for each property and each simulation. The easiest
way to identify your protein of interest and it's association is to determine it's
pdb code. Once you have the pdb code you can query the Master_property_v
view.
SELECT * FROM [Directory].dbo.[Master_property_v] WHERE pdb4='1enh'
This view contains a summary of all the tables in the database and the associated
simulations. By using the where clause to limit the selection to just rows just
containing
pdb4='1enh' you can determine the simulations we have
in the database associated with it. You will also be able to limit your search to
a particular
sim_id. The
Sim_id is a unique simulation identifier
and all the simulation data tables use the sim_id. Each coordinate and property
table will end with the
sim_id. For example for 1enh
sim_id 678 relates
to a run 1 298K simulation and the tables associated with this will end in _678
e.g.
dbo.[coord_678], dbo.[rmsd_678], dbo.[SASA_678] and so on. The master
property view will also tell you which server and database the tables are on. Generally
speaking the data tables for a particular simulation will always be in the same
database. Once you have located the tables of interest you can construct you queries
to return the data you require using the
WHERE clause to slice the data correctly
and using the ORDER clause to correctly order the data, by integration step for
example.
Cross simulation queries
You can use JOINS to put multiple tables together for example the RMSDs of runs 1 to 3
of the 498K 1enh runs.
SELECT run1.step, run1.rmsd, run2.rmsd, run3.rmsd
FROM dbo.[rmsd_678] as run1
JOIN dbo.[rmsd_679] as run2
ON ( run1.step=run2.step )
JOIN dbo.[rmsd_680] as run3
ON ( run1.step= run3.step )
ORDER by run1.step
If you want o look at all the rmsd for all the 298K's and find those runs with high values for example you could use the master_rmsd_v view and select out just the 298K runs and look for values greater than 5 Angstroms RMSD using the WHERE clause (WHERE temp =298 and rmsd > 5). The master views are large tables that are generated on the fly and so can take some time to access.
If you are looking at a large set of targets and want to run the same query on a number of different tables from the set of simulations it is advisable to use a CURSOR. A cursor essentially lets you loop over every table in your set and run the set query over them. The example below shows a cursor that collates all the properties from multiple tables into one table for each simulation.
DECLARE @sim_id INT
DECLARE @table_prefix VARCHAR(64)
DECLARE @server_name VARCHAR(32)
DECLARE @database_name VARCHAR(32)
DECLARE @first BIT
DECLARE @view_cmd VARCHAR(MAX)
DECLARE @struct_id INT
DECLARE @residues INT
SET @first = 1
DECLARE sim_id_cursor CURSOR FOR select distinct a.sim_id, struct_id, server_name,database_name from [directory].dbo.[master_property_v] as a
join [directory].dbo.[master_simulation_simulationgroup] as b
on a.sim_id=b.sim_id
where a.property_abbrev='RMSD'
and b.sim_grp_id=1000
OPEN sim_id_cursor
FETCH NEXT FROM sim_id_cursor INTO @sim_id, @struct_id, @server_name, @database_name
SET @table_prefix = '[' + @server_name + '].[' + @database_name + '].'
SELECT @residues=MAX(residue_number) FROM [helix].[directory].[dbo].[Master_ID] WHERE struct_id = @struct_id
WHILE @@FETCH_STATUS = 0
BEGIN
SELECT @residues=MAX(residue_number) FROM [helix].[directory].[dbo].[Master_ID] WHERE struct_id = @struct_id
SET @table_prefix = '[' + @server_name + '].[' + @database_name + '].'
SET @view_cmd = '
SELECT a.sim_id, a.struct_id, a.struct_inst, f.temp, f.run, a.time_step, a.rmsd, b.dissimilarity_score, c.native,
c.non_native, e.radgyr, e.end2end, d.Main_chain, d.side_chain, d.polar, d.non_polar,d.mc_polar,d.mc_non_polar,
d.sc_polar, d.sc_non_polar, d.total, g.fraction_helix, h.fraction_beta INTO dbo.[PS_n_'+ CAST(@sim_id AS VARCHAR) +']
FROM ' + @table_prefix + 'dbo.[rmsd_'+ CAST(@sim_id AS VARCHAR) +'] AS a
Join ' + @table_prefix + 'dbo.[congen_'+ CAST(@sim_id AS VARCHAR)+'] as b on a.time_step=b.time_step
Join ' + @table_prefix + 'dbo.[contact_' + CAST(@sim_id AS VARCHAR)+ '] as c on a.time_step=c.time_step
Join (SELECT time_step, SUM(main_chain)as Main_chain, SUM(side_chain)as Side_chain,
SUM(polar)as Polar, SUM(non_polar)as Non_polar, SUM(mc_polar)as MC_polar,
SUM(mc_non_polar)as MC_non_polar, SUM(sc_polar)as SC_polar,
SUM(sc_non_polar) as SC_non_polar, SUM(total)as Total
from ' + @table_prefix + 'dbo.[sasa_'+ CAST(@sim_id AS VARCHAR)+']
group by [time_step]) as d on a.time_step=d.time_step
Join ' + @table_prefix + 'dbo.[radgee_'+ CAST(@sim_id AS VARCHAR)+'] as e on a.time_step=e.time_step
Join (select distinct Sim_id, struct_id, temp, run from [Helix].[directory].dbo.[master_property_v]) as f on a.struct_id = b.struct_id and a.sim_id=f.sim_id
LEFT OUTER Join (SELECT time_step, CAST(COUNT(*) AS DECIMAL)/CAST(' + CAST(@residues AS VARCHAR) + ' AS DECIMAL) AS fraction_helix FROM
' + @table_prefix + 'dbo.[dssp_'+ CAST(@sim_id AS VARCHAR)+'] WHERE ss_id = 12 GROUP BY time_step) AS g on a.time_step=g.time_step
LEFT OUTER JOIN
(SELECT time_step, CAST(COUNT(*) AS DECIMAL)/CAST(' + CAST(@residues AS VARCHAR) + ' AS DECIMAL) AS fraction_beta FROM
' + @table_prefix + 'dbo.[dssp_' + CAST(@sim_id AS VARCHAR)+'] WHERE ss_id = 3 OR ss_id = 4 GROUP BY time_step)
as h on a.time_step=h.time_step
'
FETCH NEXT FROM sim_id_cursor INTO @sim_id, @struct_id, @server_name, @database_name
EXEC (@view_cmd)
END
CLOSE sim_id_cursor
DEALLOCATE sim_id_cursor
Using Functions
It is worth noting that queries that use VBA or EXCEL math functions can be
slow in MDX, especially if one wanted to calculate properties of an entire
trajectory (20,000+ time frames). In SQL however, the calculations are generally
faster. As an example the SQL query to calculate the average phi psi angles
using circular statistics (using the SIN, COS, ARCTAN and PI functions) takes
around 20 seconds to run on an entire trajectory, whilst in MDX the same query
can take several hours.
MDX Queries
SQL Queries