XueyiProgress < Main

Main

Main.XueyiProgressr1.1 - 22 Mar 2005 - 04:05 - Main.guesttopic end

Things need to do

RNASC -- RNA Structure Correction. A C++ program for correcting bad clashes in existing RNA structure.
ColCheck^? -- Collision Check. Comparing the performances of a bunch of collision check algorithms using real proteins for simulation.
Derive Ramachandran-like plots for RNA residue/suite.

Weekly Reports

Mar. 21, 2005

I finished a book "Molecular Modeling" (by four authors in Europe) during the spring break and found a method for clustering the 6/7 dimension dihedral data of RNA residue/suite. Instead of calculating the isosurfaces on these data, maybe a good way is to calculate the energy for each conformation in the 6/7 dimension data and search for the local minima. These local minima should be good candidates of clustered conformations.

From computer science perspective, calculating energy is trival as long as energy functions are correct, but finding the local minima in 6/7 dimension data and further calculating the isosurfaces (must be projected to 3/4 dimensions?) by the energy level may be hard. I think Ramachandran-like plots represented by energy isosurfaces should be more convincing to the biologists.

Mar. 02, 2005

Finished two collision checking algorithms -- grid algorithm (#1) and its improvement (#2) (only update the rotated chain -- in one rotation per step, 3 atoms per residue, 100,000 steps and [-10, 10] per rotation.

#1, SEED = 1.

	residues	clash	runtime per step (ms)	check pairs per step
1A92	50	3	0.2014	851
1CTF	68	3.039	0.271	1121
1O16	154	4.961	0.64132	2952
1SBP	309	10.631	1.28444	5549
1QRR	390	5.762	1.63581	7151
1LOX	647	7.936	2.75856	12106
1JB0	740	7.517	3.29283	15440

#2, SEED = 1.

	residues	clash %	runtime per step	check pairs per step
1A92	50	3.0	0.11465	29
1CTF	68	3.039	0.15682	35
1O16	154	4.961	0.3493	64
1SBP	309	10.631	0.66927	91
1QRR	390	5.617	0.87015	113
1LOX	647	7.936	1.45691	102
1JB0	740	7.517	1.67429	126

#1, SEED = NULL, reflesh every 50 steps.

	residues	clash %	runtime per step	check pairs per step
1A92	50	59.767	0.18987	789
1CTF	68	60.023	0.25247	1034
1O16	154	62.105	0.54649	1902
1SBP	309	63.116	1.09537	3666
1QRR	390	44.984	1.47702	5640
1LOX	647	35.439	2.56809	10210
1JB0	740	35.213	3.07102	12381

#2, SEED = NULL, reflesh every 50 steps.

	residues	clash %	runtime per step	check pairs per step
1A92	50	74.131	0.03765	46
1CTF	68	77.017	0.05268	56
1O16	154	82.929	0.09274	52
1SBP	309	85.112	0.23554	110
1QRR	390	81.432	0.31726	94
1LOX	647	73.863	0.60418	85
1JB0	740	73.785	0.65384	86

Some observations:

1. For Algorithm #1. SEED = 1 yields ~5% collisions while SEED = NULL yields > 50% collisions. The atoms will be updated and hashed every step, so the advantage of having many collisions doesn't significantly reduce the processing time (improved ~7%).

2. For Algorithm #2. SEED = 1 yields ~5% collisions while SEED = NULL yields > 50% collisions. In each step, only rotated atoms will be updated, so the advantage of having many collisions significantly reduces the processing time (2.5~3.5 times faster). The similar comparion pairs mean that the comparing time is almost the same (bug???), the only difference is the update time when the rotated step is accepted. Most of operations in this step is copy coodinates and update the grid.

3. When SEED = 1, #2 is about 2 times faster than #1. The significant difference of comparing pairs per step means that about half of the time is spend on the calculating new coordinates and updating grids and coordinates (when the rotation is accepted). It is because most of the rotations are accepted, we need to update the grids and coordinates in #2 in every step (this work is done before checking collisions in #1).

4. According to 1, 2 and 3, it is not surprise that when SEED = NULL, #2 is about 5 times faster than #1. It is because more than half rotations are failed, so we no need to update the grids and coordinates.

Feb. 22, 2005

RNASC -- still working on the rough step, it turns out debugging in C++ is harder than Matlab.

Collision Algorithm Comparision (ColCheck^?) -- working on three grid algorithms when considering backbone atoms only (3 atoms per residue).

I realize there is some trickes in the chain-tree algorithm. They use Oriented Bounding Box (OBB), as shown in the figure 7 of Lotan 04. When the bond between F and G is rotated, only the transformation matrices in F and K are changed and only the OBBs in N and O are re-computed. It means they delayed the update of coordinates of atoms in G and H and the update of the OBB in M (OBB in M doesn't need to be re-computed, it just needs a rotation) to the step for checking collisions. So they can achieve the update in O(logn) for one rotation.

The trick here is that, when they decend from the root to leaves for checking bad collisions, they will stop and discard the rotation as soon as they find a bad collision, so the delay of update has some advantages. But considering the practical situations, we can find some interesting findings:

1. When the length of protein backbone is short (= 374 residues in 1HTB, i.e. ~1500 nodes on the backbone), the update time for grid and chain-tree is close, as shown in figure 9. So for longer chain, chain-tree algorithm is much better. But I think most simulated protein should have short backbone, since we still don't have accurate energy function and knowledges to simulate large protein. Note in this simulation, they only consider the atoms on the backbone, 3 atoms per residue, one rotation each step, 0-30 degree for each rotation.

2. They choose randomly from 0-30 degree for each rotation in this paper, but they also choose randomly from 0-12 degree for each rotation (calculating energy instead of collision) in Lotan 03 and they claim that the rejection ratio is 60% (bad collision) for one rotation each step in 1CTF, which contains only 68 residues. So I assume the rejection ratio should be much higher in the simulations done in this paper. Question: What's the normal/acceptable rejection ratio in actual simulation?

3. In figure 9b, chain-tree algorithm is 2-5 times faster than grid algorithm, and we can see that for 1HTB, chain-tree is approximately 3 times faster. Note here the algorithm stops as soon as they find the bad collision. But in figure 10, they try to find all collisions in each step and they use 11 different conformations (gyration radius ranges from 20A to 85A) for simulation. We can see that for one rotation per step, the chain-tree can be 10 times faster than grid. Does it mean the grid algorithm can find collision faster than chain-tree? Or Does it mean grid is faster in compact conformation (native conformation) than in noncompact conformation? I think this figure is a little misleading.

-- XueyiWang - 20 Feb 2005

Feb. 15, 2005

RNASC:

I'm still working on the rough step of adjusting RNA backbone. In rough step for adjusting suite, the brute force algorithm fixes the phosphorus position, samples the dihedrals of O5*, C5*, C4* and O3*, C3* at every 5 (or user defined) degree, and uses some criterions like the maximum and minimum distances of C4*-C1* and C3*-C1*, the maximum and minimum angles of C5*-C4*-C1* and O3*-C3*-C1*. For extended adjustment, say, adjusting atoms within two residues instead of one suite, the adjusted atoms are one times more than before, but the criterions are the same. When rough step finishes, the valid dihedrals of O5*, C5*, C4* and O3*, C3* are kept for refining.

I added some more flags in command line, but I still didn't figure out the better way to do clustering.

-ROUGHSTEP#		Span of degrees in rough step.
-REFINESTEP#		Span of degrees in refine step.

-FIRSTPHOSPHORUS [#.# #.# #.#]
		Adjust the positions of first phosphorus. In suite, it is
		the only phosphorus can be adjusted.
		In suite, the x axis is defined as from C1* in first base
		to C1* in second base. When defining triangle
		P-C1(1)-C1(2) in counter clockwise, z axis points out of
		the triangle and perpendicular to it.
-SECONDPHOSPHORUS [#.# #.# #.#]
		Adjust the positions of second phosphorus.
-FIRSTBASE [#.# #.# #.#]
		Adjust the positions of first base. In residue, it is
		the only base can be adjusted.
		In residue, the x axis is defined as from first phosphorus
		to second phosphorus. When defining triangle P(1)-C1*-P(2)
		in counter clockwise, z axis points out of the triangle
		and is perpendicular to it.
-SECONDBASE [#.# #.# #.#]
		Adjust the positions of second base.

Collision algorithms -- I'm not sure I can finish all of these:

Experiment contents:

1. Use real PDB files. Remove the atoms on the side chains except Cb.

2. Randomly generate the rotation position and degree of dihedrals.

3. Compare performance of different algorithms in two situations: a. rotating 1 dihedral in each step; b. rotating 3 dihedrals in each step; c. each rotate chooses from [-10,10] randomly.

4. Another situations: a. only atoms on the main chain are considered. b. atoms on the backbone and the cb atoms are considered.

5. 100,000 simulation steps each experiment.

Different algorithms to be compared:

1. General grid algorithm: resampling all atoms and re-checking collisions in each step.

2. Improved grid algorithm by updating rotated atoms only and checking collisions with un-rotated atoms.

3. Further improvement on grid algorithm by setting the origin to the center of protein strand instead of one end. This one is trival but I'd like to see the performance.

4. Sweep-plane algorithm. It seems that the reason why the general grid algorithm is much more slower than chain-tree algorithm seems lies on the too many un-needed interacting pairs in grid algorithm. From the comparison in Lotan's paper, 20 times more comparison causes 10 times slower in running time. The improvement in 2 should reduce much more comparison, but because the span of grid should equal to the largest bad-collision threshold in all atom pairs, there are still lots of un-needed pair comparison. So I'm thinking of using a sweep-plane algorithm to improve the performance. First split the protein backbone into a series of strands, as shown in the figure. Each strand contains an atom with local maximum z coordinate in the middle and two atoms with local minimum z coordinates in the two ends. After each rotation, we reconstruct the rotated protein structure into strands. Then we sweep down the protein structure, compare the rotated and un-rotated protein strands. In the sweep plane, we can also use a sweep line to detect possible pair interactions. It seems that this method reduces the pair comparisons (I assume the time complexity is close to O(nlogn)??), but in practical, I'm not quite sure if the overhead will eat up the saved comparison time.

5. chain-tree algorithm by Itay Lotan.

6. Modified chain-tree algorithm: using Axis Aligned Bounding Volume instead of Oriented Bounding Volume and calculating overlapped volume. If at a certain level, bounding boxes A and B have overlapped volume V, then we walk down the tree one level, get the bounding boxes C & D under A and E & F under B. For the algorithm in 5, we need to check four pairs at most, C&E, C&F, D&E and D&F. But now if C has no overlaps with V, we can reduce the pairs to two: D&E and D&F. Although AABB has larger volume than OBB, I think we can improve the performance by reducing the number of pairs and simplifying the overlap check. Question: does this method reduce the time complexity, which is O(n4/3). I still didn't figure it out.

-- XueyiWang - 14 Feb 2005

Feb. 08, 2005

I decided to use the CCP4 Coordinate Library and Clipper Library for reading PDB files, constructing RNA structure, and manipulating X-ray crystallography data.

I finished the command line parts, defined as follows:

RNASC [-Flags [value]] filename

Flags:

-RESNUM#		Number of residue to be adjusted (default=first residue)
		For clash in suite, it's the residue of the phosphate.
-CHAIN#		The chain ID (default = the first chain)
-MODEL#		The model ID (default = the first model)
-PUCKER#[-#]		Specific pucker types of first and second sugar (default=1))
		For clash in residue, there is only one sugar pucker.)
		0 - determined by the perpendicular distance from phosphate)
		to first base.)
		1 - determined by the perpendicular distance from phosphate)
		to bond N1/9 - C1*)
		2,3 - C2endo and C3endo.
-RESIDUECLASH		Clash in residue.

-PARAMETER# [file]		Specify the set of bond lengths and angles (default=1))
		1 -- canonical paramters)
		2 -- bond lengths and angles used in pdb file)
		4 -- the average 1 and 2)
		8 -- user defined parameters (unavailable)
		Different sets can be used at the same time, e.g. 7 means
		calculating the structure three times, each time use one
		set of parameters from 1, 2 and 4.
-SIG#		The allowable range of the bond angles in sugar puckers,
		designated by the times of the sigma values of each bond
		angle (default=3).
-SHORTEXTENSION		Adjust the atoms in one suite/residue range, if this item
		is not specified, adjusting will be in two residues/suites
		range.

-ENERGYFUNC#.#		Energy function used for checking bad clashes in pairwise
		(default=0.4). If desginated by number, e.g. default=0.4
		means the function is vdwi+vdwj-0.4. Otherwise, new energy
		function must be specified (unavailable)

-CONFORMNUM#		The number of comformations to be output (default=100).
-NOMEASUREDIHEDRAL		Don't measure dihedrals for each conformation.
-NOMEASURECONFORM		Don't determine conformation type for each conformation.

-Help		all command line parameters.

-- XueyiWang - 07 Feb 2005

Other options

You might also want options to cluster results, so that nearby solutions (maybe measured by RMSD of coordinates or of parameters) are considered the same and -CONFORMNUM counts distinct solutions.

-- JackSnoeyink - 08 Feb 2005
to top

End of topic
Skip to action links | Back to top

Edit | Attach image or document | Printable version | Raw text | More topic actions
Revisions: | r1.12 | > | r1.11 | > | r1.10 | Total page history | Backlinks

You are here: Main > ProgressReports > XueyiProgress

to top