IS&T Home

 

ACS Home
Research
Publications
Communications
Proposals
Agendas
Grants
ITR Project

Related Links:

 

 

Advanced Campus Services
Information Systems & Technology
Georgia State University
P. O. Box 3968
Atlanta, Georgia 30302-3968
Phone +1 404 463 9685
Email: avandenberg@gsu.edu

Directory Services Project
September 5, 2003
3:00 pm - 4:30 pm
Classroom South 514

1. Review Tasks/assignments from Aug. 29 meeting (15 minutes)

2. Funding status (NSF ITR, NMI REU supplement, NMI Grids?)

3. Experiments (overview and discussion) (see next page)

signatures

reference sets

SOM for Novell, OpenLDAP, etc...

4. Status of Semantic Facilitator (TM) (SM) prototype

schema extract & database store

schema tokenizing for SOM

schema metadata (vendor & version, institution, timestamp, connect info)

SFA labels

visualization

5. NMI REU + NMI Grid + Semantic Facilitator + ITR = research testbed

Next meeting: Friday, September 19, 2003 - 3:45 - 5:15 pm


Experiments:

On Tuesday September 2, 2003, Vijay, Lei, Jijie, and Art talked about what experiments may be appropriate to our current state of work. We mentioned some possible candidates and agreed to continue thinking about this. The following comments are from Art's notes:

We noted that "reference sets" was a potentially rich area for work (see Vijay's Aug 24 email attached).

Experiments with human subjects (say, experts validating clustering or users using an interface a là Roussinov) would need Institutional Review Board (IRB) approval. Since the IRB process is being reviewed (and tightened up), we might need to adjust.

We discussed the fact that simple heuristics may be "just as good" as SOM - i.e. is the clustering that results from SOM really just a reflection of inheritance? If so, heuristic algorithm that develops "inheritance tree" may be sufficient - just observe the resulting "branching nodes."

The challenge with clustering of attributes' metadata is that the metadata is sparse: OID, NAME, SYNTAX, DESCRIPTION. And latter two are perhaps limited distinguishing factors (being "one of only several different values.")

With that said, here are some possible experiments:

a) repeat the 320 SOM experiment exactly. Confirm results

b) Conduct the 320 SOM experiment, except using <Novell, OpenLDAP, SecureWay> objects.

c) Conduct a) or b) but vary the domain of the 320 values (x, y, neighborhood size, iterations)

d) SOM using attributes for <iPlanet, Novell, OpenLDAP, SecureWay>.

e) Treat whole schema (thanks Susan Qu). i.e. Objectclasses, attributes, matching rules. and cluster. This has aspects of "DNA (directory node analysis) signature. Hypothesis: using same configuration (same LDAP, same SOM parameters) results in exactly same mapping.

f) Continue with genetic algorithm.

g) Find reference set as intersection of <iPlanet, Novell, OpenLDAP, SecureWay>. Conduct clustering using this reference set as the "expert solution" to achieve. Compare to results of experiments b) or c) which used a "universal vector."

h) Find reference set as <person, organizationalPerson, inetOrgPerson, eduPerson, other_ eduPerson>. Conduct clustering using this reference set as the "expert solution" to achieve. Compare to results of experiments b) or c) which used a "universal vector." Hypothesis: at least from perspective of specific filter , this then clusters appropriately. I.e. "As an expert in a certain area (person info), I'm only interested in those objects anyway."

i) Calculate distances of resulting SOM objects mappings (i.e. don't just use fixed rectangular matrix) and determine clustering. Compare for different LDAPs. Hypothesis: we can determine clusters more accurately.

i) Using similar calculation of object distances on resulting mappings, determine threshold of "nearness" that identifies clusters. If g) or h) reference sets have a certain nearness factor, is that helpful?

j) Consider reference sets that aggregate. Start with core, cluster, include new items in core that are close, recluster. is there a reasonable point at which one now has robust reference set that works in general? Hypothesis: we can build around an "armature" and soon the form becomes self-evident.


From: "Vijay K. Vaishnavi" <vvaishna@gsu.edu>

Date: Sun Aug 24, 2003 7:43:56 PM US/Eastern

To: "Cdshaw" <cdshaw@cc.gatech.edu>, "Art Vandenberg" <avandenberg@langate.gsu.edu>

Cc: "Lei Li" <lli@cis.gsu.edu>, "Vijay K. Vaishnavi" <vvaishna@gsu.edu>, "Jijie Wang" <jijiew@yahoo.com>

Subject: Reference Sets: Initial thoughts

Attachments: There is 1 attachment

Art, Chris:

I think the discussion on reference sets in the Friday's meeting and the

suggestion of tags by Chris has resulted in a viable approach to verifying

the correctness of the clustering process. Here are some related thoughts:

1. We need to have only one "story"/"scenario" that each internet2 school

must use to enter the data from the scenario into its directory schema for

eduPerson. We may want to use two or three such scenarios but there is no

real need for it.

2. If we can get each of the 200 odd schools to enter the scenario data

into its directory schema for eduPerson and if we have created the scenario

such that it captures all the attributes that any school has for eduPerson

then the approach itself may be theoretically enough for "correctly"

clustering all the eduPerson attributes in use by the internet2 schools. To

hope to get the desired response, we need to make the task as simple as

possible. For example, we could create a form for each school that it can

fill in for the given scenario, which corresponds to the schema for

eduPerson for the school directory.

3. While we cannot guarantee that we will get a 100% response, we can make

sure that we create the scenario such that all attributes being used by any

of these schools get used. This will mean that we need to download the

scenario for each of these schools and possibly the eduPerson schema for

schools like Michigan or Wisconsin will point to what we need to include in

the scenario.

4. Even though the approach can itself lead to an acceptable clustering of

attributes, we need to go beyond the manual process to facilitate the

process as well as to make it generalizable to schools who do not

participate in the manual process and to make the process dynamic. To

achieve this, we can use the manual approach for meta-training of the

automated process and for validation of that training. We could divide the

schools (from which we get the responses) randomly into two sets with the

first set containing about double the number of elements in the second set.

We could then use the first set to "meta-train" the genetic algorithm and

then use the other set to verify the correctness of the clustering. We need

to think on these details from a statistical standpoint.

5. To hope that the algorithm has any chance of success, we need to use the

significant words in the description of the attributes in defining the

universal vector set in addition to the attributes themselves.

We need to get started on downloading the eduPerson attributes for all the

schools and the corresponding descriptions to see how the problem space is

shaping. I think there is some inherent merit to this approach and the

approach is generalizable to other domains.

Let us discuss the approach. If we agree on the approach then we should

proceed to completing the tasks that support the approach.

I wanted to circulate this message at least to everybody who was present in

the meeting, but I did not have at hand the e-mail addresses of Susana and

Roop.

I am proposing to have an extra class on Friday, September 5 )at 4:30 pm)

for a course I am teaching and so I will have to leave at 4:15 pm if we have

the next meeting on September 3.

Vijay

Vijay K. Vaishnavi

Professor of Computer Information Systems & Professor of Computer Science

Georgia State University

URL: http://www.cis.gsu.edu/~vvaishna

E-Mail: vvaishna@gsu.edu

Telephone: 404-651-3891


Last Updated: March 2, 2006