|
ACS Home Related Links:
Advanced Campus Services |
Directory Services Project 1. Review Tasks/assignments from Aug. 29 meeting (15 minutes) 2. Funding status (NSF ITR, NMI REU supplement, NMI Grids?) 3. Experiments (overview and discussion) (see next page) signatures reference sets SOM for Novell, OpenLDAP, etc... 4. Status of Semantic Facilitator (TM) (SM) prototype schema extract & database store schema tokenizing for SOM schema metadata (vendor & version, institution, timestamp, connect info) SFA labels visualization 5. NMI REU + NMI Grid + Semantic Facilitator + ITR = research testbed Next meeting: Friday, September 19, 2003 - 3:45 - 5:15 pmExperiments: On Tuesday September 2, 2003, Vijay, Lei, Jijie, and Art talked about what experiments may be appropriate to our current state of work. We mentioned some possible candidates and agreed to continue thinking about this. The following comments are from Art's notes: We noted that "reference sets" was a potentially rich area for work (see Vijay's Aug 24 email attached). Experiments with human subjects (say, experts validating clustering or users using an interface a là Roussinov) would need Institutional Review Board (IRB) approval. Since the IRB process is being reviewed (and tightened up), we might need to adjust. We discussed the fact that simple heuristics may be "just as good" as SOM - i.e. is the clustering that results from SOM really just a reflection of inheritance? If so, heuristic algorithm that develops "inheritance tree" may be sufficient - just observe the resulting "branching nodes." The challenge with clustering of attributes' metadata is that the metadata is sparse: OID, NAME, SYNTAX, DESCRIPTION. And latter two are perhaps limited distinguishing factors (being "one of only several different values.") With that said, here are some possible experiments: a) repeat the 320 SOM experiment exactly. Confirm results b) Conduct the 320 SOM experiment, except using <Novell, OpenLDAP, SecureWay> objects. c) Conduct a) or b) but vary the domain of the 320 values (x, y, neighborhood size, iterations) d) SOM using attributes for <iPlanet, Novell, OpenLDAP, SecureWay>. e) Treat whole schema (thanks Susan Qu). i.e. Objectclasses, attributes, matching rules. and cluster. This has aspects of "DNA (directory node analysis) signature. Hypothesis: using same configuration (same LDAP, same SOM parameters) results in exactly same mapping. f) Continue with genetic algorithm. g) Find reference set as intersection of <iPlanet, Novell, OpenLDAP, SecureWay>. Conduct clustering using this reference set as the "expert solution" to achieve. Compare to results of experiments b) or c) which used a "universal vector." h) Find reference set as <person, organizationalPerson, inetOrgPerson, eduPerson, other_ eduPerson>. Conduct clustering using this reference set as the "expert solution" to achieve. Compare to results of experiments b) or c) which used a "universal vector." Hypothesis: at least from perspective of specific filter , this then clusters appropriately. I.e. "As an expert in a certain area (person info), I'm only interested in those objects anyway." i) Calculate distances of resulting SOM objects mappings (i.e. don't just use fixed rectangular matrix) and determine clustering. Compare for different LDAPs. Hypothesis: we can determine clusters more accurately. i) Using similar calculation of object distances on resulting mappings, determine threshold of "nearness" that identifies clusters. If g) or h) reference sets have a certain nearness factor, is that helpful? j) Consider reference sets that aggregate. Start with core, cluster, include new items in core that are close, recluster. is there a reasonable point at which one now has robust reference set that works in general? Hypothesis: we can build around an "armature" and soon the form becomes self-evident. From: "Vijay K. Vaishnavi" <vvaishna@gsu.edu> Date: Sun Aug 24, 2003 7:43:56 PM US/Eastern To: "Cdshaw" <cdshaw@cc.gatech.edu>, "Art Vandenberg" <avandenberg@langate.gsu.edu> Cc: "Lei Li" <lli@cis.gsu.edu>, "Vijay K. Vaishnavi" <vvaishna@gsu.edu>, "Jijie Wang" <jijiew@yahoo.com> Subject: Reference Sets: Initial thoughts Attachments: There is 1 attachment Art, Chris: I think the discussion on reference sets in the Friday's meeting and the suggestion of tags by Chris has resulted in a viable approach to verifying the correctness of the clustering process. Here are some related thoughts: 1. We need to have only one "story"/"scenario" that each internet2 school must use to enter the data from the scenario into its directory schema for eduPerson. We may want to use two or three such scenarios but there is no real need for it. 2. If we can get each of the 200 odd schools to enter the scenario data into its directory schema for eduPerson and if we have created the scenario such that it captures all the attributes that any school has for eduPerson then the approach itself may be theoretically enough for "correctly" clustering all the eduPerson attributes in use by the internet2 schools. To hope to get the desired response, we need to make the task as simple as possible. For example, we could create a form for each school that it can fill in for the given scenario, which corresponds to the schema for eduPerson for the school directory. 3. While we cannot guarantee that we will get a 100% response, we can make sure that we create the scenario such that all attributes being used by any of these schools get used. This will mean that we need to download the scenario for each of these schools and possibly the eduPerson schema for schools like Michigan or Wisconsin will point to what we need to include in the scenario. 4. Even though the approach can itself lead to an acceptable clustering of attributes, we need to go beyond the manual process to facilitate the process as well as to make it generalizable to schools who do not participate in the manual process and to make the process dynamic. To achieve this, we can use the manual approach for meta-training of the automated process and for validation of that training. We could divide the schools (from which we get the responses) randomly into two sets with the first set containing about double the number of elements in the second set. We could then use the first set to "meta-train" the genetic algorithm and then use the other set to verify the correctness of the clustering. We need to think on these details from a statistical standpoint. 5. To hope that the algorithm has any chance of success, we need to use the significant words in the description of the attributes in defining the universal vector set in addition to the attributes themselves. We need to get started on downloading the eduPerson attributes for all the schools and the corresponding descriptions to see how the problem space is shaping. I think there is some inherent merit to this approach and the approach is generalizable to other domains. Let us discuss the approach. If we agree on the approach then we should proceed to completing the tasks that support the approach. I wanted to circulate this message at least to everybody who was present in the meeting, but I did not have at hand the e-mail addresses of Susana and Roop. I am proposing to have an extra class on Friday, September 5 )at 4:30 pm) for a course I am teaching and so I will have to leave at 4:15 pm if we have the next meeting on September 3. Vijay Vijay K. Vaishnavi Professor of Computer Information Systems & Professor of Computer Science Georgia State University URL: http://www.cis.gsu.edu/~vvaishna E-Mail: vvaishna@gsu.edu Telephone: 404-651-3891 |
Last Updated: March 2, 2006