Abstract

This concept page defines some rough ideas that could be used to either refine search in the ressources module using SolR or make a more generic faceted search using only Solr's faceted search.

Choice 1: Enhancing the resources module

Goal

The goal is to do a refined search inside the selected items, so the existing resources module will be enhanced with a customized search to search inside selected content.

Benefits

Search keywords inside the resources, assist the user with a search inside the resources.

Implementation possibilities

Create an new search box which talks to solr and filters on the url it is in as prefix, call itself to display the results or show another page outside the resources module page.

Drawbacks/Difficulties

No use of solr's faceted search will be made, the implementation will not be that generic.

Choice 2: Using SolR's faceted search

Goal

The goal is to use the faceted search offered by SolR to implement a generic faceted search on all website content.

Benefits

A generic approach with different visualizations are possible, faceting can be done on all the content, not only on resources.

Implementation possibilities

explains how it would be possible to add solr's faceting possibilities to Magnolia.

Enhancing the existing search with SolR's faceting possibilities

Use Cases

Enabling this, the following use cases could be answered.

Easy Configuration, adding facets is easy through access to solr fields and teh possibility to facet on evrything that is indexed.
Image Added
Image Added
Image Added
Image Added
Facet/Categorize on all fields submitted to the index and for all content inside the index ( DMS/DATA, WEBSITE, Third party )

Image Added

Do keyword based searches in faceted content, get the current facets for a specific keyword search.
Image Added

- The above keyword search gives the associated categories that are available for teh specific search, refining is possible by clicking again on one of the items, for instance clicking on IT Systems will give us teh only result matching IT-Systems and "Magnolia Presentation".
Image Added
Do range faceting ( price/dates) propose a general search interface for product/e-commerce sites.
Be able to provide user context content paths, maybe this could be another concept page on its own.
- First all content is categorized, each content must have at least one user profile categorization ( developer, marketing, buyer, ...)
- Image Added
- Then, navigation is done through solr's faceted search.Based on a few initial choices, different layouts can be proposed after each refining.
- Image Added

Proposed architecture

We will try to make things as generic as possible, to be able to use as well other search providers, we extend the ExtSearchResultModel Class with the FacetedSearchResultModel Class which will only contain the specific getters/setters for faceting.

Image Added

How do we push the categories to the index ?

Two things have to be distinguished here, content from the website and assets like documents, movies and other stuff.

Website content

Content from the website is already picked up by the Heritrix crawler that calls the provider instance through the Extended Search configuration and pushes urls to the solr server which will extract all content and index it.

To add categorization tags, we can add categories to a meta field in the page by adding a script in the HtmlHeader template as followsThe URL splitted in paths could define a nice already existing categorization, for instance a faceted search today on teh corp website with a url categorization gives teh following results.

Code Block

language	html/xml

<lst name="facet_fields"><lst name="url"><int name="cms">529</int><int name="magnolia">529</int><int name="20011">524</int><int name="test">524</int><int name="community">168</int><int name="conference">137</int><int name="program">124</int><int name="company">97</int><int name="our">97</int><int name="news">93</int><int name="clients">91</int><int name="press">72</int><int name="day">68</int><int name="references">66</int><int name="www">64</int><int name="archive">62</int><int name="releases">62</int><int name="youtube">56</int><int name="embed">54</int><int name="2010">53</int><int name="speakers">53</int><int name="partner">50</int><int name="presentation">45</int><int name="old">43</int><int name="country">39</int><int name="partners">38</int><int name="amplify">35</int><int name="miami">35</int><int name="presentations">31</int><int name="4">30</int><int name="case">24</int><int name="studies">24</int><int name="dms">23</int><int name="landing">23</int><int name="features">22</int><int name="newsletter">22</int><int name="5">21</int><int name="de">21</int><int name="industry">21</int><int name="0">18</int><int name="pdf">17</int><int name="1">16</int><int name="2">16</int><int name="3">14</int><int name="level">13</int><int name="top">12</int><int name="8">11</int><int name="and">11</int><int name="briefs">11</int><int name="tech">11</int><int name="us">11</int><int name="coverage">10</int><int name="management">10</int><int name="directory">9</int><int name="presence">9</int><int name="products">9</int><int name="resource">9</int><int name="services">9</int><int name="virtual">9</int><int name="2009">8</int><int name="9">8</int><int name="mbc">8</int><int name="release">8</int><int name="spotlight">8</int><int name="support">8</int><int name="webinars">8</int><int name="contact">7</int><int name="industries">7</int><int name="location">7</int><int name="logos">7</int><int name="workshops">7</int><int name="7">6</int><int name="brief">6</int><int name="eps">6</int><int name="jobs">6</int><int name="navy">6</int><int name="t">6</int><int name="20">5</int><int name="a">5</int><int name="c">5</int><int name="development">5</int><int name="e">5</int><int name="enterprise">5</int><int name="evaluation">5</int><int name="open">5</int><int name="pr">5</int><int name="robots">5</int><int name="roles">5</int><int name="shirt">5</int><int name="static">5</int><int name="the">5</int><int name="travel">5</int><int name="txt">5</int><int name="venue">5</int><int name="visit">5</int><int name="workshop">5</int><int name="all">4</int></lst>

Filtering out the irrelevant tags could give a nice generic auto categorization. Of course this does not take in consideration the user defined categories through the category module.

Drawbacks/Difficulties

The Solr indexing is URL based, so there need to be way to either add those user selected categories to the URL which would be difficult I guess, or catch the categories from the URL and send them to the solR index.

This would be possible if there is a way to get the rootnode from teh url and browse the JCR to gather the different associated categories.

Info
This will work only if there are no multiple categorizations present in the page on different subcontents !

Choice 3: Tagging content inside the page ( meta and micro tags )

Goal

The goal is to enhance the categorization module to tag content inside the page, either for the whole page (meta tag in header or in the div), this way they can be picked up by an external parser or search engine and offer SEO enhancement and in house faceting.

Benefits

Standardized categorization and content tagging, easily exploitable by standard parser tools and search engines.

Micro tags could as well be used to tell the custom magnolia extractor not to index certain content, for a complete page, "robots.txt" can be used

Implementation possibilities

Drawbacks/Difficulties

[#assign categories = pageModel.categories!]
[#assign hasCategories = categories?has_content]

[#function getCats itemlist]
    [#-- Assigns: Get Content from List Item--]
    [#local cats = ""]
    [#list itemlist as item]
       [#local itemName = item.@name]
       [#local itemDisplayName = item.displayName!itemName]
       [#local cats = itemDisplayName + "," + cats]
    [/#list]
    [#return cats]
[/#function]


[#if hasCategories]
<meta name="categories" content="${getCats(categories)}"/>
[/#if]

Enabling this, Solr's tika parser will pick up stuff in meta categories field, and index it if teh solr scheme has a corresponding categories field, now this is nice but what if we want to create other facets, like it is done with the resources module. In the resources module we have "root" categories that we can call facets like resources_role, resources_subject, you would not like to modify your scheme each time you add other facets to magnolia no ?

This is where the power of solr enters the game, in solr you can add dynamic fields which will be created if they do not exist in the index, to do so we added the following field in solr's scheme.

Code Block

language	html/xml

<dynamicField name="category_*" type="category" stored="true" indexed="true" multiValued="true"/>

This tells Solr to automatically create a category field each time a facet that starts with category_ is added to the index, this means that if a meta field as follows is sent to the index;

Code Block

language	html/xml

<meta name="category_resources_role" content="constraint1, constraint2"/>

category_resources_role is created in the scheme and constraint1 and constraint2 are indexed under this field or facet.

Info
The choice to either prepend the "category_" prefix to the categories "root" category in magnolia or to prepend it when submitting the content, especially when submitting resources from the JCR data repository is an implementation decision.

Now what about JCR content that is not accessible by the crawler.

This type of content can be send either by performing an extract of the data, converts it to have the correct solr syntax and submits it to the index on bulk or batch basis, or through a JCREventListener that will submit the content once it is available for publishing.

I wrote the following command to index video_resources and slideshow_resources to the solr index, this of course has to be enhanced by finding maybe a way through workflow to index or not the specified content.

Code Block

language	java

 private final static String XpathData = "//*[((@jcr:primaryType='slideshow-resource') or (@jcr:primaryType='video-resource'))]";


    /* (non-Javadoc)
     * @see info.magnolia.commands.MgnlCommand#execute(info.magnolia.context.Context)
     */
    @Override
    public boolean execute(Context context) throws Exception {

        Session session = context.getJCRSession("data");
        QueryManager qm = session.getWorkspace().getQueryManager();
        Query query = qm.createQuery(XpathData, "xpath");
        NodeIterator nodeIt = query.execute().getNodes();
        
        
        /**
         * This call is important since we ask access to the search provider instance
         * 
         */
        SearchService<?, ?, ?, String> svc = EsUtil.getProviderInstance();

        while(nodeIt.hasNext()){
            Node current = nodeIt.nextNode();
            Map<String,String>things = this.prepareThings(current, session);
            if(things!=null){
                svc.addUpdate(RepositoryEntries.DAM.name(), things);
            }
        }


        return true;
    }

    private Map<String,String> prepareThings(Node current,Session session){

        Map<String,String>things = new HashMap<String,String>();
        ContentMap mp = new ContentMap(current);
        Map<String,List<String>> categorySet = extractCategories(((String[])mp.get("categories")),session);
        //put categories
        /**
         * Here we already give the correct syntax for usage with solr, 
         * This is not ok since this class should be agnostic, we can overcome this by creating a generic format and converters for each format in the provider package's logic  
         * 
         */
        for(String facet:categorySet.keySet()){
            things.put("literal.category_"+facet,StringUtils.join(categorySet.get(facet),","));
        }
        String abstrakt = (String)mp.get("abstract");
        things.put("literal.abstract", abstrakt);
        String itemType = (String) mp.get("itemtype");
        things.put("literal.type", itemType);
        String link = (String) mp.get("link");
        things.put("literal.htmllink", link);
        String url = extractURL(link);
        things.put("literal.url", url);
        String id=null;
        try {
            id = URLEncoder.encode(url,"UTF-8");
        } catch (UnsupportedEncodingException e) {
            log.warn("could not encode id"+e.getMessage());
            return null;
        }
        things.put("literal.id", id);
        String name = (String) mp.get("name");
        things.put("literal.title", name);
        log.debug(name+"<======>"+url+"<=====>"+StringUtils.join(categorySet.values(),"-"));
        //cr:lastModified,width,nodeDataTemplate,jcr:data,depth,jcr:uuid,size,extension,id,height,name,path,jcr:mimeType,fileName,nodeType,jcr:primaryType
        String thumbnail = (String) ((ContentMap)mp.get("thumbnail")).get("name");
        log.debug("node's thumbnail name:"+thumbnail);

        return things;
    }

Multiple categorizations by page

Page tree

Versions Compared

Old Version 4

New Version Current

Key

Table of Contents

Abstract

Choice 1: Enhancing the resources module

Goal

Benefits

Implementation possibilities

Drawbacks/Difficulties

Choice 2: Using SolR's faceted search

Goal

Benefits

Implementation possibilities

Enhancing the existing search with SolR's faceting possibilities

Use Cases

Proposed architecture

How do we push the categories to the index ?

Website content

Drawbacks/Difficulties

Choice 3: Tagging content inside the page ( meta and micro tags )

Goal

Benefits

Implementation possibilities

Drawbacks/Difficulties

Now what about JCR content that is not accessible by the crawler.

Page tree

Page History

Versions Compared

Old Version 4

New Version Current

Key

Table of Contents

Abstract

Choice 1: Enhancing the resources module

Goal

Benefits

Implementation possibilities

Drawbacks/Difficulties

Choice 2: Using SolR's faceted search

Goal

Benefits

Implementation possibilities

Enhancing the existing search with SolR's faceting possibilities

Use Cases

Proposed architecture

How do we push the categories to the index ?

Website content

Drawbacks/Difficulties

Choice 3: Tagging content inside the page ( meta and micro tags )

Goal

Benefits

Implementation possibilities

Drawbacks/Difficulties

Now what about JCR content that is not accessible by the crawler.