Abstract

This concept page explains how it would be possible to add solr's faceting possibilities to Magnolia.

Enhancing the existing search with SolR's faceting possibilities

Use Cases

Enabling this, the following use cases could be answered.

 

  • Easy Configuration, adding facets is easy through access to solr fields and teh possibility to facet on evrything that is indexed.
  • Facet/Categorize on all fields submitted to the index and for all content inside the index ( DMS/DATA, WEBSITE, Third party )

  • Do keyword based searches in faceted content, get the current facets for a specific keyword search.
    • The above keyword search gives the associated categories that are available for teh specific search, refining is possible by clicking again on one of the items, for instance  clicking on IT Systems will give us teh only result matching IT-Systems and "Magnolia Presentation".
  • Do range faceting ( price/dates) propose a general search interface for product/e-commerce sites.
  • Be able to provide user context content paths, maybe this could be another concept page on its own.
    • First all content is categorized, each content must have at least one user profile categorization ( developer, marketing, buyer, ...)
    • Then, navigation is done through solr's faceted search.Based on a few initial choices, different layouts can be proposed after each refining.

    •  

 

Proposed architecture

We will try to make things as generic as possible, to be able to use as well other search providers, we extend the ExtSearchResultModel Class with the FacetedSearchResultModel Class which will only contain the specific getters/setters for faceting.

 

How do we push the categories to the index ?

Two things have to be distinguished here, content from the website and assets like documents, movies and other stuff.

Website content

Content from the website is already picked up by the Heritrix crawler that calls the provider instance through the Extended Search configuration and pushes urls to the solr server which will extract all content and index it.

To add categorization tags, we can add categories to a meta field in the page by adding a script in the HtmlHeader template as follows.

[#assign categories = pageModel.categories!]
[#assign hasCategories = categories?has_content]

[#function getCats itemlist]
    [#-- Assigns: Get Content from List Item--]
    [#local cats = ""]
    [#list itemlist as item]
       [#local itemName = item.@name]
       [#local itemDisplayName = item.displayName!itemName]
       [#local cats = itemDisplayName + "," + cats]
    [/#list]
    [#return cats]
[/#function]


[#if hasCategories]
<meta name="categories" content="${getCats(categories)}"/>
[/#if]


Enabling this, Solr's tika parser will pick up stuff in meta categories field, and index it if teh solr scheme has a corresponding categories field, now this is nice but what if we want to create other facets, like it is done with the resources module. In the resources module we have "root" categories that we can call facets like resources_role, resources_subject, you would not like to modify your scheme each time you add other facets to magnolia no ?

This is where the power of solr enters the game, in solr you can add dynamic fields which will be created if they do not exist in the index, to do so we added the following field in solr's scheme.

<dynamicField name="category_*" type="category" stored="true" indexed="true" multiValued="true"/>

This tells Solr to automatically create a category field each time a facet that starts with category_ is added to the index, this means that if a meta field as follows is sent to the index;

<meta name="category_resources_role" content="constraint1, constraint2"/>

category_resources_role is created in the scheme and constraint1 and constraint2 are indexed under this field or facet.

The choice to either prepend the "category_" prefix to the categories "root" category in magnolia or to prepend it when submitting the content, especially when submitting resources from the JCR data repository is an implementation decision.

Now what about JCR content that is not accessible by the crawler.

This type of content can be send either by performing an extract of the data, converts it to have the correct solr syntax and submits it to the index on bulk or batch basis, or through a JCREventListener that will submit the content once it is available for publishing.

I wrote the following command to index video_resources and slideshow_resources to the solr index, this of course has to be enhanced by finding maybe a way through workflow to index or not the specified content.

 private final static String XpathData = "//*[((@jcr:primaryType='slideshow-resource') or (@jcr:primaryType='video-resource'))]";


    /* (non-Javadoc)
     * @see info.magnolia.commands.MgnlCommand#execute(info.magnolia.context.Context)
     */
    @Override
    public boolean execute(Context context) throws Exception {

        Session session = context.getJCRSession("data");
        QueryManager qm = session.getWorkspace().getQueryManager();
        Query query = qm.createQuery(XpathData, "xpath");
        NodeIterator nodeIt = query.execute().getNodes();
        
        
        /**
         * This call is important since we ask access to the search provider instance
         * 
         */
        SearchService<?, ?, ?, String> svc = EsUtil.getProviderInstance();

        while(nodeIt.hasNext()){
            Node current = nodeIt.nextNode();
            Map<String,String>things = this.prepareThings(current, session);
            if(things!=null){
                svc.addUpdate(RepositoryEntries.DAM.name(), things);
            }
        }


        return true;
    }

    private Map<String,String> prepareThings(Node current,Session session){

        Map<String,String>things = new HashMap<String,String>();
        ContentMap mp = new ContentMap(current);
        Map<String,List<String>> categorySet = extractCategories(((String[])mp.get("categories")),session);
        //put categories
        /**
         * Here we already give the correct syntax for usage with solr, 
         * This is not ok since this class should be agnostic, we can overcome this by creating a generic format and converters for each format in the provider package's logic  
         * 
         */
        for(String facet:categorySet.keySet()){
            things.put("literal.category_"+facet,StringUtils.join(categorySet.get(facet),","));
        }
        String abstrakt = (String)mp.get("abstract");
        things.put("literal.abstract", abstrakt);
        String itemType = (String) mp.get("itemtype");
        things.put("literal.type", itemType);
        String link = (String) mp.get("link");
        things.put("literal.htmllink", link);
        String url = extractURL(link);
        things.put("literal.url", url);
        String id=null;
        try {
            id = URLEncoder.encode(url,"UTF-8");
        } catch (UnsupportedEncodingException e) {
            log.warn("could not encode id"+e.getMessage());
            return null;
        }
        things.put("literal.id", id);
        String name = (String) mp.get("name");
        things.put("literal.title", name);
        log.debug(name+"<======>"+url+"<=====>"+StringUtils.join(categorySet.values(),"-"));
        //cr:lastModified,width,nodeDataTemplate,jcr:data,depth,jcr:uuid,size,extension,id,height,name,path,jcr:mimeType,fileName,nodeType,jcr:primaryType
        String thumbnail = (String) ((ContentMap)mp.get("thumbnail")).get("name");
        log.debug("node's thumbnail name:"+thumbnail);

        return things;
    }

 

 

 

 

 

 

 

 

 

  • No labels

1 Comment

  1. Existing documentation for current implemented modules:

    Solr module