Page History

...

Two things have to be distinguished here, content from the website and assets like documents, movies and other stuff.

Website content

Content from the website is already picked up by the Heritrix crawler that calls the provider instance through the Extended Search configuration and pushes urls to the solr server which will extract all content and index it.

...

Code Block

language	html/xml

[#assign categories = pageModel.categories!]
[#assign hasCategories = categories?has_content]

[#function getCats itemlist]
    [#-- Assigns: Get Content from List Item--]
    [#local cats = ""]
    [#list itemlist as item]
       [#local itemName = item.@name]
       [#local itemDisplayName = item.displayName!itemName]
       [#local cats = itemDisplayName + "," + cats]
    [/#list]
    [#return cats]
[/#function]


[#if hasCategories]
<meta name="categories" content="${getCats(categories)}"/>
[/#if]

Enabling this, Solr's tika parser will pick up stuff in meta categories field, and index it if teh solr scheme has a corresponding categories field, now this is nice but what if we want to create other facets, like it is done with the resources module. In the resources module we have "root" categories that we can call facets like resources_role, resources_subject, you would not like to modify your scheme each time you add other facets to magnolia no ?

This is where the power of solr enters the game, in solr you can add dynamic fields which will be created if they do not exist in the index, to do so we added the following field in solr's scheme.

Code Block

language	html/xml

<dynamicField name="category_*" type="category" stored="true" indexed="true" multiValued="true"/>

This tells Solr to automatically create a category field each time a facet that starts with category_ is added to the index, this means that if a meta field as follows is sent to the index;

Code Block

language	html/xml

<meta name="category_resources_role" content="constraint1, constraint2"/>

category_resources_role is created in the scheme and constraint1 and constraint2 are indexed under this field or facet.

Info
The choice to either prepend the "category_" prefix to the categories "root" category in magnolia or to prepend it when submitting the content, especially when submitting resources from the JCR data repository is an implementation decision.

Now what about JCR content that is not accessible by the crawler.

This type of content can be send either by performing an extract of the data, converts it to have the correct solr syntax and submits it to the index on bulk or batch basis, or through a JCREventListener that will submit the content once it is available for publishing.

I wrote the following command to index video_resources and slideshow_resources to the solr index, this of course has to be enhanced by finding maybe a way through workflow to index or not the specified content.

Code Block

language	java

 private final static String XpathData = "//*[((@jcr:primaryType='slideshow-resource') or (@jcr:primaryType='video-resource'))]";


    /* (non-Javadoc)
     * @see info.magnolia.commands.MgnlCommand#execute(info.magnolia.context.Context)
     */
    @Override
    public boolean execute(Context context) throws Exception {

        Session session = context.getJCRSession("data");
        QueryManager qm = session.getWorkspace().getQueryManager();
        Query query = qm.createQuery(XpathData, "xpath");
        NodeIterator nodeIt = query.execute().getNodes();
        SearchService<?, ?, ?, String> svc = EsUtil.getProviderInstance();

        while(nodeIt.hasNext()){
            Node current = nodeIt.nextNode();
            Map<String,String>things = this.prepareThings(current, session);
            if(things!=null){
                svc.addUpdate(RepositoryEntries.DAM.name(), things);
            }
        }


        return true;
    }

    private Map<String,String> prepareThings(Node current,Session session){

        Map<String,String>things = new HashMap<String,String>();
        ContentMap mp = new ContentMap(current);
        Map<String,List<String>> categorySet = extractCategories(((String[])mp.get("categories")),session);
        //put categories
        for(String facet:categorySet.keySet()){
            things.put("literal.category_"+facet,StringUtils.join(categorySet.get(facet),","));
        }
        String abstrakt = (String)mp.get("abstract");
        things.put("literal.abstract", abstrakt);
        String itemType = (String) mp.get("itemtype");
        things.put("literal.type", itemType);
        String link = (String) mp.get("link");
        things.put("literal.htmllink", link);
        String url = extractURL(link);
        things.put("literal.url", url);
        String id=null;
        try {
            id = URLEncoder.encode(url,"UTF-8");
        } catch (UnsupportedEncodingException e) {
            log.warn("could not encode id"+e.getMessage());
            return null;
        }
        things.put("literal.id", id);
        String name = (String) mp.get("name");
        things.put("literal.title", name);
        log.debug(name+"<======>"+url+"<=====>"+StringUtils.join(categorySet.values(),"-"));
        //cr:lastModified,width,nodeDataTemplate,jcr:data,depth,jcr:uuid,size,extension,id,height,name,path,jcr:mimeType,fileName,nodeType,jcr:primaryType
        String thumbnail = (String) ((ContentMap)mp.get("thumbnail")).get("name");
        log.debug("node's thumbnail name:"+thumbnail);

        return things;
    }

    private String extractURL(String link) {
        Pattern p = Pattern.compile("href=\"([^\"]*)\"|src=\"([^\"]*)\"", Pattern.DOTALL);
        Matcher m = p.matcher(link);
        String url = null;
        if (m.find()) {
            for(int i=0;i<m.groupCount();i++){
               if(m.group(i)!=null){
                   int a = m.group(i).indexOf("http");
                   url = m.group(i).substring(a);
                   url = url.replaceAll("\"", "");
               }
          }
        }
        if(url==null){
            if(link.startsWith("http")){
                url=link;
            }
        }
        return url;
    }

    private Map<String,List<String>> extractCategories(String [] nodes,Session session){
        Map<String,List<String>> categorySet = new HashMap<String,List<String>>();

        for(String nodeId:nodes){
            Node category;
            try {
                category = session.getNodeByIdentifier(nodeId);
                String facet = category.getParent().getName();
                String tag = category.getName();
                if(categorySet.containsKey(facet)){
                    categorySet.get(facet).add(tag);
                }else{
                    List<String>values=new ArrayList<String>();
                    values.add(tag);
                    categorySet.put(facet, values);
                }
            } catch (ItemNotFoundException e) {
                log.warn(e.getMessage() + " not found as category");
            } catch (RepositoryException e) {
                log.error(e.getMessage());
            }

        }
        return categorySet;
    }

Page tree

Versions Compared

Old Version 10

New Version 11

Key

Website content

Now what about JCR content that is not accessible by the crawler.