Problem Solved: Finding Data in Content Migration

Web Development | April 18, 2017

Migrations are Complex

Migrating content from a legacy CMS platform to a new system like AEM is a crucial step in the implementation process; and one that takes input, time, and expertise from a variety of roles. From the client side, the need to clean up the content as much as possible is a real concern. Quite often, years of multiple authors working in the CMS across many geographical boundaries will create many pages of outdated content and cases of poorly constructed HTML. Both of which, along with other instances, will lead to issues in the new system for internal users.

When you look at things from the partner or vendor side, migration is not a simple task by any means. It involves teams working together to solve both simple and complex tasks, both of which can be very time intensive. For example, the migration expert needs to reprocess the XML in Talend to get the desired data that needs to be deployed to AEM. This process takes a lot of time and can hold up progress. It also has to be done with each new or adjusted field we want migrated.

Another time-intensive task is finding specific data. For instance. you may want to locate and validate removed special characters in a URL. It gets tricky, and lengthy, when you have 8,000+ pages to crawl through. Even if you are someone (like our talented Business Systems Analysts) who has knowledge about CRXDE query tools, you may not be comfortable writing complex queries. So you reach out to the technical lead, requesting them to query up all content in CRXDE and to confirm the issue has been fixed. Again, more time, more back and forth.

Collaborating to Develop a Solution

During a recent client migration, our team had a eureka! moment when working through data we were moving from SharePoint to AEM. As the Technical Lead, I have to find solutions that keep the development of the project moving forward quickly and efficiently. This can range from problem solving our continuous integration environments, to resolving highly technical security and search engine bugs, or developing tools to help the development team be more efficient. A Technical Lead wears many hats, but that’s what we have to do in order to solve the wide array of problems that our developers face.

A successful migration includes working closely with others, including a Business Systems Analyst, who is a systems expert with the ability to accurately and consistently translate the business requirements of the client into documentation actionable by design and development teams. Together, we worked through many requests needed in the migration effort, and each one meant spending hours to write, run, and validate the query and provide results.

We also ran into a scenario where some of the pages that were migrated had tables with fixed widths. These fixed widths were based on the SharePoint site’s design and were making the images within them appear squished. Cleaning content styles migrated from SharePoint was not scoped for this project. We needed to know how many pages were affected by these table styles so that we could report these numbers to the customer to discuss cleaning strategies. This is where the need for a way to crawl content and search for specific values became necessary. In this case, our client was unable to quantify the number of pages using this criteria to search within SharePoint.

Out of the box AEM search functionality is great when you are searching for a specific value on a resource, but when you use a search criteria of “<table” in an RTE(Rich Text Editor), you get thousands of results, and looking through all of those results to find which table has a fixed width is impractical. We began using a combination of Queries and Traversal techniques inside of a tool within AEM to search for specific content issues.  

Making the Process More Efficient

As we continued our migration effort, scenario after scenario would arise that would require us to find content or style issues within the migrated content. We had written the Query and Traversal script to do these scans in a more efficient way to us save time, however the script needed to be customized for each scenario. This is when we had the idea to make the script and its’ encompassing tool more abstract to allow configurations and regex patterned searches. From these ideas and the simple need for a better way, we developed the Content Scanner and put it into practice.

Right away, this tool allowed us to conduct exact searches within the HTML on any page, and considerably cut down the time when doing things like the previously mentioned table fixed width search. We had now found a way to easily perform custom searches for exact content within Pages, Components, or DAM Assets that did not require hours of code changes and that would allow anyone to perform these searches with a few dialog changes. All results are displayed in HTML with the ability to export to CSV and be opened within a spreadsheet application. The flexible reporting this tool gave us allowed us to speed up migration and development efforts.

Adding the Content Scanner to the Process

Prior to the creation of the Content Scanner, each request to find corrected content and perform validation would take anywhere from 2-4 hours. Now, using the tool to make a few configuration changes and clicking run, the process to find content and perform validation takes about 30-60 minutes for each query.

At iCiDIGITAL, we are constantly developing unique solutions to the problems our customers face at different phases of a project. Our expertise is not just in the technical details, but in using critical thinking to solve issues and adding value to the final deliverable.



Darrin is a Software Engineer and an Adobe CQ5 Consultant with a background in Quality Assurance. Technical language proficiency includes Java, C#, Javascript, HTML, CSS, and SQL. Additional areas of expertise include OOP, Agile Development, Android SDK, and ASP.NET. Prior to being at iCiDIGITAL, Darrin served as a Software QA Engineer.