AMPLYFI Insights

Extracting Insight from Unstructured Data

By February 9, 2020 May 12th, 2020 No Comments

Introduction

Life is complex, and since the advent of the digital age its complexity has evolved in areas previously hard to imagine. Few would argue with this, so it should be no surprise that the digital-footprints and data produced in this new world does not conform to traditional tabulated data categories that are easily assimilated by humans and processed by traditional computational methods.

Non-uniform data, known as unstructured data, in a corporate environment can take the form of images, videos and, most commonly, text-rich documents covering internet pages, earnings reports, company emails, instant messages, PDFs, internal reports, CRM data, patents, academic journals and social media feeds. As corporations have started to wake up to this, the most frequent question we are asked here at AMPLYFI is how companies can create value from the chaos that unstructured data brings and better inform strategic decisions. In this white paper we discuss the common pitfalls of trying to monetise unstructured data and how best to avoid them. 

Before diving into the challenges caused by unstructured data, it is useful to first get a grounding in Big Data, appreciate just how much data is available, and understand how quickly the creation of data is accelerating. Perhaps the most trusted and widely referenced source that has attempted to approximate this is the International Data Corporation (IDC), which has periodically revisited the question. In 2018, a study by the IDC estimated the global datasphere (all data in existence) would surpass 50 zettabytes in 2020 [1].  For reference, a 2012 study by the same authors [2] suggested that in 2010 the datasphere contained around 0.12 zettabytes, meaning that over the last 10 years the global datasphere has doubled almost 9 times. For most people, their appreciation for the size of data stretches to megabytes (MB) and gigabytes (GB) given these are the most commonly used units to describe the size of images, movies and mobile data plans. This leads to the question, how big is a zettabyte? Well, one trillion (1012) gigabytes make a single zettabyte. Assume streaming a high definition movie uses around 2 gigabytes of data per hour, a single zettabyte is equivalent to 57 million years of non-stop streaming. Out of this unimaginably large pool of data, academic literature reports that up to 95% is stored in unstructured and non-uniform formats [3] and, in the corporate environment, only 2% of this data is ever touched again after being saved for the first time [4]. This highlights the wastefulness of large organisations when it comes to turning their data into a resource. Until recently, organisations could excuse themselves by citing the complexity and abundance of this material, but with modern analytical techniques this is no longer a viable excuse.

Using traditional methods, unstructured data can be overwhelming and hard to compute. For example, alongside the estimation that the global datasphere will exceed 50 zettabytes, there is another widely reported metric that 2.5 quintillion bytes of information are created every single day [5]. As the second measure is not reported in an identical format to our first estimate (all-time data creation in zettabytes vs. bytes created per day), they cannot be reconciled without sophisticated Natural Language Processing (NLP) and number recognition techniques. A quick calculation (using my human comprehension of numbers and written language) shows the two estimations do not actually support each other [6]. Another problem with this second statistic surrounds its validity. Efficient searching of the internet with AMPLYFI’s DeepReseach engine shows that, as far back as 2011, IBM were approximating that 2.5 quintillion bytes were being created every day [7, 8, 9].  Assuming the IDC is correct and global data usage is increasing rapidly (which anecdotally seems correct), it is unlikely the figure of 2.5 quintillion bytes of daily data production in 2011 remains true in 2019. One possible explanation is that a steady stream of re-publishers, each referencing someone else, have kept the data looking current rather than almost a decade old. Recent high-profile users of this dated statistic include Forbes online in 2018 [10] and a Microsoft white paper in 2019 [11]. As the original IBM material does not provide the rationale behind the approximation it becomes very hard to review the methodology and independently assess its validity.

Perhaps it would be more accurate to simply say that nobody actually knows how much data we are producing (including global industry leaders), but agree that, there is an awful lot and the pile is growing at an ever-increasing rate. Traditional curation and analytic tools have fallen behind the rising tide of unstructured data; leaving organisations needing to leverage a new generation of AI-led business intelligence solutions if they want to keep up.

Data Complexity and Abundance

Figure 1 shows the continuum of data from simple (structured) to complex (unstructured) and describes an overriding feature that we observe at AMPLYFI. As the complexity of data increases, as does the abundance at which it is found. Structured data sets, such as tabulated company stock price histories, are simple and contained in a finite area alongside publicly listed companies. This can be contrasted with the entire universe of highly complex, text-heavy academic journal articles and patents that is constantly expanding. The volume and complexity of data making up stock price history is far less than the data held in these sources.

The challenge many organisations face is being able to convert complex, but abundant, data sources into a viable format capable of informing decision making and creating value. At AMPLYFI we have made this area of complex, unstructured data our home, and leverage state of the art machine learning capabilities to extract value from data that is hard to reach and hard to analyse.

In order to give real world examples of how to approach unstructured data, this paper requires a large and complex source of data to analyse. The largest and most complex repository of unstructured data lies in open source content held across the internet. So for the remainder of this paper, we will use the entire internet as our example data source. Using this data we will address three areas that commonly make it difficult to leverage value from large unstructured data sets: how to search them effectively; how to quantify/contrast unstructured results; and how to stay up to date with newly generated material.

Searching

Most internet users have a favourite way to search for data and will use this method repeatedly; regardless of the type of search they are undertaking or the type of result they are hoping to return. Research carried out at the University of Cambridge identifies four principal types of search and highlights how each can be used to foster innovation within an organisation [12]. Open-ended searches of external data sets (such as the internet) hold the most potential for innovation but are shown to be the most challenging to undertake. This search type is akin to searching for Donald Rumsfeld’s “unknown-unknowns” [13]; it is hard to find unstructured information when the data isn’t yours, isn’t neatly indexed, and you don’t actually know what you should be looking for.

Part of the challenge for organisations is their employees not using the right tools for in-depth internet research. Traditional, freemium, search providers such as Google or Bing work very well to return familiar websites/papers or answer very specific questions, but fall down when the user doesn’t quite know what they should be looking for or are interested in understanding market trends or key players. These search engines rely on directed and specific search queries that make genuine research on unfamiliar topics problematic. Traditional search engines are also severely limited in the proportion of the internet they are able to access. The limited part of the internet they cover is known as the surface web and is traversed by basic web crawlers that are capable of clicking and following links. The rest of the internet is known as the deep web, which, in 2001, was estimated to be around 500 times larger than the surface web [14]. Real internet users are interacting with websites in a more complex manner than simply clicking links, so will regularly enter the deep web, without ever realising. The deep web can be considered as a more extensive and less structured part of the ‘regular internet’ (which is distinctly different and should not be confused with the dark web [15]).

If we consider searching the internet as being equivalent to any other type of unstructured data search, we are witnessing users searching a tiny fraction of their database (effectively the surface web) using tools designed to retrieve records known to be in existence (direct and specific query of neatly indexed and curated data). The shortcomings experienced in using traditional search engines for open-ended research led AMPLYFI to design a deep web research engine, DeepResearch. This engine utilises intelligent web harvesting technology to take users beyond the surface web and into the deep web to unlock rich content previously undiscoverable. The user experience has been designed with open-ended research in mind as the tool employs advanced NLP techniques to deliver a novel filtering method, with the ability to highlight related topics and entities previously unknown to the user and allow them to rapidly drill down to the most relevant content.

At an organisational level, DeepResearch generates value by enabling users to make better informed decisions driven by smarter and more efficient research. The simple user experience makes this tool ideal to roll out across an entire organisation and has the power to inform decisions at every level. For example a key AMPLYFI client in financial services uses DeepResearch to undertake customer due diligence. The client declined to lend nearly £1m to an organisation after DeepResearch highlighted a previously unknown link to its activities that sat outside of its lending appetite. The connection could not be detected using a traditional internet search engine and would likely have resulted in a loss for our client.

Quantifying the Unquantifiable

Direct comparison of unstructured data across the internet is very difficult. In the opening section I used an example of only two datapoints to show how challenging this can be. However, what happens when a user wants to compare data across a corpus of potentially millions of unstructured documents? Or if the subject of analysis is not numerical? For this we cannot rely on human interpretation and need to develop novel computational methods to analyse and compare this varied source of data. For instance, if an organisation wanted to use the internet to analyse the electric vehicle market there would be potentially millions of relevant documents and articles. Each of these documents would contain many related technologies, topics, people, companies and locations. Before it could derive any value from these mentions, it would need to be able to quantify the importance of any given entity. A rudimentary and worryingly common way to do this would be to look at the word count of a given phrase. Given a large corpus of documents relating to electric vehicles it wouldn’t be too difficult to come up with a comparison of the number of times BMW or Volkswagen were mentioned. However, this comparison would be relatively shallow as the user has no concept of the context of these mentions, the sentiment, the number of duplicate documents, or even when these mentions occurred. Any large-scale analysis of this type would also fall into a familiar trap in that the user has to know what search phrases they are looking for in advance, making the analysis limited and biased by preconceived notions.

A number of things need to fall into place before a meaningful comparison can take place across a large volume of unstructured data, but understanding the rate at which a data set scales is paramount. If the rate of data set growth is non-uniform (as with the internet), this growth must be understood before drawing direct comparison across different points in time.

Relationship of Time and Volume of data

Figure 3 represents the size of a data source with an increasing rate of expansion. Data extracted from t0 is not equivalent to data taken from t1 and should be treated accordingly. In word count analysis, a user may find organisation A is mentioned once at t0 and four times at t1. From this data alone the user can draw the conclusion that the influence of organisation A is increasing. However, if the total amount of data gathered at t1 is ten times larger than at t0, an argument can be made to suggest that its influence has in fact dropped. Normalisation with respect to time must occur before comparisons can be drawn and, once this understanding has been gained, it can also be used to forecast the likely future behaviour of the data source. An understanding of chronological context also allows the introduction of a standardised metric that enables comparison across the entire data set. At AMPLYFI we have developed the technology needed to understand chronological context and use NLP and ground-breaking machine learning algorithms to turn qualitative text-based unstructured data into quantifiable and actionable insight. 

DataVoyant is a business intelligence platform designed and developed by AMPLYFI that is able to find a large number of text-based documents across the world’s largest data set (surface and deep web), extract all entities (people, places, organisations, technologies and topics), quantify the importance of each entity and calculate the strength of inter-entity connections. This is achieved using proprietary machine learning technology that relies on advanced NLP techniques to recognise and classify entities using unsupervised learning without pre-defined user input or pre-existing list of known entities. Eliminating user interference in this way releases the analysis from the bias imparted by pre-determined concepts.

To give an example of how DataVoyant can support early stage recognition of major trends and possible disruptions, we can use the digital camera market.

To set the context, a recent CITA (Camera & Imaging Products Association) report, shows that digital camera sales peaked in 2010, before declining by more than 80% by 2018 [16].  The year in which the decline accelerated was 2013 when sales dropped by 36%. One of the most frequently cited reasons for this decline is the disruption caused by smartphone cameras that, at the beginning of their introduction, was underestimated by established camera manufacturers. As a retrospective this analysis is insightful. However we want to show how, by using DataVoyant, digital camera manufacturers could have correctly interpreted the importance of the smartphone between the years of 2009 and 2013. 

Using a sample of internet documents related to digital cameras [17], DataVoyant applied its NLP and machine learning techniques to search for trends, disruptions and links between entities and topics. If camera manufacturers would have had access to DataVoyant at the time (between 2009 and 2013) they would have been able to track the following trends as they emerged on a monthly basis.

At a high level, DataVoyant analysis would have shown a 15% increase in the significance of the digital camera between 2009 and 2013. Whilst this would seem like positive news for camera manufacturers, a closer look at the data would have suggested otherwise. In the year 2013, the organisation most closely connected to the increased prominence of the digital camera was not the likes of Canon or Nikon, but Apple, the smartphone manufacturer (Figure 4). If you follow this track back to 2009, Apple is already in the top 10, and steps into the top 5 for the first time in 2010, three years before digital camera sales see their steepest decline. DataVoyant also picks out the upcoming improvements in smartphone camera quality by signalling and quantifying the emerging technologies that would underpin advances in appeal and ultimately lead to widespread disruption. Increased battery life, the CMOS Image Sensor and touch screens are some of the nascent, converging enabling technologies highlighted by DataVoyant.

Company league table positions for Apple and Nikon for the term ‘Digital Camera’

1
Apple becomes the company with the strongest link to ‘digital camera’ in 2012

This small selection of results alone would have clearly shown the severity and rising likelihood of the disruption smartphones were about to inflict. It would have highlighted that even though digital photography was flourishing, changing user habits and technological trends were beginning to favour smartphones.

Staying up to Date

Once an organisation can search, compare and analyse unstructured data effectively, one daunting task still remains. It is essential for the organisation to keep on top of new data when it becomes available. Traditional approaches to staying informed can vary but at an individual level it could take the shape of skimming RSS feeds. At an organisational level it takes the form of a pool of analysts, usually based in a low-cost environment. In many cases, this is not the best approach as the supposed cost advantage of maintaining services offshore is eroded by large teams, hidden costs, group think and decreased operational efficiency [18]. Any such team could be tasked with finding competitor announcements, tracking patent applications, reading business intelligence/academic reports, studying earnings calls, understanding government policy and compiling numerical statistics. Given the growing pace at which new content is generated, it is not practical to maintain a forever growing pool of analysts needed to keep afloat in this environment. A more powerful approach is to augment researchers with AI-driven tools to increase productivity. Many of the essential tasks undertaken by business intelligence analysts can be streamlined by at least partially automating the task of scanning and filtering unstructured data streams.

At AMPLYFI our approach to data and intelligence gathering is not one-size fits all and we recognise that even within an institution individual groups may have different requirements. Figure 5 (overleaf) highlights the relationship between the population at an organisation and the depth of knowledge they require about a given subject. A topical example could be a pharmaceutical company with UK operations preparing for Brexit. There is a large general population of employees that need to be kept aware of high-level risks directly affecting the overall business and how these risks are evolving. There is a smaller intermediate population that needs to know about the strategic importance of shifting policy and how this may affect the business, key competitors and partners. In most cases this population includes C-suite personnel and decision makers. A smaller cross sector of an organisation will need to be cognisant of even the smallest policy change and meticulously plan for numerous scenarios. This specialist analyst (e.g. supply chain analyst) is also likely to be data-driven and may require up-to-date trade statistics and modelling capabilities at their fingertips.

AMPLYFI has developed a dashboard-based offering called AMPLYFI Stream which can deliver bespoke information and data monitoring across an entire organisation. This platform leverages AI technologies alongside surface and deep web harvesting to monitor key metrics, filter relevant content and surface key insight in real time. The sensitivity of each dashboard can be modified at a user level to deliver a personalised platform tailored to the needs of each group or user.

The Future of Unstructured Data

As machine learning and NLP technology continue to advance, trust in unstructured data analysis will increase, and as it moves into the mainstream, behaviours will change to allow a holistic view of data and how best to use it. By combining structured and unstructured data, unique analysis is generated with context, comprehensive understanding and is underpinned with hard facts. If we again consider all data as a part of a continuum from simple (structured) to complex (unstructured), there is no reason why all of this data can’t be packaged into a single complimentary offering.

Data Complexity and Abundance

With expertise in extracting insight from data sources, AMPLYFI is ideally situated to augment unstructured data with structured data sets. At a time where incumbent data providers are struggling to move up this data complexity landscape, AMPLYFI have already solved many of the issues associated with handling unstructured data. It is much easier to move downhill rather than uphill, and this is how AMPLYFI’s journey contrasts with established information and data analytics providers.

Conclusion

Since the invention of computing machines, organisations have been using these devices to analyse structured data sets with ever-increasing complexity, speed and accuracy. In financial services, this journey has culminated in the development of highly complex quant strategies able to make thousands of calculations per second. The technology used to exploit standard data sets is maturing; access to in-depth analysis of structured data is universally available offering minimal differentiation between organisations. Mastery of unstructured data has been more elusive, not due to lack of vision (the link between business intelligence and computational analysis of complex data is documented as far back as 1958) [19], but due to the difficulty associated with the task. It is only over the past few years that technology has advanced to the level where comprehension of large-scale unstructured data sets has become possible. Modern machine learning technologies have turned the realms of the unstructured into the new battleground for business intelligence differentiation. AMPLYFI has several platforms, including DeepResearch, DataVoyant and Stream, that enable organisations to quickly and effectively open the door to this untapped resource. As behaviours towards unstructured data evolve, the complementary nature of structured and unstructured data is gradually being realised. The next generation of business intelligence tools in the AMPLYFI pipeline will innovatively augment unstructured data with known data sets to allow unrivalled levels of insight.

References

1 – Data Age 2025: The Digitization of the World from Edge to Core; D. Reinsel, J. Gantz, J. Rydning; IDC; November 2018. https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf (Accessed 28/01/2020)
2 – The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East; J. Gantz, D. Reinsel; IDC; December 2012. https://www.cs.princeton.edu/courses/archive/spring13/cos598C/idc-the-digital-universe-in-2020.pdf (Original site of publication not available but a pdf download is available at the above link from Princeton University. Accessed 28/01/2020)
3 – Beyond the Hype: Big Data Concepts, Methods, and Analytics; A. Gandomi, M. Haider; International Journal of Information Management, Vol. 35, Issue 2; April 2015. http://www.sciencedirect.com/science/article/pii/S0268401214001066 (Accessed 28/01/2020)
4 – The Changing World of Insights, and Why Companies Should Act Now; C. Gray; IBM; September 2018. https://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP102770 (Accessed 28/01/2020)
5 – An Inquisitive Citizen Data Analyst; J. Allen; Senior Director of Product Marketing for Adobe Analytics; September 2019. https://www.cio.com/article/3440107/an-inquisitive-citizen-data-analyst.html (Accessed 28/01/2020)
6 – 2.5 quintillion bytes per day would equate to 0.9125 zettabytes being produced per year. It is generally accepted that the amount of data produced is growing rapidly, so if 0.9125 zettabytes were produced in 2019 (the date of reference 5) and this number has been growing by a conservative 20% per year, the world would only have created 5.3 zettabytes from the year 2000-2019
7 – Bringing big data to the Enterprise: What is big data?; IBM; http://www-01.ibm.com/software/data/bigdata/ (This website is no longer available, however, an internet archiving tool has a site capture from May 2011. https://web.archive.org/web/20110520084338/http://www-01.ibm.com/software/data/bigdata/. Accessed 28/01/2020)
8 – IBM Study: Digital Era Transforming CMO’s Agenda, Revealing Gap in Readiness; IBM; October 2011. https://www-03.ibm.com/press/us/en/pressrelease/35633.wss (Accessed 28/01/2020)
9 – Customer Analytics Pay Off, Driving Top-Line Growth by Bringing Science to the Art of Marketing; M. Teerlink, M. Haydock; IBM Institute for Business Value; September 2011. http://www.sift-ag.com/public_html/download/spss/doc/Customer%20analytics%20pays%20off.PDF (Original site of publication no longer available, however a PDF is available at the above URL. Accessed 28/01/2020)
10 – How much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read; B. Marr; May 2018. https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/#44edbd0660ba (Accessed 28/01/2020)
11 – Extract Actionable Insights From All Your Content; K. D’Orazio, L.Cabrera-Cordon, D. Legenzoff, C. Noteboom, T. Woolford; Microsoft; September 2019. https://azure.microsoft.com/en-au/resources/extract-actionable-insights-from-all-your-content/ (Accessed 18/12/2019)
12 – Technology Intelligence Systems: How Companies Keep Track of the Latest Technological Developments…; Institute for Manufacturing University of Cambridge; IfM Briefing, Vol. 1 No. 2; 2007. https://www.ifm.eng.cam.ac.uk/uploads/Resources/Briefings/v1n2_ifm_briefing.pdf (Accessed 28/01/2019)
13 – US Department of Defense News Briefing; 12th February 2002.  https://archive.defense.gov/Transcripts/Transcript.aspx?TranscriptID=2636 (Secretary of Defense Donald Rumsfeld famously answers press questions by explaining the concept of known-knowns and unknown-unknowns. Accessed 28/01/2020)
14 – White Paper: The Deep Web: Surfacing Hidden Value; BrightPlanet; The Journal of Electronic Publishing, Vol. 7, Issue 1; August 2001. https://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;view=text;rgn=main;idno=3336451.0007.104 (Accessed 28/01/2020, incredibly, there have been no credible analysis of the deep vs. surface web since 2001)
15 – How to Safely Access the Deep and Dark Webs; S. Symanovich; Norton (Symantec). https://us.norton.com/internetsecurity-how-to-how-can-i-access-the-deep-web.html (Accessed 28/01/2020)
16 – Shipments of Cameras and Interchangeable Lenses; Camera and Imaging Products Association; 2019. http://www.cipa.jp/stats/documents/common/cr200.pdf (Accessed 31/01/2020)
17 – A sample of approximately 18,000 documents including patents, academic journals, news, blogs and general websites published between 2009 and 2019. All documents directly discuss digital cameras in some capacity.
18 – Invisible Costs in Offshoring Services Work; A. Stringfellow, M. B. Teagarden, W. Nie; Journal of Operations Management, Vol. 26 No. 2; March 2008. https://onlinelibrary.wiley.com/doi/abs/10.1016/j.jom.2007.02.009 (Accessed 30/01/2020)
19 – A Business Intelligence System; H. P. Luhn; IBM Journal; October 1958. https://www.semanticscholar.org/paper/A-Business-Intelligence-System-Luhn/a9c2cbdd49df560aaf1eecf5138aba84ace1bc0b (Accessed 28/01/2020)

Leave a Reply