Merge pull request #7784 from eugene7646/kws_documentation_update_8487

KWS documentation update (8487)
2025-07-06 21:00:22 +00:00 · 2023-06-01 23:44:55 -04:00 · 2023-06-01 23:44:55 -04:00 · c1275c1c75
commit c1275c1c75
parent bb36ec447a 67e9deec1f
8 changed files with 39 additions and 26 deletions
--- a/docs/doxygen-user/adHocKeywordSearch.dox
+++ b/docs/doxygen-user/adHocKeywordSearch.dox
@ -10,6 +10,20 @@ The ad hoc keyword search features allows you to run single keyword terms or lis

 The \ref keyword_search_page must be selected during ingest before doing an ad hoc keyword search. If you don't want to search for any of the existing keyword lists, you can deselect everything to just index the files for later searching.

+\subsection adhoc_limitations Limitations of Ad Hoc Keyword Search
+
+With the release of Autopsy 4.21.0, two types of keyword searching are supported: Solr search with full text indexing and the built-in Autopsy "In-Line" Keyword Search.
+
+Enabling full text indexing with Solr during the ingest process allows for comprehensive ad-hoc manual text searching, encompassing all of the extracted text from files and artifacts.
+
+On the other hand, the In-Line Keyword Search conducts the search during ingest, specifically at the time of text extraction. It only indexes small sections of the files that contain keyword matches (for display purposes). Consequently, unless full text indexing with Solr is enabled, the ad-hoc search will be restricted to these limited sections of the files that had keyword hits. This limitation significantly reduces the amount of searchable text available for ad-hoc searches.
+
+Other situations which will result in not being able to search all of the text extracted from all of the files and artifacts include:
+<ul>
+<li>If file filtering was used during ingest, resulting in only a subset of files getting ingested. See \ref file_filters for information on file filtering.
+<li>If Autopsy case contains multiple data sources and one or more of those data sources was not indexed during it's ingest.
+</ul>
+
 \section ad_hoc_kw_types_section Creating Keywords

 The following sections will give a description of each keyword type, then will show some sample text and how various search terms would work against it. 
@ -36,7 +50,7 @@ Substring match should be used where the search term is just part of a word, or
 - "UMP", "oX" will match
 - "y dog", "ish-brown" will not match

-## Regex match
+\subsection regex_match Regex match

 Regex match can be used to search for a specific pattern. Regular expressions are supported using Lucene Regex Syntax which is documented here: https://www.elastic.co/guide/en/elasticsearch/reference/1.6/query-dsl-regexp-query.html#regexp-syntax. Wildcards are automatically added to the beginning and end of the regular expressions to ensure all matches are found. Additionally, the resulting hits are split on common token separator boundaries (e.g. space, newline, colon, exclamation point etc.) to make the resulting keyword hit more amenable to highlighting. As of Autopsy 4.9, regex searches are no longer case sensitive. This includes literal characters and character classes.

@ -70,9 +84,12 @@ If you want to override this default behavior:
 ### Non-Latin text
 In general all three types of keyword searches will work as expected but the feature has not been thoroughly tested with all character sets. For example, the searches may no longer be case-insensitive. As with regex above, we suggest testing on a sample file.

+### Differences between "In-Line" and Solr regular expression search
+It's important to be aware that there might be occasional differences in the results of regular expression searches between the "In-Line" Keyword Search and Solr search. This is because the "In-Line" Keyword Search uses Java regular expressions, while Solr search employs Lucene regular expressions.
+
 \section ad_hoc_kw_search Keyword Search

-Individual keyword or regular expressions can quickly be searched using the search text box widget. You can select "Exact Match", "Substring Match" and "Regular Expression" match. See the earlier \ref ad_hoc_kw_types_section section for information on each keyword type.  The search can be restricted to only certain data sources by selecting the checkbox near the bottom and then highlighting the data sources to search within. Multiple data sources can be selected used shift+left click or control+left click. The "Save search results" checkbox determines whether the search results will be saved to the case database. 
+Individual keyword or regular expressions can quickly be searched using the search text box widget. You can select "Exact Match", "Substring Match" and "Regular Expression" match. See the earlier \ref ad_hoc_kw_types_section section for information on each keyword type, as well as \ref adhoc_limitations.  The search can be restricted to only certain data sources by selecting the checkbox near the bottom and then highlighting the data sources to search within. Multiple data sources can be selected used shift+left click or control+left click. The "Save search results" checkbox determines whether the search results will be saved to the case database. 

 \image html keyword-search-bar.PNG

@ -92,14 +109,5 @@ If the "Save search results" checkbox was enabled, the results of the keyword li

 \image html keyword-search-list-results.PNG

-\section ad_hoc_during_ingest Doing ad hoc searches during ingest
-
-Ad hoc searches are intended to be used after ingest completes, but can be used in a limited capacity while ingest is ongoing.
-
-Manual \ref ad_hoc_kw_search for individual keywords or regular expressions can be executed while ingest is ongoing, using the current index. Note however, that you may miss some results if the entire index has not yet been populated. Autopsy enables you to perform the search on an incomplete index in order to retrieve some preliminary results in real-time.
-
-During the ingest, the normal manual search using \ref ad_hoc_kw_lists behaves differently than after ingest is complete. A selected list can instead be added to the ingest process and it will be searched in the background instead.
-
-Most keyword management features are disabled during ingest. You can not edit keyword lists but can create new lists (but not add to them) and copy and export existing lists.

 */
--- a/docs/doxygen-user/images/keyword-search-configuration-dialog-general.PNG
+++ b/docs/doxygen-user/images/keyword-search-configuration-dialog-general.PNG
--- a/docs/doxygen-user/images/keyword-search-hits.PNG
+++ b/docs/doxygen-user/images/keyword-search-hits.PNG
--- a/docs/doxygen-user/images/keyword-search-ingest-settings.PNG
+++ b/docs/doxygen-user/images/keyword-search-ingest-settings.PNG
--- a/docs/doxygen-user/images/keyword-search-ocr-image.png
+++ b/docs/doxygen-user/images/keyword-search-ocr-image.png
--- a/docs/doxygen-user/images/keyword-search-ocr-indexed-text.png
+++ b/docs/doxygen-user/images/keyword-search-ocr-indexed-text.png
--- a/docs/doxygen-user/images/keyword_results.PNG
+++ b/docs/doxygen-user/images/keyword_results.PNG
--- a/docs/doxygen-user/keyword_search.dox
+++ b/docs/doxygen-user/keyword_search.dox
@ -5,22 +5,30 @@

 \section keyword_module_overview What Does It Do

-The Keyword Search module facilitates both the \ref ingest_page "ingest" portion of searching and also supports manual text searching after ingest has completed (see \ref ad_hoc_keyword_search_page). It extracts text from files being ingested, selected reports generated by other modules, and results generated by other modules. This extracted text is then added to a Solr index that can then be searched.
+The Keyword Search module facilitates both the \ref ingest_page "ingest" portion of searching and also supports manual text searching after ingest has completed (see \ref ad_hoc_keyword_search_page). It extracts text from files being ingested, selected reports generated by other modules, and results generated by other modules. 

-Autopsy tries its best to extract the maximum amount of text from the files being indexed. First, the indexing will try to extract text from supported file formats, such as pure text file format, MS Office Documents, PDF files, Email, and many others. If the file is not supported by the standard text extractor, Autopsy will fall back to a string extraction algorithm. String extraction on unknown file formats or arbitrary binary files can often extract a sizeable amount of text from a file, often enough to provide additional clues to reviewers. String extraction will not extract text strings from encrypted files.
+Autopsy tries its best to extract the maximum amount of text from the files being indexed. First, it will try to extract text from supported file formats, such as pure text file format, MS Office Documents, PDF files, Email, and many others. If the file is not supported by the standard text extractor, Autopsy will fall back to a string extraction algorithm. String extraction on unknown file formats or arbitrary binary files can often extract a sizeable amount of text from a file, often enough to provide additional clues to reviewers. String extraction will not extract text strings from encrypted files.

 Autopsy ships with some built-in lists that define regular expressions and enable the user to search for Phone Numbers, IP addresses, URLs and E-mail addresses. However, enabling some of these very general lists can produce a very large number of hits, and many of them can be false-positives. Regular expressions can potentially take a long time to complete.

-Once files are placed in the Solr index, they can be searched quickly for specific keywords, regular expressions, or keyword search lists that can contain a mixture of keywords and regular expressions. Search queries can be executed automatically during the ingest run or at the end of the ingest, depending on the current settings and the time it takes to ingest the image.  
-
 Refer to \ref ad_hoc_keyword_search_page for more details on specifying regular expressions and other types of searches. 

+With the release of Autopsy 4.21.0, two types of keyword searching are supported: Solr search with full text indexing and the built-in Autopsy "In-Line" Keyword Search. For detailed information on configuring the search type, refer to \ref keyword_ingest_settings.
+
+\subsection keyword_SolrSearch Solr Search With Indexing
+
+Full text indexing with Solr provides users with the flexibility to perform ad-hoc manual text searches after the ingest process is complete (see \ref ad_hoc_keyword_search_page). However, it's important to note that full text indexing can significantly slow down the ingest speed for large data sources and cases. Once files are indexed in the Solr index, they can be quickly searched for specific keywords, regular expressions, or a combination of both in keyword search lists.
+
+\subsection keyword_InlineSearch In-Line Keyword Search
+
+On the other hand, the In-Line Keyword Search conducts keyword searches during the ingest process at the time of text extraction. It only indexes small sections of the files that contain keyword matches for display purposes. Our profiling runs indicate that, in most cases, this approach reduces the data source ingest time by half. This means that using the In-Line Keyword Search, a data source can be ingested in approximately half the time it takes to ingest and search the same data source using Solr indexing. However, a drawback of this method is that all search terms must be specified before the ingest begins, and there is no option to perform ad-hoc searches on the entire extracted text after the ingest process is complete.
+
 \section keyword_search_configuration_dialog Keyword Search Configuration Dialog

 The keyword search configuration dialog has three tabs, each with its own purpose:
 \li The \ref keyword_keywordListsTab is used to add, remove, and modify keyword search lists.
 \li The \ref keyword_stringExtractionTab is used to enable language scripts and extraction type.
-\li The \ref keyword_generalSettingsTab is used to configure the ingest timings and display information.
+\li The \ref keyword_generalSettingsTab is used to configure display information.

 \subsection keyword_keywordListsTab Lists tab

@ -57,23 +65,20 @@ The user can also use the String Viewer first and try different script/language
 \subsubsection keyword_nsrl NIST NSRL Support
 The hash lookup ingest service can be configured to use the NIST NSRL hash set of known files. The keyword search advanced configuration dialog "General" tab contains an option to skip keyword indexing and search on files that have previously marked as "known" and uninteresting files. Selecting this option can greatly reduce size of the index and improve ingest performance. In most cases, user does not need to keyword search for "known" files.

-\subsubsection keyword_update_freq Result update frequency during ingest
-To control how frequently searches are executed during ingest, the user can adjust the timing setting available in the keyword search advanced configuration dialog "General" tab. Setting the number of minutes lower will result in more frequent index updates and searches being executed and the user will be able to see results more in real-time. However, more frequent updates can affect the overall performance, especially on lower-end systems, and can potentially lengthen the overall time needed for the ingest to complete.
-
-One can also choose to have no periodic searches. This will speed up the ingest. Users choosing this option can run their keyword searches once the entire keyword search index is complete.
-
 \section keyword_usage Using the Module

-Search queries can be executed manually by the user at any time, as long as there are some files already indexed and ready to be searched. Searching before indexing is complete will naturally only search indexes that are already compiled.
-
-See \ref ingest_page "Ingest" for more information on ingest in general.
-
-Once there are files in the index, \ref ad_hoc_keyword_search_page will be available for use to manually search at any time.
+After the ingest has completed, \ref ad_hoc_keyword_search_page will be available for manual search. The amount of files/text available for Ad Hoc Search depends on the Keyword Search module settings at the time of the ingest. See section \ref adhoc_limitations for details. 

 \subsection keyword_ingest_settings Ingest Settings

 The Ingest Settings for the Keyword Search module allow the user to enable or disable the specific built-in search expressions, Phone Numbers, IP Addresses, Email Addresses, and URLs. Using the Advanced button (covered below), one can add custom keyword groups.

+With the release of Autopsy 4.21.0, two types of keyword searching are supported: Solr search with full text indexing and the built-in Autopsy "In-Line" Keyword Search. See \ref keyword_ingest_settings on details regarding search type configuraiton. See sections \ref keyword_SolrSearch and \ref keyword_InlineSearch for details of each search type.
+
+To select the keyword search type, you can use the "Add text to Solr Index" checkbox. When this checkbox is unchecked, Autopsy will perform the "In-Line" Keyword Search during ingest. However, most of the extracted text will not be indexed by Solr, effectively disabling the functionality described in \ref ad_hoc_keyword_search_page. On the other hand, if the checkbox is selected, Autopsy will perform the "In-Line" Keyword Search during ingest and also add all of the extracted text to the Solr index. This allows you to search the indexed text later using \ref ad_hoc_keyword_search_page.
+
+Additionally, it's important to be aware that there might be occasional differences in the results of regular expression searches between the "In-Line" Keyword Search and Solr search. This is because the "In-Line" Keyword Search uses Java regular expressions, while Solr search employs Lucene regular expressions. You can find more details about this in \ref regex_match.
+
 \image html keyword-search-ingest-settings.PNG

 \subsubsection keyword_ocr Optical Character Recognition