1637 Commits

Author SHA1 Message Date
Richard Cordovano
aa4474d54d Change TikaTextExtractor static init parallelStream use to stream 2017-01-11 12:31:15 -05:00
millmanorama
860f4361b3 Merge branch '2132-32k-chunks' into 2184-overlapping-chunks
# Conflicts:
#	KeywordSearch/src/org/sleuthkit/autopsy/keywordsearch/Ingester.java
2017-01-09 11:51:36 +01:00
millmanorama
a8cfbd1e10 refactor TextExtractor to an interface and to remove the intermediary getInputStream() method 2017-01-09 11:07:12 +01:00
millmanorama
7325174dc3 Merge remote-tracking branch 'upstream/develop' into 2132-32k-chunks 2017-01-09 10:47:59 +01:00
Richard Cordovano
837eb1477f Pull in text extraction refactoring and resolve merge conflicts 2017-01-08 10:22:18 -05:00
Richard Cordovano
5463d3a719 Remove kws public Server.getIngester, Ingester is not public 2017-01-07 10:30:58 -05:00
millmanorama
161ba2098c cleanup and comments in Chunker 2017-01-06 14:53:58 +01:00
millmanorama
990433fc36 refactor Chunker read methods to use a common helper method. 2017-01-06 13:16:40 +01:00
millmanorama
64ba5f6e66 Merge remote-tracking branch 'upstream/develop' into 2184-overlapping-chunks 2017-01-06 11:09:27 +01:00
millmanorama
52251bcb2e move Reader reset back to beginning of next() and increase buffer size to 2048. 2017-01-06 00:03:45 +01:00
millmanorama
5e0f9abdf9 reset at end to avoid "This stream has not been marked" error. 2017-01-04 17:22:56 +01:00
millmanorama
151742c21b record length in chars and mark/reset reader to produce overlaps 2017-01-04 17:16:20 +01:00
millmanorama
d8ec4290f2 reduce max window size to prevent off by one error 2017-01-04 17:16:19 +01:00
millmanorama
94e136b451 first pass at overlapping chunks 2017-01-04 17:16:17 +01:00
millmanorama
d14c15fbdb bump chunk size to exactly 32k, single read chars to 1024 2017-01-04 12:25:34 +01:00
millmanorama
8410970b11 Chunker implements Iterator and Iterable 2017-01-03 14:57:55 +01:00
millmanorama
15c2d395fa move Chunk and Chunker out of Ingester 2017-01-03 14:26:48 +01:00
millmanorama
d2a6fe3fda move chunking algorithm into seperate class(es) and reduce chunk size to ~32k 2017-01-03 14:26:46 +01:00
Richard Cordovano
46369eff44 Update NBM versioning for 4.3.0 2017-01-02 18:45:21 -05:00
Richard Cordovano
13411450aa 4.3.0 preps: DSPs, public API restore, const name 2017-01-02 17:36:59 -05:00
millmanorama
3557f141e1 use UTF-8 encoding for ArtifactTextExtractor streams and readers 2017-01-02 16:45:51 +01:00
millmanorama
4ae0a688bc don't commit unnecessarily 2016-12-31 14:31:11 +01:00
millmanorama
8526427b4f cleanup and comment TextExtractor
cleanup and comment TextExtractor immplementations more.

remove constants left over from merge
2016-12-28 17:30:42 +01:00
millmanorama
f56c2b43c8 move all 'appendix' related code into TikaTextExtractor and simplify TextExtractor interface. 2016-12-28 17:30:32 +01:00
millmanorama
8841f6e773 minor fixes 2016-12-28 17:30:30 +01:00
millmanorama
2d5cd2efc1 comment up Ingester 2016-12-28 17:30:27 +01:00
millmanorama
c94d3de872 move encoding options to StringsTextExtractor 2016-12-28 17:30:25 +01:00
millmanorama
9b85284194 remove unused outerclasses that have copies as innerclasses 2016-12-28 17:30:23 +01:00
millmanorama
c42f687bfb more cleanup
more cleanup
2016-12-28 17:30:15 +01:00
millmanorama
b904c37dd2 remove more unneeded ContentStreams and cleanup logging 2016-12-28 15:03:45 +01:00
millmanorama
0303c96d41 cleanup Ingester.indexChunk 2016-12-28 15:03:04 +01:00
millmanorama
abf21f58ee remove obsolete and unused ContentStreams 2016-12-28 15:03:03 +01:00
millmanorama
2b4bb33798 cleanup up ArtifactExtractor; reduce use of ContentStream 2016-12-28 15:03:01 +01:00
millmanorama
697a7d7a58 reduce method overloads for indexing artifacts 2016-12-28 15:02:59 +01:00
millmanorama
b38171dbd7 make the ByteXXXStream classes inner classes of the TextExtractors that use them. 2016-12-28 15:02:58 +01:00
millmanorama
85af7c57b6 build out ArtifactExtractor 2016-12-28 15:02:56 +01:00
millmanorama
1a70a4e8b2 introduce ArtifactExtractor 2016-12-28 15:02:39 +01:00
millmanorama
359dc16ee5 inline indexChunk 2016-12-28 15:02:23 +01:00
millmanorama
c9795cabcb pull up methods from TextExtractorBase into TextExtractor.java 2016-12-28 15:02:21 +01:00
millmanorama
0f1f8b2211 refactor common chunking algorithm into TextExtractorBase, remove AbstractFileChunk 2016-12-28 15:02:18 +01:00
Richard Cordovano
a5902d50f5 Correctly handle CancellationException in KeywordSearchResultFactory.BlackboardResultWriter 2016-12-19 17:27:42 -05:00
Eugene Livis
d1616cdeb6 Fixed a very misleading error mesage 2016-12-14 09:56:25 -05:00
Richard Cordovano
bb1975b9c4 Merge pull request #2428 from zhhl/2123-sortSolrResultToKeepConsistantKeywordPreview
2123: Sort the Solr results to keep KeywordSearch Preview pick up the…
2016-12-14 09:51:08 -05:00
U-BASIS\zhaohui
2711788582 2123: correction 2016-12-13 17:42:02 -05:00
U-BASIS\zhaohui
05a6fa8d37 2123: clean up 2016-12-13 17:38:22 -05:00
U-BASIS\zhaohui
8a1f272738 2123: let Solr do ascending sorting to let us have a consistant result 2016-12-13 17:33:41 -05:00
U-BASIS\zhaohui
4a0202cea9 2123: Sort the Solr results to keep KeywordSearch Preview pick up the same result each time 2016-12-11 09:56:57 -05:00
Ann Priestman
231e87187d Add dialog to allow the user to add multiple keywords at a time. 2016-12-08 09:58:31 -05:00
esaunders
a782e52f80 Removed filterOneHitPerDocument() since (a) it's use prevents the display of hits across multiple pages/chunks and (b) QueryResults.writeAllHitsToBlackBoard() takes care of ensuring that only a single blackboard artifact is created per document. 2016-12-07 16:17:24 -05:00
esaunders
83f8d575e9 Add quotes around the keyword when the search results are not available to make highlighting work correctly. 2016-12-07 16:14:00 -05:00