Is there any method or software to extract same and similar sentences (segments) from a file?
Thread poster: Rajan Chopra
Rajan Chopra
Rajan Chopra
India
Local time: 17:46
Member (2008)
English to Hindi
+ ...
Jul 14, 2018

Hi experts,

I want to know is there any method or a software to extract same and similar sentences (segments) from a file?

Wordfast Pro retains one of the same segments and removes all such segments in a newly generated file. But I could not find a software or a method to extract the similar segments by entering the desired percentage (e.g. removing the segments which are 85% similar). I am looking for a software or a procedure by which one may remove the similar senten
... See more
Hi experts,

I want to know is there any method or a software to extract same and similar sentences (segments) from a file?

Wordfast Pro retains one of the same segments and removes all such segments in a newly generated file. But I could not find a software or a method to extract the similar segments by entering the desired percentage (e.g. removing the segments which are 85% similar). I am looking for a software or a procedure by which one may remove the similar sentences.

Does a software or method exist at all for doing so? I would appreciate if you could guide me in this respect.

Thanks and regards,

Chopra
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 13:16
Member (2006)
English to Afrikaans
+ ...
Only in very cumbersome ways Jul 14, 2018

chopra_2002 wrote:
I want to know is there any method or a software to extract same and similar sentences (segments) from a file?


I know of no automated method, but this is truly something that I would have expected CAT tools to have, i.e. to analyze a file against a TM and then export a TM that contains all matches from that TM up to a certain percentage. Some tools do have a similar feature, but it only exports the single highest match for any segment.

To do what you want, then, you would then create a dummy TM that contains the source text as both source and target text, and then analyze that same file against that TM. You'll get all 100% matches, of course, but a CAT tool should also show you non-100% matches. Now all you need is a way to export all those matches (100% and fuzzy) to a separate TM.

FWIW... from a theoretical point of view:

I have a way to do it in WFP 3, but it is cumbersome and time consuming. I use an AutoIt script that allows me to visit each segment and then automatically copy a certain number of matches from the match pane. I use it to extract segments from a TM server but you can use it to extract segments from a dummy TM created from the current file. I offer no support: wftmserver_extract.zip.

I've just discovered that doing something like this would be quite simple in OmegaT (again, using a script, unfortunately, which I don't have time to write right now), because you can customise the way fuzzy matches are displayed in the match pane, and if you put the cursor in the fuzzy match pane, you can copy all text from it (i.e. all matches at once), and when you press Ctrl+U (go to next segment), the cursor remains in the fuzzy match pane, so you can then easily copy the next set of matches without having to worry about automating the clicking between panes.

tm matching options

I think I recall that an earlier version of WFC had the ability to create something that was then known as a "project TM", which was something like what you're looking for (except that the extraction would be in TM format), but I could not find any mention of it now.


[Edited at 2018-07-14 08:41 GMT]


Rajan Chopra
 
Philippe Etienne
Philippe Etienne  Identity Verified
Spain
Local time: 13:16
Member
English to French
Back to the past Jul 14, 2018

chopra_2002 wrote:
...But I could not find a software or a method to extract the similar segments by entering the desired percentage (e.g. removing the segments which are 85% similar).

Trados pre-2009 (Workbench)
It can extract all repetitions in a set of files.
It could also generate a file with all the segments below a user-defined ceiling (like anything below a 85% match): the "unknown segments". But I understand you want xx%matches only, so it would be what's "left out".

MemoQ 2013 R2
Using views, it can extract all repetitions in a set of files.
Using the Pre-translate feature and setting the "Good match" value to whatever threshold of interest in the TM settings, you can pretranslate your set of files, then create a view containing only the lines pretranslated up to that specific match value (or its complement), then export the view to a bilingual Word file/memoQ file/bilingual old-Trados doc. file.

Surely all modern CAT Tools also have a workaround for this.

Philippe

[Edited at 2018-07-14 12:50 GMT]


Rajan Chopra
 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 13:16
Member (2006)
English to Afrikaans
+ ...
@Philippe (and myself) Jul 14, 2018

Philippe Etienne wrote:
Trados pre-2009 (Workbench)
It can extract all repetitions in a set of files. ... It could also generate a file with all the segments below a user-defined ceiling (like anything below a 85% match): the "unknown segments". But I understand you want xx%matches only, so it would be what's "left out".


In addition, this feature works only if you have an existing TM, and it extracts only segments that do not match that TM above the threshold. This means that if you try this with an empty TM, it will extract all segments, and if you try this with an exact-match TM, it will extract no segments.

I just checked, and Trados pre-2009 does have the "Create Project TM" feature that I had remembered, but it is/was only available in the professional (i.e. agency) version, and I can't tell from the user guide whether this project TM would have had only one match per segment (in which case it would be useless for the OP's purposes) or multiple matches.

MemoQ 2013 R2
Using the Pre-translate feature and setting the "Good match" value to whatever threshold of interest in the TM settings, you can pretranslate your set of files...


Yes, but again this only works if you're analysing your file against a TM, and I suspect the same will apply: if the TM is empty, all segments will be extacted, if the TM is an exact-match TM, no segments will be extracted.

According to this help page, MemoQ 2014 could also create a "project TM" in the same way as Trados pre-2009, but in the case of MemoQ it claims that the TM would contain all matches, not just the highest matches, which means that creating a project TM against an exact-match TM would result in something similar to what the OP is looking for. This feature is also available in MemoQ 2015 (which I have). Unfortunately the exported TM does not contain any indication of what the match percentages were, so the resulting TM will contain all segments and there would be no way to find and delete segments that fell below a certain threshold.

OmegaT
I just found that OmegaT does have a "Create Project TM" feature that exports all segments above a certain threshold (not just the highest match), although this doesn't help the OP because it also exports exact matches and it does not indicate the match percentage in the export TM.
https://gist.github.com/yu-tang/6526991



[Edited at 2018-07-15 16:24 GMT]


Rajan Chopra
 
Olaf Schutze (X)
Olaf Schutze (X)  Identity Verified
Vietnam
English to German
+ ...
How you will define you percentual "matches"? Feb 12, 2019

There are almost always spaces, symbols, formatting characters, typos and whatever in each document. You will have to assign a value to each used/possible character and then do the maths.
All this is done by the TM's (e.g. MemoQ) to get some sort of statistics and they are roughly the best results, you can get automatically. MemoQ uses even two TM's for creating statistics, as they can very differ.
Those differences can even differ again by certain languages pairs; fortnight = four
... See more
There are almost always spaces, symbols, formatting characters, typos and whatever in each document. You will have to assign a value to each used/possible character and then do the maths.
All this is done by the TM's (e.g. MemoQ) to get some sort of statistics and they are roughly the best results, you can get automatically. MemoQ uses even two TM's for creating statistics, as they can very differ.
Those differences can even differ again by certain languages pairs; fortnight = fourteen (days) = 14 (days)?

Here is some LINUX bash code, how to slowly approach:
1) Copy all textual content into a plain text file
2) Paste all textual content in and save file as "in.txt"
3) Open a terminal in working directory ad paste the following few lines:

for x in $(cat in.txt | tr -s ' ' '\n' | sort | uniq -c | sort -r | awk '{ print $2}');
do grep -oh '[^.]*\s'$x'\s' in.txt >> out.tmp;
done
sort out.tmp > out.txt


4) The resulting strings look like:

"The
The resulting
The resulting strings
The resulting strings look
The resulting strings look like" and the after the matches, one still has to do

* All single occurrences BEFORE any REPEATING match are gone. The LONGEST lines of similar are those most valuable strings/segments

5) Save/copy/import the out.txt to your TM project.
This is usually my FIRST file of a project, so the primary project TM can learn to build, there is the usual progress, a TM 'learns' and consumes a bit of the computers processing power/time.

6) in your terminal run the following lines:

rm out.tmp in.txt
exit

6) Again, it's a messy approach I am using, to identify more easily identify fixed terms and for quoting.

Have fun





[Edited at 2019-02-13 04:51 GMT]

[Edited at 2019-02-13 04:52 GMT]

[Edited at 2019-02-13 04:55 GMT]

[Edited at 2019-02-13 04:55 GMT]

[Edited at 2019-02-13 04:57 GMT]
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 13:16
Member (2006)
English to Afrikaans
+ ...
Old thread Feb 13, 2019

chopra_2002 wrote:
I want to know is there any method or a software to extract same and similar sentences (segments) from a file?
...
I am looking for a software or a procedure by which one may remove the similar sentences.


Olaf's response to this old thread made me consider the possibility that I may have misinterpreted Chopra's original message. I had assumed that Chopra meant "retain in a separate file" when he said "extract", but upon rereading his request, it occurred to me that he may have meant "remove" when he said "extract". What do you all think?


Rajan Chopra
 
Rajan Chopra
Rajan Chopra
India
Local time: 17:46
Member (2008)
English to Hindi
+ ...
TOPIC STARTER
You are right Feb 13, 2019

I made a mistake. I should have written remove instead of extract. Suppose, if a file has 40 sentences out of which there are 8 sentences which are similar. A CAT tool can remove the same sentences but it can't remove the similar sentences. So, in the example given above, it will be great if could find a feature to keep one of those 8 sentences and the 7 sentences are removed in the newly created file. Later, the TM of this sentence may help in translating the remaining those 7 sentences.... See more
I made a mistake. I should have written remove instead of extract. Suppose, if a file has 40 sentences out of which there are 8 sentences which are similar. A CAT tool can remove the same sentences but it can't remove the similar sentences. So, in the example given above, it will be great if could find a feature to keep one of those 8 sentences and the 7 sentences are removed in the newly created file. Later, the TM of this sentence may help in translating the remaining those 7 sentences.

Is it possible by some software or application?

Regards,

Chopra



Samuel Murray wrote:

chopra_2002 wrote:
I want to know is there any method or a software to extract same and similar sentences (segments) from a file?
...
I am looking for a software or a procedure by which one may remove the similar sentences.


Olaf's response to this old thread made me consider the possibility that I may have misinterpreted Chopra's original message. I had assumed that Chopra meant "retain in a separate file" when he said "extract", but upon rereading his request, it occurred to me that he may have meant "remove" when he said "extract". What do you all think?



[Edited at 2019-02-13 07:13 GMT]
Collapse


 
Olaf Schutze (X)
Olaf Schutze (X)  Identity Verified
Vietnam
English to German
+ ...
It's a cumbersome task Feb 13, 2019

especially for people, who fully depend on ready made lines.
Yes, its also possible, extract the plain longest matches ONLY, it is very painful for me, doing half-page-sentences, whilst virtually any TM add's for free the next word, so the TM grows naturally, without doing even a thing, but adding another 10-20 words, whilst one only has a 5% match, that's the point for growing.
I have and use MemoQ 2015 since about a year or so, before I was till on 4.x something.
As work in I
... See more
especially for people, who fully depend on ready made lines.
Yes, its also possible, extract the plain longest matches ONLY, it is very painful for me, doing half-page-sentences, whilst virtually any TM add's for free the next word, so the TM grows naturally, without doing even a thing, but adding another 10-20 words, whilst one only has a 5% match, that's the point for growing.
I have and use MemoQ 2015 since about a year or so, before I was till on 4.x something.
As work in IT and apart from translation with MemoQ, I do nothing but Linux.
I actually did run the lines also to, some 45000 segments,and that is prepared within a few seconds.
My TM's having about 5.5 million strings + connected/exchanging on-line. That sounds a lot, but that is the way, I work - lazy masting the TM's to have my work ready in timely manner. And yes, they are still growing, barely a string with less than ten choices and even less, to add new words, only improving phrasing
Collapse


 
Olaf Schutze (X)
Olaf Schutze (X)  Identity Verified
Vietnam
English to German
+ ...
.... Feb 13, 2019

chopra_2002 wrote:

I made a mistake. I should have written remove instead of extract. Suppose, if a file has 40 sentences out of which there are 8 sentences which are similar. A CAT tool can remove the same sentences but it can't remove the similar sentences. So, in the example given above, it will be great if could find a feature to keep one of those 8 sentences and the 7 sentences are removed in the newly created file. Later, the TM of this sentence may help in translating the remaining those 7 sentences.

Is it possible by some software or application?

Regards,

Chopra



Samuel Murray wrote:

chopra_2002 wrote:
I want to know is there any method or a software to extract same and similar sentences (segments) from a file?
...
I am looking for a software or a procedure by which one may remove the similar sentences.


Olaf's response to this old thread made me consider the possibility that I may have misinterpreted Chopra's original message. I had assumed that Chopra meant "retain in a separate file" when he said "extract", but upon rereading his request, it occurred to me that he may have meant "remove" when he said "extract". What do you all think?



[Edited at 2019-02-13 07:13 GMT]



I would/do keep them all, because at the ends only a small bit changes or is missing. The missing bits to 100% matches are often the overhung at the end, which is to much or wrong.


 
Olaf Schutze (X)
Olaf Schutze (X)  Identity Verified
Vietnam
English to German
+ ...
"Yes, but again this only works if you're analysing your file against a TM, and I suspect the same " Feb 13, 2019

Samuel Murray wrote:

Philippe Etienne wrote:
Trados pre-2009 (Workbench)
It can extract all repetitions in a set of files. ... It could also generate a file with all the segments below a user-defined ceiling (like anything below a 85% match): the "unknown segments". But I understand you want xx%matches only, so it would be what's "left out".


In addition, this feature works only if you have an existing TM, and it extracts only segments that do not match that TM above the threshold. This means that if you try this with an empty TM, it will extract all segments, and if you try this with an exact-match TM, it will extract no segments.

I just checked, and Trados pre-2009 does have the "Create Project TM" feature that I had remembered, but it is/was only available in the professional (i.e. agency) version, and I can't tell from the user guide whether this project TM would have had only one match per segment (in which case it would be useless for the OP's purposes) or multiple matches.

MemoQ 2013 R2
Using the Pre-translate feature and setting the "Good match" value to whatever threshold of interest in the TM settings, you can pretranslate your set of files...


Yes, but again this only works if you're analysing your file against a TM, and I suspect the same will apply: if the TM is empty, all segments will be extacted, if the TM is an exact-match TM, no segments will be extracted.

According to this help page, MemoQ 2014 could also create a "project TM" in the same way as Trados pre-2009, but in the case of MemoQ it claims that the TM would contain all matches, not just the highest matches, which means that creating a project TM against an exact-match TM would result in something similar to what the OP is looking for. This feature is also available in MemoQ 2015 (which I have). Unfortunately the exported TM does not contain any indication of what the match percentages were, so the resulting TM will contain all segments and there would be no way to find and delete segments that fell below a certain threshold.

OmegaT
I just found that OmegaT does have a "Create Project TM" feature that exports all segments above a certain threshold (not just the highest match), although this doesn't help the OP because it also exports exact matches and it does not indicate the match percentage in the export TM.
https://gist.github.com/yu-tang/6526991



[Edited at 2018-07-15 16:24 GMT]



Usually, you should be able to connect more than a single TM to a translation/project. I always start off with an empty one, whilst older/bigger/merged ones are providing, what the empty doesn't have. More or less about 10 I have always in use.
You should number/version your TM's and the older ones as secondary/minor ...
In my oldest ones, there I often now see "oooooh my goodness". If that is to often, I just let die a whole (lowest number) TM, independent of matches. Because when working like I do, there is quick a huge new source to replace

[Edited at 2019-02-13 11:50 GMT]


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 13:16
Member (2006)
English to Afrikaans
+ ...
I know of no such tool Feb 13, 2019

chopra_2002 wrote:
Suppose, if a file has 40 sentences out of which there are 8 sentences which are similar. It will be great if could find a feature to keep one of those 8 sentences and the 7 sentences are removed in the newly created file.


I know of no tool that can do this automatically, no. I can see the usefulness, though.


 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Laureana Pavon[Call to this topic]

You can also contact site staff by submitting a support request »

Is there any method or software to extract same and similar sentences (segments) from a file?






CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »
Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »