New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore duplicate institutions #189
Comments
Examples of potential duplicatessplit across multiple comments as the text is too long 11 groups by title+code
|
146 groups by code
|
404 by title
332 fuzzy title match groups
51 by fuzzy title+city
|
Same institutionSame code, different titles Different codes, same titles PCU Muséum National d'Histoire Naturelle Different institutionsSame code, different titles |
To help decide if something is an institution or collection we can look at
To decide which institution code to use, we could look at
[+ Marie edit:]
|
Thanks @MortenHofft! For each cluster, someone needs to decide:
Once everything is decided, here is what should happen:
This means that we would need 3 functions in the API:
To decide which institution should be the main one, we should use what Morten describes above:
|
Good summary @ManonGros
In this case we need to decide what to do with the identifiers.
So marking the institution as |
Looking in more details on the likely duplicate, then the wast majority (a guess 95%) is related to Index Herbariorum. Can that help us somehow? |
What about the following? If an institution should in fact be a collection:
|
Summary of our Skype conversation yesterday:
We ended up choosing option 1 for cases where we know this is a duplicate: either something that we have generated or the institution contacts us to delete the duplicate. We have a case for this. The Bishop museum contacted us to delete this duplicate "BISH" institution and keep only a "BISH" collection under the "BPBM" institution.
|
Already implemented in /grscicoll/[institution|collection]/possibleDuplicates |
We know we have duplicate institutions. But not how many. Let us try to identify the duplicates and discuss how to clean/merge them (if that is indeed preferred). First step is to get a feeling for how many duplicates there are and what types we see.
Glance and guess: we can delete 500 institutions (by merging them with existing)
Update: below examples only include institutions without a country (a mistake) - so 2500 institutions were ignored the cluster examples below.
The text was updated successfully, but these errors were encountered: