Chinese PoS Segmentation Technical Notes

来源:互联网 发布:哈姆雷特好看吗 知乎 编辑:程序博客网 时间:2024/04/30 13:59

1. Run java byte program

command: java XXXX

There is no need to include the ".class" affix to the program name when calling this java command. If so, an error will be thrown.


2. Character encoding convert in Linux

no.1 Check file encoding:

file --mime-encoding filename

no.2 Check system availalbe encoding sets

iconv -l

no.3 convert

iconv -f old_encoding -t new_encoding filename 


3. Java IO bufferedwriter

The write action performed on the objective file will not be committed until buffered object was properly closed, which means calling close() method on a bufferedwriter object is a must.


4. Java IO R/W utf-8 text file

Constructor OutputStreamWriter accept encoding parameter, which means it can be used to wrap FileOutputStream constructor and handle different kinds of encodings


5. ICTCLAS2015 user defined dictionary path

Data/FieldDict.pdat, Data/FieldDict.pos


6. ICTCLAS2015 user dictionary must be encoded in ANSI form to be correctly imported into Data/FieldDict.pdat and Data/FieldDict.pos files


7. python ctypes.c_char_p()

This function requires the memory address of the object to be successfully called, note in ctypes, the memory address is represented by the (python) id of an object


8. HanLp segment.seg() function will, by default, remove all the "\n" characters in the text strings. The method of how to change this setting is unknown.


9. Java:

If a package statement is not used then the class, interfaces, enumerations and annotation types will be put into an unnamed package 


10. Python Requests Module Timeout

request.get("XXX", timeout = 1)

0 0